2024-06-03 00:33:36

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 0/5] sys_ringbuffer

New syscall for mapping generic ringbuffers for arbitary (supported)
file descriptors.

Ringbuffers can be created either when requested or at file open time,
and can be mapped into multiple address spaces (naturally, since files
can be shared as well).

Initial motivation is for fuse, but I plan on adding support to pipes
and possibly sockets as well - pipes are a particularly interesting use
case, because if both the sender and receiver of a pipe opt in to the
new ringbuffer interface, we can make them the _same_ ringbuffer for
true zero copy IO, while being backwards compatible with existing pipes.

the ringbuffer_wait and ringbuffer_wakeup syscalls are probably going
away in a future iteration, in favor of just using futexes.

In my testing, reading/writing from the ringbuffer 16 bytes at a time is
~7x faster than using read/write syscalls - and I was testing with
mitigations off, real world benefit will be even higher.

Kent Overstreet (5):
darray: lift from bcachefs
darray: Fix darray_for_each_reverse() when darray is empty
fs: sys_ringbuffer
ringbuffer: Test device
ringbuffer: Userspace test helper

MAINTAINERS | 7 +
arch/x86/entry/syscalls/syscall_32.tbl | 3 +
arch/x86/entry/syscalls/syscall_64.tbl | 3 +
fs/Makefile | 2 +
fs/bcachefs/Makefile | 1 -
fs/bcachefs/btree_types.h | 2 +-
fs/bcachefs/btree_update.c | 2 +
fs/bcachefs/btree_write_buffer_types.h | 2 +-
fs/bcachefs/fsck.c | 2 +-
fs/bcachefs/journal_io.h | 2 +-
fs/bcachefs/journal_sb.c | 2 +-
fs/bcachefs/sb-downgrade.c | 3 +-
fs/bcachefs/sb-errors_types.h | 2 +-
fs/bcachefs/sb-members.h | 3 +-
fs/bcachefs/subvolume.h | 1 -
fs/bcachefs/subvolume_types.h | 2 +-
fs/bcachefs/thread_with_file_types.h | 2 +-
fs/bcachefs/util.h | 28 +-
fs/ringbuffer.c | 474 ++++++++++++++++++++++++
fs/ringbuffer_test.c | 209 +++++++++++
{fs/bcachefs => include/linux}/darray.h | 61 +--
include/linux/darray_types.h | 22 ++
include/linux/fs.h | 2 +
include/linux/mm_types.h | 4 +
include/linux/ringbuffer_sys.h | 18 +
include/uapi/linux/futex.h | 1 +
include/uapi/linux/ringbuffer_sys.h | 40 ++
init/Kconfig | 9 +
kernel/fork.c | 2 +
lib/Kconfig.debug | 5 +
lib/Makefile | 2 +-
{fs/bcachefs => lib}/darray.c | 12 +-
tools/ringbuffer/Makefile | 3 +
tools/ringbuffer/ringbuffer-test.c | 254 +++++++++++++
34 files changed, 1125 insertions(+), 62 deletions(-)
create mode 100644 fs/ringbuffer.c
create mode 100644 fs/ringbuffer_test.c
rename {fs/bcachefs => include/linux}/darray.h (63%)
create mode 100644 include/linux/darray_types.h
create mode 100644 include/linux/ringbuffer_sys.h
create mode 100644 include/uapi/linux/ringbuffer_sys.h
rename {fs/bcachefs => lib}/darray.c (56%)
create mode 100644 tools/ringbuffer/Makefile
create mode 100644 tools/ringbuffer/ringbuffer-test.c

--
2.45.1



2024-06-03 00:33:45

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 1/5] darray: lift from bcachefs

dynamic arrays - inspired from CCAN darrays, basically c++ stl vectors.

Used by thread_with_stdio, which is also being lifted from bcachefs for
xfs.

Signed-off-by: Kent Overstreet <[email protected]>
---
MAINTAINERS | 7 +++
fs/bcachefs/Makefile | 1 -
fs/bcachefs/btree_types.h | 2 +-
fs/bcachefs/btree_update.c | 2 +
fs/bcachefs/btree_write_buffer_types.h | 2 +-
fs/bcachefs/fsck.c | 2 +-
fs/bcachefs/journal_io.h | 2 +-
fs/bcachefs/journal_sb.c | 2 +-
fs/bcachefs/sb-downgrade.c | 3 +-
fs/bcachefs/sb-errors_types.h | 2 +-
fs/bcachefs/sb-members.h | 3 +-
fs/bcachefs/subvolume.h | 1 -
fs/bcachefs/subvolume_types.h | 2 +-
fs/bcachefs/thread_with_file_types.h | 2 +-
fs/bcachefs/util.h | 28 +-----------
{fs/bcachefs => include/linux}/darray.h | 59 ++++++++++++++++---------
include/linux/darray_types.h | 22 +++++++++
lib/Makefile | 2 +-
{fs/bcachefs => lib}/darray.c | 12 ++++-
19 files changed, 95 insertions(+), 61 deletions(-)
rename {fs/bcachefs => include/linux}/darray.h (66%)
create mode 100644 include/linux/darray_types.h
rename {fs/bcachefs => lib}/darray.c (56%)

diff --git a/MAINTAINERS b/MAINTAINERS
index d6c90161c7bf..fafa30715f66 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6010,6 +6010,13 @@ F: net/ax25/ax25_out.c
F: net/ax25/ax25_timer.c
F: net/ax25/sysctl_net_ax25.c

+DARRAY
+M: Kent Overstreet <[email protected]>
+L: [email protected]
+S: Maintained
+F: include/linux/darray.h
+F: include/linux/darray_types.h
+
DATA ACCESS MONITOR
M: SeongJae Park <[email protected]>
L: [email protected]
diff --git a/fs/bcachefs/Makefile b/fs/bcachefs/Makefile
index 66ca0bbee639..281e4a7c1f31 100644
--- a/fs/bcachefs/Makefile
+++ b/fs/bcachefs/Makefile
@@ -28,7 +28,6 @@ bcachefs-y := \
checksum.o \
clock.o \
compress.o \
- darray.o \
debug.o \
dirent.o \
disk_groups.o \
diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index d63db4fefe73..7dcd015619af 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -2,13 +2,13 @@
#ifndef _BCACHEFS_BTREE_TYPES_H
#define _BCACHEFS_BTREE_TYPES_H

+#include <linux/darray_types.h>
#include <linux/list.h>
#include <linux/rhashtable.h>

#include "bbpos_types.h"
#include "btree_key_cache_types.h"
#include "buckets_types.h"
-#include "darray.h"
#include "errcode.h"
#include "journal_types.h"
#include "replicas_types.h"
diff --git a/fs/bcachefs/btree_update.c b/fs/bcachefs/btree_update.c
index f3c645a43dcb..23d52129db40 100644
--- a/fs/bcachefs/btree_update.c
+++ b/fs/bcachefs/btree_update.c
@@ -14,6 +14,8 @@
#include "snapshot.h"
#include "trace.h"

+#include <linux/darray.h>
+
static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l,
const struct btree_insert_entry *r)
{
diff --git a/fs/bcachefs/btree_write_buffer_types.h b/fs/bcachefs/btree_write_buffer_types.h
index 9b9433de9c36..5f248873087c 100644
--- a/fs/bcachefs/btree_write_buffer_types.h
+++ b/fs/bcachefs/btree_write_buffer_types.h
@@ -2,7 +2,7 @@
#ifndef _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
#define _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H

-#include "darray.h"
+#include <linux/darray_types.h>
#include "journal_types.h"

#define BTREE_WRITE_BUFERED_VAL_U64s_MAX 4
diff --git a/fs/bcachefs/fsck.c b/fs/bcachefs/fsck.c
index c8f57465131c..3ead927285b6 100644
--- a/fs/bcachefs/fsck.c
+++ b/fs/bcachefs/fsck.c
@@ -5,7 +5,6 @@
#include "btree_cache.h"
#include "btree_update.h"
#include "buckets.h"
-#include "darray.h"
#include "dirent.h"
#include "error.h"
#include "fs-common.h"
@@ -18,6 +17,7 @@
#include "xattr.h"

#include <linux/bsearch.h>
+#include <linux/darray.h>
#include <linux/dcache.h> /* struct qstr */

/*
diff --git a/fs/bcachefs/journal_io.h b/fs/bcachefs/journal_io.h
index 2ca9cde30ea8..2b8f458cf13c 100644
--- a/fs/bcachefs/journal_io.h
+++ b/fs/bcachefs/journal_io.h
@@ -2,7 +2,7 @@
#ifndef _BCACHEFS_JOURNAL_IO_H
#define _BCACHEFS_JOURNAL_IO_H

-#include "darray.h"
+#include <linux/darray_types.h>

void bch2_journal_pos_from_member_info_set(struct bch_fs *);
void bch2_journal_pos_from_member_info_resume(struct bch_fs *);
diff --git a/fs/bcachefs/journal_sb.c b/fs/bcachefs/journal_sb.c
index db80e506e3ab..9db57f6f1035 100644
--- a/fs/bcachefs/journal_sb.c
+++ b/fs/bcachefs/journal_sb.c
@@ -2,8 +2,8 @@

#include "bcachefs.h"
#include "journal_sb.h"
-#include "darray.h"

+#include <linux/darray.h>
#include <linux/sort.h>

/* BCH_SB_FIELD_journal: */
diff --git a/fs/bcachefs/sb-downgrade.c b/fs/bcachefs/sb-downgrade.c
index 390a1bbd2567..526e2c26d1b4 100644
--- a/fs/bcachefs/sb-downgrade.c
+++ b/fs/bcachefs/sb-downgrade.c
@@ -6,12 +6,13 @@
*/

#include "bcachefs.h"
-#include "darray.h"
#include "recovery_passes.h"
#include "sb-downgrade.h"
#include "sb-errors.h"
#include "super-io.h"

+#include <linux/darray.h>
+
#define RECOVERY_PASS_ALL_FSCK BIT_ULL(63)

/*
diff --git a/fs/bcachefs/sb-errors_types.h b/fs/bcachefs/sb-errors_types.h
index 666599d3fb9d..39cae3a6a024 100644
--- a/fs/bcachefs/sb-errors_types.h
+++ b/fs/bcachefs/sb-errors_types.h
@@ -2,7 +2,7 @@
#ifndef _BCACHEFS_SB_ERRORS_TYPES_H
#define _BCACHEFS_SB_ERRORS_TYPES_H

-#include "darray.h"
+#include <linux/darray_types.h>

#define BCH_SB_ERRS() \
x(clean_but_journal_not_empty, 0) \
diff --git a/fs/bcachefs/sb-members.h b/fs/bcachefs/sb-members.h
index dd93192ec065..338275899b60 100644
--- a/fs/bcachefs/sb-members.h
+++ b/fs/bcachefs/sb-members.h
@@ -2,9 +2,10 @@
#ifndef _BCACHEFS_SB_MEMBERS_H
#define _BCACHEFS_SB_MEMBERS_H

-#include "darray.h"
#include "bkey_types.h"

+#include <linux/darray.h>
+
extern char * const bch2_member_error_strs[];

static inline struct bch_member *
diff --git a/fs/bcachefs/subvolume.h b/fs/bcachefs/subvolume.h
index afa5e871efb2..0311b8669c76 100644
--- a/fs/bcachefs/subvolume.h
+++ b/fs/bcachefs/subvolume.h
@@ -2,7 +2,6 @@
#ifndef _BCACHEFS_SUBVOLUME_H
#define _BCACHEFS_SUBVOLUME_H

-#include "darray.h"
#include "subvolume_types.h"

enum bch_validate_flags;
diff --git a/fs/bcachefs/subvolume_types.h b/fs/bcachefs/subvolume_types.h
index 9b10c8947828..3a1ee762ad61 100644
--- a/fs/bcachefs/subvolume_types.h
+++ b/fs/bcachefs/subvolume_types.h
@@ -2,7 +2,7 @@
#ifndef _BCACHEFS_SUBVOLUME_TYPES_H
#define _BCACHEFS_SUBVOLUME_TYPES_H

-#include "darray.h"
+#include <linux/darray_types.h>

typedef DARRAY(u32) snapshot_id_list;

diff --git a/fs/bcachefs/thread_with_file_types.h b/fs/bcachefs/thread_with_file_types.h
index e0daf4eec341..41990756aa26 100644
--- a/fs/bcachefs/thread_with_file_types.h
+++ b/fs/bcachefs/thread_with_file_types.h
@@ -2,7 +2,7 @@
#ifndef _BCACHEFS_THREAD_WITH_FILE_TYPES_H
#define _BCACHEFS_THREAD_WITH_FILE_TYPES_H

-#include "darray.h"
+#include <linux/darray_types.h>

struct stdio_buf {
spinlock_t lock;
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index 5d2c470a49ac..1da52a8b3914 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -5,22 +5,22 @@
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/closure.h>
+#include <linux/darray.h>
#include <linux/errno.h>
#include <linux/freezer.h>
#include <linux/kernel.h>
-#include <linux/sched/clock.h>
#include <linux/llist.h>
#include <linux/log2.h>
#include <linux/percpu.h>
#include <linux/preempt.h>
#include <linux/ratelimit.h>
+#include <linux/sched/clock.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/workqueue.h>

#include "mean_and_variance.h"

-#include "darray.h"
#include "time_stats.h"

struct closure;
@@ -626,30 +626,6 @@ static inline void memset_u64s_tail(void *s, int c, unsigned bytes)
memset(s + bytes, c, rem);
}

-/* just the memmove, doesn't update @_nr */
-#define __array_insert_item(_array, _nr, _pos) \
- memmove(&(_array)[(_pos) + 1], \
- &(_array)[(_pos)], \
- sizeof((_array)[0]) * ((_nr) - (_pos)))
-
-#define array_insert_item(_array, _nr, _pos, _new_item) \
-do { \
- __array_insert_item(_array, _nr, _pos); \
- (_nr)++; \
- (_array)[(_pos)] = (_new_item); \
-} while (0)
-
-#define array_remove_items(_array, _nr, _pos, _nr_to_remove) \
-do { \
- (_nr) -= (_nr_to_remove); \
- memmove(&(_array)[(_pos)], \
- &(_array)[(_pos) + (_nr_to_remove)], \
- sizeof((_array)[0]) * ((_nr) - (_pos))); \
-} while (0)
-
-#define array_remove_item(_array, _nr, _pos) \
- array_remove_items(_array, _nr, _pos, 1)
-
static inline void __move_gap(void *array, size_t element_size,
size_t nr, size_t size,
size_t old_gap, size_t new_gap)
diff --git a/fs/bcachefs/darray.h b/include/linux/darray.h
similarity index 66%
rename from fs/bcachefs/darray.h
rename to include/linux/darray.h
index 4b340d13caac..ff167eb795f2 100644
--- a/fs/bcachefs/darray.h
+++ b/include/linux/darray.h
@@ -1,34 +1,26 @@
/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _BCACHEFS_DARRAY_H
-#define _BCACHEFS_DARRAY_H
+/*
+ * (C) 2022-2024 Kent Overstreet <[email protected]>
+ */
+#ifndef _LINUX_DARRAY_H
+#define _LINUX_DARRAY_H

/*
- * Dynamic arrays:
+ * Dynamic arrays
*
* Inspired by CCAN's darray
*/

+#include <linux/darray_types.h>
#include <linux/slab.h>

-#define DARRAY_PREALLOCATED(_type, _nr) \
-struct { \
- size_t nr, size; \
- _type *data; \
- _type preallocated[_nr]; \
-}
-
-#define DARRAY(_type) DARRAY_PREALLOCATED(_type, 0)
-
-typedef DARRAY(char) darray_char;
-typedef DARRAY(char *) darray_str;
-
-int __bch2_darray_resize(darray_char *, size_t, size_t, gfp_t);
+int __darray_resize_slowpath(darray_char *, size_t, size_t, gfp_t);

static inline int __darray_resize(darray_char *d, size_t element_size,
size_t new_size, gfp_t gfp)
{
return unlikely(new_size > d->size)
- ? __bch2_darray_resize(d, element_size, new_size, gfp)
+ ? __darray_resize_slowpath(d, element_size, new_size, gfp)
: 0;
}

@@ -69,6 +61,28 @@ static inline int __darray_make_room(darray_char *d, size_t t_size, size_t more,
#define darray_first(_d) ((_d).data[0])
#define darray_last(_d) ((_d).data[(_d).nr - 1])

+/* Insert/remove items into the middle of a darray: */
+
+#define array_insert_item(_array, _nr, _pos, _new_item) \
+do { \
+ memmove(&(_array)[(_pos) + 1], \
+ &(_array)[(_pos)], \
+ sizeof((_array)[0]) * ((_nr) - (_pos))); \
+ (_nr)++; \
+ (_array)[(_pos)] = (_new_item); \
+} while (0)
+
+#define array_remove_items(_array, _nr, _pos, _nr_to_remove) \
+do { \
+ (_nr) -= (_nr_to_remove); \
+ memmove(&(_array)[(_pos)], \
+ &(_array)[(_pos) + (_nr_to_remove)], \
+ sizeof((_array)[0]) * ((_nr) - (_pos))); \
+} while (0)
+
+#define array_remove_item(_array, _nr, _pos) \
+ array_remove_items(_array, _nr, _pos, 1)
+
#define darray_insert_item(_d, pos, _item) \
({ \
size_t _pos = (pos); \
@@ -79,10 +93,15 @@ static inline int __darray_make_room(darray_char *d, size_t t_size, size_t more,
_ret; \
})

+#define darray_remove_items(_d, _pos, _nr_to_remove) \
+ array_remove_items((_d)->data, (_d)->nr, (_pos) - (_d)->data, _nr_to_remove)
+
#define darray_remove_item(_d, _pos) \
- array_remove_item((_d)->data, (_d)->nr, (_pos) - (_d)->data)
+ darray_remove_items(_d, _pos, 1)
+
+/* Iteration: */

-#define __darray_for_each(_d, _i) \
+#define __darray_for_each(_d, _i) \
for ((_i) = (_d).data; _i < (_d).data + (_d).nr; _i++)

#define darray_for_each(_d, _i) \
@@ -106,4 +125,4 @@ do { \
darray_init(_d); \
} while (0)

-#endif /* _BCACHEFS_DARRAY_H */
+#endif /* _LINUX_DARRAY_H */
diff --git a/include/linux/darray_types.h b/include/linux/darray_types.h
new file mode 100644
index 000000000000..a400a0c3600d
--- /dev/null
+++ b/include/linux/darray_types.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * (C) 2022-2024 Kent Overstreet <[email protected]>
+ */
+#ifndef _LINUX_DARRAY_TYpES_H
+#define _LINUX_DARRAY_TYpES_H
+
+#include <linux/types.h>
+
+#define DARRAY_PREALLOCATED(_type, _nr) \
+struct { \
+ size_t nr, size; \
+ _type *data; \
+ _type preallocated[_nr]; \
+}
+
+#define DARRAY(_type) DARRAY_PREALLOCATED(_type, 0)
+
+typedef DARRAY(char) darray_char;
+typedef DARRAY(char *) darray_str;
+
+#endif /* _LINUX_DARRAY_TYpES_H */
diff --git a/lib/Makefile b/lib/Makefile
index 3b1769045651..f540c84e8c08 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -48,7 +48,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
percpu-refcount.o rhashtable.o base64.o \
once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
- generic-radix-tree.o bitmap-str.o
+ generic-radix-tree.o bitmap-str.o darray.o
obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o
obj-y += string_helpers.o
obj-$(CONFIG_STRING_HELPERS_KUNIT_TEST) += string_helpers_kunit.o
diff --git a/fs/bcachefs/darray.c b/lib/darray.c
similarity index 56%
rename from fs/bcachefs/darray.c
rename to lib/darray.c
index ac35b8b705ae..7cb064f14b39 100644
--- a/fs/bcachefs/darray.c
+++ b/lib/darray.c
@@ -1,10 +1,14 @@
// SPDX-License-Identifier: GPL-2.0
+/*
+ * (C) 2022-2024 Kent Overstreet <[email protected]>
+ */

+#include <linux/darray.h>
#include <linux/log2.h>
+#include <linux/module.h>
#include <linux/slab.h>
-#include "darray.h"

-int __bch2_darray_resize(darray_char *d, size_t element_size, size_t new_size, gfp_t gfp)
+int __darray_resize_slowpath(darray_char *d, size_t element_size, size_t new_size, gfp_t gfp)
{
if (new_size > d->size) {
new_size = roundup_pow_of_two(new_size);
@@ -22,3 +26,7 @@ int __bch2_darray_resize(darray_char *d, size_t element_size, size_t new_size, g

return 0;
}
+EXPORT_SYMBOL_GPL(__darray_resize_slowpath);
+
+MODULE_AUTHOR("Kent Overstreet");
+MODULE_LICENSE("GPL");
--
2.45.1


2024-06-03 00:34:10

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 2/5] darray: Fix darray_for_each_reverse() when darray is empty

Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/darray.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/darray.h b/include/linux/darray.h
index ff167eb795f2..603d6762c29a 100644
--- a/include/linux/darray.h
+++ b/include/linux/darray.h
@@ -108,7 +108,7 @@ do { \
for (typeof(&(_d).data[0]) _i = (_d).data; _i < (_d).data + (_d).nr; _i++)

#define darray_for_each_reverse(_d, _i) \
- for (typeof(&(_d).data[0]) _i = (_d).data + (_d).nr - 1; _i >= (_d).data; --_i)
+ for (typeof(&(_d).data[0]) _i = (_d).data + (_d).nr - 1; (_d).data && _i >= (_d).data; --_i)

#define darray_init(_d) \
do { \
--
2.45.1


2024-06-03 00:34:35

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 3/5] fs: sys_ringbuffer

Add new syscalls for generic ringbuffers that can be attached to
arbitrary (supporting) file descriptors.

A ringbuffer consists of:
- a single page for head/tail pointers, size/mask, and other ancilliary
metadata, described by 'struct ringbuffer_ptrs'
- a data buffer, consisting of one or more pages mapped at
'ringbuffer_ptrs.data_offset' above the address of 'ringbuffer_ptrs'

The data buffer is always a power of two size. Head and tail pointers
are u32 byte offsets, and they are stored unmasked (i.e., they use the
full 32 bit range) - they must be masked for reading.

- ringbuffer(int fd, int rw, u32 size, ulong *addr)

Create or get address of an existing ringbuffer for either reads or
writes, of at least size bytes, and attach it to the given file
descriptor; the address of the ringbuffer is returned via addr.

Since files can be shared between processes in different address spaces
a ringbuffer may be mapped into multiple address spaces via this
syscall.

- ringbuffer_wait(int fd, int rw)

Wait for space to be availaable (on a ringbuffer for writing), or data
to be available (on a ringbuffer for writing).

todo: add parameters for timeout, minimum amount of data/space to wait for

- ringbuffer_wakeup(int fd, int rw)

Required after writing to a previously empty ringbuffer, or reading from
a previously full ringbuffer to notify waiters on the other end

todo - investigate integrating with futexes?
todo - add extra fields to ringbuffer_ptrs for waiting on a minimum
amount of data/space, i.e. to signal when a wakeup is required

Kernel interfaces:
- To indicate that ringbuffers are supported on a file, set
FOP_RINGBUFFER_READ and/or FOP_RINGBUFFER_WRITE in your
file_operations.
- To read or write to a file's associated ringbuffers
(file->f_ringbuffer), use ringbuffer_read() or ringbuffer_write().

Signed-off-by: Kent Overstreet <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 3 +
arch/x86/entry/syscalls/syscall_64.tbl | 3 +
fs/Makefile | 1 +
fs/ringbuffer.c | 474 +++++++++++++++++++++++++
include/linux/fs.h | 2 +
include/linux/mm_types.h | 4 +
include/linux/ringbuffer_sys.h | 18 +
include/uapi/linux/futex.h | 1 +
include/uapi/linux/ringbuffer_sys.h | 40 +++
init/Kconfig | 9 +
kernel/fork.c | 2 +
11 files changed, 557 insertions(+)
create mode 100644 fs/ringbuffer.c
create mode 100644 include/linux/ringbuffer_sys.h
create mode 100644 include/uapi/linux/ringbuffer_sys.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7fd1f57ad3d3..2385359eaf75 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -467,3 +467,6 @@
460 i386 lsm_set_self_attr sys_lsm_set_self_attr
461 i386 lsm_list_modules sys_lsm_list_modules
462 i386 mseal sys_mseal
+463 i386 ringbuffer sys_ringbuffer
+464 i386 ringbuffer_wait sys_ringbuffer_wait
+465 i386 ringbuffer_wakeup sys_ringbuffer_wakeup
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index a396f6e6ab5b..942602ece075 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -384,6 +384,9 @@
460 common lsm_set_self_attr sys_lsm_set_self_attr
461 common lsm_list_modules sys_lsm_list_modules
462 common mseal sys_mseal
+463 common ringbuffer sys_ringbuffer
+464 common ringbuffer_wait sys_ringbuffer_wait
+465 common ringbuffer_wakeup sys_ringbuffer_wakeup

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/Makefile b/fs/Makefile
index 6ecc9b0a53f2..48e54ac01fb1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_AIO) += aio.o
+obj-$(CONFIG_RINGBUFFER) += ringbuffer.o
obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FS_ENCRYPTION) += crypto/
obj-$(CONFIG_FS_VERITY) += verity/
diff --git a/fs/ringbuffer.c b/fs/ringbuffer.c
new file mode 100644
index 000000000000..82e042c1c89b
--- /dev/null
+++ b/fs/ringbuffer.c
@@ -0,0 +1,474 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "%s() " fmt "\n", __func__
+
+#include <linux/darray.h>
+#include <linux/errname.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/mman.h>
+#include <linux/mount.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/ringbuffer_sys.h>
+#include <linux/syscalls.h>
+#include <linux/uio.h>
+
+#define RINGBUFFER_FS_MAGIC 0xa10a10a2
+
+static DEFINE_MUTEX(ringbuffer_lock);
+
+static struct vfsmount *ringbuffer_mnt;
+
+struct ringbuffer_mapping {
+ ulong addr;
+ struct mm_struct *mm;
+};
+
+struct ringbuffer {
+ u32 size; /* always a power of two */
+ u32 mask; /* size - 1 */
+ unsigned order;
+ wait_queue_head_t wait[2];
+ struct ringbuffer_ptrs *ptrs;
+ void *data;
+ /* hidden internal file for the mmap */
+ struct file *rb_file;
+ DARRAY(struct ringbuffer_mapping) mms;
+};
+
+static const struct address_space_operations ringbuffer_aops = {
+ .dirty_folio = noop_dirty_folio,
+#if 0
+ .migrate_folio = ringbuffer_migrate_folio,
+#endif
+};
+
+#if 0
+static int ringbuffer_mremap(struct vm_area_struct *vma)
+{
+ struct file *file = vma->vm_file;
+ struct mm_struct *mm = vma->vm_mm;
+ struct kioctx_table *table;
+ int i, res = -EINVAL;
+
+ spin_lock(&mm->ioctx_lock);
+ rcu_read_lock();
+ table = rcu_dereference(mm->ioctx_table);
+ if (!table)
+ goto out_unlock;
+
+ for (i = 0; i < table->nr; i++) {
+ struct kioctx *ctx;
+
+ ctx = rcu_dereference(table->table[i]);
+ if (ctx && ctx->ringbuffer_file == file) {
+ if (!atomic_read(&ctx->dead)) {
+ ctx->user_id = ctx->mmap_base = vma->vm_start;
+ res = 0;
+ }
+ break;
+ }
+ }
+
+out_unlock:
+ rcu_read_unlock();
+ spin_unlock(&mm->ioctx_lock);
+ return res;
+}
+#endif
+
+static const struct vm_operations_struct ringbuffer_vm_ops = {
+#if 0
+ .mremap = ringbuffer_mremap,
+#endif
+#if IS_ENABLED(CONFIG_MMU)
+ .fault = filemap_fault,
+ .map_pages = filemap_map_pages,
+ .page_mkwrite = filemap_page_mkwrite,
+#endif
+};
+
+static int ringbuffer_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ vm_flags_set(vma, VM_DONTEXPAND);
+ vma->vm_ops = &ringbuffer_vm_ops;
+ return 0;
+}
+
+static const struct file_operations ringbuffer_fops = {
+ .mmap = ringbuffer_mmap,
+};
+
+void ringbuffer_free(struct ringbuffer *rb)
+{
+ pr_debug("%px", rb);
+
+ lockdep_assert_held(&ringbuffer_lock);
+
+ darray_for_each(rb->mms, map)
+ darray_for_each_reverse(map->mm->ringbuffers, rb2)
+ if (rb == *rb2)
+ darray_remove_item(&map->mm->ringbuffers, rb2);
+
+ if (rb->rb_file) {
+ /* Kills mapping: */
+ truncate_setsize(file_inode(rb->rb_file), 0);
+
+ struct address_space *mapping = rb->rb_file->f_mapping;
+ spin_lock(&mapping->i_private_lock);
+ mapping->i_private_data = NULL;
+ spin_unlock(&mapping->i_private_lock);
+
+ fput(rb->rb_file);
+ }
+
+ free_pages((ulong) rb->data, get_order(rb->size));
+ free_page((ulong) rb->ptrs);
+ kfree(rb);
+}
+
+static int ringbuffer_alloc_inode(struct ringbuffer *rb)
+{
+ struct inode *inode = alloc_anon_inode(ringbuffer_mnt->mnt_sb);
+ int ret = PTR_ERR_OR_ZERO(inode);
+ if (ret)
+ goto err;
+
+ inode->i_mapping->a_ops = &ringbuffer_aops;
+ inode->i_mapping->i_private_data = rb;
+ inode->i_size = rb->size * 2;
+ mapping_set_large_folios(inode->i_mapping);
+
+ rb->rb_file = alloc_file_pseudo(inode, ringbuffer_mnt, "[ringbuffer]",
+ O_RDWR, &ringbuffer_fops);
+ ret = PTR_ERR_OR_ZERO(rb->rb_file);
+ if (ret)
+ goto err_iput;
+
+ struct folio *f_ptrs = page_folio(virt_to_page(rb->ptrs));
+ struct folio *f_data = page_folio(virt_to_page(rb->data));
+
+ __folio_set_locked(f_ptrs);
+ __folio_mark_uptodate(f_ptrs);
+
+ void *shadow = NULL;
+ ret = __filemap_add_folio(rb->rb_file->f_mapping, f_ptrs,
+ (1U << rb->order) - 1, GFP_KERNEL, &shadow);
+ if (ret)
+ goto err;
+ folio_unlock(f_ptrs);
+
+ __folio_set_locked(f_data);
+ __folio_mark_uptodate(f_data);
+ shadow = NULL;
+ ret = __filemap_add_folio(rb->rb_file->f_mapping, f_data,
+ 1U << rb->order, GFP_KERNEL, &shadow);
+ if (ret)
+ goto err;
+ folio_unlock(f_data);
+ return 0;
+err_iput:
+ iput(inode);
+ return ret;
+err:
+ truncate_setsize(file_inode(rb->rb_file), 0);
+ fput(rb->rb_file);
+ return ret;
+}
+
+static int ringbuffer_map(struct ringbuffer *rb, ulong *addr)
+{
+ struct mm_struct *mm = current->mm;
+ int ret = 0;
+
+ lockdep_assert_held(&ringbuffer_lock);
+
+ if (!rb->rb_file) {
+ ret = ringbuffer_alloc_inode(rb);
+ if (ret)
+ return ret;
+ }
+
+ ret = darray_make_room(&rb->mms, 1) ?:
+ darray_make_room(&mm->ringbuffers, 1);
+ if (ret)
+ return ret;
+
+ ret = mmap_write_lock_killable(mm);
+ if (ret)
+ return ret;
+
+ ulong unused;
+ struct ringbuffer_mapping map = {
+ .addr = do_mmap(rb->rb_file, 0, rb->size + PAGE_SIZE,
+ PROT_READ|PROT_WRITE,
+ MAP_SHARED, 0,
+ (1U << rb->order) - 1,
+ &unused, NULL),
+ .mm = mm,
+ };
+ mmap_write_unlock(mm);
+
+ ret = PTR_ERR_OR_ZERO((void *) map.addr);
+ if (ret)
+ return ret;
+
+ ret = darray_push(&mm->ringbuffers, rb) ?:
+ darray_push(&rb->mms, map);
+ BUG_ON(ret); /* we preallocated */
+
+ *addr = map.addr;
+ return 0;
+}
+
+static int ringbuffer_get_addr_or_map(struct ringbuffer *rb, ulong *addr)
+{
+ lockdep_assert_held(&ringbuffer_lock);
+
+ struct mm_struct *mm = current->mm;
+
+ darray_for_each(rb->mms, map)
+ if (map->mm == mm) {
+ *addr = map->addr;
+ return 0;
+ }
+
+ return ringbuffer_map(rb, addr);
+}
+
+struct ringbuffer *ringbuffer_alloc(u32 size)
+{
+ unsigned order = get_order(size);
+ size = PAGE_SIZE << order;
+
+ struct ringbuffer *rb = kzalloc(sizeof(*rb), GFP_KERNEL);
+ if (!rb)
+ return ERR_PTR(-ENOMEM);
+
+ rb->size = size;
+ rb->mask = size - 1;
+ rb->order = order;
+ init_waitqueue_head(&rb->wait[READ]);
+ init_waitqueue_head(&rb->wait[WRITE]);
+
+ rb->ptrs = (void *) __get_free_page(GFP_KERNEL|__GFP_ZERO);
+ rb->data = (void *) __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_COMP, order);
+ if (!rb->ptrs || !rb->data) {
+ ringbuffer_free(rb);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* todo - implement a fallback when high order allocation fails */
+
+ rb->ptrs->size = size;
+ rb->ptrs->mask = size - 1;
+ rb->ptrs->data_offset = PAGE_SIZE;
+ return rb;
+}
+
+/*
+ * XXX: we require synchronization when killing a ringbuffer (because no longer
+ * mapped anywhere) to a file that is still open (and in use)
+ */
+static void ringbuffer_mm_drop(struct mm_struct *mm, struct ringbuffer *rb)
+{
+ darray_for_each_reverse(rb->mms, map)
+ if (mm == map->mm) {
+ pr_debug("removing %px from %px", rb, mm);
+ darray_remove_item(&rb->mms, map);
+ }
+}
+
+void ringbuffer_mm_exit(struct mm_struct *mm)
+{
+ mutex_lock(&ringbuffer_lock);
+ darray_for_each_reverse(mm->ringbuffers, rb)
+ ringbuffer_mm_drop(mm, *rb);
+ mutex_unlock(&ringbuffer_lock);
+
+ darray_exit(&mm->ringbuffers);
+}
+
+SYSCALL_DEFINE4(ringbuffer, unsigned, fd, int, rw, u32, size, ulong __user *, ringbufferp)
+{
+ ulong rb_addr;
+
+ int ret = get_user(rb_addr, ringbufferp);
+ if (unlikely(ret))
+ return ret;
+
+ if (unlikely(rb_addr || !size || rw > WRITE))
+ return -EINVAL;
+
+ struct fd f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ struct ringbuffer *rb = f.file->f_op->ringbuffer(f.file, rw);
+ if (!rb) {
+ ret = -EOPNOTSUPP;
+ goto err;
+ }
+
+ mutex_lock(&ringbuffer_lock);
+ ret = ringbuffer_get_addr_or_map(rb, &rb_addr);
+ if (ret)
+ goto err_unlock;
+
+ ret = put_user(rb_addr, ringbufferp);
+err_unlock:
+ mutex_unlock(&ringbuffer_lock);
+err:
+ fdput(f);
+ return ret;
+}
+
+ssize_t ringbuffer_read_iter(struct ringbuffer *rb, struct iov_iter *iter, bool nonblocking)
+{
+ u32 tail = rb->ptrs->tail, orig_tail = tail;
+ u32 head = smp_load_acquire(&rb->ptrs->head);
+
+ if (unlikely(head == tail)) {
+ if (nonblocking)
+ return -EAGAIN;
+ int ret = wait_event_interruptible(rb->wait[READ],
+ (head = smp_load_acquire(&rb->ptrs->head)) != rb->ptrs->tail);
+ if (ret)
+ return ret;
+ }
+
+ while (iov_iter_count(iter)) {
+ u32 tail_masked = tail & rb->mask;
+ u32 len = min(iov_iter_count(iter),
+ min(head - tail,
+ rb->size - tail_masked));
+ if (!len)
+ break;
+
+ len = copy_to_iter(rb->data + tail_masked, len, iter);
+
+ tail += len;
+ }
+
+ smp_store_release(&rb->ptrs->tail, tail);
+
+ smp_mb();
+
+ if (rb->ptrs->head - orig_tail >= rb->size)
+ wake_up(&rb->wait[WRITE]);
+
+ return tail - orig_tail;
+}
+EXPORT_SYMBOL_GPL(ringbuffer_read_iter);
+
+ssize_t ringbuffer_write_iter(struct ringbuffer *rb, struct iov_iter *iter, bool nonblocking)
+{
+ u32 head = rb->ptrs->head, orig_head = head;
+ u32 tail = smp_load_acquire(&rb->ptrs->tail);
+
+ if (unlikely(head - tail >= rb->size)) {
+ if (nonblocking)
+ return -EAGAIN;
+ int ret = wait_event_interruptible(rb->wait[WRITE],
+ head - (tail = smp_load_acquire(&rb->ptrs->tail)) < rb->size);
+ if (ret)
+ return ret;
+ }
+
+ while (iov_iter_count(iter)) {
+ u32 head_masked = head & rb->mask;
+ u32 len = min(iov_iter_count(iter),
+ min(tail + rb->size - head,
+ rb->size - head_masked));
+ if (!len)
+ break;
+
+ len = copy_from_iter(rb->data + head_masked, len, iter);
+
+ head += len;
+ }
+
+ smp_store_release(&rb->ptrs->head, head);
+
+ smp_mb();
+
+ if ((s32) (rb->ptrs->tail - orig_head) >= 0)
+ wake_up(&rb->wait[READ]);
+
+ return head - orig_head;
+}
+EXPORT_SYMBOL_GPL(ringbuffer_write_iter);
+
+SYSCALL_DEFINE2(ringbuffer_wait, unsigned, fd, int, rw)
+{
+ int ret = 0;
+
+ if (rw > WRITE)
+ return -EINVAL;
+
+ struct fd f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ struct ringbuffer *rb = f.file->f_op->ringbuffer(f.file, rw);
+ if (!rb) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ struct ringbuffer_ptrs *rp = rb->ptrs;
+ wait_event(rb->wait[rw], rw == READ
+ ? rp->head != rp->tail
+ : rp->head - rp->tail < rb->size);
+err:
+ fdput(f);
+ return ret;
+}
+
+SYSCALL_DEFINE2(ringbuffer_wakeup, unsigned, fd, int, rw)
+{
+ int ret = 0;
+
+ if (rw > WRITE)
+ return -EINVAL;
+
+ struct fd f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ struct ringbuffer *rb = f.file->f_op->ringbuffer(f.file, rw);
+ if (!rb) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ wake_up(&rb->wait[!rw]);
+err:
+ fdput(f);
+ return ret;
+}
+
+static int ringbuffer_init_fs_context(struct fs_context *fc)
+{
+ if (!init_pseudo(fc, RINGBUFFER_FS_MAGIC))
+ return -ENOMEM;
+ fc->s_iflags |= SB_I_NOEXEC;
+ return 0;
+}
+
+static int __init ringbuffer_init(void)
+{
+ static struct file_system_type ringbuffer_fs = {
+ .name = "ringbuffer",
+ .init_fs_context = ringbuffer_init_fs_context,
+ .kill_sb = kill_anon_super,
+ };
+ ringbuffer_mnt = kern_mount(&ringbuffer_fs);
+ if (IS_ERR(ringbuffer_mnt))
+ panic("Failed to create ringbuffer fs mount.");
+ return 0;
+}
+__initcall(ringbuffer_init);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0283cf366c2a..3026f8f92d6f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1996,6 +1996,7 @@ struct offset_ctx;

typedef unsigned int __bitwise fop_flags_t;

+struct ringbuffer;
struct file_operations {
struct module *owner;
fop_flags_t fop_flags;
@@ -2004,6 +2005,7 @@ struct file_operations {
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+ struct ringbuffer *(*ringbuffer)(struct file *, int);
int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
unsigned int flags);
int (*iterate_shared) (struct file *, struct dir_context *);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 24323c7d0bd4..6e412718ce7e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -5,6 +5,7 @@
#include <linux/mm_types_task.h>

#include <linux/auxvec.h>
+#include <linux/darray_types.h>
#include <linux/kref.h>
#include <linux/list.h>
#include <linux/spinlock.h>
@@ -911,6 +912,9 @@ struct mm_struct {
spinlock_t ioctx_lock;
struct kioctx_table __rcu *ioctx_table;
#endif
+#ifdef CONFIG_RINGBUFFER
+ DARRAY(struct ringbuffer *) ringbuffers;
+#endif
#ifdef CONFIG_MEMCG
/*
* "owner" points to a task that is regarded as the canonical
diff --git a/include/linux/ringbuffer_sys.h b/include/linux/ringbuffer_sys.h
new file mode 100644
index 000000000000..843509f72514
--- /dev/null
+++ b/include/linux/ringbuffer_sys.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RINGBUFFER_SYS_H
+#define _LINUX_RINGBUFFER_SYS_H
+
+#include <linux/darray_types.h>
+#include <linux/spinlock_types.h>
+#include <uapi/linux/ringbuffer_sys.h>
+
+struct mm_struct;
+void ringbuffer_mm_exit(struct mm_struct *mm);
+
+void ringbuffer_free(struct ringbuffer *rb);
+struct ringbuffer *ringbuffer_alloc(u32 size);
+
+ssize_t ringbuffer_read_iter(struct ringbuffer *rb, struct iov_iter *iter, bool nonblock);
+ssize_t ringbuffer_write_iter(struct ringbuffer *rb, struct iov_iter *iter, bool nonblock);
+
+#endif /* _LINUX_RINGBUFFER_SYS_H */
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index d2ee625ea189..09d94a5cb849 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -22,6 +22,7 @@
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
#define FUTEX_LOCK_PI2 13
+#define FUTEX_WAIT_GE 14

#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
diff --git a/include/uapi/linux/ringbuffer_sys.h b/include/uapi/linux/ringbuffer_sys.h
new file mode 100644
index 000000000000..a7afe8647cc1
--- /dev/null
+++ b/include/uapi/linux/ringbuffer_sys.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_RINGBUFFER_SYS_H
+#define _UAPI_LINUX_RINGBUFFER_SYS_H
+
+#include <uapi/linux/types.h>
+
+/*
+ * ringbuffer_ptrs - head and tail pointers for a ringbuffer, mappped to
+ * userspace:
+ */
+struct ringbuffer_ptrs {
+ /*
+ * We use u32s because this type is shared between the kernel and
+ * userspace - ulong/size_t won't work here, we might be 32bit userland
+ * and 64 bit kernel, and u64 would be preferable (reduced probability
+ * of ABA) but not all architectures can atomically read/write to a u64;
+ * we need to avoid torn reads/writes.
+ *
+ * head and tail pointers are incremented and stored without masking;
+ * this is to avoid ABA and differentiate between a full and empty
+ * buffer - they must be masked with @mask to get an actual offset into
+ * the data buffer.
+ *
+ * All units are in bytes.
+ *
+ * Data is emitted at head, consumed from tail.
+ */
+ __u32 head;
+ __u32 tail;
+ __u32 size; /* always a power of two */
+ __u32 mask; /* size - 1 */
+
+ /*
+ * Starting offset of data buffer, from the start of this struct - will
+ * always be PAGE_SIZE.
+ */
+ __u32 data_offset;
+};
+
+#endif /* _UAPI_LINUX_RINGBUFFER_SYS_H */
diff --git a/init/Kconfig b/init/Kconfig
index 72404c1f2157..c43d536d4898 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1673,6 +1673,15 @@ config IO_URING
applications to submit and complete IO through submission and
completion rings that are shared between the kernel and application.

+config RINGBUFFER
+ bool "Enable ringbuffer() syscall" if EXPERT
+ select XARRAY_MULTI
+ default y
+ help
+ This option adds support for generic ringbuffers, which can be
+ attached to any (supported) file descriptor, allowing for reading and
+ writing without syscall overhead.
+
config ADVISE_SYSCALLS
bool "Enable madvise/fadvise syscalls" if EXPERT
default y
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..9190a06a6365 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -103,6 +103,7 @@
#include <linux/rseq.h>
#include <uapi/linux/pidfd.h>
#include <linux/pidfs.h>
+#include <linux/ringbuffer_sys.h>

#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -1340,6 +1341,7 @@ static inline void __mmput(struct mm_struct *mm)
VM_BUG_ON(atomic_read(&mm->mm_users));

uprobe_clear_state(mm);
+ ringbuffer_mm_exit(mm);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
--
2.45.1


2024-06-03 00:34:44

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 4/5] ringbuffer: Test device

This adds /dev/ringbuffer-test, which supports reading and writing a
sequence of integers, to test performance and correctness.

Signed-off-by: Kent Overstreet <[email protected]>
---
fs/Makefile | 1 +
fs/ringbuffer_test.c | 209 +++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 5 ++
3 files changed, 215 insertions(+)
create mode 100644 fs/ringbuffer_test.c

diff --git a/fs/Makefile b/fs/Makefile
index 48e54ac01fb1..91061f281f0a 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_RINGBUFFER) += ringbuffer.o
+obj-$(CONFIG_RINGBUFFER_TEST) += ringbuffer_test.o
obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FS_ENCRYPTION) += crypto/
obj-$(CONFIG_FS_VERITY) += verity/
diff --git a/fs/ringbuffer_test.c b/fs/ringbuffer_test.c
new file mode 100644
index 000000000000..01aa9c55120d
--- /dev/null
+++ b/fs/ringbuffer_test.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "%s() " fmt "\n", __func__
+
+#include <linux/device.h>
+#include <linux/errname.h>
+#include <linux/fs.h>
+#include <linux/kthread.h>
+#include <linux/ringbuffer_sys.h>
+#include <linux/uio.h>
+
+struct ringbuffer_test_file {
+ struct ringbuffer_test_rw {
+ struct mutex lock;
+ struct ringbuffer *rb;
+ struct task_struct *thr;
+ } rw[2];
+};
+
+#define BUF_NR 4
+
+static int ringbuffer_test_writer(void *p)
+{
+ struct file *file = p;
+ struct ringbuffer_test_file *f = file->private_data;
+ struct ringbuffer *rb = f->rw[READ].rb;
+ u32 idx = 0;
+ u32 buf[BUF_NR];
+
+ while (!kthread_should_stop()) {
+ cond_resched();
+
+ struct kvec vec = { buf, sizeof(buf) };
+ struct iov_iter iter;
+ iov_iter_kvec(&iter, ITER_SOURCE, &vec, 1, sizeof(buf));
+
+ for (unsigned i = 0; i < ARRAY_SIZE(buf); i++)
+ buf[i] = idx + i;
+
+ ssize_t ret = ringbuffer_write_iter(rb, &iter, false);
+ if (ret < 0)
+ continue;
+ idx += ret / sizeof(buf[0]);
+ }
+
+ return 0;
+}
+
+static int ringbuffer_test_reader(void *p)
+{
+ struct file *file = p;
+ struct ringbuffer_test_file *f = file->private_data;
+ struct ringbuffer *rb = f->rw[WRITE].rb;
+ u32 idx = 0;
+ u32 buf[BUF_NR];
+
+ while (!kthread_should_stop()) {
+ cond_resched();
+
+ struct kvec vec = { buf, sizeof(buf) };
+ struct iov_iter iter;
+ iov_iter_kvec(&iter, ITER_DEST, &vec, 1, sizeof(buf));
+
+ ssize_t ret = ringbuffer_read_iter(rb, &iter, false);
+ if (ret < 0)
+ continue;
+
+ unsigned nr = ret / sizeof(buf[0]);
+ for (unsigned i = 0; i < nr; i++)
+ if (buf[i] != idx + i)
+ pr_err("read wrong data");
+ idx += ret / sizeof(buf[0]);
+ }
+
+ return 0;
+}
+
+static void ringbuffer_test_free(struct ringbuffer_test_file *f)
+{
+ for (unsigned i = 0; i < ARRAY_SIZE(f->rw); i++)
+ if (!IS_ERR_OR_NULL(f->rw[i].thr))
+ kthread_stop_put(f->rw[i].thr);
+ for (unsigned i = 0; i < ARRAY_SIZE(f->rw); i++)
+ if (!IS_ERR_OR_NULL(f->rw[i].rb))
+ ringbuffer_free(f->rw[i].rb);
+ kfree(f);
+}
+
+static int ringbuffer_test_open(struct inode *inode, struct file *file)
+{
+ static const char * const rw_str[] = { "reader", "writer" };
+ int ret = 0;
+
+ struct ringbuffer_test_file *f = kzalloc(sizeof(*f), GFP_KERNEL);
+ if (!f)
+ return -ENOMEM;
+
+ for (struct ringbuffer_test_rw *i = f->rw;
+ i < f->rw + ARRAY_SIZE(f->rw);
+ i++) {
+ unsigned idx = i - f->rw;
+
+ mutex_init(&i->lock);
+
+ i->rb = ringbuffer_alloc(PAGE_SIZE * 4);
+ ret = PTR_ERR_OR_ZERO(i->rb);
+ if (ret)
+ goto err;
+
+ i->thr = kthread_create(idx == READ
+ ? ringbuffer_test_reader
+ : ringbuffer_test_writer,
+ file, "ringbuffer_%s", rw_str[idx]);
+ ret = PTR_ERR_OR_ZERO(i->thr);
+ if (ret)
+ goto err;
+ get_task_struct(i->thr);
+ }
+
+ file->private_data = f;
+ wake_up_process(f->rw[0].thr);
+ wake_up_process(f->rw[1].thr);
+ return 0;
+err:
+ ringbuffer_test_free(f);
+ return ret;
+}
+
+static int ringbuffer_test_release(struct inode *inode, struct file *file)
+{
+ ringbuffer_test_free(file->private_data);
+ return 0;
+}
+
+static ssize_t ringbuffer_test_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+ struct file *file = iocb->ki_filp;
+ struct ringbuffer_test_file *f = file->private_data;
+ struct ringbuffer_test_rw *i = &f->rw[READ];
+
+ ssize_t ret = mutex_lock_interruptible(&i->lock);
+ if (ret)
+ return ret;
+
+ ret = ringbuffer_read_iter(i->rb, iter, file->f_flags & O_NONBLOCK);
+ mutex_unlock(&i->lock);
+ return ret;
+}
+
+static ssize_t ringbuffer_test_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+ struct file *file = iocb->ki_filp;
+ struct ringbuffer_test_file *f = file->private_data;
+ struct ringbuffer_test_rw *i = &f->rw[WRITE];
+
+ ssize_t ret = mutex_lock_interruptible(&i->lock);
+ if (ret)
+ return ret;
+
+ ret = ringbuffer_write_iter(i->rb, iter, file->f_flags & O_NONBLOCK);
+ mutex_unlock(&i->lock);
+ return ret;
+}
+
+static struct ringbuffer *ringbuffer_test_ringbuffer(struct file *file, int rw)
+{
+ struct ringbuffer_test_file *i = file->private_data;
+
+ BUG_ON(rw > WRITE);
+
+ return i->rw[rw].rb;
+}
+
+static const struct file_operations ringbuffer_fops = {
+ .owner = THIS_MODULE,
+ .read_iter = ringbuffer_test_read_iter,
+ .write_iter = ringbuffer_test_write_iter,
+ .ringbuffer = ringbuffer_test_ringbuffer,
+ .open = ringbuffer_test_open,
+ .release = ringbuffer_test_release,
+};
+
+static int __init ringbuffer_test_init(void)
+{
+ int ringbuffer_major = register_chrdev(0, "ringbuffer-test", &ringbuffer_fops);
+ if (ringbuffer_major < 0)
+ return ringbuffer_major;
+
+ static const struct class ringbuffer_class = { .name = "ringbuffer_test" };
+ int ret = class_register(&ringbuffer_class);
+ if (ret)
+ goto major_out;
+
+ struct device *ringbuffer_device = device_create(&ringbuffer_class, NULL,
+ MKDEV(ringbuffer_major, 0),
+ NULL, "ringbuffer-test");
+ ret = PTR_ERR_OR_ZERO(ringbuffer_device);
+ if (ret)
+ goto class_out;
+
+ return 0;
+
+class_out:
+ class_unregister(&ringbuffer_class);
+major_out:
+ unregister_chrdev(ringbuffer_major, "ringbuffer-test");
+ return ret;
+}
+__initcall(ringbuffer_test_init);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 59b6765d86b8..bb16762af575 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2957,6 +2957,11 @@ config TEST_OBJPOOL

If unsure, say N.

+config RINGBUFFER_TEST
+ bool "Test driver for sys_ringbuffer"
+ default n
+ depends on RINGBUFFER
+
endif # RUNTIME_TESTING_MENU

config ARCH_USE_MEMTEST
--
2.45.1


2024-06-03 00:35:56

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 5/5] ringbuffer: Userspace test helper

This adds a helper for testing the new ringbuffer syscall using
/dev/ringbuffer-test; it can do performance testing of both normal reads
and writes, and reads and writes via the ringbuffer interface.

Signed-off-by: Kent Overstreet <[email protected]>
---
tools/ringbuffer/Makefile | 3 +
tools/ringbuffer/ringbuffer-test.c | 254 +++++++++++++++++++++++++++++
2 files changed, 257 insertions(+)
create mode 100644 tools/ringbuffer/Makefile
create mode 100644 tools/ringbuffer/ringbuffer-test.c

diff --git a/tools/ringbuffer/Makefile b/tools/ringbuffer/Makefile
new file mode 100644
index 000000000000..2fb27a19b43e
--- /dev/null
+++ b/tools/ringbuffer/Makefile
@@ -0,0 +1,3 @@
+CFLAGS=-g -O2 -Wall -Werror -I../../include
+
+all: ringbuffer-test
diff --git a/tools/ringbuffer/ringbuffer-test.c b/tools/ringbuffer/ringbuffer-test.c
new file mode 100644
index 000000000000..0fba99e40858
--- /dev/null
+++ b/tools/ringbuffer/ringbuffer-test.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#define READ 0
+#define WRITE 1
+
+#define min(a, b) (a < b ? a : b)
+
+#define __EXPORTED_HEADERS__
+#include <uapi/linux/ringbuffer_sys.h>
+
+#define BUF_NR 4
+
+typedef uint32_t u32;
+typedef unsigned long ulong;
+
+static inline struct ringbuffer_ptrs *ringbuffer(int fd, int rw, u32 size)
+{
+ ulong addr = 0;
+ int ret = syscall(463, fd, rw, size, &addr);
+ if (ret < 0)
+ errno = -ret;
+ return (void *) addr;
+}
+
+static inline int ringbuffer_wait(int fd, int rw)
+{
+ return syscall(464, fd, rw);
+}
+
+static inline int ringbuffer_wakeup(int fd, int rw)
+{
+ return syscall(465, fd, rw);
+}
+
+static ssize_t ringbuffer_read(int fd, struct ringbuffer_ptrs *rb,
+ void *buf, size_t len)
+{
+ void *rb_data = (void *) rb + rb->data_offset;
+
+ u32 head, orig_tail = rb->tail, tail = orig_tail;
+
+ while ((head = __atomic_load_n(&rb->head, __ATOMIC_ACQUIRE)) == tail)
+ ringbuffer_wait(fd, READ);
+
+ while (len && head != tail) {
+ u32 tail_masked = tail & rb->mask;
+ unsigned b = min(len,
+ min(head - tail,
+ rb->size - tail_masked));
+
+ memcpy(buf, rb_data + tail_masked, b);
+ buf += b;
+ len -= b;
+ tail += b;
+ }
+
+ __atomic_store_n(&rb->tail, tail, __ATOMIC_RELEASE);
+
+ __atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+ if (rb->head - orig_tail >= rb->size)
+ ringbuffer_wakeup(fd, READ);
+
+ return tail - orig_tail;
+}
+
+static ssize_t ringbuffer_write(int fd, struct ringbuffer_ptrs *rb,
+ void *buf, size_t len)
+{
+ void *rb_data = (void *) rb + rb->data_offset;
+
+ u32 orig_head = rb->head, head = orig_head, tail;
+
+ while (head - (tail = __atomic_load_n(&rb->tail, __ATOMIC_ACQUIRE)) >= rb->size)
+ ringbuffer_wait(fd, WRITE);
+
+ while (len && head - tail < rb->size) {
+ u32 head_masked = head & rb->mask;
+ unsigned b = min(len,
+ min(tail - head + rb->size,
+ rb->size - head_masked));
+
+ memcpy(rb_data + head_masked, buf, b);
+ buf += b;
+ len -= b;
+ head += b;
+ }
+
+ __atomic_store_n(&rb->head, head, __ATOMIC_RELEASE);
+
+ __atomic_thread_fence(__ATOMIC_SEQ_CST);
+
+ if ((s32) (rb->tail - orig_head) >= 0)
+ ringbuffer_wakeup(fd, WRITE);
+
+ return head - orig_head;
+}
+
+static void usage(void)
+{
+ puts("ringbuffer-test - test ringbuffer syscall\n"
+ "Usage: ringbuffer-test [OPTION]...\n"
+ "\n"
+ "Options:\n"
+ " --type=(io|ringbuffer)\n"
+ " --rw=(read|write)\n"
+ " -h, --help Display this help and exit\n");
+}
+
+static inline ssize_t rb_test_read(int fd, struct ringbuffer_ptrs *rb,
+ void *buf, size_t len)
+{
+ return rb
+ ? ringbuffer_read(fd, rb, buf, len)
+ : read(fd, buf, len);
+}
+
+static inline ssize_t rb_test_write(int fd, struct ringbuffer_ptrs *rb,
+ void *buf, size_t len)
+{
+ return rb
+ ? ringbuffer_write(fd, rb, buf, len)
+ : write(fd, buf, len);
+}
+
+int main(int argc, char *argv[])
+{
+ const struct option longopts[] = {
+ { "type", required_argument, NULL, 't' },
+ { "rw", required_argument, NULL, 'r' },
+ { "help", no_argument, NULL, 'h' },
+ { NULL }
+ };
+ int use_ringbuffer = false, rw = false;
+ int opt;
+
+ while ((opt = getopt_long(argc, argv, "h", longopts, NULL)) != -1) {
+ switch (opt) {
+ case 't':
+ if (!strcmp(optarg, "io"))
+ use_ringbuffer = false;
+ else if (!strcmp(optarg, "ringbuffer") ||
+ !strcmp(optarg, "rb"))
+ use_ringbuffer = true;
+ else {
+ fprintf(stderr, "Invalid type %s\n", optarg);
+ exit(EXIT_FAILURE);
+ }
+ break;
+ case 'r':
+ if (!strcmp(optarg, "read"))
+ rw = false;
+ else if (!strcmp(optarg, "write"))
+ rw = true;
+ else {
+ fprintf(stderr, "Invalid rw %s\n", optarg);
+ exit(EXIT_FAILURE);
+ }
+ break;
+ case '?':
+ fprintf(stderr, "Invalid option %c\n", opt);
+ usage();
+ exit(EXIT_FAILURE);
+ case 'h':
+ usage();
+ exit(EXIT_SUCCESS);
+ }
+ }
+
+ int fd = open("/dev/ringbuffer-test", O_RDWR);
+ if (fd < 0) {
+ fprintf(stderr, "Error opening /dev/ringbuffer-test: %m\n");
+ exit(EXIT_FAILURE);
+ }
+
+ struct ringbuffer_ptrs *rb = NULL;
+ if (use_ringbuffer) {
+ rb = ringbuffer(fd, rw, 4096);
+ if (!rb) {
+ fprintf(stderr, "Error from sys_ringbuffer: %m\n");
+ exit(EXIT_FAILURE);
+ }
+
+ fprintf(stderr, "got ringbuffer %p\n", rb);
+ }
+
+ printf("Starting test with ringbuffer=%u, rw=%u\n", use_ringbuffer, rw);
+ static const char * const rw_str[] = { "read", "wrote" };
+
+ struct timeval start;
+ gettimeofday(&start, NULL);
+ size_t nr_prints = 1;
+
+ u32 buf[BUF_NR];
+ u32 idx = 0;
+
+ while (true) {
+ struct timeval now;
+ gettimeofday(&now, NULL);
+
+ struct timeval next_print = start;
+ next_print.tv_sec += nr_prints;
+
+ if (timercmp(&now, &next_print, >)) {
+ printf("%s %u u32s, %lu mb/sec\n", rw_str[rw], idx,
+ (idx * sizeof(u32) / (now.tv_sec - start.tv_sec)) / (1UL << 20));
+ nr_prints++;
+ if (nr_prints > 20)
+ break;
+ }
+
+ if (rw == READ) {
+ int r = rb_test_read(fd, rb, buf, sizeof(buf));
+ if (r <= 0) {
+ fprintf(stderr, "Read returned %i (%m)\n", r);
+ exit(EXIT_FAILURE);
+ }
+
+ unsigned nr = r / sizeof(u32);
+ for (unsigned i = 0; i < nr; i++) {
+ if (buf[i] != idx + i) {
+ fprintf(stderr, "Read returned wrong data at idx %u: got %u instead\n",
+ idx + i, buf[i]);
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ idx += nr;
+ } else {
+ for (unsigned i = 0; i < BUF_NR; i++)
+ buf[i] = idx + i;
+
+ int r = rb_test_write(fd, rb, buf, sizeof(buf));
+ if (r <= 0) {
+ fprintf(stderr, "Write returned %i (%m)\n", r);
+ exit(EXIT_FAILURE);
+ }
+
+ unsigned nr = r / sizeof(u32);
+ idx += nr;
+ }
+ }
+
+ exit(EXIT_SUCCESS);
+}
--
2.45.1


2024-06-03 04:17:17

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/5] fs: sys_ringbuffer

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on linus/master v6.10-rc2]
[cannot apply to akpm-mm/mm-nonmm-unstable tip/x86/asm akpm-mm/mm-everything next-20240531]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/darray-lift-from-bcachefs/20240603-083536
base: tip/locking/core
patch link: https://lore.kernel.org/r/20240603003306.2030491-4-kent.overstreet%40linux.dev
patch subject: [PATCH 3/5] fs: sys_ringbuffer
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20240603/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project bafda89a0944d947fc4b3b5663185e07a397ac30)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240603/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> <stdin>:1603:2: warning: syscall ringbuffer not implemented [-W#warnings]
1603 | #warning syscall ringbuffer not implemented
| ^
>> <stdin>:1606:2: warning: syscall ringbuffer_wait not implemented [-W#warnings]
1606 | #warning syscall ringbuffer_wait not implemented
| ^
>> <stdin>:1609:2: warning: syscall ringbuffer_wakeup not implemented [-W#warnings]
1609 | #warning syscall ringbuffer_wakeup not implemented
| ^
3 warnings generated.
--
In file included from arch/arm/kernel/asm-offsets.c:12:
In file included from include/linux/mm.h:2253:
include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
1 warning generated.
>> <stdin>:1603:2: warning: syscall ringbuffer not implemented [-W#warnings]
1603 | #warning syscall ringbuffer not implemented
| ^
>> <stdin>:1606:2: warning: syscall ringbuffer_wait not implemented [-W#warnings]
1606 | #warning syscall ringbuffer_wait not implemented
| ^
>> <stdin>:1609:2: warning: syscall ringbuffer_wakeup not implemented [-W#warnings]
1609 | #warning syscall ringbuffer_wakeup not implemented
| ^
3 warnings generated.

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-06-03 04:40:21

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/5] fs: sys_ringbuffer

Hi Kent,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/locking/core]
[also build test ERROR on linus/master v6.10-rc2]
[cannot apply to akpm-mm/mm-nonmm-unstable tip/x86/asm akpm-mm/mm-everything next-20240531]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/darray-lift-from-bcachefs/20240603-083536
base: tip/locking/core
patch link: https://lore.kernel.org/r/20240603003306.2030491-4-kent.overstreet%40linux.dev
patch subject: [PATCH 3/5] fs: sys_ringbuffer
config: i386-buildonly-randconfig-002-20240603 (https://download.01.org/0day-ci/archive/20240603/[email protected]/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240603/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

In file included from <built-in>:1:
>> ./usr/include/linux/ringbuffer_sys.h:5:10: fatal error: 'uapi/linux/types.h' file not found
5 | #include <uapi/linux/types.h>
| ^~~~~~~~~~~~~~~~~~~~
1 error generated.

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-06-07 01:50:17

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH 0/5] sys_ringbuffer

On Sun, Jun 02, 2024 at 08:32:57PM -0400, Kent Overstreet wrote:
> New syscall for mapping generic ringbuffers for arbitary (supported)
> file descriptors.
>
> Ringbuffers can be created either when requested or at file open time,
> and can be mapped into multiple address spaces (naturally, since files
> can be shared as well).
>
> Initial motivation is for fuse, but I plan on adding support to pipes
> and possibly sockets as well - pipes are a particularly interesting use
> case, because if both the sender and receiver of a pipe opt in to the
> new ringbuffer interface, we can make them the _same_ ringbuffer for
> true zero copy IO, while being backwards compatible with existing pipes.

Hi Kent,
I recently came across a similar use case where the ability to "upgrade"
an fd into a more efficient interface would be useful like in this pipe
scenario you are describing.

My use case is when you have a block device using the ublk driver. ublk
lets userspace servers implement block devices. ublk is great when
compatibility is required with applications that expect block device
fds, but when an application is willing to implement a shared memory
interface to communicate directly with the ublk server then going
through a block device is inefficient.

In my case the application is QEMU, where the virtual machine runs a
virtio-blk driver that could talk directly to the ublk server via
vhost-user-blk. vhost-user-blk is a protocol that allows the virtual
machine to talk directly to the ublk server via shared memory without
going through QEMU or the host kernel block layer.

QEMU would need a way to upgrade from a ublk block device file to a
vhost-user socket. Just like in your pipe example, this approach relies
on being able to go from a "compatibility" fd to a more efficient
interface gracefully when both sides support this feature.

The generic ringbuffer approach in this series would not work for
the vhost-user protocol because the client must be able to provide its
own memory and file descriptor passing is needed in general. The
protocol spec is here:
https://gitlab.com/qemu-project/qemu/-/blob/master/docs/interop/vhost-user.rst

A different way to approach the fd upgrading problem is to treat this as
an AF_UNIX connectivity feature rather than a new ring buffer API.
Imagine adding a new address type to AF_UNIX for looking up connections
in a struct file (e.g. the pipe fd) instead of on the file system (or
the other AF_UNIX address types).

The first program creates the pipe and also an AF_UNIX socket. It calls
bind(2) on the socket with the sockaddr_un path
"/dev/self/fd/<fd>/<discriminator>" where fd is a pipe fd and
discriminator is a string like "ring-buffer" that describes the
service/protocol. The AF_UNIX kernel code parses this special path and
stores an association with the pipe file for future connect(2) calls.
The program listens on the AF_UNIX socket and then continues doing its
stuff.

The second program runs and inherits the pipe fd on stdin. It creates an
AF_UNIX socket and attempts to connect(2) to
"/dev/self/fd/0/ring-buffer". The AF_UNIX kernel code parses this
special path and establishes a connection between the connecting and
listening sockets inside the pipe fd's struct file. If connect(2) fails
then the second program knows that this is an ordinary pipe that does
not support upgrading to ring buffer operation.

Now the AF_UNIX socket can be used to pass shared memory for the ring
buffer and futexes. This AF_UNIX approach also works for my ublk block
device to vhost-user-blk upgrade use case. It does not require a new
ring buffer API but instead involves extending AF_UNIX.

You have more use cases than just the pipe scenario, maybe my half-baked
idea won't cover all of them, but I wanted to see what you think.

Stefan

> the ringbuffer_wait and ringbuffer_wakeup syscalls are probably going
> away in a future iteration, in favor of just using futexes.
>
> In my testing, reading/writing from the ringbuffer 16 bytes at a time is
> ~7x faster than using read/write syscalls - and I was testing with
> mitigations off, real world benefit will be even higher.
>
> Kent Overstreet (5):
> darray: lift from bcachefs
> darray: Fix darray_for_each_reverse() when darray is empty
> fs: sys_ringbuffer
> ringbuffer: Test device
> ringbuffer: Userspace test helper
>
> MAINTAINERS | 7 +
> arch/x86/entry/syscalls/syscall_32.tbl | 3 +
> arch/x86/entry/syscalls/syscall_64.tbl | 3 +
> fs/Makefile | 2 +
> fs/bcachefs/Makefile | 1 -
> fs/bcachefs/btree_types.h | 2 +-
> fs/bcachefs/btree_update.c | 2 +
> fs/bcachefs/btree_write_buffer_types.h | 2 +-
> fs/bcachefs/fsck.c | 2 +-
> fs/bcachefs/journal_io.h | 2 +-
> fs/bcachefs/journal_sb.c | 2 +-
> fs/bcachefs/sb-downgrade.c | 3 +-
> fs/bcachefs/sb-errors_types.h | 2 +-
> fs/bcachefs/sb-members.h | 3 +-
> fs/bcachefs/subvolume.h | 1 -
> fs/bcachefs/subvolume_types.h | 2 +-
> fs/bcachefs/thread_with_file_types.h | 2 +-
> fs/bcachefs/util.h | 28 +-
> fs/ringbuffer.c | 474 ++++++++++++++++++++++++
> fs/ringbuffer_test.c | 209 +++++++++++
> {fs/bcachefs => include/linux}/darray.h | 61 +--
> include/linux/darray_types.h | 22 ++
> include/linux/fs.h | 2 +
> include/linux/mm_types.h | 4 +
> include/linux/ringbuffer_sys.h | 18 +
> include/uapi/linux/futex.h | 1 +
> include/uapi/linux/ringbuffer_sys.h | 40 ++
> init/Kconfig | 9 +
> kernel/fork.c | 2 +
> lib/Kconfig.debug | 5 +
> lib/Makefile | 2 +-
> {fs/bcachefs => lib}/darray.c | 12 +-
> tools/ringbuffer/Makefile | 3 +
> tools/ringbuffer/ringbuffer-test.c | 254 +++++++++++++
> 34 files changed, 1125 insertions(+), 62 deletions(-)
> create mode 100644 fs/ringbuffer.c
> create mode 100644 fs/ringbuffer_test.c
> rename {fs/bcachefs => include/linux}/darray.h (63%)
> create mode 100644 include/linux/darray_types.h
> create mode 100644 include/linux/ringbuffer_sys.h
> create mode 100644 include/uapi/linux/ringbuffer_sys.h
> rename {fs/bcachefs => lib}/darray.c (56%)
> create mode 100644 tools/ringbuffer/Makefile
> create mode 100644 tools/ringbuffer/ringbuffer-test.c
>
> --
> 2.45.1
>


Attachments:
(No filename) (6.66 kB)
signature.asc (499.00 B)
Download all attachments