2020-12-03 22:36:52

by Palmer Dabbelt

[permalink] [raw]
Subject: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

This patch set contains dm-user, a device mapper target that proxies incoming
BIOs to userspace via a misc device. Essentially it's FUSE, but for block
devices. There's more information in the documentation patch and as a handful
of commends, so I'm just going to avoid duplicating that here. I don't really
think there's any fundamental functionality that dm-user enables, as one could
use something along the lines of nbd/iscsi, but dm-user does result in
extremely simple userspace daemons -- so simple that when I tried to write a
helper userspace library for dm-user I just ended up with nothing.

I talked about this a bit at Plumbers and was hoping to send patches a bit
earlier on in the process, but got tied up with a few things. As a result this
is actually quite far along: it's at the point where we're starting to run this
on real devices as part of an updated Android OTA update flow, where we're
using this to provide an Android-specific compressed backing store for
dm-snap-persistent. The bulk of that project is scattered throughout the
various Android trees, so there are kselftests and a (somewhat bare for now)
Documentation entry with the intent of making this a self-contained
contribution. There's a lot to the Android userspace daemon, but it doesn't
interact with dm-user in a very complex manner.

This is still in a somewhat early stage, but it's at the point where things
largely function. I'm certainly not ready to commit to the user ABI
implemented here and there are a bunch of FIXMEs scattered throughout the code,
but I do think that it's far along enough to begin a more concrete discussion
of where folks would like to go with something like this. While I'd intending
on sorting that stuff out, I'd like to at least get a feel for whether this is
a path worth pursuing before spending a bunch more time on it.

I haven't done much in the way of performance analysis for dm-user. Earlier on
I did some simple throughput tests and found that dm-user/ext4 was faster than
half the speed of tmpfs, which is way out of the realm of being an issue for
our use case (decompressing blocks out of a phone's storage). The design of
dm-user does preclude an extremely high performance implementation, where I
assume one would want an explicit ring buffer and zero copy, but I feel like
users who want that degree of performance are probably better served writing a
proper kernel driver. I wouldn't be opposed to pushing on performance (ideally
without a major design change), but for now I feel like time is better spent
fortifying the user ABI and fixing the various issues with the implementation.

The patches follow as usual, but in case it's easier I've published a tree as
well:

git://git.kernel.org/pub/scm/linux/kernel/git/palmer/dm-user.git -b dm-user-v1



2020-12-03 22:37:46

by Palmer Dabbelt

[permalink] [raw]
Subject: [PATCH v1 3/5] dm: dm-user: New target that proxies BIOs to userspace

From: Palmer Dabbelt <[email protected]>

dm-user is a device mapper target that allows a userpsace process to
handle each incoming BIO -- essentially it's Fuse, but for the block
layer.

Signed-off-by: Palmer Dabbelt <[email protected]>

---

This has numerous issues with this, which I've enumerated via FIXMEs
scattered throughout the code. While it's obviously in no shape to be
merged, this does at least function at a basic level (the next patch has
some tests). Many of the FIXMEs are simply missing functionality, but I
wanted to send this out earlier rather than later as I have some higher
level questions:

* Does it even make sense to have this within device mapper? There's no
fundamental reason for this to be a device mapper target (ie, it could
just be its own block device), but being this does allow us to
piggyback on existing mechanisms to handle the device lifecycle.
* Is dm-user (in cooperation with the userspace daemon) responsible for
ordering flush-related BIOs with any other BIOs, or is that handled
elsewhere within the kernel?
* Is my shared target mutex legal?
* Is there any benefit to returing DM_MAPIO_KILLED as opposed to later
terminating the BIO with an IO error after it has been submitted?

Each of the above is discussed in more detail in the code.
---
drivers/md/Kconfig | 13 +
drivers/md/Makefile | 1 +
drivers/md/dm-user.c | 1227 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1241 insertions(+)
create mode 100644 drivers/md/dm-user.c

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 30ba3573626c..bcafca0e571d 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -617,4 +617,17 @@ config DM_ZONED

If unsure, say N.

+config DM_USER
+ tristate "Block device in userspace"
+ depends on BLK_DEV_DM
+ help
+ This device-mapper target allows a userspace daemon to provide the
+ contents of a block device. See
+ <file:Documentation/block/dm-user.rst> for more information.
+
+ To compile this code as a module, choose M here: the module will be
+ called dm-user.
+
+ If unsure, say N.
+
endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 6d3e234dc46a..82ae3d496a00 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
+obj-$(CONFIG_DM_USER) += dm-user.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_DUST) += dm-dust.o
diff --git a/drivers/md/dm-user.c b/drivers/md/dm-user.c
new file mode 100644
index 000000000000..0aaa8f39f18a
--- /dev/null
+++ b/drivers/md/dm-user.c
@@ -0,0 +1,1227 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2020 Palmer Dabbelt <[email protected]>
+ */
+
+#include <linux/device-mapper.h>
+#include <uapi/linux/dm-user.h>
+
+#include <linux/bio.h>
+#include <linux/init.h>
+#include <linux/mempool.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/uio.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "user"
+
+#define MAX_OUTSTANDING_MESSAGES 128
+
+/*
+ * dm-user uses four structures:
+ *
+ * - "struct target", the outermost structure, corresponds to a single device
+ * mapper target. This contains the set of outstanding BIOs that have been
+ * provided by DM and are not actively being processed by the user, along
+ * with a misc device that userspace can open to communicate with the
+ * kernel. Each time userspaces opens the misc device a new channel is
+ * created.
+ * - "struct channel", which represents a single active communication channel
+ * with userspace. Userspace may choose arbitrary read/write sizes to use
+ * when processing messages, channels form these into logical accesses.
+ * When userspace responds to a full message the channel completes the BIO
+ * and obtains a new message to process from the target.
+ * - "struct message", which wraps a BIO with the additional information
+ * required by the kernel to sort out what to do with BIOs when they return
+ * from userspace.
+ * - "struct dm_user_message", which is the exact message format that
+ * userspace sees.
+ *
+ * The hot path contains three distinct operations:
+ *
+ * - user_map(), which is provided a BIO from device mapper that is queued
+ * into the target. This allocates and enqueues a new message.
+ * - dev_read(), which dequeues a message, copies it to userspace.
+ * - dev_write(), which looks up a message (keyed by sequence number) and
+ * completes the corresponding BIO.
+ *
+ * Lock ordering (outer to inner)
+ *
+ * 1) miscdevice's global lock. This is held around dev_open, so it has to be
+ * the outermost lock.
+ * 2) target->lock
+ * 3) channel->lock
+ */
+
+struct message {
+ /*
+ * Messages themselves do not need a lock, they're protected by either
+ * the target or channel's lock, depending on which can reference them
+ * directly.
+ */
+ struct dm_user_message msg;
+ struct bio *bio;
+ size_t posn_to_user;
+ size_t total_to_user;
+ size_t posn_from_user;
+ size_t total_from_user;
+
+ struct list_head from_user;
+ struct list_head to_user;
+
+ /*
+ * These are written back from the user. They live in the same spot in
+ * the message, but we need to either keep the old values around or
+ * call a bunch more BIO helpers. These are only valid after write has
+ * adopted the message.
+ */
+ u64 return_type;
+ u64 return_flags;
+};
+
+struct target {
+ /*
+ * A target has a single lock, which protects everything in the target
+ * (but does not protect the channels associated with a target).
+ */
+ struct mutex lock;
+
+ /*
+ * There is only one point at which anything blocks: userspace blocks
+ * reading a new message, which is woken up by device mapper providing
+ * a new BIO to process (or tearing down the target). The
+ * corresponding write side doesn't block, instead we treat userspace's
+ * response containing a message that has yet to be mapped as an
+ * invalid operation.
+ */
+ struct wait_queue_head wq;
+
+ /*
+ * Messages are delivered to userspace in order, but may be returned
+ * out of order. This allows userspace to schedule IO if it wants to.
+ */
+ mempool_t message_pool;
+ u64 next_seq_to_map;
+ u64 next_seq_to_user;
+ struct list_head to_user;
+
+ /*
+ * There is a misc device per target. The name is selected by
+ * userspace (via a DM create ioctl argument), and each ends up in
+ * /dev/dm-user/. It looks like a better way to do this may be to have
+ * a filesystem to manage these, but this was more expedient. The
+ * current mechanism is functional, but does result in an arbitrary
+ * number of dynamically created misc devices.
+ */
+ struct miscdevice miscdev;
+
+ /*
+ * Device mapper's target destructor triggers tearing this all down,
+ * but we can't actually free until every channel associated with this
+ * target has been destroyed. Channels each have a reference to their
+ * target, and there is an additional single reference that corresponds
+ * to both DM and the misc device (both of which are destroyed by DM).
+ *
+ * In the common case userspace will be asleep waiting for a new
+ * message when device mapper decides to destroy the target, which
+ * means no new messages will appear. The destroyed flag triggers a
+ * wakeup, which will end up removing the reference.
+ */
+ struct kref references;
+ int dm_destroyed;
+};
+
+struct channel {
+ struct target *target;
+
+ /*
+ * A channel has a single lock, which prevents multiple reads (or
+ * multiple writes) from conflicting with each other.
+ */
+ struct mutex lock;
+
+ struct message *cur_to_user;
+ struct message *cur_from_user;
+ ssize_t to_user_error;
+ ssize_t from_user_error;
+
+ /*
+ * Once a message has been forwarded to userspace on a channel it must
+ * be responded to on the same channel. This allows us to error out
+ * the messages that have not yet been responded to by a channel when
+ * that channel closes, which makes handling errors more reasonable for
+ * fault-tolerant userspace daemons. It also happens to make avoiding
+ * shared locks between user_map() and dev_read() a lot easier.
+ *
+ * This does preclude a multi-threaded work stealing userspace
+ * implementation (or at least, force a degree of head-of-line blocking
+ * on the response path).
+ */
+ struct list_head from_user;
+
+ /*
+ * Responses from userspace can arrive in arbitrarily small chunks.
+ * We need some place to buffer one up until we can find the
+ * corresponding kernel-side message to continue processing, so instead
+ * of allocating them we just keep one off to the side here. This can
+ * only ever be pointed to by from_user_cur, and will never have a BIO.
+ */
+ struct message scratch_message_from_user;
+};
+
+static inline struct target *target_from_target(struct dm_target *target)
+{
+ WARN_ON(target->private == NULL);
+ return target->private;
+}
+
+static inline struct target *target_from_miscdev(struct miscdevice *miscdev)
+{
+ return container_of(miscdev, struct target, miscdev);
+}
+
+static inline struct channel *channel_from_file(struct file *file)
+{
+ WARN_ON(file->private_data == NULL);
+ return file->private_data;
+}
+
+static inline struct target *target_from_channel(struct channel *c)
+{
+ WARN_ON(c->target == NULL);
+ return c->target;
+}
+
+static inline size_t bio_size(struct bio *bio)
+{
+ struct bio_vec bvec;
+ struct bvec_iter iter;
+ size_t out = 0;
+
+ bio_for_each_segment(bvec, bio, iter)
+ out += bio_iter_len(bio, iter);
+ return out;
+}
+
+static inline size_t bio_bytes_needed_to_user(struct bio *bio)
+{
+ switch (bio_op(bio)) {
+ case REQ_OP_WRITE:
+ return sizeof(struct dm_user_message) + bio_size(bio);
+ case REQ_OP_READ:
+ case REQ_OP_FLUSH:
+ case REQ_OP_DISCARD:
+ case REQ_OP_SECURE_ERASE:
+ case REQ_OP_WRITE_SAME:
+ case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ZONE_OPEN:
+ case REQ_OP_ZONE_CLOSE:
+ case REQ_OP_ZONE_FINISH:
+ case REQ_OP_ZONE_APPEND:
+ case REQ_OP_ZONE_RESET:
+ return sizeof(struct dm_user_message);
+
+ /*
+ * These ops are not passed to userspace under the assumption that
+ * they're not going to be particularly useful in that context.
+ */
+ case REQ_OP_SCSI_IN:
+ case REQ_OP_SCSI_OUT:
+ case REQ_OP_DRV_IN:
+ case REQ_OP_DRV_OUT:
+ /* Anything new isn't supported,at least not yet. */
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+static inline size_t bio_bytes_needed_from_user(struct bio *bio)
+{
+ switch (bio_op(bio)) {
+ case REQ_OP_READ:
+ return sizeof(struct dm_user_message) + bio_size(bio);
+ case REQ_OP_WRITE:
+ case REQ_OP_FLUSH:
+ case REQ_OP_DISCARD:
+ case REQ_OP_SECURE_ERASE:
+ case REQ_OP_WRITE_SAME:
+ case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ZONE_OPEN:
+ case REQ_OP_ZONE_CLOSE:
+ case REQ_OP_ZONE_FINISH:
+ case REQ_OP_ZONE_APPEND:
+ case REQ_OP_ZONE_RESET:
+ return sizeof(struct dm_user_message);
+
+ /*
+ * These ops are not passed to userspace under the assumption that
+ * they're not going to be particularly useful in that context.
+ */
+ case REQ_OP_SCSI_IN:
+ case REQ_OP_SCSI_OUT:
+ case REQ_OP_DRV_IN:
+ case REQ_OP_DRV_OUT:
+ /* Anything new isn't supported,at least not yet. */
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+static inline long bio_type_to_user_type(struct bio *bio)
+{
+ switch (bio_op(bio)) {
+ case REQ_OP_READ:
+ return DM_USER_REQ_MAP_READ;
+ case REQ_OP_WRITE:
+ return DM_USER_REQ_MAP_WRITE;
+ case REQ_OP_FLUSH:
+ return DM_USER_REQ_MAP_FLUSH;
+ case REQ_OP_DISCARD:
+ return DM_USER_REQ_MAP_DISCARD;
+ case REQ_OP_SECURE_ERASE:
+ return DM_USER_REQ_MAP_SECURE_ERASE;
+ case REQ_OP_WRITE_SAME:
+ return DM_USER_REQ_MAP_WRITE_SAME;
+ case REQ_OP_WRITE_ZEROES:
+ return DM_USER_REQ_MAP_WRITE_ZEROES;
+ case REQ_OP_ZONE_OPEN:
+ return DM_USER_REQ_MAP_ZONE_OPEN;
+ case REQ_OP_ZONE_CLOSE:
+ return DM_USER_REQ_MAP_ZONE_CLOSE;
+ case REQ_OP_ZONE_FINISH:
+ return DM_USER_REQ_MAP_ZONE_FINISH;
+ case REQ_OP_ZONE_APPEND:
+ return DM_USER_REQ_MAP_ZONE_APPEND;
+ case REQ_OP_ZONE_RESET:
+ return DM_USER_REQ_MAP_ZONE_RESET;
+
+ /*
+ * These ops are not passed to userspace under the assumption that
+ * they're not going to be particularly useful in that context.
+ */
+ case REQ_OP_SCSI_IN:
+ case REQ_OP_SCSI_OUT:
+ case REQ_OP_DRV_IN:
+ case REQ_OP_DRV_OUT:
+ /* Anything new isn't supported,at least not yet. */
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+static inline long bio_flags_to_user_flags(struct bio *bio)
+{
+ u64 out = 0;
+ typeof(bio->bi_opf) opf = bio->bi_opf & ~REQ_OP_MASK;
+
+ if (opf & REQ_FAILFAST_DEV) {
+ opf &= ~REQ_FAILFAST_DEV;
+ out |= DM_USER_REQ_MAP_FLAG_FAILFAST_DEV;
+ }
+
+ if (opf & REQ_FAILFAST_TRANSPORT) {
+ opf &= ~REQ_FAILFAST_TRANSPORT;
+ out |= DM_USER_REQ_MAP_FLAG_FAILFAST_TRANSPORT;
+ }
+
+ if (opf & REQ_FAILFAST_DRIVER) {
+ opf &= ~REQ_FAILFAST_DRIVER;
+ out |= DM_USER_REQ_MAP_FLAG_FAILFAST_DRIVER;
+ }
+
+ if (opf & REQ_SYNC) {
+ opf &= ~REQ_SYNC;
+ out |= DM_USER_REQ_MAP_FLAG_SYNC;
+ }
+
+ if (opf & REQ_META) {
+ opf &= ~REQ_META;
+ out |= DM_USER_REQ_MAP_FLAG_META;
+ }
+
+ if (opf & REQ_PRIO) {
+ opf &= ~REQ_PRIO;
+ out |= DM_USER_REQ_MAP_FLAG_PRIO;
+ }
+
+ if (opf & REQ_NOMERGE) {
+ opf &= ~REQ_NOMERGE;
+ out |= DM_USER_REQ_MAP_FLAG_NOMERGE;
+ }
+
+ if (opf & REQ_IDLE) {
+ opf &= ~REQ_IDLE;
+ out |= DM_USER_REQ_MAP_FLAG_IDLE;
+ }
+
+ if (opf & REQ_INTEGRITY) {
+ opf &= ~REQ_INTEGRITY;
+ out |= DM_USER_REQ_MAP_FLAG_INTEGRITY;
+ }
+
+ if (opf & REQ_FUA) {
+ opf &= ~REQ_FUA;
+ out |= DM_USER_REQ_MAP_FLAG_FUA;
+ }
+
+ if (opf & REQ_PREFLUSH) {
+ opf &= ~REQ_PREFLUSH;
+ out |= DM_USER_REQ_MAP_FLAG_PREFLUSH;
+ }
+
+ if (opf & REQ_PREFLUSH) {
+ opf &= ~REQ_PREFLUSH;
+ out |= DM_USER_REQ_MAP_FLAG_PREFLUSH;
+ }
+
+ if (opf & REQ_RAHEAD) {
+ opf &= ~REQ_RAHEAD;
+ out |= DM_USER_REQ_MAP_FLAG_RAHEAD;
+ }
+
+ if (opf & REQ_BACKGROUND) {
+ opf &= ~REQ_BACKGROUND;
+ out |= DM_USER_REQ_MAP_FLAG_BACKGROUND;
+ }
+
+ if (opf & REQ_BACKGROUND) {
+ opf &= ~REQ_BACKGROUND;
+ out |= DM_USER_REQ_MAP_FLAG_BACKGROUND;
+ }
+
+ if (opf & REQ_NOWAIT) {
+ opf &= ~REQ_NOWAIT;
+ out |= DM_USER_REQ_MAP_FLAG_NOWAIT;
+ }
+
+ if (opf & REQ_CGROUP_PUNT) {
+ opf &= ~REQ_CGROUP_PUNT;
+ out |= DM_USER_REQ_MAP_FLAG_CGROUP_PUNT;
+ }
+
+ if (opf & REQ_NOUNMAP) {
+ opf &= ~REQ_NOUNMAP;
+ out |= DM_USER_REQ_MAP_FLAG_NOUNMAP;
+ }
+
+ if (opf & REQ_HIPRI) {
+ opf &= ~REQ_HIPRI;
+ out |= DM_USER_REQ_MAP_FLAG_HIPRI;
+ }
+
+ if (unlikely(opf)) {
+ pr_warn("unsupported BIO type %x\n", opf);
+ return -EOPNOTSUPP;
+ }
+ WARN_ON(out < 0);
+ return out;
+}
+
+/*
+ * Not quite what's in blk-map.c, but instead what I thought the functions in
+ * blk-map did. This one seems more generally useful and I think we could
+ * write the blk-map version in terms of this one. The differences are that
+ * this has a return value that counts, and blk-map uses the BIO _all iters.
+ * Neither advance the BIO iter but don't advance the IOV iter, which is a bit
+ * odd here.
+ */
+static ssize_t bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
+{
+ struct bio_vec bvec;
+ struct bvec_iter biter;
+ ssize_t out = 0;
+
+ bio_for_each_segment(bvec, bio, biter) {
+ ssize_t ret;
+
+ ret = copy_page_from_iter(bvec.bv_page, bvec.bv_offset,
+ bvec.bv_len, iter);
+
+ /*
+ * FIXME: I thought that IOV copies had a mechanism for
+ * terminating early, if for example a signal came in while
+ * sleeping waiting for a page to be mapped, but I don't see
+ * where that would happen.
+ */
+ WARN_ON(ret < 0);
+ out += ret;
+
+ if (!iov_iter_count(iter))
+ break;
+
+ if (ret < bvec.bv_len)
+ return ret;
+ }
+
+ return out;
+}
+
+static ssize_t bio_copy_to_iter(struct bio *bio, struct iov_iter *iter)
+{
+ struct bio_vec bvec;
+ struct bvec_iter biter;
+ ssize_t out = 0;
+
+ bio_for_each_segment(bvec, bio, biter) {
+ ssize_t ret;
+
+ ret = copy_page_to_iter(bvec.bv_page, bvec.bv_offset,
+ bvec.bv_len, iter);
+
+ /* as above */
+ WARN_ON(ret < 0);
+ out += ret;
+
+ if (!iov_iter_count(iter))
+ break;
+
+ if (ret < bvec.bv_len)
+ return ret;
+ }
+
+ return out;
+}
+
+static ssize_t msg_copy_to_iov(struct message *msg, struct iov_iter *to)
+{
+ ssize_t copied = 0;
+
+ if (!iov_iter_count(to))
+ return 0;
+
+ if (msg->posn_to_user < sizeof(msg->msg)) {
+ copied = copy_to_iter((char *)(&msg->msg) + msg->posn_to_user,
+ sizeof(msg->msg) - msg->posn_to_user, to);
+ } else {
+ copied = bio_copy_to_iter(msg->bio, to);
+ if (copied > 0)
+ bio_advance(msg->bio, copied);
+ }
+
+ if (copied < 0)
+ return copied;
+
+ msg->posn_to_user += copied;
+ return copied;
+}
+
+static ssize_t msg_copy_from_iov(struct message *msg, struct iov_iter *from)
+{
+ ssize_t copied = 0;
+
+ if (!iov_iter_count(from))
+ return 0;
+
+ if (msg->posn_from_user < sizeof(msg->msg)) {
+ copied = copy_from_iter(
+ (char *)(&msg->msg) + msg->posn_from_user,
+ sizeof(msg->msg) - msg->posn_from_user, from);
+ } else {
+ copied = bio_copy_from_iter(msg->bio, from);
+ if (copied > 0)
+ bio_advance(msg->bio, copied);
+ }
+
+ if (copied < 0)
+ return copied;
+
+ msg->posn_from_user += copied;
+ return copied;
+}
+
+static struct message *msg_get_map(struct target *t)
+{
+ struct message *m;
+
+ lockdep_assert_held(&t->lock);
+
+ m = mempool_alloc(&t->message_pool, GFP_NOIO);
+ m->msg.seq = t->next_seq_to_map++;
+ INIT_LIST_HEAD(&m->to_user);
+ INIT_LIST_HEAD(&m->from_user);
+ return m;
+}
+
+static struct message *msg_get_to_user(struct target *t)
+{
+ struct message *m;
+
+ lockdep_assert_held(&t->lock);
+
+ if (list_empty(&t->to_user))
+ return NULL;
+
+ m = list_first_entry(&t->to_user, struct message, to_user);
+ list_del(&m->to_user);
+ return m;
+}
+
+static struct message *msg_get_from_user(struct channel *c, u64 seq)
+{
+ struct message *m;
+ struct list_head *cur;
+
+ lockdep_assert_held(&c->lock);
+
+ list_for_each(cur, &c->from_user) {
+ m = list_entry(cur, struct message, from_user);
+ if (m->msg.seq == seq) {
+ list_del(&m->from_user);
+ return m;
+ }
+ }
+
+ return NULL;
+}
+
+void message_kill(struct message *m, mempool_t *pool)
+{
+ m->bio->bi_status = BLK_STS_IOERR;
+ bio_endio(m->bio);
+ bio_put(m->bio);
+ mempool_free(m, pool);
+}
+
+/*
+ * Returns 0 when there is no work left to do. This must be callable without
+ * holding the target lock, as it is part of the waitqueue's check expression.
+ * When called without the lock it may spuriously indicate there is remaining
+ * work, but when called with the lock it must be accurate.
+ */
+int target_poll(struct target *t)
+{
+ return !list_empty(&t->to_user) || t->dm_destroyed;
+}
+
+void target_release(struct kref *ref)
+{
+ struct target *t = container_of(ref, struct target, references);
+ struct list_head *cur;
+
+ /*
+ * There may be outstanding BIOs that have not yet been given to
+ * userspace. At this point there's nothing we can do about them, as
+ * there are and will never be any channels.
+ */
+ list_for_each (cur, &t->to_user) {
+ message_kill(list_entry(cur, struct message, to_user),
+ &t->message_pool);
+ }
+
+ mempool_exit(&t->message_pool);
+ mutex_unlock(&t->lock);
+ mutex_destroy(&t->lock);
+ kfree(t);
+}
+
+void target_put(struct target *t)
+{
+ /*
+ * This both releases a reference to the target and the lock. We leave
+ * it up to the caller to hold the lock, as they probably needed it for
+ * something else.
+ */
+ lockdep_assert_held(&t->lock);
+
+ if (!kref_put(&t->references, target_release))
+ mutex_unlock(&t->lock);
+}
+
+struct channel *channel_alloc(struct target *t)
+{
+ struct channel *c;
+
+ lockdep_assert_held(&t->lock);
+
+ c = kzalloc(sizeof(*c), GFP_KERNEL);
+ if (c == NULL)
+ return NULL;
+
+ kref_get(&t->references);
+ c->target = t;
+ c->cur_from_user = &c->scratch_message_from_user;
+ mutex_init(&c->lock);
+ INIT_LIST_HEAD(&c->from_user);
+ return c;
+}
+
+void channel_free(struct channel *c)
+{
+ struct list_head *cur;
+
+ lockdep_assert_held(&c->lock);
+
+ /*
+ * There may be outstanding BIOs that have been given to userspace but
+ * have not yet been completed. The channel has been shut down so
+ * there's no way to process the rest of those messages, so we just go
+ * ahead and error out the BIOs. Hopefully whatever's on the other end
+ * can handle the errors. One could imagine splitting the BIOs and
+ * completing as much as we got, but that seems like overkill here.
+ *
+ * Our only other options would be to let the BIO hang around (which
+ * seems way worse) or to resubmit it to userspace in the hope there's
+ * another channel. I don't really like the idea of submitting a
+ * message twice.
+ */
+ if (c->cur_to_user != NULL)
+ message_kill(c->cur_to_user, &c->target->message_pool);
+ if (c->cur_from_user != &c->scratch_message_from_user)
+ message_kill(c->cur_from_user, &c->target->message_pool);
+ list_for_each(cur, &c->from_user)
+ message_kill(list_entry(cur, struct message, to_user),
+ &c->target->message_pool);
+
+ mutex_lock(&c->target->lock);
+ target_put(c->target);
+ mutex_unlock(&c->lock);
+ mutex_destroy(&c->lock);
+ kfree(c);
+}
+
+static int dev_open(struct inode *inode, struct file *file)
+{
+ struct channel *c;
+ struct target *t;
+
+ /*
+ * This is called by miscdev, which sets private_data to point to the
+ * struct miscdevice that was opened. The rest of our file operations
+ * want to refer to the channel that's been opened, so we swap that
+ * pointer out with a fresh channel.
+ *
+ * This is called with the miscdev lock held, which is also held while
+ * registering/unregistering the miscdev. The miscdev must be
+ * registered for this to get called, which means there must be an
+ * outstanding reference to the target, which means it cannot be freed
+ * out from under us despite us not holding a reference yet.
+ */
+ t = container_of(file->private_data, struct target, miscdev);
+ mutex_lock(&t->lock);
+ file->private_data = c = channel_alloc(t);
+
+ if (c == NULL) {
+ mutex_unlock(&t->lock);
+ return -ENOSPC;
+ }
+
+ mutex_unlock(&t->lock);
+ return 0;
+}
+
+static ssize_t dev_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct channel *c = channel_from_file(iocb->ki_filp);
+ ssize_t total_processed = 0;
+ ssize_t processed;
+
+ mutex_lock(&c->lock);
+
+ if (unlikely(c->to_user_error)) {
+ total_processed = c->to_user_error;
+ goto cleanup_unlock;
+ }
+
+ if (c->cur_to_user == NULL) {
+ struct target *t = target_from_channel(c);
+
+ mutex_lock(&t->lock);
+
+ while (!target_poll(t)) {
+ int e;
+
+ mutex_unlock(&t->lock);
+ mutex_unlock(&c->lock);
+ e = wait_event_interruptible(t->wq, target_poll(t));
+ mutex_lock(&c->lock);
+ mutex_lock(&t->lock);
+
+ if (unlikely(e != 0)) {
+ /*
+ * We haven't processed any bytes in either the
+ * BIO or the IOV, so we can just terminate
+ * right now. Elsewhere in the kernel handles
+ * restarting the syscall when appropriate.
+ */
+ total_processed = e;
+ mutex_unlock(&t->lock);
+ goto cleanup_unlock;
+ }
+ }
+
+ if (unlikely(t->dm_destroyed)) {
+ /*
+ * DM has destroyed this target, so just lock
+ * the user out. There's really nothing else
+ * we can do here. Note that we don't actually
+ * tear any thing down until userspace has
+ * closed the FD, as there may still be
+ * outstanding BIOs.
+ *
+ * This is kind of a wacky error code to
+ * return. My goal was really just to try and
+ * find something that wasn't likely to be
+ * returned by anything else in the miscdev
+ * path. The message "block device required"
+ * seems like a somewhat reasonable thing to
+ * say when the target has disappeared out from
+ * under us, but "not block" isn't sensible.
+ */
+ c->to_user_error = total_processed = -ENOTBLK;
+ mutex_unlock(&t->lock);
+ goto cleanup_unlock;
+ }
+
+ /*
+ * Ensures that accesses to the message data are not ordered
+ * before the remote accesses that produce that message data.
+ *
+ * This pairs with the barrier in user_map(), via the
+ * conditional within the while loop above. Also see the lack
+ * of barrier in user_dtr(), which is why this can be after the
+ * destroyed check.
+ */
+ smp_rmb();
+
+ c->cur_to_user = msg_get_to_user(t);
+ WARN_ON(c->cur_to_user == NULL);
+ mutex_unlock(&t->lock);
+ }
+
+ processed = msg_copy_to_iov(c->cur_to_user, to);
+ total_processed += processed;
+
+ WARN_ON(c->cur_to_user->posn_to_user > c->cur_to_user->total_to_user);
+ if (c->cur_to_user->posn_to_user == c->cur_to_user->total_to_user) {
+ struct message *m = c->cur_to_user;
+
+ c->cur_to_user = NULL;
+ list_add_tail(&m->from_user, &c->from_user);
+ }
+
+cleanup_unlock:
+ mutex_unlock(&c->lock);
+ return total_processed;
+}
+
+static ssize_t dev_splice_read(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags)
+{
+ return -EOPNOTSUPP;
+}
+
+static ssize_t dev_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct channel *c = channel_from_file(iocb->ki_filp);
+ ssize_t total_processed = 0;
+ ssize_t processed;
+
+ mutex_lock(&c->lock);
+
+ if (unlikely(c->from_user_error)) {
+ total_processed = c->from_user_error;
+ goto cleanup_unlock;
+ }
+
+ /*
+ * cur_from_user can never be NULL. If there's no real message it must
+ * point to the scratch space.
+ */
+ WARN_ON(c->cur_from_user == NULL);
+ if (c->cur_from_user->posn_from_user < sizeof(struct dm_user_message)) {
+ struct message *msg, *old;
+
+ processed = msg_copy_from_iov(c->cur_from_user, from);
+ if (processed <= 0) {
+ pr_warn("msg_copy_from_iov() returned %zu\n",
+ processed);
+ c->from_user_error = -EINVAL;
+ goto cleanup_unlock;
+ }
+ total_processed += processed;
+
+ /*
+ * In the unlikely event the user has provided us a very short
+ * write, not even big enough to fill a message, just succeed.
+ * We'll eventually build up enough bytes to do something.
+ */
+ if (unlikely(c->cur_from_user->posn_from_user <
+ sizeof(struct dm_user_message)))
+ goto cleanup_unlock;
+
+ old = c->cur_from_user;
+ mutex_lock(&c->target->lock);
+ msg = msg_get_from_user(c, c->cur_from_user->msg.seq);
+ if (msg == NULL) {
+ pr_info("user provided an invalid messag seq of %llx\n",
+ old->msg.seq);
+ mutex_unlock(&c->target->lock);
+ c->from_user_error = -EINVAL;
+ goto cleanup_unlock;
+ }
+ mutex_unlock(&c->target->lock);
+
+ WARN_ON(old->posn_from_user != sizeof(struct dm_user_message));
+ msg->posn_from_user = sizeof(struct dm_user_message);
+ msg->return_type = old->msg.type;
+ msg->return_flags = old->msg.flags;
+ WARN_ON(msg->posn_from_user > msg->total_from_user);
+ c->cur_from_user = msg;
+ WARN_ON(old != &c->scratch_message_from_user);
+ }
+
+ /*
+ * Userspace can signal an error for single requests by overwriting the
+ * seq field.
+ */
+ switch (c->cur_from_user->return_type) {
+ case DM_USER_RESP_SUCCESS:
+ c->cur_from_user->bio->bi_status = BLK_STS_OK;
+ break;
+ case DM_USER_RESP_ERROR:
+ case DM_USER_RESP_UNSUPPORTED:
+ default:
+ c->cur_from_user->bio->bi_status = BLK_STS_IOERR;
+ goto finish_bio;
+ }
+
+ /*
+ * The op was a success as far as userspace is concerned, so process
+ * whatever data may come along with it. The user may provide the BIO
+ * data in multiple chunks, in which case we don't need to finish the
+ * BIO.
+ */
+ processed = msg_copy_from_iov(c->cur_from_user, from);
+ total_processed += processed;
+
+ if (c->cur_from_user->posn_from_user <
+ c->cur_from_user->total_from_user)
+ goto cleanup_unlock;
+
+finish_bio:
+ /*
+ * When we set up this message the BIO's size matched the
+ * message size, if that's not still the case then something
+ * has gone off the rails.
+ */
+ WARN_ON(bio_size(c->cur_from_user->bio) != 0);
+ bio_endio(c->cur_from_user->bio);
+ bio_put(c->cur_from_user->bio);
+
+ /*
+ * We don't actually need to take the target lock here, as all
+ * we're doing is freeing the message and mempools have their
+ * own lock. Each channel has its ows scratch message.
+ */
+ WARN_ON(c->cur_from_user == &c->scratch_message_from_user);
+ mempool_free(c->cur_from_user, &c->target->message_pool);
+ c->scratch_message_from_user.posn_from_user = 0;
+ c->cur_from_user = &c->scratch_message_from_user;
+
+cleanup_unlock:
+ mutex_unlock(&c->lock);
+ return total_processed;
+}
+
+static ssize_t dev_splice_write(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags)
+{
+ return -EOPNOTSUPP;
+}
+
+static __poll_t dev_poll(struct file *file, poll_table *wait)
+{
+ return -EOPNOTSUPP;
+}
+
+static int dev_release(struct inode *inode, struct file *file)
+{
+ struct channel *c;
+
+ c = channel_from_file(file);
+ mutex_lock(&c->lock);
+ channel_free(c);
+
+ return 0;
+}
+
+static int dev_fasync(int fd, struct file *file, int on)
+{
+ return -EOPNOTSUPP;
+}
+
+static long dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ return -EOPNOTSUPP;
+}
+
+static const struct file_operations file_operations = {
+ .owner = THIS_MODULE,
+ .open = dev_open,
+ .llseek = no_llseek,
+ .read_iter = dev_read,
+ .splice_read = dev_splice_read,
+ .write_iter = dev_write,
+ .splice_write = dev_splice_write,
+ .poll = dev_poll,
+ .release = dev_release,
+ .fasync = dev_fasync,
+ .unlocked_ioctl = dev_ioctl,
+};
+
+static int user_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct target *t;
+ int r;
+
+ if (argc != 3) {
+ ti->error = "Invalid argument count";
+ r = -EINVAL;
+ goto cleanup_none;
+ }
+
+ t = kzalloc(sizeof(*t), GFP_KERNEL);
+ if (t == NULL) {
+ r = -ENOSPC;
+ goto cleanup_none;
+ }
+ ti->private = t;
+
+ /*
+ * We begin with a single reference to the target, which is miscdev's
+ * reference. This ensures that the target won't be freed
+ * until after the miscdev has been unregistered and all extant
+ * channels have been closed.
+ */
+ kref_init(&t->references);
+ kref_get(&t->references);
+
+ mutex_init(&t->lock);
+ init_waitqueue_head(&t->wq);
+ INIT_LIST_HEAD(&t->to_user);
+ mempool_init_kmalloc_pool(&t->message_pool, MAX_OUTSTANDING_MESSAGES,
+ sizeof(struct message));
+
+ t->miscdev.minor = MISC_DYNAMIC_MINOR;
+ t->miscdev.fops = &file_operations;
+ t->miscdev.name = kasprintf(GFP_KERNEL, "dm-user/%s", argv[2]);
+ if (t->miscdev.name == NULL) {
+ r = -ENOSPC;
+ goto cleanup_message_pool;
+ }
+
+ /*
+ * Once the miscdev is registered it can be opened and therefor
+ * concurrent references to the channel can happen. Holding the target
+ * lock during misc_register() could deadlock. If registration
+ * succeeds then we will not access the target again so we just stick a
+ * barrier here, which pairs with taking the target lock everywhere
+ * else the target is accessed.
+ *
+ * I forgot where we ended up on the RCpc/RCsc locks. IIU RCsc locks
+ * would mean that we could take the target lock earlier and release it
+ * here instead of the memory barrier. I'm not sure that's any better,
+ * though, and this isn't on a hot path so it probably doesn't matter
+ * either way.
+ */
+ smp_mb();
+
+ r = misc_register(&t->miscdev);
+ if (r) {
+ DMERR("Unable to register miscdev %s for dm-user",
+ t->miscdev.name);
+ r = -ENOSPC;
+ goto cleanup_misc_name;
+ }
+
+ return 0;
+
+cleanup_misc_name:
+ kfree(t->miscdev.name);
+cleanup_message_pool:
+ mempool_exit(&t->message_pool);
+ kfree(t);
+cleanup_none:
+ return r;
+}
+
+static void user_dtr(struct dm_target *ti)
+{
+ struct target *t = target_from_target(ti);
+
+ /*
+ * Removes the miscdev. This must be called without the target lock
+ * held to avoid a possible deadlock because our open implementation is
+ * called holding the miscdev lock and must later take the target lock.
+ *
+ * There is no race here because only DM can register/unregister the
+ * miscdev, and DM ensures that doesn't happen twice. The internal
+ * miscdev lock is sufficient to ensure there are no races between
+ * deregistering the miscdev and open.
+ */
+ misc_deregister(&t->miscdev);
+
+ /*
+ * We are now free to take the target's lock and drop our reference to
+ * the target. There are almost certainly tasks sleeping in read on at
+ * least one of the channels associated with this target, this
+ * explicitly wakes them up and terminates the read.
+ */
+ mutex_lock(&t->lock);
+ /*
+ * No barrier here, as wait/wake ensures that the flag visibility is
+ * correct WRT the wake/sleep state of the target tasks.
+ */
+ t->dm_destroyed = true;
+ wake_up_all(&t->wq);
+ target_put(t);
+}
+
+/*
+ * Consumes a BIO from device mapper, queueing it up for userspace.
+ */
+static int user_map(struct dm_target *ti, struct bio *bio)
+{
+ struct target *t;
+ struct message *entry;
+
+ t = target_from_target(ti);
+ /*
+ * FIXME
+ *
+ * This seems like a bad idea. Specifically, here we're
+ * directly on the IO path when we take the target lock, which may also
+ * be taken from a user context. The user context doesn't actively
+ * trigger anything that may sleep while holding the lock, but this
+ * still seems like a bad idea.
+ *
+ * The obvious way to fix this would be to use a proper queue, which
+ * would result in no shared locks between the direct IO path and user
+ * tasks. I had a version that did this, but the head-of-line blocking
+ * from the circular buffer resulted in us needing a fairly large
+ * allocation in order to avoid situations in which the queue fills up
+ * and everything goes off the rails.
+ *
+ * I could jump through a some hoops to avoid a shared lock while still
+ * allowing for a large queue, but I'm not actually sure that allowing
+ * for very large queues is the right thing to do here. Intuitively it
+ * seems better to keep the queues small in here (essentially sized to
+ * the user latency for performance reasons only) and signal up the
+ * stack to start throttling IOs. I don't see a way to do that
+ * (returning DM_MAPIO_REQUEUE seems like it'd work, but doesn't do
+ * that).
+ *
+ * The best way I could come up with to fix this would be to use a
+ * two-lock concurrent queue that's of infinite size (ie, linked list
+ * based), which would get rid of the explicit shared lock. The
+ * mempool spinlock would still be shared, but I could just defer the
+ * free from dev_write to user_map (and probably a worker).
+ */
+ mutex_lock(&t->lock);
+
+ /*
+ * FIXME
+ *
+ * The assumption here is that there's no benefit to returning
+ * DM_MAPIO_KILL as opposed to just erroring out the BIO, but I'm not
+ * sure that's actually true -- for example, I could imagine users
+ * expecting that submitted BIOs are unlikely to fail and therefor
+ * relying on submission failure to indicate an unsupported type.
+ *
+ * There's two ways I can think of to fix this:
+ * - Add DM arguments that are parsed during the constructor that
+ * allow various dm_target flags to be set that indicate the op
+ * types supported by this target. This may make sense for things
+ * like discard, where DM can already transform the BIOs to a form
+ * that's likely to be supported.
+ * - Some sort of pre-filter that allows userspace to hook in here
+ * and kill BIOs before marking them as submitted. My guess would
+ * be that a userspace round trip is a bad idea here, but a BPF
+ * call seems resonable.
+ *
+ * My guess is that we'd likely want to do both. The first one is easy
+ * and gives DM the proper info, so it seems better. The BPF call
+ * seems overly complex for just this, but one could imagine wanting to
+ * sometimes return _MAPPED and a BPF filter would be the way to do
+ * that.
+ *
+ * For example, in Android we have an in-kernel DM device called
+ * "dm-bow" that takes advange of some portion of the space that has
+ * been discarded on a device to provide opportunistic block-level
+ * backups. While one could imagine just implementing this entirely in
+ * userspace, that would come with an appreciable performance penalty.
+ * Instead one could keep a BPF program that forwards most accesses
+ * directly to the backing block device while informing a userspace
+ * daemon of any discarded space and on writes to blocks that are to be
+ * backed up.
+ */
+ if (unlikely((bio_type_to_user_type(bio) < 0)
+ || (bio_flags_to_user_flags(bio) < 0))) {
+ mutex_unlock(&t->lock);
+ pr_warn("dm-user: unsupported bio_op() %d\n", bio_op(bio));
+ return DM_MAPIO_KILL;
+ }
+
+ entry = msg_get_map(t);
+ if (unlikely(entry == NULL)) {
+ mutex_unlock(&t->lock);
+ pr_warn("dm-user: unable to allocate message\n");
+ return DM_MAPIO_KILL;
+ }
+
+ bio_get(bio);
+ entry->msg.type = bio_type_to_user_type(bio);
+ entry->msg.flags = bio_flags_to_user_flags(bio);
+ entry->msg.sector = bio->bi_iter.bi_sector;
+ entry->msg.len = bio_size(bio);
+ entry->bio = bio;
+ entry->posn_to_user = 0;
+ entry->total_to_user = bio_bytes_needed_to_user(bio);
+ entry->posn_from_user = 0;
+ entry->total_from_user = bio_bytes_needed_from_user(bio);
+ /* Pairs with the barrier in dev_read() */
+ smp_wmb();
+ list_add_tail(&entry->to_user, &t->to_user);
+ wake_up_interruptible(&t->wq);
+ mutex_unlock(&t->lock);
+ return DM_MAPIO_SUBMITTED;
+}
+
+static struct target_type user_target = {
+ .name = "user",
+ .version = { 1, 0, 0 },
+ .module = THIS_MODULE,
+ .ctr = user_ctr,
+ .dtr = user_dtr,
+ .map = user_map,
+};
+
+static int __init dm_user_init(void)
+{
+ int r;
+
+ r = dm_register_target(&user_target);
+ if (r) {
+ DMERR("register failed %d", r);
+ goto error;
+ }
+
+ return 0;
+
+error:
+ return r;
+}
+
+static void __exit dm_user_exit(void)
+{
+ dm_unregister_target(&user_target);
+}
+
+module_init(dm_user_init);
+module_exit(dm_user_exit);
+MODULE_AUTHOR("Palmer Dabbelt <[email protected]>");
+MODULE_DESCRIPTION(DM_NAME " target returning blocks from userspace");
+MODULE_LICENSE("GPL");
--
2.29.2.454.gaff20da3a2-goog

2020-12-03 22:37:53

by Palmer Dabbelt

[permalink] [raw]
Subject: [PATCH v1 4/5] selftests/dm-user: A handful of tests for dm-user

From: Palmer Dabbelt <[email protected]>

These tests ruly on fsstress and fio to generate accesses to a block
device backed by a dm-user daemon.

Signed-off-by: Palmer Dabbelt <[email protected]>

---

I've lumped these all together rather than splitting them up. The tests
themselves are independent, but the associated build/run infastructure
is pretty simple and I got tired of handling all the merge conflicts
that came from juggling each test as its own patch.

The tests themselves sholud be portable, but the harness will only run
in my environment (ie, QEMU). That's kind of ugly, but I'm not really
sure how to do this in a more reasonable way. I run the tests as
follows:

$ rm -f scratch
$ truncate scratch --size=10G
$ qemu-system-x86_64 \
-m 8G -smp 32 -cpu host -accel kvm \
-kernel "${TREE}"/arch/x86_64/boot/bzImage \
-initrd buildroot/output/images/rootfs.cpio \
-append "console=ttyS0" \
-drive
file="${TREE}"/tools/testing/selftests/kselftest_install/kselftest-packages/kselftest.squashfs,if=virtio
\
-drive
file=scratch,if=virtio \
-nographic |&
tee log
---
tools/testing/selftests/.gitignore | 3 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/dm-user/.gitignore | 3 +
tools/testing/selftests/dm-user/Makefile | 23 ++
tools/testing/selftests/dm-user/README | 20 ++
.../selftests/dm-user/daemon-example.c | 186 ++++++++++++++
.../selftests/dm-user/daemon-parallel.c | 240 ++++++++++++++++++
.../testing/selftests/dm-user/daemon-short.c | 196 ++++++++++++++
.../selftests/dm-user/fio-rand-read-1G.fio | 16 ++
.../selftests/dm-user/fio-verify-1G.fio | 10 +
.../testing/selftests/dm-user/harness-fio.sh | 45 ++++
.../selftests/dm-user/harness-fsstress.sh | 44 ++++
.../selftests/dm-user/include/logging.h | 148 +++++++++++
tools/testing/selftests/dm-user/run.sh | 74 ++++++
14 files changed, 1009 insertions(+)
create mode 100644 tools/testing/selftests/dm-user/.gitignore
create mode 100644 tools/testing/selftests/dm-user/Makefile
create mode 100644 tools/testing/selftests/dm-user/README
create mode 100644 tools/testing/selftests/dm-user/daemon-example.c
create mode 100644 tools/testing/selftests/dm-user/daemon-parallel.c
create mode 100644 tools/testing/selftests/dm-user/daemon-short.c
create mode 100644 tools/testing/selftests/dm-user/fio-rand-read-1G.fio
create mode 100644 tools/testing/selftests/dm-user/fio-verify-1G.fio
create mode 100755 tools/testing/selftests/dm-user/harness-fio.sh
create mode 100755 tools/testing/selftests/dm-user/harness-fsstress.sh
create mode 100644 tools/testing/selftests/dm-user/include/logging.h
create mode 100755 tools/testing/selftests/dm-user/run.sh

diff --git a/tools/testing/selftests/.gitignore b/tools/testing/selftests/.gitignore
index 055a5019b13c..88b1938ea5e6 100644
--- a/tools/testing/selftests/.gitignore
+++ b/tools/testing/selftests/.gitignore
@@ -8,3 +8,6 @@ tpm2/SpaceTest.log
# Python bytecode and cache
__pycache__/
*.py[cod]
+
+# selftest install dir
+/kselftest_install/
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d9c283503159..f5e0f61c4384 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -9,6 +9,7 @@ TARGETS += clone3
TARGETS += core
TARGETS += cpufreq
TARGETS += cpu-hotplug
+TARGETS += dm-user
TARGETS += drivers/dma-buf
TARGETS += efivarfs
TARGETS += exec
diff --git a/tools/testing/selftests/dm-user/.gitignore b/tools/testing/selftests/dm-user/.gitignore
new file mode 100644
index 000000000000..7b0aa3e4a738
--- /dev/null
+++ b/tools/testing/selftests/dm-user/.gitignore
@@ -0,0 +1,3 @@
+/daemon-example
+/daemon-parallel
+/daemon-short
diff --git a/tools/testing/selftests/dm-user/Makefile b/tools/testing/selftests/dm-user/Makefile
new file mode 100644
index 000000000000..98ff4f5d0fad
--- /dev/null
+++ b/tools/testing/selftests/dm-user/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: GPL-2.0
+.PHONY: all clean
+
+top_srcdir = ../../../..
+INCLUDES := -I../ -Iinclude/ -I$(top_srcdir)/usr/include
+CFLAGS := $(CFLAGS) -g -O2 -Wall -static -D_GNU_SOURCE -pthread $(INCLUDES)
+KSFT_KHDR_INSTALL := 1
+
+TEST_GEN_FILES := \
+ daemon-example \
+ daemon-parallel \
+ daemon-short
+
+TEST_PROGS := \
+ fio-rand-read-1G.fio \
+ fio-verify-1G.fio \
+ harness-fio.sh \
+ harness-fsstress.sh \
+ run.sh
+
+$(TEST_GEN_FILES): khdr
+
+include ../lib.mk
diff --git a/tools/testing/selftests/dm-user/README b/tools/testing/selftests/dm-user/README
new file mode 100644
index 000000000000..213de27db35d
--- /dev/null
+++ b/tools/testing/selftests/dm-user/README
@@ -0,0 +1,20 @@
+dm-user Tests
+=============
+Tests for dm-user.
+
+Quick Start
+-----------
+It's probably a bad idea to just run this blindly, but all you need to do is:
+
+# make
+# ./run.sh
+
+Slow Start
+----------
+These tests use `dmsetup` to manage device mapper nodes, which is part of lvm2.
+Some use `fio`, and some use the `fsstress` from xfstests. Some of the tests
+also expect "/dev/vdb" to exist and to be at least 10G.
+
+I use a simple buildroot-based initramfs to run the tests. I've added an
+xfstests package to get fsstress, but I haven't sent out the patches yet. I
+run everything in QEMU.
diff --git a/tools/testing/selftests/dm-user/daemon-example.c b/tools/testing/selftests/dm-user/daemon-example.c
new file mode 100644
index 000000000000..b245fad192bf
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-example.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+int write_all(int fd, void *buf, size_t len)
+{
+ char *buf_c = buf;
+ ssize_t total = 0;
+ ssize_t once;
+
+ while (total < len) {
+ once = write(fd, buf_c + total, len - total);
+ if (once <= 0)
+ return once;
+ total += once;
+ }
+
+ return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+ char *buf_c = buf;
+ ssize_t total = 0;
+ ssize_t once;
+
+ while (total < len) {
+ once = read(fd, buf_c + total, len - total);
+ if (once <= 0)
+ return once;
+ total += once;
+ }
+
+ return total;
+}
+
+int simple_daemon(char *control_dev,
+ size_t block_bytes,
+ char *store)
+
+{
+ int control_fd = open(control_dev, O_RDWR);
+
+ if (control_fd < 0) {
+ ksft_print_msg("Unable to open control device %s\n", control_dev);
+ return RET_FAIL;
+ }
+
+ while (1) {
+ struct dm_user_message msg;
+ __u64 type;
+ char *base;
+
+ if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to read msg");
+ return RET_FAIL;
+ }
+
+ base = store + msg.sector * SECTOR_SIZE;
+ if (base + msg.len > store + block_bytes) {
+ fprintf(stderr, "access out of bounds\n");
+ return RET_FAIL;
+ }
+
+ type = msg.type;
+ switch (type) {
+ case DM_USER_REQ_MAP_WRITE:
+ msg.type = DM_USER_RESP_SUCCESS;
+ if (read_all(control_fd, base, msg.len) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to read buf");
+ return RET_FAIL;
+ }
+ break;
+ case DM_USER_REQ_MAP_FLUSH:
+ /* Nothing extra to do on flush, we're in memory. */
+ case DM_USER_REQ_MAP_READ:
+ msg.type = DM_USER_RESP_SUCCESS;
+ break;
+ default:
+ msg.type = DM_USER_RESP_UNSUPPORTED;
+ break;
+ }
+
+ if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to write msg");
+ return RET_FAIL;
+ }
+
+ if (type == DM_USER_REQ_MAP_READ) {
+ if (write_all(control_fd, base, msg.len) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to write buf");
+ return RET_FAIL;
+ }
+ }
+ }
+
+ /* The daemon doesn't actully terminate for this test. */
+ perror("Unable to read from control device");
+ return RET_FAIL;
+}
+
+void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+ printf(" -c <control dev> Control device to use for the test\n");
+ printf(" -s <sectors> The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = RET_PASS;
+ int c;
+ char *control_dev = NULL;
+ long block_bytes = 1024;
+ char *store;
+
+ prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+ while ((c = getopt(argc, argv, "h:v:c:s:")) != -1) {
+ switch (c) {
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ case 'c':
+ control_dev = strdup(optarg);
+ break;
+ case 's':
+ block_bytes = atoi(optarg) * SECTOR_SIZE;
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+ ksft_print_header();
+ ksft_set_plan(1);
+ ksft_print_msg("%s: block_bytes=%zu\n",
+ basename(argv[0]),
+ block_bytes);
+
+ store = malloc(block_bytes);
+ for (size_t i = 0; i < block_bytes/sizeof(size_t); ++i)
+ ((size_t *)(store))[i] = i;
+
+ ret = simple_daemon(control_dev, block_bytes, store);
+
+ print_result(basename(argv[0]), ret);
+ exit(ret);
+ return ret;
+}
diff --git a/tools/testing/selftests/dm-user/daemon-parallel.c b/tools/testing/selftests/dm-user/daemon-parallel.c
new file mode 100644
index 000000000000..9e5303f02241
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-parallel.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include <sys/mman.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+#define MAX_WORKER_COUNT 256
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+struct test_context {
+ char *control_dev;
+ size_t block_bytes;
+ char *store;
+ long worker_count;
+ char *backing_path;
+};
+
+int write_all(int fd, void *buf, size_t len)
+{
+ char *buf_c = buf;
+ ssize_t total = 0;
+ ssize_t once;
+
+ while (total < len) {
+ once = write(fd, buf_c + total, len - total);
+ if (once <= 0)
+ return once;
+ total += once;
+ }
+
+ return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+ char *buf_c = buf;
+ ssize_t total = 0;
+ ssize_t once;
+
+ while (total < len) {
+ once = read(fd, buf_c + total, len - total);
+ if (once <= 0)
+ return once;
+ total += once;
+ }
+
+ return total;
+}
+
+void *simple_daemon(void *context_uc)
+{
+ struct test_context *context = context_uc;
+ char *store = context->store;
+ int control_fd = open(context->control_dev, O_RDWR);
+
+ if (control_fd < 0) {
+ ksft_print_msg("Unable to open control device %s\n", context->control_dev);
+ return (void *)(RET_FAIL);
+ }
+
+ while (1) {
+ struct dm_user_message msg;
+ __u64 type;
+ char *base;
+
+ if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+ if (errno == ENOTBLK)
+ return (void *)(RET_PASS);
+
+ perror("unable to read msg");
+ return (void *)(RET_FAIL);
+ }
+
+ base = store + msg.sector * SECTOR_SIZE;
+ if (base + msg.len > store + context->block_bytes) {
+ fprintf(stderr, "access out of bounds\n");
+ return (void *)(RET_FAIL);
+ }
+
+ type = msg.type;
+ switch (type) {
+ case DM_USER_REQ_MAP_READ:
+ msg.type = DM_USER_RESP_SUCCESS;
+ break;
+ case DM_USER_REQ_MAP_WRITE:
+ msg.type = DM_USER_RESP_SUCCESS;
+ if (read_all(control_fd, base, msg.len) < 0) {
+ if (errno == ENOTBLK)
+ return (void *)(RET_PASS);
+
+ perror("unable to read buf");
+ return (void *)(RET_FAIL);
+ }
+ break;
+ case DM_USER_REQ_MAP_FLUSH:
+ msg.type = DM_USER_RESP_SUCCESS;
+ sync();
+ break;
+ default:
+ msg.type = DM_USER_RESP_UNSUPPORTED;
+ break;
+ }
+
+ if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+ if (errno == ENOTBLK)
+ return (void *)(RET_PASS);
+
+ perror("unable to write msg");
+ return (void *)(RET_FAIL);
+ }
+
+ if (type == DM_USER_REQ_MAP_READ) {
+ if (write_all(control_fd, base, msg.len) < 0) {
+ if (errno == ENOTBLK)
+ return (void *)(RET_PASS);
+
+ perror("unable to write buf");
+ return (void *)(RET_FAIL);
+ }
+ }
+ }
+
+ /* The daemon doesn't actully terminate for this test. */
+ perror("Unable to read from control device");
+ return (void *)(RET_FAIL);
+}
+
+void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+ printf(" -c <control dev> Control device to use for the test\n");
+ printf(" -s <sectors> The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = RET_PASS;
+ int done = 0;
+ int c;
+ struct test_context context = {
+ .control_dev = NULL,
+ .block_bytes = 0,
+ .worker_count = 1,
+ .backing_path = NULL,
+ };
+ pthread_t daemon[MAX_WORKER_COUNT];
+ void *pthread_ret;
+
+ prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+ while ((c = getopt(argc, argv, "h:v:c:s:w:b:")) != -1) {
+ switch (c) {
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ case 'c':
+ context.control_dev = strdup(optarg);
+ break;
+ case 's':
+ context.block_bytes = atoi(optarg) * SECTOR_SIZE;
+ break;
+ case 'w':
+ context.worker_count = atoi(optarg);
+ break;
+ case 'b':
+ context.backing_path = strdup(optarg);
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+ ksft_print_header();
+ ksft_set_plan(1);
+ ksft_print_msg("%s: block_bytes=%zu\n",
+ basename(argv[0]),
+ context.block_bytes);
+
+ ret = RET_PASS;
+
+ if (context.backing_path == NULL) {
+ ksft_print_msg("Using an in-memory backing store\n");
+ context.store = malloc(context.block_bytes);
+ for (size_t i = 0; i < context.block_bytes/sizeof(size_t); ++i)
+ ((size_t *)(context.store))[i] = i;
+ } else {
+ int backing_fd = open(context.backing_path, O_RDWR);
+
+ ksft_print_msg("Using %s as a backing store\n", context.backing_path);
+ if (backing_fd < 0) {
+ perror("Unable to open backing store");
+ ksft_print_msg("Unable to open backing store %s\n", context.backing_path);
+ return RET_FAIL;
+ }
+
+ context.store = mmap(NULL, context.block_bytes,
+ PROT_READ | PROT_WRITE, MAP_SHARED,
+ backing_fd, 0);
+ }
+
+ for (size_t i = 0; i < context.worker_count; ++i)
+ if (pthread_create(&daemon[i], NULL, &simple_daemon, &context) < 0)
+ ret = RET_ERROR;
+
+ while (!done) {
+ for (size_t i = 0; i < context.worker_count; ++i) {
+ if (pthread_tryjoin_np(daemon[i], &pthread_ret) == 0) {
+ if (pthread_ret != RET_PASS)
+ ret = RET_ERROR;
+ done = 1;
+ }
+ }
+
+ sleep(1);
+ }
+
+ print_result(basename(argv[0]), ret);
+ exit(ret);
+}
diff --git a/tools/testing/selftests/dm-user/daemon-short.c b/tools/testing/selftests/dm-user/daemon-short.c
new file mode 100644
index 000000000000..40fd114cb390
--- /dev/null
+++ b/tools/testing/selftests/dm-user/daemon-short.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020 Google, Inc
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <poll.h>
+#include <pthread.h>
+#include <linux/dm-user.h>
+#include <sys/prctl.h>
+#include "logging.h"
+
+#define SECTOR_SIZE 512
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+int write_all(int fd, void *buf, size_t len)
+{
+ char *buf_c = buf;
+ ssize_t total = 0;
+ ssize_t once;
+
+ while (total < len) {
+ size_t max = len - total;
+
+ if (max > 3)
+ max = max / 3;
+
+ once = write(fd, buf_c + total, max);
+ if (once <= 0)
+ return once;
+ total += once;
+ }
+
+ return total;
+}
+
+int read_all(int fd, void *buf, size_t len)
+{
+ char *buf_c = buf;
+ ssize_t total = 0;
+ ssize_t once;
+
+ while (total < len) {
+ size_t max = len - total;
+
+ if (max > 3)
+ max = max / 3;
+
+ once = read(fd, buf_c + total, max);
+ if (once <= 0)
+ return once;
+ total += once;
+ }
+
+ return total;
+}
+
+int simple_daemon(char *control_dev,
+ size_t block_bytes,
+ char *store)
+
+{
+ int control_fd = open(control_dev, O_RDWR);
+
+ if (control_fd < 0) {
+ ksft_print_msg("Unable to open control device %s\n", control_dev);
+ return RET_FAIL;
+ }
+
+ while (1) {
+ struct dm_user_message msg;
+ __u64 type;
+ char *base;
+
+ if (read_all(control_fd, &msg, sizeof(msg)) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to read msg");
+ return RET_FAIL;
+ }
+
+ base = store + msg.sector * SECTOR_SIZE;
+ if (base + msg.len > store + block_bytes) {
+ fprintf(stderr, "access out of bounds\n");
+ return RET_FAIL;
+ }
+
+ type = msg.type;
+ switch (type) {
+ case DM_USER_REQ_MAP_WRITE:
+ msg.type = DM_USER_RESP_SUCCESS;
+ if (read_all(control_fd, base, msg.len) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to read buf");
+ return RET_FAIL;
+ }
+ break;
+ case DM_USER_REQ_MAP_FLUSH:
+ /* Nothing extra to do on flush, we're in memory. */
+ case DM_USER_REQ_MAP_READ:
+ msg.type = DM_USER_RESP_SUCCESS;
+ break;
+ default:
+ msg.type = DM_USER_RESP_UNSUPPORTED;
+ break;
+ }
+
+ if (write_all(control_fd, &msg, sizeof(msg)) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to write msg");
+ return RET_FAIL;
+ }
+
+ if (type == DM_USER_REQ_MAP_READ) {
+ if (write_all(control_fd, base, msg.len) < 0) {
+ if (errno == ENOTBLK)
+ return RET_PASS;
+
+ perror("unable to write buf");
+ return RET_FAIL;
+ }
+ }
+ }
+
+ /* The daemon doesn't actully terminate for this test. */
+ perror("Unable to read from control device");
+ return RET_FAIL;
+}
+
+void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+ printf(" -c <control dev> Control device to use for the test\n");
+ printf(" -s <sectors> The number of sectors in the device\n");
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = RET_PASS;
+ int c;
+ char *control_dev = NULL;
+ long block_bytes = 1024;
+ char *store;
+
+ prctl(PR_SET_IO_FLUSHER, 0, 0, 0, 0);
+
+ while ((c = getopt(argc, argv, "h:v:c:s:")) != -1) {
+ switch (c) {
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ case 'c':
+ control_dev = strdup(optarg);
+ break;
+ case 's':
+ block_bytes = atoi(optarg) * SECTOR_SIZE;
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+ ksft_print_header();
+ ksft_set_plan(1);
+ ksft_print_msg("%s: block_bytes=%zu\n",
+ basename(argv[0]),
+ block_bytes);
+
+ store = malloc(block_bytes);
+ for (size_t i = 0; i < block_bytes/sizeof(size_t); ++i)
+ ((size_t *)(store))[i] = i;
+
+ ret = simple_daemon(control_dev, block_bytes, store);
+
+ print_result(basename(argv[0]), ret);
+ exit(ret);
+ return ret;
+}
diff --git a/tools/testing/selftests/dm-user/fio-rand-read-1G.fio b/tools/testing/selftests/dm-user/fio-rand-read-1G.fio
new file mode 100644
index 000000000000..f971483e0e27
--- /dev/null
+++ b/tools/testing/selftests/dm-user/fio-rand-read-1G.fio
@@ -0,0 +1,16 @@
+; fio-rand-read.job for fiotest
+
+[global]
+name=fio-rand-read-1G
+filename=fio-rand-read-1G
+rw=randread
+bs=4K
+direct=0
+numjobs=1
+time_based=1
+runtime=30
+
+[file1]
+size=1G
+ioengine=io_uring
+iodepth=16
diff --git a/tools/testing/selftests/dm-user/fio-verify-1G.fio b/tools/testing/selftests/dm-user/fio-verify-1G.fio
new file mode 100644
index 000000000000..4b626271ce7c
--- /dev/null
+++ b/tools/testing/selftests/dm-user/fio-verify-1G.fio
@@ -0,0 +1,10 @@
+# The most basic form of data verification. Write the device randomly
+# in 4K chunks, then read it back and verify the contents.
+[write-and-verify]
+rw=randwrite
+bs=4k
+ioengine=libaio
+iodepth=16
+direct=1
+verify=crc32c
+size=1G
diff --git a/tools/testing/selftests/dm-user/harness-fio.sh b/tools/testing/selftests/dm-user/harness-fio.sh
new file mode 100755
index 000000000000..4b95c9f5efd8
--- /dev/null
+++ b/tools/testing/selftests/dm-user/harness-fio.sh
@@ -0,0 +1,45 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright 2020 Google, Inc
+
+# Just a fixed size for now, but it's passed to the tests and they're supposed
+# to respect it.
+SIZE=1024
+BLOCK=kselftest-dm-user-block
+CONTROL=kselftest-dm-user-control
+unset FIO
+
+while [ x"$1" != x"--" ]
+do
+ case "$1" in
+ "-s") SIZE="$2"; shift 2;;
+ "-f") FIO="$2"; shift 2;;
+ *) echo "$0: unknown argument $1" >&2; exit 1;;
+ esac
+done
+shift
+
+# Run the benchmark again via dm-user, to see what the overhead is.
+dmsetup create $BLOCK << EOF
+0 $SIZE user 0 $SIZE $CONTROL
+EOF
+
+dmsetup resume $BLOCK
+
+"$@" -s $SIZE -c /dev/dm-user/$CONTROL &
+
+yes | mkfs.ext2 /dev/mapper/$BLOCK
+mount /dev/mapper/$BLOCK /mnt
+cp "$FIO" /mnt/benchmark.fio
+(cd /mnt; fio benchmark.fio)
+umount /mnt
+
+# Mount again and read the whole thing, just to see if there's any corruption.
+mount /dev/mapper/$BLOCK /mnt
+find /mnt -type f | xargs cat > /dev/null
+umount /mnt
+
+dmsetup remove $BLOCK
+
+# Make sure the daemon actually responds to DM closing it.
+wait
diff --git a/tools/testing/selftests/dm-user/harness-fsstress.sh b/tools/testing/selftests/dm-user/harness-fsstress.sh
new file mode 100755
index 000000000000..265c0ff636ca
--- /dev/null
+++ b/tools/testing/selftests/dm-user/harness-fsstress.sh
@@ -0,0 +1,44 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright 2020 Google, Inc
+
+BLOCK=kselftest-dm-user-block
+CONTROL=kselftest-dm-user-control
+unset SIZE
+unset NPROC
+unset NOP
+
+while [ x"$1" != x"--" ]
+do
+ case "$1" in
+ "-s") SIZE="$2"; shift 2;;
+ "-n") NOP="$2"; shift 2;;
+ "-p") NPROC="$2"; shift 2;;
+ *) echo "$0: unknown argument $1" >&2; exit 1;;
+ esac
+done
+shift
+
+# Runs the fs stress tests
+dmsetup create $BLOCK << EOF
+0 $SIZE user 0 $SIZE $CONTROL
+EOF
+
+dmsetup resume $BLOCK
+
+"$@" -s $SIZE -c /dev/dm-user/$CONTROL &
+
+yes | mkfs.ext2 /dev/mapper/$BLOCK
+mount /dev/mapper/$BLOCK /mnt
+/usr/xfstests/ltp/fsstress -d /mnt/ -n "$NOP" -p "$NPROC"
+umount /mnt
+
+# Mount again and read the whole thing, just to see if there's any corruption.
+mount /dev/mapper/$BLOCK /mnt
+find /mnt -type f | xargs cat > /dev/null
+umount /mnt
+
+dmsetup remove $BLOCK
+
+# Make sure the daemon actually responds to DM closing it.
+wait
diff --git a/tools/testing/selftests/dm-user/include/logging.h b/tools/testing/selftests/dm-user/include/logging.h
new file mode 100644
index 000000000000..874c69ce5cce
--- /dev/null
+++ b/tools/testing/selftests/dm-user/include/logging.h
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/******************************************************************************
+ *
+ * Copyright © International Business Machines Corp., 2009
+ *
+ * DESCRIPTION
+ * Glibc independent futex library for testing kernel functionality.
+ *
+ * AUTHOR
+ * Darren Hart <[email protected]>
+ *
+ * HISTORY
+ * 2009-Nov-6: Initial version by Darren Hart <[email protected]>
+ *
+ *****************************************************************************/
+
+#ifndef _LOGGING_H
+#define _LOGGING_H
+
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/futex.h>
+#include "kselftest.h"
+
+/*
+ * Define PASS, ERROR, and FAIL strings with and without color escape
+ * sequences, default to no color.
+ */
+#define ESC 0x1B, '['
+#define BRIGHT '1'
+#define GREEN '3', '2'
+#define YELLOW '3', '3'
+#define RED '3', '1'
+#define ESCEND 'm'
+#define BRIGHT_GREEN ESC, BRIGHT, ';', GREEN, ESCEND
+#define BRIGHT_YELLOW ESC, BRIGHT, ';', YELLOW, ESCEND
+#define BRIGHT_RED ESC, BRIGHT, ';', RED, ESCEND
+#define RESET_COLOR ESC, '0', 'm'
+static const char PASS_COLOR[] = {BRIGHT_GREEN, ' ', 'P', 'A', 'S', 'S',
+ RESET_COLOR, 0};
+static const char ERROR_COLOR[] = {BRIGHT_YELLOW, 'E', 'R', 'R', 'O', 'R',
+ RESET_COLOR, 0};
+static const char FAIL_COLOR[] = {BRIGHT_RED, ' ', 'F', 'A', 'I', 'L',
+ RESET_COLOR, 0};
+static const char INFO_NORMAL[] = " INFO";
+static const char PASS_NORMAL[] = " PASS";
+static const char ERROR_NORMAL[] = "ERROR";
+static const char FAIL_NORMAL[] = " FAIL";
+const char *INFO = INFO_NORMAL;
+const char *PASS = PASS_NORMAL;
+const char *ERROR = ERROR_NORMAL;
+const char *FAIL = FAIL_NORMAL;
+
+/* Verbosity setting for INFO messages */
+#define VQUIET 0
+#define VCRITICAL 1
+#define VINFO 2
+#define VMAX VINFO
+int _verbose = VCRITICAL;
+
+/* Functional test return codes */
+#define RET_PASS 0
+#define RET_ERROR -1
+#define RET_FAIL -2
+
+/**
+ * log_color() - Use colored output for PASS, ERROR, and FAIL strings
+ * @use_color: use color (1) or not (0)
+ */
+void log_color(int use_color)
+{
+ if (use_color) {
+ PASS = PASS_COLOR;
+ ERROR = ERROR_COLOR;
+ FAIL = FAIL_COLOR;
+ } else {
+ PASS = PASS_NORMAL;
+ ERROR = ERROR_NORMAL;
+ FAIL = FAIL_NORMAL;
+ }
+}
+
+/**
+ * log_verbosity() - Set verbosity of test output
+ * @verbose: Enable (1) verbose output or not (0)
+ *
+ * Currently setting verbose=1 will enable INFO messages and 0 will disable
+ * them. FAIL and ERROR messages are always displayed.
+ */
+void log_verbosity(int level)
+{
+ if (level > VMAX)
+ level = VMAX;
+ else if (level < 0)
+ level = 0;
+ _verbose = level;
+}
+
+/**
+ * print_result() - Print standard PASS | ERROR | FAIL results
+ * @ret: the return value to be considered: 0 | RET_ERROR | RET_FAIL
+ *
+ * print_result() is primarily intended for functional tests.
+ */
+void print_result(const char *test_name, int ret)
+{
+ switch (ret) {
+ case RET_PASS:
+ ksft_test_result_pass("%s\n", test_name);
+ ksft_print_cnts();
+ return;
+ case RET_ERROR:
+ ksft_test_result_error("%s\n", test_name);
+ ksft_print_cnts();
+ return;
+ case RET_FAIL:
+ ksft_test_result_fail("%s\n", test_name);
+ ksft_print_cnts();
+ return;
+ }
+}
+
+/* log level macros */
+#define info(message, vargs...) \
+do { \
+ if (_verbose >= VINFO) \
+ fprintf(stderr, "\t%s: "message, INFO, ##vargs); \
+} while (0)
+
+#define error(message, err, args...) \
+do { \
+ if (_verbose >= VCRITICAL) {\
+ if (err) \
+ fprintf(stderr, "\t%s: %s: "message, \
+ ERROR, strerror(err), ##args); \
+ else \
+ fprintf(stderr, "\t%s: "message, ERROR, ##args); \
+ } \
+} while (0)
+
+#define fail(message, args...) \
+do { \
+ if (_verbose >= VCRITICAL) \
+ fprintf(stderr, "\t%s: "message, FAIL, ##args); \
+} while (0)
+
+#endif
diff --git a/tools/testing/selftests/dm-user/run.sh b/tools/testing/selftests/dm-user/run.sh
new file mode 100755
index 000000000000..2ed2581e4a57
--- /dev/null
+++ b/tools/testing/selftests/dm-user/run.sh
@@ -0,0 +1,74 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright 2020 Palmer Dabbelt <[email protected]>
+
+# Top-level run script for dm-user kernel self tests. This just runs a bunch
+# of different tests back to back, relying on the kernel selftest infrastructure
+# to tease out the success/failure of each. The tests all use the same global
+# directories and such, so it's not like there's a whole lot
+#
+# The actual test code should be fairly portable, but the scripts that run it
+# aren't. See the README for more information.
+
+# Runs various FIO scripts against an ext2-based filesystem backed by dm-user.
+if test -e /usr/bin/fio
+then
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-example
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-short
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 1
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 4
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 16
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 64
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 256
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 1 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 4 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 16 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 64 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-rand-read-1G.fio -- ./daemon-parallel -w 256 -b /dev/vdb
+
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-example
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-short
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 1
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 4
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 16
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 64
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 256
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 1 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 4 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 16 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 64 -b /dev/vdb
+ ./harness-fio.sh -s 3000000 -f fio-verify-1G.fio -- ./daemon-parallel -w 256 -b /dev/vdb
+else
+ echo "Unable to find /usr/bin/fio"
+fi
+
+# Runs fsstress from xfstests against an ext2-based filesystem backed by
+# dm-user.
+if test -e /usr/xfstests/ltp/fsstress
+then
+ ./harness-fsstress.sh -s 3000000 -p 1 -n 10000 -- ./daemon-example
+ ./harness-fsstress.sh -s 3000000 -p 4 -n 10000 -- ./daemon-example
+ ./harness-fsstress.sh -s 3000000 -p 16 -n 10000 -- ./daemon-example
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-example
+ ./harness-fsstress.sh -s 3000000 -p 256 -n 10000 -- ./daemon-example
+
+ ./harness-fsstress.sh -s 3000000 -p 1 -n 10000 -- ./daemon-short
+ ./harness-fsstress.sh -s 3000000 -p 4 -n 10000 -- ./daemon-short
+ ./harness-fsstress.sh -s 3000000 -p 16 -n 10000 -- ./daemon-short
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-short
+ ./harness-fsstress.sh -s 3000000 -p 256 -n 10000 -- ./daemon-short
+
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 1
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 4
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 16
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 64
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 256
+
+ ./harness-fsstress.sh -s 3000000 -p 1 -n 10000 -- ./daemon-parallel -w 1 -b /dev/vdb
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 1 -b /dev/vdb
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 4 -b /dev/vdb
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 16 -b /dev/vdb
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 64 -b /dev/vdb
+ ./harness-fsstress.sh -s 3000000 -p 64 -n 10000 -- ./daemon-parallel -w 256 -b /dev/vdb
+else
+ echo "Unable to find /usr/xfstests/ltp/fsstress"
+fi
--
2.29.2.454.gaff20da3a2-goog

2020-12-03 22:37:57

by Palmer Dabbelt

[permalink] [raw]
Subject: [PATCH v1 1/5] Documentation: Describe dm-user

From: Palmer Dabbelt <[email protected]>

I started by patterning this after the Fuse documentation, which is
located at Documentation/fs/fuse.rst. There's not a whole lot of that
left, though.

Signed-off-by: Palmer Dabbelt <[email protected]>

---

This is a work in progress, but nothing in there should be incorrect.
---
Documentation/block/dm-user.rst | 99 +++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)
create mode 100644 Documentation/block/dm-user.rst

diff --git a/Documentation/block/dm-user.rst b/Documentation/block/dm-user.rst
new file mode 100644
index 000000000000..5eb3120f3fd5
--- /dev/null
+++ b/Documentation/block/dm-user.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+dm-user
+=======
+
+What is dm-user?
+================
+
+dm-user is a device mapper target that allows block accesses to be satisfied by
+an otherwise unprivileged daemon running in userspace. Conceptually it is
+FUSE, but for block devices as opposed to file systems.
+
+Creating a dm-user Target
+=========================
+
+dm-user is implemented as a Device Mapper target, which allows for various
+device management tasks. In general dm-user targets function in the same
+fashion as other device-mapper targets, with the exception that dm-user targets
+handle requests via a userspace daemon as opposed to one of various in-kernel
+mechanisms. As such there is little difference between creating a dm-user
+target and any other device mapper target: the standard device mapper control
+device and ioctl() calls are used to create a table with at least one target of
+the "user" type. Like all other targets this table entry needs a start/size
+pair. The additional required argument is the name of the control device that
+will be associated with this target. Specifically:
+
+````
+user <start sector> <number of sectors> <path to control device>
+````
+
+As a concrete example, the following `dmsetup` invocation will create a new
+device mapper block device available at `/dev/mapper/blk`, consisting entirely
+of a single target which can be controlled via a stream of messages passed over
+`/dev/dm-user/ctl`.
+
+````
+dmsetup create blk <<EOF
+0 1024 user 0 1024 ctl
+EOF
+dmsetup resume blk
+````
+
+Userspace is expected to spin up a daemon to handle those block requests, which
+will block in the meantime. The various Device Mapper operations should all
+function as they do for other targets. If the userspace daemon terminates (or
+otherwise closes the control file descriptor) then the kernel will continue to
+queue up BIOs until userspace either starts a new daemon or destroys the
+target.
+
+Userspace may open each control device multiple times, in which case the kernel
+will distribute messages among each file instance.
+
+Writing a dm-user Daemon
+========================
+
+tools/testing/selftests/dm-user contains a handful of test daemons.
+functional/simple-read-all.c is also suitable as an example.
+
+Kernel - userspace interface
+****************************
+
+FIXME: A description of `struct dm_user_message`
+
+Kernel Implementation
+=====================
+
+BIO Lifecycle
+*************
+
+| "dd if=/dev/mapper/user ... " | dm-user block server
+| |
+| | >sys_read()
+| | >dev_read()
+| | [sleep on c->wq]
+| |
+| >sys_read() |
+| [... block and DM layer ... ] |
+| >user_map() |
+| [enqueue message] |
+| [wake up c->wq] |
+| <user_map() | [woken up]
+| [sleep on BIO completion] | [copy message to user]
+| | <dev_read()
+| | <sys_read()
+| |
+| | [obtain request data]
+| |
+| | >sys_write()
+| | >dev_write()
+| | [copy message from user]
+| | [complete BIO]
+| [woken up on BIO completion] | <dev_write()
+| <sys_read() | <sys_write()
+| |
+| [write and loop] | [loop for more messages]
+
+Locking Scheme
+**************
--
2.29.2.454.gaff20da3a2-goog

2020-12-03 22:38:56

by Palmer Dabbelt

[permalink] [raw]
Subject: [PATCH v1 5/5] MAINTAINERS: Add myself as a dm-user maintainer

From: Palmer Dabbelt <[email protected]>

I'm not sure this is big enough to warrant an entry in the MAINTAINERS file,
but I know it can be quite a bit of work to maintain something like this so I'm
happy to sign up if that helps.

Signed-off-by: Palmer Dabbelt <[email protected]>
---
MAINTAINERS | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2daa6ee673f7..ab9d7746cfb4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5198,6 +5198,13 @@ W: http://sources.redhat.com/cluster/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git
F: fs/dlm/

+DM USER (Block Device in Userspace)
+M: Palmer Dabbelt <[email protected]>
+S: Maintained
+F: include/linux/dm-user.h
+F: drivers/md/dm-user.c
+F: tools/testing/selftests/dm-user/
+
DMA BUFFER SHARING FRAMEWORK
M: Sumit Semwal <[email protected]>
M: Christian König <[email protected]>
--
2.29.2.454.gaff20da3a2-goog

2020-12-04 10:36:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

What is the advantage over simply using nbd?

2020-12-07 18:58:44

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote:
> What is the advantage over simply using nbd?

There's a short bit about that in the cover letter (and in some talks), but
I'll expand on it here -- I suppose my most important question is "is this
interesting enough to take upstream?", so there should be at least a bit of a
description of what it actually enables:

I don't think there's any deep fundamental advantages to doing this as opposed
to nbd/iscsi over localhost/unix (or by just writing a kernel implementation,
for that matter), at least in terms of anything that was previously impossible
now becoming possible. There are a handful of things that are easier and/or
faster, though.

dm-user looks a lot like NBD without the networking. The major difference is
which side initiates messages: in NBD the kernel initiates messages, while in
dm-user userspace initiates messages (via a read that will block if there is no
message, but presumably we'd want to add support for a non-blocking userspace
implementations eventually). The NBD approach certainly makes sense for a
networked system, as one generally wants to have a single storage server
handling multiple clients, but inverting that makes some things simpler in
dm-user.

One specific advantage of this change is that a dm-user target can be
transitioned from one daemon to another without any IO errors: just spin up the
second daemon, signal the first to stop requesting new messages, and let it
exit. We're using that mechanism to replace the daemon launched by early init
(which runs before the security subsystem is up, as in our use case dm-user
provides the root filesystem) with one that's properly sandboxed (which can
only be launched after the root filesystem has come up). There are ways around
this (replacing the DM table, for example), but they don't fit it as cleanly.

Unless I'm missing something, NBD servers aren't capable of that style of
transition: soft disconnects can only be initiated by the client (the kernel,
in this case), which leaves no way for the server to transition while
guaranteeing that no IOs error out. It's usually possible to shoehorn this
sort of direction reversing concept into network protocols, but it's also
usually ugly (I'm thinking of IDLE, for example). I didn't try to actually do
it, but my guess would be that adding a way for the server to ask the client to
stop sending messages until a new server shows up would be at least as much
work as doing this.

There are also a handful of possible performance advantages, but I haven't gone
through the work to prove any of them out yet as performance isn't all that
important for our first use case. For example:

* Cutting out the network stack is unlikely to hurt performance. I'm not sure
if it will help performance, though. I think if we really had workload where
the extra copy was likely to be an issue we'd want an explicit ring buffer,
but I have a theory that it would be possible to get very good performance out
of a stream-style API by using multiple channels and relying on io_uring to
plumb through multiple ops per channel.
* There's a comment in the implementation about allowing userspace to insert
itself into user_map(), likely by uploading a BPF fragment. There's a whole
class of interesting block devices that could be written in this fashion:
essentially you keep a cache on a regular block device that handles the common
cases by remapping BIOs and passing them along, relegating the more complicated
logic to fetch cache misses and watching some subset of the access stream where
necessary.

We have a use case like this in Android, where we opportunistically store
backups in a portion of the TRIM'd space on devices. It's currently
implemented entirely in kernel by the dm-bow target, but IIUC that was deemed
too Android-specific to merge. Assuming we could get good enough performance
we could move that logic to userspace, which lets us shrink our diff with
upstream. It feels like some other interesting block devices could be
written in a similar fashion.

All in all, I've found it a bit hard to figure out what sort of interest people
have in dm-user: when I bring this up I seem to run into people who've done
similar things before and are vaguely interested, but certainly nobody is
chomping at the bit. I'm sending it out in this early state to try and figure
out if it's interesting enough to keep going.

2020-12-10 03:43:02

by Bart Van Assche

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
> All in all, I've found it a bit hard to figure out what sort of interest
> people
> have in dm-user: when I bring this up I seem to run into people who've done
> similar things before and are vaguely interested, but certainly nobody is
> chomping at the bit.  I'm sending it out in this early state to try and
> figure
> out if it's interesting enough to keep going.

Cc-ing Josef and Mike since their nbd contributions make me wonder
whether this new driver could be useful to their use cases?

Thanks,

Bart.

2020-12-11 08:31:59

by Josef Bacik

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On 12/9/20 10:38 PM, Bart Van Assche wrote:
> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>> All in all, I've found it a bit hard to figure out what sort of interest
>> people
>> have in dm-user: when I bring this up I seem to run into people who've done
>> similar things before and are vaguely interested, but certainly nobody is
>> chomping at the bit.  I'm sending it out in this early state to try and
>> figure
>> out if it's interesting enough to keep going.
>
> Cc-ing Josef and Mike since their nbd contributions make me wonder
> whether this new driver could be useful to their use cases?
>

Sorry gmail+imap sucks and I can't get my email client to get at the original
thread. However here is my take.

1) The advantages of using dm-user of NBD that you listed aren't actually
problems for NBD. We have NBD working in production where you can hand off the
sockets for the server without ending in timeouts, it was actually the main
reason we wrote our own server so we could use the FD transfer stuff to restart
the server without impacting any clients that had the device in use.

2) The extra copy is a big deal, in fact we already have too many copies in our
existing NBD setup and are actively looking for ways to avoid those.

Don't take this as I don't think dm-user is a good idea, but I think at the very
least it should start with the very best we have to offer, starting with as few
copies as possible.

If you are using it currently in production then cool, there's clearly a usecase
for it. Personally as I get older and grouchier I want less things in the
kernel, so if this enables us to eventually do everything NBD related in
userspace with no performance drop then I'd be down. I don't think you need to
make that your primary goal, but at least polishing this up so it could
potentially be abused in the future would make it more compelling for merging.
Thanks,

Josef

2020-12-15 03:04:06

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On Thu, 10 Dec 2020 09:03:21 PST (-0800), [email protected] wrote:
> On 12/9/20 10:38 PM, Bart Van Assche wrote:
>> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>>> All in all, I've found it a bit hard to figure out what sort of interest
>>> people
>>> have in dm-user: when I bring this up I seem to run into people who've done
>>> similar things before and are vaguely interested, but certainly nobody is
>>> chomping at the bit.  I'm sending it out in this early state to try and
>>> figure
>>> out if it's interesting enough to keep going.
>>
>> Cc-ing Josef and Mike since their nbd contributions make me wonder
>> whether this new driver could be useful to their use cases?
>>
>
> Sorry gmail+imap sucks and I can't get my email client to get at the original
> thread. However here is my take.

and I guess I then have to apoligize for missing your email ;). Hopefully that
was the problem, but who knows.

> 1) The advantages of using dm-user of NBD that you listed aren't actually
> problems for NBD. We have NBD working in production where you can hand off the
> sockets for the server without ending in timeouts, it was actually the main
> reason we wrote our own server so we could use the FD transfer stuff to restart
> the server without impacting any clients that had the device in use.

OK. So you just send the FD around using one of the standard mechanisms to
orchestrate the handoff? I guess that might work for our use case, assuming
whatever the security side of things was doing was OK with the old FD. TBH I'm
not sure how all that works and while we thought about doing that sort of
transfer scheme we decided to just open it again -- not sure how far we were
down the dm-user rabbit hole at that point, though, as this sort of arose out
of some other ideas.

> 2) The extra copy is a big deal, in fact we already have too many copies in our
> existing NBD setup and are actively looking for ways to avoid those.
>
> Don't take this as I don't think dm-user is a good idea, but I think at the very
> least it should start with the very best we have to offer, starting with as few
> copies as possible.

I was really experting someone to say that. It does seem kind of silly to build
out the new interface, but not go all the way to a ring buffer. We just didn't
really have any way to justify the extra complexity as our use cases aren't
that high performance. I kind of like to have benchmarks for this sort of
thing, though, and I didn't have anyone who had bothered avoiding the last copy
to compare against.

> If you are using it currently in production then cool, there's clearly a usecase
> for it. Personally as I get older and grouchier I want less things in the
> kernel, so if this enables us to eventually do everything NBD related in
> userspace with no performance drop then I'd be down. I don't think you need to
> make that your primary goal, but at least polishing this up so it could
> potentially be abused in the future would make it more compelling for merging.
> Thanks,

Ya, it's in Android already and we'll be shipping it as part of the new OTA
flow for the next release. The rules on deprecation are a bit different over
there, though, so it's not like we're wed to it. The whole point of bringing
this up here was to try and get something usable by everyone, and while I'd
eventually like to get whatever's in Android into the kernel proper we'd really
planned on supporting an extra Android-only ABI for a cycle at least.

I'm kind of inclined to take a crack at the extra copy, to at least see if
building something that eliminates it is viable. I'm not really sure if it is
(or at least, if it'll net us a meaningful amount of performance), but it'd at
least be interesting to try.

It'd be nice to have some benchmark target, though, as otherwise this stuff
hangs on forever. My workloads are in selftests later on in the patch set, but
I'm essentially using tmpfs as a baseline to compare against ext4+dm-user with
some FIO examples as workloads. Our early benchmark numbers indicated this was
way faster than we needed, so I didn't even bother putting together a proper
system to run on so I don't really have any meaningful numbers there. Is there
an NBD server that's fast that I should be comparing against?

I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
reasonable performance numbers before bothering with a v2.

2020-12-22 13:34:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> reasonable performance numbers before bothering with a v2.

FYI, my other main worry beside duplicating nbd is that device mapper
really is a stacked interface that sits on top of other block device.
Turning this into something else that just pipes data to userspace
seems very strange.

2020-12-22 14:38:24

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On Tue, Dec 22 2020 at 8:32am -0500,
Christoph Hellwig <[email protected]> wrote:

> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> > reasonable performance numbers before bothering with a v2.
>
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

I agree. Only way I'd be interested is if it somehow tackled enabling
much more efficient IO. Earlier discussion in this thread mentioned
that zero-copy and low overhead wasn't a priority (because it is hard,
etc). But the hard work has already been done with io_uring. If
dm-user had a prereq of leaning heavily on io_uring and also enabled IO
polling for bio-based then there may be a win to supporting it.

But unless lower latency (or some other more significant win) is made
possible I just don't care to prop up an unnatural DM bolt-on.

Mike

2020-12-22 20:33:32

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On Tue, 22 Dec 2020 05:32:46 PST (-0800), Christoph Hellwig wrote:
> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> reasonable performance numbers before bothering with a v2.
>
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

Agreed. It certainly doesn't fit the DM model. We'd considered doing a non-DM
version of this (maybe "ubd"), but decided to stick with dm-user because we
didn't want to duplicate all the device creation stuff that DM provides. A
simple version of that wouldn't be that hard to do, but the DM version has a
lot of features and we get that all for free. We essentially decided to run
with DM until it gets in the way, and the only sticking point we ended up with
was that REQUEUE stuff (though not sure how that would show up with a bare
block device) and that scheduler question.

I'm going to stick with DM for now, unless it gets in the way, to avoid coming
up with a device creation scheme myself. In the long term it's probably best
to have this be a standalone thing, but I don't want to dump a bunch of time
into putting that stuff together only to find that this isn't interesting
enough from a performance perspective to stick around.

2020-12-22 20:40:27

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On Tue, 22 Dec 2020 06:36:16 PST (-0800), [email protected] wrote:
> On Tue, Dec 22 2020 at 8:32am -0500,
> Christoph Hellwig <[email protected]> wrote:
>
>> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> > reasonable performance numbers before bothering with a v2.
>>
>> FYI, my other main worry beside duplicating nbd is that device mapper
>> really is a stacked interface that sits on top of other block device.
>> Turning this into something else that just pipes data to userspace
>> seems very strange.
>
> I agree. Only way I'd be interested is if it somehow tackled enabling
> much more efficient IO. Earlier discussion in this thread mentioned
> that zero-copy and low overhead wasn't a priority (because it is hard,
> etc). But the hard work has already been done with io_uring. If
> dm-user had a prereq of leaning heavily on io_uring and also enabled IO
> polling for bio-based then there may be a win to supporting it.
>
> But unless lower latency (or some other more significant win) is made
> possible I just don't care to prop up an unnatural DM bolt-on.

I don't remember if I mentioned this in the thread, but it was definately in
the Plumbers talk, but I'd had the general idea bouncing around that it would
be possible to write a high-performance version of this using an interface
similar to the one provided here while relying on io_uring for the
high-performance userspace. That definately won't work with exactly the
current interface, but my hope was to avoid writing my own high-performance
ring buffer. My worry was that it'll be too tricky to map this all to
zero-copy, and I guess I forgot about it.

Now that you bring it up, it certainly seems worth taking a shot at. We'd
essentially have the best of both worlds: userspace implementations that want
to be simple could just use read()/write(), while those that want to be higher
performance could have their implicit ring buffer.

I'm currently trying to put together a benchmarking setup that is of sufficient
fidelity that I would believe the numbers, which is really why I don't have any
performance numbers yet (no sense posting numbers I would shoot down :)). I'll
try to remember to take a shot at an io_uring based userspace (probably with
some dm-user interface modifications) to see how it feels.

2020-12-23 07:52:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace


FYI, a few years ago I spent some time helping a customer to prepare
their block device in userspace using fuse code for upstreaming, but
at some point they abandoned the project. But if for some reason we
don't want to use nbd I think a driver using the fuse infrastructure
would be the next logical choice.

2020-12-23 17:02:13

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

On 12/22/20 11:48 PM, Christoph Hellwig wrote:
> FYI, a few years ago I spent some time helping a customer to prepare
> their block device in userspace using fuse code for upstreaming, but
> at some point they abandoned the project. But if for some reason we
> don't want to use nbd I think a driver using the fuse infrastructure
> would be the next logical choice.

Hi Christoph,

Thanks for having shared this information. Since I'm not familiar with the
FUSE code: does this mean translating block device accesses into FUSE_READ
and FUSE_WRITE messages? Does the FUSE kernel code only support exchanging
such messages between kernel and user space via the read() and write()
system calls? I'm asking this since there is already an interface in the
Linux kernel for implementing block devices in user space that uses another
approach, namely a ring buffer for messages and data that is shared between
kernel and user space (documented in Documentation/target/tcmu-design.rst).
Is one system call per read and per write operation fast enough for all
block-device-in-user-space implementations?

Thanks,

Bart.