2016-10-26 19:21:38

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 00/14] Bus1 Kernel Message Bus

Hi

This proposal introduces bus1.ko, a kernel messaging bus. This is not a request
for inclusion, yet. It is rather an initial draft and a Request For Comments.

While bus1 emerged out of the kdbus project, bus1 was started from scratch and
the concepts have little in common. In a nutshell, bus1 provides a
capability-based IPC system, similar in nature to Android Binder, Cap'n Proto,
and seL4. The module is completely generic and does neither require nor mandate
a user-space counter-part.

o Description

Bus1 is a local IPC system, which provides a decentralized infrastructure to
share objects between local peers. The main building blocks are nodes and
handles. Nodes represent objects of a local peer, while handles represent
descriptors that point to a node. Nodes can be created and destroyed by any
peer, and they will always remain owned by their respective creator. Handles
on the other hand, are used to refer to nodes and can be passed around with
messages as auxiliary data. Whenever a handle is transferred, the receiver
will get its own handle allocated, pointing to the same node as the original
handle.

Any peer can send messages directed at one of their handles. This will
transfer the message to the owner of the node the handle points to. If a
peer does not posess a handle to a given node, it will not be able to send a
message to that node. That is, handles provide exclusive access management.
Anyone that somehow acquired a handle to a node is privileged to further
send this handle to other peers. As such, access management is transitive.
Once a peer acquired a handle, it cannot be revoked again. However, a node
owner can, at anytime, destroy a node. This will effectively unbind all
existing handles to that node on any peer, notifying each one of the
destruction.

Unlike nodes and handles, peers cannot be addressed directly. In fact, peers
are completely disconnected entities. A peer is merely an anchor of a set of
nodes and handles, including an incoming message queue for any of those.
Whether multiple nodes are all part of the same peer, or part of different
peers does not affect the remote view of those. Peers solely exist as
management entity and command dispatcher to local processes.

The set of actors on a system is completely decentralized. There is no
global component involved that provides a central registry or discovery
mechanism. Furthermore, communication between peers only involves those
peers, and does not affect any other peer in any way. No global
communication lock is taken. However, any communication is still globally
ordered, including unicasts, multicasts, and notifications.

o Prior Art

The concepts behind bus1 are almost identical to capability systems like
Android Binder, Google Mojo, Cap'n Proto, seL4, and more. Bus1 differs from
them by supporting Global Ordering, Multicasts, Resource Accounting, No
Global Locking, No Global Context.

While the bus1 UAPI does not expose all features (like soft-references as
supported by Binder), the in-kernel code includes support for it. Multiple
UAPIs can be supported on top of the in-kernel bus1 code, including support
for the Binder UAPI. Efforts on this are still on-going.

o Documentation

The first patch in this series provides the bus1(7) man-page. It explains
all concepts in bus1 in more detail. Furthermore, it describes the API that
is available on bus1 file descriptors. The pre-compiled man-page is
available at:

http://www.bus1.org/bus1.html

There is also a great bunch of in-source documentation available. All
cross-source-file APIs have KernelDoc annotations. Furthermore, we have an
introduction for each subsystem, to be found in the header files. The total
number in lines of code for bus1 is roughly ~4.5k. The remaining ~5k LOC
are comments and documentation.

o Upstream

The upstream development repository is available on github:

http://github.com/bus1/bus1

It is an out-of-tree repository that allows easy and fast development of
new bus1 features. The in-tree integration repository is available at:

http://github.com/bus1/linux

o Conferences

Tom and I will be attending Linux Plumbers Conf next week. Please do not
hesitate to contact us there in person. There will also be a presentation
[1] of bus1 on the last day of the conference.

Thanks
Tom & David

[1] https://www.linuxplumbersconf.org/2016/ocw/proposals/3819

Tom Gundersen (14):
bus1: add bus1(7) man-page
bus1: provide stub cdev /dev/bus1
bus1: util - active reference utility library
bus1: util - fixed list utility library
bus1: util - pool utility library
bus1: util - queue utility library
bus1: tracking user contexts
bus1: implement peer management context
bus1: provide transaction context for multicasts
bus1: add handle management
bus1: implement message transmission
bus1: hook up file-operations
bus1: limit and protect resources
bus1: basic user-space kselftests

Documentation/bus1/.gitignore | 2 +
Documentation/bus1/Makefile | 41 +
Documentation/bus1/bus1.xml | 833 +++++++++++++++++++++
Documentation/bus1/stylesheet.xsl | 16 +
include/uapi/linux/bus1.h | 138 ++++
init/Kconfig | 17 +
ipc/Makefile | 1 +
ipc/bus1/Makefile | 16 +
ipc/bus1/handle.c | 823 ++++++++++++++++++++
ipc/bus1/handle.h | 312 ++++++++
ipc/bus1/main.c | 146 ++++
ipc/bus1/main.h | 88 +++
ipc/bus1/message.c | 656 ++++++++++++++++
ipc/bus1/message.h | 171 +++++
ipc/bus1/peer.c | 1163 +++++++++++++++++++++++++++++
ipc/bus1/peer.h | 163 ++++
ipc/bus1/security.h | 45 ++
ipc/bus1/tests.c | 19 +
ipc/bus1/tests.h | 32 +
ipc/bus1/tx.c | 360 +++++++++
ipc/bus1/tx.h | 102 +++
ipc/bus1/user.c | 628 ++++++++++++++++
ipc/bus1/user.h | 140 ++++
ipc/bus1/util.c | 214 ++++++
ipc/bus1/util.h | 141 ++++
ipc/bus1/util/active.c | 419 +++++++++++
ipc/bus1/util/active.h | 154 ++++
ipc/bus1/util/flist.c | 116 +++
ipc/bus1/util/flist.h | 202 +++++
ipc/bus1/util/pool.c | 572 ++++++++++++++
ipc/bus1/util/pool.h | 164 ++++
ipc/bus1/util/queue.c | 445 +++++++++++
ipc/bus1/util/queue.h | 351 +++++++++
tools/testing/selftests/bus1/.gitignore | 2 +
tools/testing/selftests/bus1/Makefile | 19 +
tools/testing/selftests/bus1/bus1-ioctl.h | 111 +++
tools/testing/selftests/bus1/test-api.c | 532 +++++++++++++
tools/testing/selftests/bus1/test-io.c | 198 +++++
tools/testing/selftests/bus1/test.h | 114 +++
39 files changed, 9666 insertions(+)
create mode 100644 Documentation/bus1/.gitignore
create mode 100644 Documentation/bus1/Makefile
create mode 100644 Documentation/bus1/bus1.xml
create mode 100644 Documentation/bus1/stylesheet.xsl
create mode 100644 include/uapi/linux/bus1.h
create mode 100644 ipc/bus1/Makefile
create mode 100644 ipc/bus1/handle.c
create mode 100644 ipc/bus1/handle.h
create mode 100644 ipc/bus1/main.c
create mode 100644 ipc/bus1/main.h
create mode 100644 ipc/bus1/message.c
create mode 100644 ipc/bus1/message.h
create mode 100644 ipc/bus1/peer.c
create mode 100644 ipc/bus1/peer.h
create mode 100644 ipc/bus1/security.h
create mode 100644 ipc/bus1/tests.c
create mode 100644 ipc/bus1/tests.h
create mode 100644 ipc/bus1/tx.c
create mode 100644 ipc/bus1/tx.h
create mode 100644 ipc/bus1/user.c
create mode 100644 ipc/bus1/user.h
create mode 100644 ipc/bus1/util.c
create mode 100644 ipc/bus1/util.h
create mode 100644 ipc/bus1/util/active.c
create mode 100644 ipc/bus1/util/active.h
create mode 100644 ipc/bus1/util/flist.c
create mode 100644 ipc/bus1/util/flist.h
create mode 100644 ipc/bus1/util/pool.c
create mode 100644 ipc/bus1/util/pool.h
create mode 100644 ipc/bus1/util/queue.c
create mode 100644 ipc/bus1/util/queue.h
create mode 100644 tools/testing/selftests/bus1/.gitignore
create mode 100644 tools/testing/selftests/bus1/Makefile
create mode 100644 tools/testing/selftests/bus1/bus1-ioctl.h
create mode 100644 tools/testing/selftests/bus1/test-api.c
create mode 100644 tools/testing/selftests/bus1/test-io.c
create mode 100644 tools/testing/selftests/bus1/test.h

--
2.10.1


2016-10-26 19:21:45

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 01/14] bus1: add bus1(7) man-page

From: Tom Gundersen <[email protected]>

Add new directory ./Documentation/bus1/ and include DocBook scripts to
build the bus1(7) man-page. This documents the bus1 character-device,
including all the file-operations you can perform on it.

Furthermore, the man-page also introduces the core bus1 concepts and
explains how they work.

Build the bus1-documentation via:

$ make -C Documentation/bus1/ mandocs
$ make -C Documentation/bus1/ htmldocs

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
Documentation/bus1/.gitignore | 2 +
Documentation/bus1/Makefile | 41 ++
Documentation/bus1/bus1.xml | 833 ++++++++++++++++++++++++++++++++++++++
Documentation/bus1/stylesheet.xsl | 16 +
4 files changed, 892 insertions(+)
create mode 100644 Documentation/bus1/.gitignore
create mode 100644 Documentation/bus1/Makefile
create mode 100644 Documentation/bus1/bus1.xml
create mode 100644 Documentation/bus1/stylesheet.xsl

diff --git a/Documentation/bus1/.gitignore b/Documentation/bus1/.gitignore
new file mode 100644
index 0000000..b4a77cc
--- /dev/null
+++ b/Documentation/bus1/.gitignore
@@ -0,0 +1,2 @@
+*.7
+*.html
diff --git a/Documentation/bus1/Makefile b/Documentation/bus1/Makefile
new file mode 100644
index 0000000..d2b9e61
--- /dev/null
+++ b/Documentation/bus1/Makefile
@@ -0,0 +1,41 @@
+cmd = $(cmd_$(1))
+srctree = $(shell pwd)
+src = .
+obj = $(srctree)/$(src)
+
+DOCS := \
+ bus1.xml
+
+XMLFILES := $(addprefix $(obj)/,$(DOCS))
+MANFILES := $(patsubst %.xml, %.7, $(XMLFILES))
+HTMLFILES := $(patsubst %.xml, %.html, $(XMLFILES))
+
+XMLTO_ARGS := \
+ -m $(srctree)/$(src)/stylesheet.xsl \
+ --skip-validation \
+ --stringparam funcsynopsis.style=ansi \
+ --stringparam man.output.quietly=1 \
+ --stringparam man.authors.section.enabled=0 \
+ --stringparam man.copyright.section.enabled=0
+
+quiet_cmd_db2man = MAN $@
+ cmd_db2man = xmlto man $(XMLTO_ARGS) -o $(obj) $<
+%.7: %.xml
+ @(which xmlto > /dev/null 2>&1) || \
+ (echo "*** You need to install xmlto ***"; \
+ exit 1)
+ $(call cmd,db2man)
+
+quiet_cmd_db2html = HTML $@
+ cmd_db2html = xmlto html-nochunks $(XMLTO_ARGS) -o $(obj) $<
+%.html: %.xml
+ @(which xmlto > /dev/null 2>&1) || \
+ (echo "*** You need to install xmlto ***"; \
+ exit 1)
+ $(call cmd,db2html)
+
+mandocs: $(MANFILES)
+
+htmldocs: $(HTMLFILES)
+
+clean-files := $(MANFILES) $(HTMLFILES)
diff --git a/Documentation/bus1/bus1.xml b/Documentation/bus1/bus1.xml
new file mode 100644
index 0000000..40b55c0
--- /dev/null
+++ b/Documentation/bus1/bus1.xml
@@ -0,0 +1,833 @@
+<?xml version='1.0'?> <!--*-nxml-*-->
+<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
+ "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
+
+<refentry id="bus1">
+
+ <refentryinfo>
+ <title>bus1</title>
+ <productname>bus1</productname>
+
+ <authorgroup>
+ <author>
+ <contrib>Documentation</contrib>
+ <firstname>David</firstname>
+ <surname>Herrmann</surname>
+ </author>
+ <author>
+ <contrib>Documentation</contrib>
+ <firstname>Tom</firstname>
+ <surname>Gundersen</surname>
+ </author>
+ </authorgroup>
+ </refentryinfo>
+
+ <refmeta>
+ <refentrytitle>bus1</refentrytitle>
+ <manvolnum>7</manvolnum>
+ </refmeta>
+
+ <refnamediv>
+ <refname>bus1</refname>
+ <refpurpose>Kernel Message Bus</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+ <funcsynopsis>
+ <funcsynopsisinfo>#include &lt;linux/bus1.h&gt;</funcsynopsisinfo>
+ </funcsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1> <!-- DESCRIPTION -->
+ <title>Description</title>
+ <para>
+The bus1 Kernel Message Bus defines and implements a distributed object model.
+It allows local processes to send messages to objects owned by remote processes,
+as well as share their own objects with others. Object ownership is static and
+cannot be transferred. Access to remote objects is prohibited, unless it was
+explicitly granted. Processes can transmit messages to a remote object via the
+message bus, transferring a data payload, object access rights, file
+descriptors, or other auxiliary data.
+ </para>
+ <para>
+To participate on the message bus, a peer context must be created. Peer contexts
+are kernel objects, identified by a file descriptor. They are not bound to any
+process, but can be shared freely. The peer context provides a message queue to
+store all incoming messages, a registry for all locally owned objects, and
+tracks access rights to remote objects. A peer context never serves as
+routing entity, but merely as anchor for peer-owned resources. Any message on
+the bus is always destined for an object, and the bus takes care to transfer a
+message into the message queue of the peer context that owns this object.
+ </para>
+ <para>
+The message bus manages object access using capabilities. That is, by default
+only the owner of an object is granted access rights. No other peer can access
+the object, nor are they aware of the existance of the object. However, access
+rights can be transmitted as auxiliary data with any message, effectively
+granting them to the receiver of the message. This even works transitively, that
+is, any peer that was granted access to an object can pass on those rights, even
+if they do not own the object. But mind that no access rights can ever be
+revoked, besides the owner destroying the object.
+ </para>
+
+ <refsect2>
+ <title>Nodes and Handles</title>
+ <para>
+Each peer context comes with a registry of owned objects, which in bus1
+parlance are called <emphasis>nodes</emphasis>. A peer is always the exclusive
+owner of all nodes it has created. Ownership cannot be transferred. The message
+bus manages access rights to nodes as a set of <emphasis>handles</emphasis> held
+by each peer. For each node a peer has access to, whether it is local or remote,
+the message bus keeps a handle on the peer. Initially when a node is created the
+node owner is the only peer with a handle to the newly created node. Handles are
+local to each peer, but can be transmitted as auxiliary data with any message,
+effectively allocating a new handle to the same node in the destination peer.
+This works transitively, and each peer that holds a handle can pass it on
+further, or deliberately drop it. As long as a peer has a handle to a node it
+can send messages to it. However, a node owner can, at any time, decide to
+destroy a node. This causes all further message transactions to this node to
+fail, although messages that have already been queued for the node are still
+delivered. When a node is destroyed, all peers that hold handles to the node are
+notified of the destruction. Moreover, if the owner of a node that has been
+destroyed releases all its handles to the node, no further messages or
+notifications destined for the node are delivered.
+ </para>
+ <para>
+Handles are the only way to refer to both local and remote nodes. For each
+handle allocated on a peer, a 64-bit ID is assigned to identify that particular
+handle on that particular peer. The ID is only valid locally on that peer, it
+cannot be used by remote peers to address the handle (in other words, the ID
+namespace is tied to each peer and does not define global entities). When
+creating a new node, userspace freely selects the ID except that the
+<constant>BUS1_HANDLE_FLAG_MANAGED</constant> bit must be cleared, and when
+receiving a handle from a remote peer the kernel assigns the ID, which always
+has the <constant>BUS1_HANDLE_FLAG_MANAGED</constant> set. Additionally, the
+<constant>BUS1_HANDLE_FLAG_REMOTE</constant> flag tells whether a specific ID
+refers to a remote handle (if set), or to an owner handle (if unset). An ID
+assigned by the
+kernel is never reused, even after a handle has been dropped. The kernel keeps a
+user-reference count for each handle. Every time a handle is exposed to a peer,
+the user-reference count of that handle is incremented by one. This is never
+done asynchronously, but only synchronously when an ioctl is called by the
+holding peer. Therefore, a peer can reliable deduce the current user-reference
+count of all its handles, regardless of any ongoing message transaction.
+References can be explicitly dropped by a peer. Once the counter of a handle
+hits zero, it is destroyed, its ID becomes invalid, and if it was assigned by
+the kernel, it will not be reused again. Note that a peer can never have
+multiple different handles to the same node, rather the kernel always coalesces
+them into a single handle, using the user-reference counter to track it.
+However, if a handle is fully released, but the peer later acquires a handle to
+the same remote node again, its ID will be different, as IDs are never reused.
+ </para>
+ <para>
+New nodes are allocated on-demand by passing the desired ID to the kernel in any
+ioctl that accepts a handle ID. When allocating a new node, the node owner
+implicitly also gets a handle to that node. As long as the node is valid, the
+kernel will pin a single user-reference to the owner's handle. This guarantees
+that a node owner always retains access to their node, until they explicitly
+destroy it (which will make it possible for userspace to release the handle like
+any other). Once all the handles to a local node have been released, no more
+messages destined for the node will be received. Otherwise, a handle to a local
+node behaves just like any other handle, that is, user-references are acquired
+and released according to its use. However, whenever the overall sum of all
+user-references on all handles to a node drops to one (which implies that only
+the pinned reference of the owner is left), a release-notification is queued on
+the node owner. If the counter is incremented again, any such notification is
+dropped, if not already dequeued.
+ </para>
+ </refsect2>
+
+ <refsect2>
+ <title>Message Transactions</title>
+ <para>
+A message transaction atomically transfers a message to any number of
+destinations. Unless requested otherwise, the message transaction fully succeeds
+or fully fails.
+ </para>
+ <para>
+To receive messag payloads, each peer has an associated shmem-backed
+<emphasis>pool</emphasis> which may be mapped read-only by the receiving peer.
+The kernel copies the message payload directly from the sending peer to each of
+the receivers' pool without an intermediary kernel buffer. The pool is divided
+into <emphasis>slices</emphasis> to hold each message. When a message is
+received, its <emphasis>offset</emphasis> into the pool in bytes is returned to
+userspace, and userspace has to explicitly release the slice once it has
+finished with it.
+ </para>
+ <para>
+The kernel amends all data messages with the <varname>uid</varname>,
+<varname>gid</varname>, <varname>pid</varname>, <varname>tid</varname>, and
+optionally the security context of the sending peer. The information is
+collected from the sending peer when the message is sent and translated into the
+namespaces of the receiving peer's file-descriptor.
+ </para>
+ </refsect2>
+
+ <refsect2>
+ <title>Seed Message</title>
+ <para>
+Every peer may pin a special <emphasis>seed</emphasis> message. Only the peer
+itself may set and retrieve the seed, and at most one seed message may be pinned
+at any given time. The seed typically describes the peer itself and pins any
+nodes and handles necessary to bootstrap the peer.
+ </para>
+ </refsect2>
+
+ <refsect2>
+ <title>Resource quotas</title>
+ <para>
+Each user has a fixed amount of available resources. The limits are static, but
+may be overridden by module parameters. Limits are placed on the amount of
+memory a user's pools may consume, the number of handles a user may hold and,
+the number of inflight messages may be destined for a user and the number of
+file descriptors may be inflight to a user. All inflight resources are accounted
+on the receiving peer.
+ </para>
+ <para>
+As resources are accounted on the receiver, a quota mechanism is in place in
+order to avoid intentional or unintentional resource exhaustion by a malicious
+or broken sending user. At the time of a message transaction, the sending user
+may consume in total (including what is consumed by previous transactions) half
+of the total resources of the receiving user that have not been consumed by
+another user. When a message is dequeued its resource consumption is deaccounted
+from the sending users quota.
+ </para>
+ <para>
+If a receiving peer does not dequeue any of its incoming messages it would be
+possible for a users quota to be fully consumed by one peer, making it
+impossible to communicate with other functioning peers owned by the same user. A
+second quota is therefore enforced per-peer, enforcing that at the time of a
+message transaction the receiving peer may consume at in total (including what
+is consumed by previous transactions) half of the total resources available to
+the sending user that have not been consumed by another peer.
+ </para>
+ </refsect2>
+
+ <refsect2>
+ <title>Global Ordering</title>
+ <para>
+Despite there being no global synchronization, all events on the bus, such as
+sending or receiving of messages, release of handles or destruction of nodes,
+behave as if they were globally ordered. That is, for any two events it is
+always possible to consider one to have happened before the other in such a way
+that it is consistent with all the effects observed on the bus.
+ </para>
+ <para>
+For instance, if two events occurr on one peer (say the sending of a message,
+and the destruction of a node), and they are observed on another peer (by
+receiving the message and receiving a destruction notification for the node), we
+are guaranteed that the order the events occurred in and the order they were
+observed in is the same.
+ </para>
+ <para>
+One could consider a further example involving three peers, if a message is sent
+from one peer to two others, and after receiving the message the first recipient
+sends a further message to the second recipient, it is guaranteed that the
+original message is received before the subsequent one.
+ </para>
+ <para>
+This principle of causality is also respected in the pressence of side-channel
+communication. That is, if one event may have triggered another, even if on
+different, disconnected, peers, we are guaranteed that the events are ordered
+accordingly. To be precise, if one event (such as receiving a message) completed
+before another (such as sending a message) was started, then they are ordered
+accordingly.
+ </para>
+ <para>
+Also in the case where there can be no causal relationship, we are guaranteed a
+global order. In case two events happend concurrently, there can never be any
+inconsistency in which occurred before the other. By way of example, consider
+two peers sending one message each to two different peers, we are guaranteed
+that both the recipient peers receive the two messages in the same order, even
+though the order may be arbitrary.
+ </para>
+ </refsect2>
+
+ <refsect2>
+ <title>Operating on a bus1 file descriptor</title>
+ <para>
+The bus1 peer file descriptor supports the following operations:
+ </para>
+ <variablelist>
+ <varlistentry> <!-- FOPS OPEN -->
+ <term>
+ <citerefentry>
+ <refentrytitle>open</refentrytitle>
+ <manvolnum>2</manvolnum>
+ </citerefentry>
+ </term>
+ <listitem>
+ <para>
+A call to
+<citerefentry>
+ <refentrytitle>open</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>
+on the bus1 character device (usually <filename>/dev/bus1</filename>) creates a
+new peer context identified by the returned file descriptor.
+ </para>
+ </listitem>
+ </varlistentry> <!-- FOPS OPEN -->
+
+ <varlistentry> <!-- FOPS POLL -->
+ <term>
+ <citerefentry>
+ <refentrytitle>poll</refentrytitle>
+ <manvolnum>2</manvolnum>
+ </citerefentry>
+ </term>
+ <term>
+ <citerefentry>
+ <refentrytitle>select</refentrytitle>
+ <manvolnum>2</manvolnum>
+ </citerefentry>
+ </term>
+ <term>(and similar)</term>
+ <listitem>
+ <para>
+The file descriptor supports
+<citerefentry>
+ <refentrytitle>poll</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>
+(and analogously
+<citerefentry>
+ <refentrytitle>epoll</refentrytitle><manvolnum>7</manvolnum>
+</citerefentry>) and
+<citerefentry>
+ <refentrytitle>select</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>, as follows:
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+The file descriptor is readable (the <varname>readfds</varname> argument of
+<citerefentry>
+ <refentrytitle>select</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>;
+the <constant>POLLIN</constant> flag of
+<citerefentry>
+ <refentrytitle>poll</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>)
+if one or more messages are ready to be dequeued.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+The file descriptor is writable (the <varname>writefds</varname> argument of
+<citerefentry>
+ <refentrytitle>select</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>;
+the <constant>POLLOUT</constant> flag of
+<citerefentry>
+ <refentrytitle>poll</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>)
+if the peer has not been shut down, yet (i.e., the peer can be used to send
+messages).
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+The file descriptor signals a hang-up (overloaded on the
+<varname>readfds</varname> argument of
+<citerefentry>
+ <refentrytitle>select</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>;
+the <constant>POLLHUP</constant> flag of
+<citerefentry>
+ <refentrytitle>poll</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>)
+if the peer has been shut down.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+The bus1 peer file descriptor also supports the other file descriptor
+multiplexing APIs:
+<citerefentry>
+ <refentrytitle>pselect</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>, and
+<citerefentry>
+ <refentrytitle>ppoll</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>.
+ </para>
+ </listitem>
+ </varlistentry> <!-- FOPS POLL -->
+
+ <varlistentry> <!-- FOPS MMAP -->
+ <term>
+ <citerefentry>
+ <refentrytitle>mmap</refentrytitle>
+ <manvolnum>2</manvolnum>
+ </citerefentry>
+ </term>
+ <listitem>
+ <para>
+A call to
+<citerefentry>
+ <refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>
+installs a memory mapping to the message pool of the peer into the caller's
+address-space. No writable mappings are allowed. Furthermore, the pool has no
+fixed size, but grows dynamically with the demands of the peer.
+ </para>
+ </listitem>
+ </varlistentry> <!-- FOPS MMAP -->
+
+ <varlistentry> <!-- FOPS IOCTL -->
+ <term>
+ <citerefentry>
+ <refentrytitle>ioctl</refentrytitle>
+ <manvolnum>2</manvolnum>
+ </citerefentry>
+ </term>
+ <listitem>
+ <para>
+The following bus1-specific commands are supported:
+ </para>
+ <variablelist>
+ <varlistentry>
+ <term><constant>BUS1_CMD_PEER_DISCONNECT</constant></term>
+ <listitem>
+ <para>
+This argument disconnects a peer and does not take an argument. All slices,
+handles, nodes and queued messages are released and destroyed and all future
+operations on the peer will fail with <constant>-ESHUTDOWN</constant>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_PEER_QUERY</constant></term>
+ <listitem>
+ <para>
+This command queries the state of a peer context. It takes the following
+structure as argument:
+<programlisting>
+struct bus1_cmd_peer_reset {
+ __u64 flags;
+ __u64 peer_flags;
+ __u64 max_slices;
+ __u64 max_handles;
+ __u64 max_inflight_bytes;
+ __u64 max_inflight_fds;
+};
+</programlisting>
+<varname>flags</varname> must always be set to 0. The state as set via
+<constant>BUS1_CMD_PEER_RESET</constant>, or the default state if it was never
+reset, is returned.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_PEER_RESET</constant></term>
+ <listitem>
+ <para>
+This command resets a peer context. It takes the following structure as
+argument:
+<programlisting>
+struct bus1_cmd_peer_reset {
+ __u64 flags;
+ __u64 peer_flags;
+ __u64 max_slices;
+ __u64 max_handles;
+ __u64 max_inflight_bytes;
+ __u64 max_inflight_fds;
+};
+</programlisting>
+If <varname>peer_flags</varname> has
+<constant>BUS1_PEER_FLAG_WANT_SECCTX</constant> set, the security context of the
+sending task is attached to each message received by this peer.
+<varname>max_slices</varname>, <varname>max_handles</varname>,
+<varname>max_inflight_bytes</varname>, and <varname>max_inflight_fds</varname>
+are the resource limits for this peer. Note that these are simply max valuse,
+the resource usage is also limited per user.
+ </para>
+ <para>
+If <varname>flags</varname> has
+<constant>BUS1_CMD_PEER_RESET_FLAG_FLUSH_SEED</constant> set, the seed message
+is dropped, and if <constant>BUS1_CMD_PEER_RESET_FLAG_FLUSH</constant> is set,
+all slices and handles are released, all messages are dropped from the queue and
+all nodes that are not pinned by the seed message are destroyed.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_HANDLE_TRANSFER</constant></term>
+ <listitem>
+ <para>
+This command transfers a handle from one peer context to another. It takes the
+following structure as argument:
+<programlisting>
+struct bus1_cmd_handle_transfer {
+ __u64 flags;
+ __u64 src_handle;
+ __u64 dst_fd;
+ __u64 dst_handle;
+};
+</programlisting>
+<varname>flags</varname> must always be set to 0, <varname>src_handle</varname>
+is the handle ID of the handle being transferred in the source context,
+<varname>dst_fd</varname> is the file descriptor representing the destination
+peer context and <varname>dst_handle</varname> must be
+<constant>BUS1_HANDLE_INVALID</constant> and is set to the new handle ID in the
+destination context on return.
+ </para>
+ <para>
+If <varname>dst_fd</varname> is set to <constant>-1</constant> the source
+context is also used as the destination.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_HANDLE_RELEASE</constant></term>
+ <listitem>
+ <para>
+This command releases one user reference to a handle. It takes a handle ID as
+argument.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_NODE_DESTROY</constant></term>
+ <listitem>
+ <para>
+This command destroys a set of nodes. It takes the following structure as
+argument:
+<programlisting>
+struct bus1_cmd_node_destroy {
+ __u64 flags;
+ __u64 ptr_nodes;
+ __u64 n_nodes;
+};
+</programlisting>
+<varname>flags</varname> must always be set to 0, <varname>ptr_nodes</varname>
+must be a pointer to an array of handle IDs of owner handles of local nodes, and
+<varname>n_nodes</varname> must be the size of the array.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_SLICE_RELEASE</constant></term>
+ <listitem>
+ <para>
+This command releases one slice from the local pool. It takes a pool offset to
+the start of the slice to be released.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_SEND</constant></term>
+ <listitem>
+ <para>
+This command sends a message. It takes the following structure as argument:
+<programlisting>
+struct bus1_cmd_send {
+ __u64 flags;
+ __u64 ptr_destinations;
+ __u64 ptr_errors;
+ __u64 n_destinations;
+ __u64 ptr_vecs;
+ __u64 n_vecs;
+ __u64 ptr_handles;
+ __u64 n_handles;
+ __u64 ptr_fds;
+ __u64 n_fds;
+};
+</programlisting>
+ </para>
+ <para>
+<varname>flags</varname> may be set to at most one of
+<constant>BUS1_SEND_FLAG_CONTINUE</constant> and
+<constant>BUS1_SEND_FLAG_SEED</constant>. If
+<constant>BUS1_SEND_FLAG_CONTINUE</constant> is set any messages that cannot
+be delivered due to errors on the remote peer do not make the whole transaction
+fail, but merely set the corresponding error code in the error code array
+respectively. If <constant>BUS1_SEND_FLAG_SEED</constant> is set the message
+replaces the seed message on the local peer. In this case,
+<varname>n_destinations</varname> must be 0.
+ </para>
+ <para>
+<varname>ptr_destinations</varname> is a pointer to an array of handle IDs,
+<varname>ptr_errors</varname> is a pointer to an array of corresponding
+errno codes, and <varname>n_destinations</varname> is the length of the arrays.
+The message being sent is delivered to the peer context owning the nodes pointed
+to by each of the handles in the array.
+ </para>
+ <para>
+<varname>ptr_vecs</varname> is a pointer to an array of iovecs and
+<varname>n_vecs</varname> is the length of the array. The iovecs represent the
+payload of the message which is delivered to each destination.
+ </para>
+ <para>
+<varname>ptr_handles</varname> is a pointer to an array of handle IDs and
+<varname>n_handles</varname> is the length of the array. Each of the handles in
+this array is installed in each destination peer context at receive time. If the
+underlying node has been destroyed at the time the message is delivered (the
+message would be ordered after the node's destruction notification) then
+<constant>BUS1_HANDLE_INVALID</constant> will be delivered instead.
+ </para>
+ <para>
+<varname>ptr_fds</varname> is a pointer to an integer array of file descriptors
+and <varname>n_fds</varname> is the length of the array. Each of the file
+descriptors in this array may be installed in the destination peer context at
+receive time (see below).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>BUS1_CMD_RECV</constant></term>
+ <listitem>
+ <para>
+This command receives a message. It takes the following structure as argument:
+<programlisting>
+struct bus1_cmd_recv {
+ __u64 flags;
+ __u64 max_offset;
+ struct {
+ __u64 type;
+ __u64 flags;
+ __u64 destination;
+ __u32 uid;
+ __u32 gid;
+ __u32 pid;
+ __u32 tid;
+ __u64 offset;
+ __u64 n_bytes;
+ __u64 n_handles;
+ __u64 n_fds;
+ __u64 n_secctx;
+ } msg;
+};
+</programlisting>
+If <constant>BUS1_RECV_FLAG_PEEK</constant> is set in <varname>flags</varname>,
+the received message is not dropped from the queue. If
+<constant>BUS1_RECV_FLAG_SEED</constant> is set, the peer's seed is received
+rather than a message from the queue. If
+<constant>BUS1_RECV_FLAG_INSTALL_FDS</constant> the file descriptors attached to
+the received message are installed in the receiving process. Care must be taken
+when using this flag from more than one process on the same message as file
+descriptor numbers are per-process and not per-peer.
+ </para>
+ <para>
+<varname>max_offset</varname> indicates the maximum offset into the pool the
+receiving peer is able to read. If a message slice would exceed this offset
+the call would fail with <constant>-ERANGE</constant>.
+ </para>
+ <para>
+<varname>msg.type</varname> indicates the type of message.
+<constant>BUS1_MSG_NONE</constant> is never returned.
+<constant>BUS1_MSG_DATA</constant> indicates a regular message sent from another
+peer, possibly containing a payload, as well as attached handles and
+filedescriptors. <constant>BUS1_MSG_NODE_DESTROY</constant> indicates that the
+node referenced by the handle in <varname>msg.destination</varname> was
+destroyed by its owner. <constant>BUS1_MSG_NODE_RELEASE</constant> indicates
+that all the references to handles referencing the node in
+<varname>msg.destination</varname> have been released.
+ </para>
+ <para>
+<varname>msg.flags</varname> indicates additional flags of the message.
+<constant>BUS1_MSG_FLAG_HAS_SECCTX</constant> indicates that a security context
+was attached to the message (to distinguish an empty <varname>n_secctx</varname>
+from an invalid one).
+<constant>BUS1_MSG_FLAG_CONTINUE</constant> indicates that there are more
+messages queued which belong to the same message transaction.
+ </para>
+ <para>
+<varname>msg.destination</varname> is the ID of the destination node or handle
+of the message.
+ </para>
+ <para>
+<varname>msg.uid</varname>, <varname>msg.gid</varname>,
+<varname>msg.pid</varname>, and <varname>msg.tid</varname> are the user, group,
+process and thread ID of the process that created the sending peer context.
+ </para>
+ <para>
+<varname>msg.offset</varname> is the offset, in bytes, into the pool of the
+payload and <varname>msg.n_bytes</varname> is its length.
+ </para>
+ <para>
+<varname>msg.n_handles</varname> is the number of handles attached to the
+message. The handle IDs are stored in the pool following the payload (and
+possibly padding to make the array 8-byte aligned).
+ </para>
+ <para>
+<varname>msg.n_fds</varname> is the number of handles attached to the
+message, or 0 if <constant>BUS1_RECV_FLAG_INSTALL_FDS</constant> was not set.
+The file descriptor numbers are stored in the pool following the handle array
+(and possibly padding to make the array 8-byte aligned).
+ </para>
+ <para>
+<varname>msg.n_secctx</varname> is the number of bytes attached to the message,
+which contain the security context of the sender. The security context is
+stored in the pool following the payload (and possibly padding to make it
+8-byte aligned).
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry> <!-- FOPS IOCTL -->
+
+ <varlistentry> <!-- FOPS CLOSE -->
+ <term>
+ <citerefentry>
+ <refentrytitle>close</refentrytitle>
+ <manvolnum>2</manvolnum>
+ </citerefentry>
+ </term>
+ <listitem>
+ <para>
+A call to
+<citerefentry>
+ <refentrytitle>close</refentrytitle><manvolnum>2</manvolnum>
+</citerefentry>
+releases the passed file descriptor. When all file descriptors associated with
+the same peer context have been closed, the peer is shut down. This destroys all
+nodes of that peer, releases all handles, flushes its queue and pool, and
+deallocates all related resources. Messages that have been sent by the peer and
+are still queued on destination queues are unaffected by this.
+ </para>
+ </listitem>
+ </varlistentry> <!-- FOPS CLOSE -->
+ </variablelist>
+ </refsect2>
+ </refsect1> <!-- DESCRIPTION -->
+
+ <refsect1> <!-- RETURN VALUE -->
+ <title>Return value</title>
+ <para>
+All bus1 operations return zero on success. On failure, a negative error code is
+returned.
+ </para>
+ </refsect1> <!-- RETURN VALUE -->
+
+ <refsect1> <!-- ERRORS -->
+ <title>Errors</title>
+ <para>
+These are all standard errors generated by the bus layer. See the description
+of each ioctl for details on their occurrence.
+ </para>
+ <variablelist>
+ <varlistentry>
+ <term><constant>EAGAIN</constant></term>
+ <listitem><para>
+No messages ready to be read.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EBADF</constant></term>
+ <listitem><para>
+Invalid file descriptor.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EDQUOT</constant></term>
+ <listitem><para>
+Resource quota exceeded.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EFAULT</constant></term>
+ <listitem><para>
+Cannot read, or write, ioctl parameters.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EHOSTUNREACH</constant></term>
+ <listitem><para>
+The destination object is no longer available.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EINVAL</constant></term>
+ <listitem><para>
+Invalid ioctl parameters.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EMSGSIZE</constant></term>
+ <listitem><para>
+The message to be sent exceeds its allowed resource limits.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>ENOMEM</constant></term>
+ <listitem><para>
+Out of kernel memory.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>ENOTTY</constant></term>
+ <listitem><para>
+Unknown ioctl.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>ENXIO</constant></term>
+ <listitem><para>
+Unknown object.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EOPNOTSUPP</constant></term>
+ <listitem><para>
+Operation not supported.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>EPERM</constant></term>
+ <listitem><para>
+Permission denied.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>ERANGE</constant></term>
+ <listitem><para>
+The message to be received would exceed the maximal offset.
+ </para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><constant>ESHUTDOWN</constant></term>
+ <listitem><para>
+Local peer was already disconnected.
+ </para></listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1> <!-- ERRORS -->
+
+ <refsect1> <!-- SEE ALSO -->
+ <title>See Also</title>
+ <simplelist type="inline">
+ <member>
+ <citerefentry>
+ <refentrytitle>bus1.pool</refentrytitle>
+ <manvolnum>7</manvolnum>
+ </citerefentry>
+ </member>
+ </simplelist>
+ </refsect1> <!-- SEE ALSO -->
+
+</refentry>
diff --git a/Documentation/bus1/stylesheet.xsl b/Documentation/bus1/stylesheet.xsl
new file mode 100644
index 0000000..52565ea
--- /dev/null
+++ b/Documentation/bus1/stylesheet.xsl
@@ -0,0 +1,16 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<stylesheet xmlns="http://www.w3.org/1999/XSL/Transform" version="1.0">
+ <param name="chunk.quietly">1</param>
+ <param name="funcsynopsis.style">ansi</param>
+ <param name="funcsynopsis.tabular.threshold">80</param>
+ <param name="callout.graphics">0</param>
+ <param name="paper.type">A4</param>
+ <param name="generate.section.toc.level">2</param>
+ <param name="use.id.as.filename">1</param>
+ <param name="citerefentry.link">1</param>
+ <strip-space elements="*"/>
+ <template name="generate.citerefentry.link">
+ <value-of select="refentrytitle"/>
+ <text>.html</text>
+ </template>
+</stylesheet>
--
2.10.1

2016-10-26 19:21:59

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 07/14] bus1: tracking user contexts

From: Tom Gundersen <[email protected]>

Different users can communicate via bus1, and many resources are shared
between multiple users. The bus1_user object represents the UID of a
user, like "struct user_struct" does in the kernel core. It is used to
account global resources, apply limits, and calculate quotas if
different UIDs communicate with each other.

All dynamic resources have global per-user limits, which cannot be
exceeded by a user. They prevent a single user from exhausting local
resources. Each peer that is created is always owned by the user that
initialized it. All resources allocated on that peer are accounted on
that pinned user. Additionally to global resources, there are local
limits per peer, that can be controlled by each peer individually
(e.g., specifying a maximum pool size). Those local limits allow a user
to distribute the globally available resources across its peer
instances.

Since bus1 allows communication across UID boundaries, any such
transmission of resources must be properly accounted. Bus1 employs
dynamic quotas to fairly distribute available resources. Those quotas
make sure that available resources of a peer cannot be exhausted by
remote UIDs, but are fairly divided among all communicating peers.

This only implements the user tracking, the resource limits will be
added in follow-up patches.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 1 +
ipc/bus1/main.c | 3 ++
ipc/bus1/user.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/user.h | 67 ++++++++++++++++++++++++
4 files changed, 224 insertions(+)
create mode 100644 ipc/bus1/user.c
create mode 100644 ipc/bus1/user.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index 3c90657..94d79e0 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,5 +1,6 @@
bus1-y := \
main.o \
+ user.o \
util/active.o \
util/flist.o \
util/pool.o \
diff --git a/ipc/bus1/main.c b/ipc/bus1/main.c
index 02412a7..526347d 100644
--- a/ipc/bus1/main.c
+++ b/ipc/bus1/main.c
@@ -16,6 +16,7 @@
#include <linux/module.h>
#include "main.h"
#include "tests.h"
+#include "user.h"

static int bus1_fop_open(struct inode *inode, struct file *file)
{
@@ -64,6 +65,7 @@ static int __init bus1_modinit(void)

error:
debugfs_remove(bus1_debugdir);
+ bus1_user_modexit();
return r;
}

@@ -71,6 +73,7 @@ static void __exit bus1_modexit(void)
{
misc_deregister(&bus1_misc);
debugfs_remove(bus1_debugdir);
+ bus1_user_modexit();
pr_info("unloaded\n");
}

diff --git a/ipc/bus1/user.c b/ipc/bus1/user.c
new file mode 100644
index 0000000..0498ab4
--- /dev/null
+++ b/ipc/bus1/user.c
@@ -0,0 +1,153 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/err.h>
+#include <linux/idr.h>
+#include <linux/kernel.h>
+#include <linux/kref.h>
+#include <linux/moduleparam.h>
+#include <linux/mutex.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uidgid.h>
+#include "user.h"
+
+static DEFINE_MUTEX(bus1_user_lock);
+static DEFINE_IDR(bus1_user_idr);
+
+/**
+ * bus1_user_modexit() - clean up global resources of user accounting
+ *
+ * This function cleans up any remaining global resources that were allocated
+ * by the user accounting helpers. The caller must make sure that no user
+ * object is referenced anymore, before calling this. This function just clears
+ * caches and verifies nothing is leaked.
+ *
+ * This is meant to be called on module-exit.
+ */
+void bus1_user_modexit(void)
+{
+ WARN_ON(!idr_is_empty(&bus1_user_idr));
+ idr_destroy(&bus1_user_idr);
+ idr_init(&bus1_user_idr);
+}
+
+static struct bus1_user *bus1_user_new(void)
+{
+ struct bus1_user *user;
+
+ user = kmalloc(sizeof(*user), GFP_KERNEL);
+ if (!user)
+ return ERR_PTR(-ENOMEM);
+
+ kref_init(&user->ref);
+ user->uid = INVALID_UID;
+ mutex_init(&user->lock);
+
+ return user;
+}
+
+static void bus1_user_free(struct kref *ref)
+{
+ struct bus1_user *user = container_of(ref, struct bus1_user, ref);
+
+ lockdep_assert_held(&bus1_user_lock);
+
+ if (likely(uid_valid(user->uid)))
+ idr_remove(&bus1_user_idr, __kuid_val(user->uid));
+ mutex_destroy(&user->lock);
+ kfree_rcu(user, rcu);
+}
+
+/**
+ * bus1_user_ref_by_uid() - get a user object for a uid
+ * @uid: uid of the user
+ *
+ * Find and return the user object for the uid if it exists, otherwise create
+ * it first.
+ *
+ * Return: A user object for the given uid, ERR_PTR on failure.
+ */
+struct bus1_user *bus1_user_ref_by_uid(kuid_t uid)
+{
+ struct bus1_user *user;
+ int r;
+
+ if (WARN_ON(!uid_valid(uid)))
+ return ERR_PTR(-ENOTRECOVERABLE);
+
+ /* fast-path: acquire reference via rcu */
+ rcu_read_lock();
+ user = idr_find(&bus1_user_idr, __kuid_val(uid));
+ if (user && !kref_get_unless_zero(&user->ref))
+ user = NULL;
+ rcu_read_unlock();
+ if (user)
+ return user;
+
+ /* slow-path: try again with IDR locked */
+ mutex_lock(&bus1_user_lock);
+ user = idr_find(&bus1_user_idr, __kuid_val(uid));
+ if (likely(!bus1_user_ref(user))) {
+ user = bus1_user_new();
+ if (!IS_ERR(user)) {
+ user->uid = uid;
+ r = idr_alloc(&bus1_user_idr, user, __kuid_val(uid),
+ __kuid_val(uid) + 1, GFP_KERNEL);
+ if (r < 0) {
+ user->uid = INVALID_UID; /* couldn't insert */
+ kref_put(&user->ref, bus1_user_free);
+ user = ERR_PTR(r);
+ }
+ }
+ }
+ mutex_unlock(&bus1_user_lock);
+
+ return user;
+}
+
+/**
+ * bus1_user_ref() - acquire reference
+ * @user: user to acquire, or NULL
+ *
+ * Acquire an additional reference to a user-object. The caller must already
+ * own a reference.
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: @user is returned.
+ */
+struct bus1_user *bus1_user_ref(struct bus1_user *user)
+{
+ if (user)
+ kref_get(&user->ref);
+ return user;
+}
+
+/**
+ * bus1_user_unref() - release reference
+ * @user: user to release, or NULL
+ *
+ * Release a reference to a user-object.
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+struct bus1_user *bus1_user_unref(struct bus1_user *user)
+{
+ if (user) {
+ if (kref_put_mutex(&user->ref, bus1_user_free, &bus1_user_lock))
+ mutex_unlock(&bus1_user_lock);
+ }
+
+ return NULL;
+}
diff --git a/ipc/bus1/user.h b/ipc/bus1/user.h
new file mode 100644
index 0000000..6cdc264
--- /dev/null
+++ b/ipc/bus1/user.h
@@ -0,0 +1,67 @@
+#ifndef __BUS1_USER_H
+#define __BUS1_USER_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Users
+ *
+ * Different users can communicate via bus1, and many resources are shared
+ * between multiple users. The bus1_user object represents the UID of a user,
+ * like "struct user_struct" does in the kernel core. It is used to account
+ * global resources, apply limits, and calculate quotas if different UIDs
+ * communicate with each other.
+ *
+ * All dynamic resources have global per-user limits, which cannot be exceeded
+ * by a user. They prevent a single user from exhausting local resources. Each
+ * peer that is created is always owned by the user that initialized it. All
+ * resources allocated on that peer are accounted on that pinned user.
+ * Additionally to global resources, there are local limits per peer, that can
+ * be controlled by each peer individually (e.g., specifying a maximum pool
+ * size). Those local limits allow a user to distribute the globally available
+ * resources across its peer instances.
+ *
+ * Since bus1 allows communication across UID boundaries, any such transmission
+ * of resources must be properly accounted. Bus1 employs dynamic quotas to
+ * fairly distribute available resources. Those quotas make sure that available
+ * resources of a peer cannot be exhausted by remote UIDs, but are fairly
+ * divided among all communicating peers.
+ */
+
+#include <linux/atomic.h>
+#include <linux/idr.h>
+#include <linux/kref.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+#include <linux/uidgid.h>
+
+/**
+ * struct bus1_user - resource accounting for users
+ * @ref: reference counter
+ * @uid: UID of the user
+ * @lock: object lock
+ * @rcu: rcu
+ */
+struct bus1_user {
+ struct kref ref;
+ kuid_t uid;
+ struct mutex lock;
+ struct rcu_head rcu;
+};
+
+/* module cleanup */
+void bus1_user_modexit(void);
+
+/* users */
+struct bus1_user *bus1_user_ref_by_uid(kuid_t uid);
+struct bus1_user *bus1_user_ref(struct bus1_user *user);
+struct bus1_user *bus1_user_unref(struct bus1_user *user);
+
+#endif /* __BUS1_USER_H */
--
2.10.1

2016-10-26 19:21:52

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 04/14] bus1: util - fixed list utility library

From: Tom Gundersen <[email protected]>

This implements a fixed-size list called bus1_flist. The size of
the list must be constant over the lifetime of the list. The list
can hold one arbitrary pointer per node.

Fixed lists are a combination of a linked list and a static array.
That is, fixed lists behave like linked lists (no random access, but
arbitrary size), but compare in speed with arrays (consequetive
accesses are fast). Unlike fixed arrays, fixed lists can hold huge
number of elements without requiring vmalloc, but solely relying on
small-size kmalloc allocations.

Internally, fixed lists are a singly-linked list of static arrays.
This guarantees that iterations behave almost like on an array,
except when crossing a batch-border.

Fixed lists can replace fixed-size arrays whenever you need to support
large number of elements, but don't need random access. Fixed lists
have ALMOST the same memory requirements as fixed-size arrays, except
one pointer of state per 'BUS1_FLIST_BATCH' elements. If only a small
size (i.e., it only requires one batch) is stored in a fixed list,
then its memory requirements and iteration time are equivalent to
fixed-size arrays.

Fixed lists will be required by the upcoming bus1 message-transactions.
They must support large auxiliary data transfers, in case users want to
send their entire handle state via the bus.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 3 +-
ipc/bus1/util/flist.c | 116 +++++++++++++++++++++++++++++
ipc/bus1/util/flist.h | 202 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 320 insertions(+), 1 deletion(-)
create mode 100644 ipc/bus1/util/flist.c
create mode 100644 ipc/bus1/util/flist.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index 9e491691..6db6d13 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,6 +1,7 @@
bus1-y := \
main.o \
- util/active.o
+ util/active.o \
+ util/flist.o

obj-$(CONFIG_BUS1) += bus1.o

diff --git a/ipc/bus1/util/flist.c b/ipc/bus1/util/flist.c
new file mode 100644
index 0000000..b8b0d4e
--- /dev/null
+++ b/ipc/bus1/util/flist.c
@@ -0,0 +1,116 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include "flist.h"
+
+/**
+ * bus1_flist_populate() - populate an flist
+ * @list: flist to operate on
+ * @n: number of elements
+ * @gfp: GFP to use for allocations
+ *
+ * Populate an flist. This pre-allocates the backing memory for an flist that
+ * was statically initialized via bus1_flist_init(). This is NOT needed if the
+ * list was allocated via bus1_flist_new().
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_flist_populate(struct bus1_flist *list, size_t n, gfp_t gfp)
+{
+ if (gfp & __GFP_ZERO)
+ memset(list, 0, bus1_flist_inline_size(n));
+
+ if (unlikely(n > BUS1_FLIST_BATCH)) {
+ /* Never populate twice! */
+ WARN_ON(list[BUS1_FLIST_BATCH].next);
+
+ n -= BUS1_FLIST_BATCH;
+ list[BUS1_FLIST_BATCH].next = bus1_flist_new(n, gfp);
+ if (!list[BUS1_FLIST_BATCH].next)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+/**
+ * bus1_flist_new() - allocate new flist
+ * @n: number of elements
+ * @gfp: GFP to use for allocations
+ *
+ * This allocates a new flist ready to store @n elements.
+ *
+ * Return: Pointer to flist, NULL if out-of-memory.
+ */
+struct bus1_flist *bus1_flist_new(size_t n, gfp_t gfp)
+{
+ struct bus1_flist list, *e, *slot;
+ size_t remaining;
+
+ list.next = NULL;
+ slot = &list;
+ remaining = n;
+
+ while (remaining >= BUS1_FLIST_BATCH) {
+ e = kmalloc_array(sizeof(*e), BUS1_FLIST_BATCH + 1, gfp);
+ if (!e)
+ return bus1_flist_free(list.next, n);
+
+ slot->next = e;
+ slot = &e[BUS1_FLIST_BATCH];
+ slot->next = NULL;
+
+ remaining -= BUS1_FLIST_BATCH;
+ }
+
+ if (remaining > 0) {
+ slot->next = kmalloc_array(remaining, sizeof(*e), gfp);
+ if (!slot->next)
+ return bus1_flist_free(list.next, n);
+ }
+
+ return list.next;
+}
+
+/**
+ * bus1_flist_free() - free flist
+ * @list: flist to operate on, or NULL
+ * @n: number of elements
+ *
+ * This deallocates an flist previously created via bus1_flist_new().
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+struct bus1_flist *bus1_flist_free(struct bus1_flist *list, size_t n)
+{
+ struct bus1_flist *e;
+
+ if (list) {
+ /*
+ * If @list was only partially allocated, then "next" pointers
+ * might be NULL. So check @list on each iteration.
+ */
+ while (list && n >= BUS1_FLIST_BATCH) {
+ e = list;
+ list = list[BUS1_FLIST_BATCH].next;
+ kfree(e);
+ n -= BUS1_FLIST_BATCH;
+ }
+
+ kfree(list);
+ }
+
+ return NULL;
+}
diff --git a/ipc/bus1/util/flist.h b/ipc/bus1/util/flist.h
new file mode 100644
index 0000000..e265d5c
--- /dev/null
+++ b/ipc/bus1/util/flist.h
@@ -0,0 +1,202 @@
+#ifndef __BUS1_FLIST_H
+#define __BUS1_FLIST_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Fixed Lists
+ *
+ * This implements a fixed-size list called bus1_flist. The size of the list
+ * must be constant over the lifetime of the list. The list can hold one
+ * arbitrary pointer per node.
+ *
+ * Fixed lists are a combination of a linked list and a static array. That is,
+ * fixed lists behave like linked lists (no random access, but arbitrary size),
+ * but compare in speed with arrays (consequetive accesses are fast). Unlike
+ * fixed arrays, fixed lists can hold huge number of elements without requiring
+ * vmalloc(), but solely relying on small-size kmalloc() allocations.
+ *
+ * Internally, fixed lists are a singly-linked list of static arrays. This
+ * guarantees that iterations behave almost like on an array, except when
+ * crossing a batch-border.
+ *
+ * Fixed lists can replace fixed-size arrays whenever you need to support large
+ * number of elements, but don't need random access. Fixed lists have ALMOST
+ * the same memory requirements as fixed-size arrays, except one pointer of
+ * state per 'BUS1_FLIST_BATCH' elements. If only a small size (i.e., it only
+ * requires one batch) is stored in a fixed list, then its memory requirements
+ * and iteration time are equivalent to fixed-size arrays.
+ */
+
+#include <linux/kernel.h>
+
+#define BUS1_FLIST_BATCH (1024)
+
+/**
+ * struct bus1_flist - fixed list
+ * @next: pointer to next batch
+ * @ptr: stored entry
+ */
+struct bus1_flist {
+ union {
+ struct bus1_flist *next;
+ void *ptr;
+ };
+};
+
+int bus1_flist_populate(struct bus1_flist *flist, size_t n, gfp_t gfp);
+struct bus1_flist *bus1_flist_new(size_t n, gfp_t gfp);
+struct bus1_flist *bus1_flist_free(struct bus1_flist *list, size_t n);
+
+/**
+ * bus1_flist_inline_size() - calculate required inline size
+ * @n: number of entries
+ *
+ * When allocating storage for an flist, this calculates the size of the
+ * initial array in bytes. Use bus1_flist_new() directly if you want to
+ * allocate an flist on the heap. This helper is only needed if you embed an
+ * flist into another struct like this:
+ *
+ * struct foo {
+ * ...
+ * struct bus1_flist list[];
+ * };
+ *
+ * In that case the flist must be the last element, and the size in bytes
+ * required by it is returned by this function.
+ *
+ * The inline-size of an flist is always bound to a fixed maximum. That is,
+ * regardless of @n, this will always return a reasonable number that can be
+ * allocated via kmalloc().
+ *
+ * Return: Size in bytes required for the initial batch of an flist.
+ */
+static inline size_t bus1_flist_inline_size(size_t n)
+{
+ return sizeof(struct bus1_flist) *
+ ((likely(n < BUS1_FLIST_BATCH)) ? n : (BUS1_FLIST_BATCH + 1));
+}
+
+/**
+ * bus1_flist_init() - initialize an flist
+ * @list: flist to initialize
+ * @n: number of entries
+ *
+ * This initializes an flist of size @n. It does NOT preallocate the memory,
+ * but only initializes @list in a way that bus1_flist_deinit() can be called
+ * on it. Use bus1_flist_populate() to populate the flist.
+ *
+ * This is only needed if your backing memory of @list is shared with another
+ * object. If possible, use bus1_flist_new() to allocate an flist on the heap
+ * and avoid this dance.
+ */
+static inline void bus1_flist_init(struct bus1_flist *list, size_t n)
+{
+ BUILD_BUG_ON(sizeof(struct bus1_flist) != sizeof(void *));
+
+ if (unlikely(n >= BUS1_FLIST_BATCH))
+ list[BUS1_FLIST_BATCH].next = NULL;
+}
+
+/**
+ * bus1_flist_deinit() - deinitialize an flist
+ * @list: flist to deinitialize
+ * @n: number of entries
+ *
+ * This deallocates an flist and releases all resources. If already
+ * deinitialized, this is a no-op. This is only needed if you called
+ * bus1_flist_populate().
+ */
+static inline void bus1_flist_deinit(struct bus1_flist *list, size_t n)
+{
+ if (unlikely(n >= BUS1_FLIST_BATCH)) {
+ bus1_flist_free(list[BUS1_FLIST_BATCH].next,
+ n - BUS1_FLIST_BATCH);
+ list[BUS1_FLIST_BATCH].next = NULL;
+ }
+}
+
+/**
+ * bus1_flist_next() - flist iterator
+ * @iter: iterator
+ * @pos: current position
+ *
+ * This advances an flist iterator by one position. @iter must point to the
+ * current position, and the new position is returned by this function. @pos
+ * must point to a variable that contains the current index position. That is,
+ * @pos must be initialized to 0 and @iter to the flist head.
+ *
+ * Neither @pos nor @iter must be modified by anyone but this helper. In the
+ * loop body you can use @iter->ptr to access the current element.
+ *
+ * This iterator is normally used like this:
+ *
+ * size_t pos, n = 128;
+ * struct bus1_flist *e, *list = bus1_flist_new(n);
+ *
+ * ...
+ *
+ * for (pos = 0, e = list; pos < n; e = bus1_flist_next(e, &pos)) {
+ * ... access e->ptr ...
+ * }
+ *
+ * Return: Next iterator position.
+ */
+static inline struct bus1_flist *bus1_flist_next(struct bus1_flist *iter,
+ size_t *pos)
+{
+ return (++*pos % BUS1_FLIST_BATCH) ? (iter + 1) : (iter + 1)->next;
+}
+
+/**
+ * bus1_flist_walk() - walk flist in batches
+ * @list: list to walk
+ * @n: number of entries
+ * @iter: iterator
+ * @pos: current position
+ *
+ * This walks an flist in batches of size up to BUS1_FLIST_BATCH. It is
+ * normally used like this:
+ *
+ * size_t pos, z, n = 65536;
+ * struct bus1_flist *e, *list = bus1_flist_new(n);
+ *
+ * ...
+ *
+ * pos = 0;
+ * while ((z = bus1_flist_walk(list, n, &e, &pos)) > 0) {
+ * ... access e[0...z]->ptr
+ * ... invariant: z <= BUS1_FLIST_BATCH
+ * ... invariant: e[i]->ptr == (&e->ptr)[i]
+ * }
+ *
+ * Return: Size of batch at @iter.
+ */
+static inline size_t bus1_flist_walk(struct bus1_flist *list,
+ size_t n,
+ struct bus1_flist **iter,
+ size_t *pos)
+{
+ if (*pos < n) {
+ n = n - *pos;
+ if (unlikely(n > BUS1_FLIST_BATCH))
+ n = BUS1_FLIST_BATCH;
+ if (likely(*pos == 0))
+ *iter = list;
+ else
+ *iter = (*iter)[BUS1_FLIST_BATCH].next;
+ *pos += n;
+ } else {
+ n = 0;
+ }
+ return n;
+}
+
+#endif /* __BUS1_FLIST_H */
--
2.10.1

2016-10-26 19:22:04

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 06/14] bus1: util - queue utility library

From: Tom Gundersen <[email protected]>

(Please refer to 'Lamport Timestamps', the concept of
'happened-before', and 'causal ordering'. The queue implementation
has its roots in Lamport Timestamps, treating a set of local CPUs
as a distributed system to avoid any global synchronization.)

A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
the time they were sent. Moreover, atomic delivery of messages to
multiple queues are supported, without any global synchronization, i.e.,
the order of message delivery is consistent across queues.

Messages can be destined for multiple queues, hence, we need to be
careful that all queues get a consistent order of incoming messages. We
define the concept of `global order' to provide a basic set of
guarantees. This global order is a partial order on the set of all
messages. The order is defined as:

1) If a message B was queued *after* a message A, then: A < B

2) If a message B was queued *after* a message A was dequeued,
then: A < B

3) If a message B was dequeued *after* a message A on the same queue,
then: A < B

(Note: Causality is honored. `after' and `before' do not refer to
the same task, nor the same queue, but rather any kind of
synchronization between the two operations.)

The queue object implements this global order in a lockless fashion. It
solely relies on a distributed clock on each queue. Each message to be
sent causes a clock tick on the local clock and on all destination
clocks. Furthermore, all clocks are synchronized, meaning they're
fast-forwarded in case they're behind the highest of all participating
peers. No global state tracking is involved.

During a message transaction, we first queue a message as 'staging'
entry in each destination with a preliminary timestamp. This timestamp
is explicitly odd numbered. Any odd numbered timestamp is considered
'staging' and causes *any* message ordered after it to be blocked until
it is no longer staging. This allows us to queue the message in parallel
with any racing multicast, and be guaranteed that all possible conflicts
are blocked until we eventually commit a transaction. To commit a
transaction (after all staging entries are queued), we choose the
highest timestamp we have seen across all destinations and re-queue all
our entries on each peer using that timestamp. Here we use a commit
timestamp (even numbered).

With this in mind, we define that a client can only dequeue messages
from its queue that have an even timestamp. Furthermore, if there is a
message queued with an odd timestamp that is lower than the even
timestamp of another message, then neither message can be dequeued.
They're considered to be in-flight conflicts. This guarantees that two
concurrent multicast messages can be queued without any *global* locks,
but either can only be dequeued by a peer if their ordering has been
established (via commit timestamps).

NOTE: A fully committed message is not guaranteed to be ready to be
dequeued as it may be blocked by a staging entry. This means
that there is an arbitrary (though bounded) time from a
message transaction completing when the queue may still appear
to be empty. In other words, message transmission is not
instantaneous. It would be possible to change this at the
cost of shortly blocking each message transaction on all other
conflicting tasks.

The queue implementation uses an rb-tree (ordered by timestamps and
sender), with a cached pointer to the front of the queue. It will be
embedded in every peer participating on the bus1 kernel message bus1.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 3 +-
ipc/bus1/util/queue.c | 445 ++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/util/queue.h | 351 +++++++++++++++++++++++++++++++++++++++
3 files changed, 798 insertions(+), 1 deletion(-)
create mode 100644 ipc/bus1/util/queue.c
create mode 100644 ipc/bus1/util/queue.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index ca8e19d..3c90657 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -2,7 +2,8 @@ bus1-y := \
main.o \
util/active.o \
util/flist.o \
- util/pool.o
+ util/pool.o \
+ util/queue.o

obj-$(CONFIG_BUS1) += bus1.o

diff --git a/ipc/bus1/util/queue.c b/ipc/bus1/util/queue.c
new file mode 100644
index 0000000..38d7b98
--- /dev/null
+++ b/ipc/bus1/util/queue.c
@@ -0,0 +1,445 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include "queue.h"
+
+static void bus1_queue_node_set_timestamp(struct bus1_queue_node *node, u64 ts)
+{
+ WARN_ON(ts & BUS1_QUEUE_TYPE_MASK);
+ node->timestamp_and_type &= BUS1_QUEUE_TYPE_MASK;
+ node->timestamp_and_type |= ts;
+}
+
+static int bus1_queue_node_order(struct bus1_queue_node *a,
+ struct bus1_queue_node *b)
+{
+ int r;
+
+ r = bus1_queue_compare(bus1_queue_node_get_timestamp(a), a->group,
+ bus1_queue_node_get_timestamp(b), b->group);
+ if (r)
+ return r;
+ if (a < b)
+ return -1;
+ if (a > b)
+ return 1;
+
+ WARN(1, "Duplicate queue entry");
+ return 0;
+}
+
+/**
+ * bus1_queue_init() - initialize queue
+ * @queue: queue to initialize
+ *
+ * This initializes a new queue. The queue memory is considered uninitialized,
+ * any previous content is unrecoverable.
+ */
+void bus1_queue_init(struct bus1_queue *queue)
+{
+ queue->clock = 0;
+ queue->flush = 0;
+ queue->leftmost = NULL;
+ rcu_assign_pointer(queue->front, NULL);
+ queue->messages = RB_ROOT;
+}
+
+/**
+ * bus1_queue_deinit() - destroy queue
+ * @queue: queue to destroy
+ *
+ * This destroys a queue that was previously initialized via bus1_queue_init().
+ * The caller must make sure the queue is empty before calling this.
+ *
+ * This function is a no-op, and only does safety checks on the queue. It is
+ * safe to call this function multiple times on the same queue.
+ *
+ * The caller must guarantee that the backing memory of @queue is freed in an
+ * rcu-delayed manner.
+ */
+void bus1_queue_deinit(struct bus1_queue *queue)
+{
+ WARN_ON(!RB_EMPTY_ROOT(&queue->messages));
+ WARN_ON(queue->leftmost);
+ WARN_ON(rcu_access_pointer(queue->front));
+}
+
+/**
+ * bus1_queue_flush() - flush message queue
+ * @queue: queue to flush
+ * @ts: flush timestamp
+ *
+ * This flushes all committed entries from @queue and returns them as
+ * singly-linked list for the caller to clean up. Staged entries are left in
+ * the queue.
+ *
+ * You must acquire a timestamp before flushing the queue (e.g., tick the
+ * clock). This timestamp must be given as @ts. Only entries lower than, or
+ * equal to, this timestamp are flushed. The timestamp is remembered as
+ * queue->flush.
+ *
+ * Return: Single-linked list of flushed entries.
+ */
+struct bus1_queue_node *bus1_queue_flush(struct bus1_queue *queue, u64 ts)
+{
+ struct bus1_queue_node *node, *list = NULL;
+ struct rb_node *n;
+
+ /*
+ * A queue contains staging and committed nodes. A committed node is
+ * fully owned by the queue, but a staging entry is always still owned
+ * by a transaction.
+ *
+ * On flush, we push all committed (i.e., queue-owned) nodes into a
+ * list and transfer them to the caller, as if they dequeued them
+ * manually. But any staging node is left linked. Depending on the
+ * timestamp that will be assigned by their transaction, they will be
+ * either lazily discarded or not.
+ */
+
+ WARN_ON(ts & 1);
+ WARN_ON(ts > queue->clock + 1);
+ WARN_ON(ts < queue->flush);
+
+ rcu_assign_pointer(queue->front, NULL);
+ queue->leftmost = NULL;
+ queue->flush = ts;
+
+ n = rb_first(&queue->messages);
+ while (n) {
+ node = container_of(n, struct bus1_queue_node, rb);
+ n = rb_next(n);
+ ts = bus1_queue_node_get_timestamp(node);
+
+ if (!(ts & 1) && ts <= queue->flush) {
+ rb_erase(&node->rb, &queue->messages);
+ RB_CLEAR_NODE(&node->rb);
+ node->next = list;
+ list = node;
+ } else if (!queue->leftmost) {
+ queue->leftmost = &node->rb;
+ }
+ }
+
+ return list;
+}
+
+static void bus1_queue_add(struct bus1_queue *queue,
+ wait_queue_head_t *waitq,
+ struct bus1_queue_node *node,
+ u64 timestamp)
+{
+ struct rb_node *front, *n, **slot;
+ struct bus1_queue_node *iter;
+ bool is_leftmost, readable;
+ u64 ts;
+ int r;
+
+ ts = bus1_queue_node_get_timestamp(node);
+ readable = rcu_access_pointer(queue->front);
+
+ /* provided timestamp must be valid */
+ if (WARN_ON(timestamp == 0 || timestamp > queue->clock + 1))
+ return;
+ /* if unstamped, it must be unlinked, and vice versa */
+ if (WARN_ON(!ts == !RB_EMPTY_NODE(&node->rb)))
+ return;
+ /* if stamped, it must be a valid staging timestamp from earlier */
+ if (WARN_ON(ts != 0 && (!(ts & 1) || timestamp < ts)))
+ return;
+ /* nothing to do? */
+ if (ts == timestamp)
+ return;
+
+ /*
+ * We update the timestamp of @node *before* erasing it. This
+ * guarantees that the comparisons to NEXT/PREV are done based on the
+ * new values.
+ * The rb-tree does not care for async key-updates, since all accesses
+ * are done locked, and tree maintenance is always stable (never looks
+ * at the keys).
+ */
+ bus1_queue_node_set_timestamp(node, timestamp);
+
+ /*
+ * On updates, we remove our entry and re-insert it with a higher
+ * timestamp. Hence, _iff_ we were the first entry, we might uncover
+ * some new front entry. Make sure we mark it as front entry then. Note
+ * that we know that our entry must be marked staging, so it cannot be
+ * set as front, yet. If there is a front, it is some other node.
+ */
+ if (&node->rb == queue->leftmost) {
+ /*
+ * We are linked into the queue as staging entry *and* we are
+ * the first entry. Now look at the following entry. If it is
+ * already committed *and* has a lower timestamp than we do, it
+ * will become the new front, so mark it as such!
+ */
+ WARN_ON(readable);
+ queue->leftmost = rb_next(&node->rb);
+ if (queue->leftmost) {
+ iter = container_of(queue->leftmost,
+ struct bus1_queue_node, rb);
+ if (!bus1_queue_node_is_staging(iter) &&
+ bus1_queue_node_order(iter, node) <= 0)
+ rcu_assign_pointer(queue->front,
+ queue->leftmost);
+ }
+ } else if ((front = rcu_dereference_raw(queue->front))) {
+ /*
+ * If there already is a front entry, just verify that we will
+ * not order *before* it. We *must not* replace it as front.
+ */
+ iter = container_of(front, struct bus1_queue_node, rb);
+ WARN_ON(bus1_queue_node_order(node, iter) <= 0);
+ }
+
+ /* must be staging, so it cannot be pointed to by queue->front */
+ if (!RB_EMPTY_NODE(&node->rb))
+ rb_erase(&node->rb, &queue->messages);
+
+ /* re-insert into sorted rb-tree with new timestamp */
+ slot = &queue->messages.rb_node;
+ n = NULL;
+ is_leftmost = true;
+ while (*slot) {
+ n = *slot;
+ iter = container_of(n, struct bus1_queue_node, rb);
+ r = bus1_queue_node_order(node, iter);
+ if (r < 0) {
+ slot = &n->rb_left;
+ } else /* if (r >= 0) */ {
+ slot = &n->rb_right;
+ is_leftmost = false;
+ }
+ }
+
+ rb_link_node(&node->rb, n, slot);
+ rb_insert_color(&node->rb, &queue->messages);
+
+ if (is_leftmost) {
+ queue->leftmost = &node->rb;
+ if (!(timestamp & 1))
+ rcu_assign_pointer(queue->front, &node->rb);
+ else
+ WARN_ON(readable);
+ }
+
+ if (waitq && !readable && rcu_access_pointer(queue->front))
+ wake_up_interruptible(waitq);
+}
+
+/**
+ * bus1_queue_stage() - stage queue entry with fresh timestamp
+ * @queue: queue to operate on
+ * @node: queue entry to stage
+ * @timestamp: minimum timestamp for @node
+ *
+ * Link a queue entry with a new timestamp. The staging entry blocks all
+ * messages with timestamps synced on this queue in the future, as well as any
+ * messages with a timestamp greater than @timestamp. However, it does not block
+ * any messages already committed to this queue.
+ *
+ * The caller must provide an even timestamp and the entry may not already have
+ * been committed.
+ *
+ * Return: The timestamp used.
+ */
+u64 bus1_queue_stage(struct bus1_queue *queue,
+ struct bus1_queue_node *node,
+ u64 timestamp)
+{
+ WARN_ON(timestamp & 1);
+ WARN_ON(bus1_queue_node_is_queued(node));
+
+ timestamp = bus1_queue_sync(queue, timestamp);
+ bus1_queue_add(queue, NULL, node, timestamp + 1);
+ WARN_ON(rcu_access_pointer(queue->front) == &node->rb);
+
+ return timestamp;
+}
+
+/**
+ * bus1_queue_commit_staged() - commit staged queue entry with new timestamp
+ * @queue: queue to operate on
+ * @waitq: wait-queue to wake up on change, or NULL
+ * @node: queue entry to commit
+ * @timestamp: new timestamp for @node
+ *
+ * Update a staging queue entry according to @timestamp. The timestamp must be
+ * even and the entry may not already have been committed.
+ *
+ * Furthermore, the queue clock must be synced with the new timestamp *before*
+ * staging an entry. Similarly, the timestamp of an entry can only be
+ * increased, never decreased.
+ */
+void bus1_queue_commit_staged(struct bus1_queue *queue,
+ wait_queue_head_t *waitq,
+ struct bus1_queue_node *node,
+ u64 timestamp)
+{
+ WARN_ON(timestamp & 1);
+ WARN_ON(!bus1_queue_node_is_queued(node));
+
+ bus1_queue_add(queue, waitq, node, timestamp);
+}
+
+/**
+ * bus1_queue_commit_unstaged() - commit unstaged queue entry with new timestamp
+ * @queue: queue to operate on
+ * @waitq: wait-queue to wake up on change, or NULL
+ * @node: queue entry to commit
+ *
+ * Directly commit an unstaged queue entry to the destination queue. The entry
+ * must not be queued, yet.
+ *
+ * The destination queue is ticked and the resulting timestamp is used to commit
+ * the queue entry.
+ */
+void bus1_queue_commit_unstaged(struct bus1_queue *queue,
+ wait_queue_head_t *waitq,
+ struct bus1_queue_node *node)
+{
+ WARN_ON(bus1_queue_node_is_queued(node));
+
+ bus1_queue_add(queue, waitq, node, bus1_queue_tick(queue));
+}
+
+/**
+ * bus1_queue_commit_synthetic() - commit synthetic entry
+ * @queue: queue to operate on
+ * @node: entry to commit
+ * @timestamp: timestamp to use
+ *
+ * This inserts the unqueued entry @node into the queue with the commit
+ * timestamp @timestamp (just like bus1_queue_commit_unstaged()). However, it
+ * only does so if the new entry would NOT become the new front. It thus allows
+ * inserting fake synthetic entries somewhere in the middle of a queue, but
+ * accepts the possibility of failure.
+ *
+ * Return: True if committed, false if not.
+ */
+bool bus1_queue_commit_synthetic(struct bus1_queue *queue,
+ struct bus1_queue_node *node,
+ u64 timestamp)
+{
+ struct bus1_queue_node *t;
+ bool queued = false;
+ int r;
+
+ WARN_ON(timestamp & 1);
+ WARN_ON(timestamp > queue->clock + 1);
+ WARN_ON(bus1_queue_node_is_queued(node));
+
+ if (queue->leftmost) {
+ t = container_of(queue->leftmost, struct bus1_queue_node, rb);
+ r = bus1_queue_compare(bus1_queue_node_get_timestamp(t),
+ t->group, timestamp, node->group);
+ if (r < 0 || (r == 0 && node < t)) {
+ bus1_queue_add(queue, NULL, node, timestamp);
+ WARN_ON(rcu_access_pointer(queue->front) == &node->rb);
+ queued = true;
+ }
+ }
+
+ return queued;
+}
+
+/**
+ * bus1_queue_remove() - remove entry from queue
+ * @queue: queue to operate on
+ * @waitq: wait-queue to wake up on change, or NULL
+ * @node: queue entry to remove
+ *
+ * This unlinks @node and fully removes it from the queue @queue. If you want
+ * to re-insert the node into a queue, you must re-initialize it first.
+ *
+ * It is an error to call this on an unlinked entry.
+ */
+void bus1_queue_remove(struct bus1_queue *queue,
+ wait_queue_head_t *waitq,
+ struct bus1_queue_node *node)
+{
+ bool readable;
+
+ if (WARN_ON(RB_EMPTY_NODE(&node->rb)))
+ return;
+
+ readable = rcu_access_pointer(queue->front);
+
+ if (queue->leftmost == &node->rb) {
+ /*
+ * We are the first entry in the queue. Regardless whether we
+ * are marked as front or not, our removal might uncover a new
+ * front. Hence, always look at the next following entry and
+ * see whether it is fully committed. If it is, mark it as
+ * front, but otherwise reset the front to NULL.
+ */
+ queue->leftmost = rb_next(queue->leftmost);
+ if (queue->leftmost &&
+ !bus1_queue_node_is_staging(container_of(queue->leftmost,
+ struct bus1_queue_node,
+ rb)))
+ rcu_assign_pointer(queue->front, queue->leftmost);
+ else
+ rcu_assign_pointer(queue->front, NULL);
+ }
+
+ rb_erase(&node->rb, &queue->messages);
+ RB_CLEAR_NODE(&node->rb);
+
+ if (waitq && !readable && rcu_access_pointer(queue->front))
+ wake_up_interruptible(waitq);
+}
+
+/**
+ * bus1_queue_peek() - peek first available entry
+ * @queue: queue to operate on
+ * @morep: where to store group-state
+ *
+ * This returns a pointer to the first available entry in the given queue, or
+ * NULL if there is none. The queue stays unmodified and the returned entry
+ * remains on the queue.
+ *
+ * This only returns entries that are ready to be dequeued. Entries that are
+ * still in staging mode will not be considered.
+ *
+ * If a node is returned, its group-state is stored in @morep. That means,
+ * if there are more messages queued as part of the same transaction, true is
+ * stored in @morep. But if the returned node is the last part of the
+ * transaction, false is returned.
+ *
+ * Return: Pointer to first available entry, NULL if none available.
+ */
+struct bus1_queue_node *bus1_queue_peek(struct bus1_queue *queue, bool *morep)
+{
+ struct bus1_queue_node *node, *t;
+ struct rb_node *n;
+
+ n = rcu_dereference_raw(queue->front);
+ if (!n)
+ return NULL;
+
+ node = container_of(n, struct bus1_queue_node, rb);
+ n = rb_next(n);
+ if (n)
+ t = container_of(n, struct bus1_queue_node, rb);
+
+ *morep = n && !bus1_queue_compare(bus1_queue_node_get_timestamp(node),
+ node->group,
+ bus1_queue_node_get_timestamp(t),
+ t->group);
+ return node;
+}
diff --git a/ipc/bus1/util/queue.h b/ipc/bus1/util/queue.h
new file mode 100644
index 0000000..1a59a60
--- /dev/null
+++ b/ipc/bus1/util/queue.h
@@ -0,0 +1,351 @@
+#ifndef __BUS1_QUEUE_H
+#define __BUS1_QUEUE_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Message Queue
+ *
+ * (You are highly encouraged to read up on 'Lamport Timestamps', the
+ * concept of 'happened-before', and 'causal ordering'. The queue
+ * implementation has its roots in Lamport Timestamps, treating a set of local
+ * CPUs as a distributed system to avoid any global synchronization.)
+ *
+ * A message queue is a FIFO, i.e., messages are linearly ordered by the time
+ * they were sent. Moreover, atomic delivery of messages to multiple queues are
+ * supported, without any global synchronization, i.e., the order of message
+ * delivery is consistent across queues.
+ *
+ * Messages can be destined for multiple queues, hence, we need to be careful
+ * that all queues get a consistent order of incoming messages. We define the
+ * concept of `global order' to provide a basic set of guarantees. This global
+ * order is a partial order on the set of all messages. The order is defined as:
+ *
+ * 1) If a message B was queued *after* a message A, then: A < B
+ *
+ * 2) If a message B was queued *after* a message A was dequeued, then: A < B
+ *
+ * 3) If a message B was dequeued *after* a message A on the same queue,
+ * then: A < B
+ *
+ * (Note: Causality is honored. `after' and `before' do not refer to the
+ * same task, nor the same queue, but rather any kind of
+ * synchronization between the two operations.)
+ *
+ * The queue object implements this global order in a lockless fashion. It
+ * solely relies on a distributed clock on each queue. Each message to be sent
+ * causes a clock tick on the local clock and on all destination clocks.
+ * Furthermore, all clocks are synchronized, meaning they're fast-forwarded in
+ * case they're behind the highest of all participating peers. No global state
+ * tracking is involved.
+ *
+ * During a message transaction, we first queue a message as 'staging' entry in
+ * each destination with a preliminary timestamp. This timestamp is explicitly
+ * odd numbered. Any odd numbered timestamp is considered 'staging' and causes
+ * *any* message ordered after it to be blocked until it is no longer staging.
+ * This allows us to queue the message in parallel with any racing multicast,
+ * and be guaranteed that all possible conflicts are blocked until we eventually
+ * commit a transaction. To commit a transaction (after all staging entries are
+ * queued), we choose the highest timestamp we have seen across all destinations
+ * and re-queue all our entries on each peer using that timestamp. Here we use a
+ * commit timestamp (even numbered).
+ *
+ * With this in mind, we define that a client can only dequeue messages from
+ * its queue that have an even timestamp. Furthermore, if there is a message
+ * queued with an odd timestamp that is lower than the even timestamp of
+ * another message, then neither message can be dequeued. They're considered to
+ * be in-flight conflicts. This guarantees that two concurrent multicast
+ * messages can be queued without any *global* locks, but either can only be
+ * dequeued by a peer if their ordering has been established (via commit
+ * timestamps).
+ *
+ * NOTE: A fully committed message is not guaranteed to be ready to be dequeued
+ * as it may be blocked by a staging entry. This means that there is an
+ * arbitrary (though bounded) time from a message transaction completing
+ * when the queue may still appear to be empty. In other words, message
+ * transmission is not instantaneous. It would be possible to change this
+ * at the cost of shortly blocking each message transaction on all other
+ * conflicting tasks.
+ *
+ * The queue implementation uses an rb-tree (ordered by timestamps and sender),
+ * with a cached pointer to the front of the queue.
+ */
+
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/wait.h>
+
+/* shift/mask for @timestamp_and_type field of queue nodes */
+#define BUS1_QUEUE_TYPE_SHIFT (62)
+#define BUS1_QUEUE_TYPE_MASK (((u64)3ULL) << BUS1_QUEUE_TYPE_SHIFT)
+#define BUS1_QUEUE_TYPE_N (4)
+
+/**
+ * struct bus1_queue_node - node into message queue
+ * @rcu: rcu-delayed destruction
+ * @rb: link into sorted message queue
+ * @timestamp_and_type: message timestamp and type of parent object
+ * @next: single-linked utility list
+ * @group: group association
+ * @owner: node owner
+ */
+struct bus1_queue_node {
+ union {
+ struct rcu_head rcu;
+ struct rb_node rb;
+ };
+ u64 timestamp_and_type;
+ struct bus1_queue_node *next;
+ void *group;
+ void *owner;
+};
+
+/**
+ * struct bus1_queue - message queue
+ * @clock: local clock (used for Lamport Timestamps)
+ * @flush: last flush timestamp
+ * @leftmost: cached left-most entry
+ * @front: cached front entry
+ * @messages: queued messages
+ */
+struct bus1_queue {
+ u64 clock;
+ u64 flush;
+ struct rb_node *leftmost;
+ struct rb_node __rcu *front;
+ struct rb_root messages;
+};
+
+void bus1_queue_init(struct bus1_queue *queue);
+void bus1_queue_deinit(struct bus1_queue *queue);
+struct bus1_queue_node *bus1_queue_flush(struct bus1_queue *queue, u64 ts);
+u64 bus1_queue_stage(struct bus1_queue *queue,
+ struct bus1_queue_node *node,
+ u64 timestamp);
+void bus1_queue_commit_staged(struct bus1_queue *queue,
+ wait_queue_head_t *waitq,
+ struct bus1_queue_node *node,
+ u64 timestamp);
+void bus1_queue_commit_unstaged(struct bus1_queue *queue,
+ wait_queue_head_t *waitq,
+ struct bus1_queue_node *node);
+bool bus1_queue_commit_synthetic(struct bus1_queue *queue,
+ struct bus1_queue_node *node,
+ u64 timestamp);
+void bus1_queue_remove(struct bus1_queue *queue,
+ wait_queue_head_t *waitq,
+ struct bus1_queue_node *node);
+struct bus1_queue_node *bus1_queue_peek(struct bus1_queue *queue, bool *morep);
+
+/**
+ * bus1_queue_node_init() - initialize queue node
+ * @node: node to initialize
+ * @type: message type
+ *
+ * This initializes a previously unused node, and prepares it for use with a
+ * message queue.
+ */
+static inline void bus1_queue_node_init(struct bus1_queue_node *node,
+ unsigned int type)
+{
+ BUILD_BUG_ON((BUS1_QUEUE_TYPE_N - 1) > (BUS1_QUEUE_TYPE_MASK >>
+ BUS1_QUEUE_TYPE_SHIFT));
+ WARN_ON(type & ~(BUS1_QUEUE_TYPE_MASK >> BUS1_QUEUE_TYPE_SHIFT));
+
+ RB_CLEAR_NODE(&node->rb);
+ node->timestamp_and_type = (u64)type << BUS1_QUEUE_TYPE_SHIFT;
+ node->next = NULL;
+ node->group = NULL;
+ node->owner = NULL;
+}
+
+/**
+ * bus1_queue_node_deinit() - destroy queue node
+ * @node: node to destroy
+ *
+ * This destroys a previously initialized queue node. This is a no-op and only
+ * serves as debugger, testing whether the node was properly unqueued before.
+ */
+static inline void bus1_queue_node_deinit(struct bus1_queue_node *node)
+{
+ WARN_ON(!RB_EMPTY_NODE(&node->rb));
+ WARN_ON(node->next);
+}
+
+/**
+ * bus1_queue_node_get_type() - query node type
+ * @node: node to query
+ *
+ * This queries the node type that was provided via the node constructor. A
+ * node never changes its type during its entire lifetime.
+ *
+ * Return: Type of @node is returned.
+ */
+static inline unsigned int
+bus1_queue_node_get_type(struct bus1_queue_node *node)
+{
+ return (node->timestamp_and_type & BUS1_QUEUE_TYPE_MASK) >>
+ BUS1_QUEUE_TYPE_SHIFT;
+}
+
+/**
+ * bus1_queue_node_get_timestamp() - query node timestamp
+ * @node: node to query
+ *
+ * This queries the node timestamp that is currently set on this node.
+ *
+ * Return: Timestamp of @node is returned.
+ */
+static inline u64 bus1_queue_node_get_timestamp(struct bus1_queue_node *node)
+{
+ return node->timestamp_and_type & ~BUS1_QUEUE_TYPE_MASK;
+}
+
+/**
+ * bus1_queue_node_is_queued() - check whether a node is queued
+ * @node: node to query
+ *
+ * This checks whether a node is currently queued in a message queue. That is,
+ * the node was linked and has not been dequeued, yet.
+ *
+ * Return: True if @node is currently queued.
+ */
+static inline bool bus1_queue_node_is_queued(struct bus1_queue_node *node)
+{
+ return !RB_EMPTY_NODE(&node->rb);
+}
+
+/**
+ * bus1_queue_node_is_staging() - check whether a node is marked staging
+ * @node: node to query
+ *
+ * This checks whether a given node is queued, but still marked staging. That
+ * means, the node has been put on the queue but there is still a transaction
+ * that pins it to commit it later.
+ *
+ * Return: True if @node is queued as staging entry.
+ */
+static inline bool bus1_queue_node_is_staging(struct bus1_queue_node *node)
+{
+ return bus1_queue_node_get_timestamp(node) & 1;
+}
+
+/**
+ * bus1_queue_tick() - increment queue clock
+ * @queue: queue to operate on
+ *
+ * This performs a clock-tick on @queue. The clock is incremented by a full
+ * interval (+2). The caller is free to use both, the new value (even numbered)
+ * and its successor (odd numbered). Both are uniquely allocated to the
+ * caller.
+ *
+ * Return: New clock value is returned.
+ */
+static inline u64 bus1_queue_tick(struct bus1_queue *queue)
+{
+ queue->clock += 2;
+ return queue->clock;
+}
+
+/**
+ * bus1_queue_sync() - sync queue clock
+ * @queue: queue to operate on
+ * @timestamp: timestamp to sync on
+ *
+ * This synchronizes the clock of @queue with the externally provided timestamp
+ * @timestamp. That is, the queue clock is fast-forwarded to @timestamp, in
+ * case it is newer than the queue clock. Otherwise, nothing is done.
+ *
+ * The passed in timestamp must be even.
+ *
+ * Return: New clock value is returned.
+ */
+static inline u64 bus1_queue_sync(struct bus1_queue *queue, u64 timestamp)
+{
+ WARN_ON(timestamp & 1);
+ queue->clock = max(queue->clock, timestamp);
+ return queue->clock;
+}
+
+/**
+ * bus1_queue_is_readable_rcu() - check whether a queue is readable
+ * @queue: queue to operate on
+ *
+ * This checks whether the given queue is readable.
+ *
+ * This does not require any locking, except for an rcu-read-side critical
+ * section.
+ *
+ * Return: True if the queue is readable, false if not.
+ */
+static inline bool bus1_queue_is_readable_rcu(struct bus1_queue *queue)
+{
+ return rcu_access_pointer(queue->front);
+}
+
+/**
+ * bus1_queue_compare() - comparator for queue ordering
+ * @a_ts: timestamp of first node to compare
+ * @a_g: group of first node to compare
+ * @b_ts: timestamp of second node to compare against
+ * @b_g: group of second node to compare against
+ *
+ * Messages on a message queue are ordered. This function implements the
+ * comparator used for all message ordering in queues. Two tags are used for
+ * ordering, the timestamp and the group-tag of a node. Both must be passed to
+ * this function.
+ *
+ * This compares the tuples (@a_ts, @a_g) and (@b_ts, @b_g).
+ *
+ * Return: <0 if (@a_ts, @a_g) is ordered before, >0 if after, 0 if same.
+ */
+static inline int bus1_queue_compare(u64 a_ts, void *a_g, u64 b_ts, void *b_g)
+{
+ /*
+ * This orders two possible queue nodes. As first-level ordering we
+ * use the timestamps, as second-level ordering we use the group-tag.
+ *
+ * Timestamp-based ordering should be obvious. We simply make sure that
+ * any message with a lower timestamp is always considered to be first.
+ * However, due to the distributed nature of the queue-clocks, multiple
+ * messages might end up with the same timestamp. A multicast picks the
+ * highest of its destination clocks and bumps everyone else. As such,
+ * the picked timestamp for a multicast might not be unique, if another
+ * multicast with only partial destination overlap races it and happens
+ * to get the same timestamp via a distinct destination clock. If that
+ * happens, we guarantee a stable order by comparing the group-tag of
+ * the nodes. The group-tag is only ever equal if both messages belong
+ * to the same transaction.
+ *
+ * Note that we strictly rely on any multicast to be staged before its
+ * final commit. This guarantees that if a node is queued with a commit
+ * timestamp, it can never be lower than the commit timestamp of any
+ * other committed node, except if it was already staged with a lower
+ * staging timestamp (as such it blocks the conflicting entry). This
+ * also implies that if two nodes share a timestamp, both will
+ * necessarily block each other until both are committed (since shared
+ * timestamps imply that an entry is guaranteed to be staged before a
+ * conflicting entry is committed).
+ */
+
+ if (a_ts < b_ts)
+ return -1;
+ else if (a_ts > b_ts)
+ return 1;
+ else if (a_g < b_g)
+ return -1;
+ else if (a_g > b_g)
+ return 1;
+
+ return 0;
+}
+
+#endif /* __BUS1_QUEUE_H */
--
2.10.1

2016-10-26 19:22:13

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 05/14] bus1: util - pool utility library

From: Tom Gundersen <[email protected]>

A bus1-pool is a shmem-backed memory pool shared between userspace and
the kernel. The pool is used to transfer memory from the kernel to
userspace without requiring userspace to pre-allocate space.

The pool is managed in slices, which are published to userspace when
they are ready to be read and must be released by userspace when
userspace is done with them.

Userspace has read-only access to its pools and the kernel has
read-write access, but published slices are not altered.

This pool implementation will be used by bus1 message transactions to
support single-copy data transfers, directly from the sender's address
space into the pool of the destination peer. The allocation algorithm
is based on the Android Binder code and has served their needs well for
many years now.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 3 +-
ipc/bus1/util/pool.c | 572 +++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/util/pool.h | 164 +++++++++++++++
3 files changed, 738 insertions(+), 1 deletion(-)
create mode 100644 ipc/bus1/util/pool.c
create mode 100644 ipc/bus1/util/pool.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index 6db6d13..ca8e19d 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,7 +1,8 @@
bus1-y := \
main.o \
util/active.o \
- util/flist.o
+ util/flist.o \
+ util/pool.o

obj-$(CONFIG_BUS1) += bus1.o

diff --git a/ipc/bus1/util/pool.c b/ipc/bus1/util/pool.c
new file mode 100644
index 0000000..2ddbffb
--- /dev/null
+++ b/ipc/bus1/util/pool.c
@@ -0,0 +1,572 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/aio.h>
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/highmem.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/shmem_fs.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uio.h>
+#include "pool.h"
+
+static struct bus1_pool_slice *bus1_pool_slice_new(size_t offset, size_t size)
+{
+ struct bus1_pool_slice *slice;
+
+ if (offset > U32_MAX || size == 0 || size > BUS1_POOL_SLICE_SIZE_MAX)
+ return ERR_PTR(-EMSGSIZE);
+
+ slice = kmalloc(sizeof(*slice), GFP_KERNEL);
+ if (!slice)
+ return ERR_PTR(-ENOMEM);
+
+ slice->offset = offset;
+ slice->size = size;
+
+ return slice;
+}
+
+static struct bus1_pool_slice *
+bus1_pool_slice_free(struct bus1_pool_slice *slice)
+{
+ if (!slice)
+ return NULL;
+
+ kfree(slice);
+
+ return NULL;
+}
+
+/* insert slice into the free tree */
+static void bus1_pool_slice_link_free(struct bus1_pool_slice *slice,
+ struct bus1_pool *pool)
+{
+ struct rb_node **n, *prev = NULL;
+ struct bus1_pool_slice *ps;
+
+ n = &pool->slices_free.rb_node;
+ while (*n) {
+ prev = *n;
+ ps = container_of(prev, struct bus1_pool_slice, rb);
+ if (slice->size < ps->size)
+ n = &prev->rb_left;
+ else
+ n = &prev->rb_right;
+ }
+
+ rb_link_node(&slice->rb, prev, n);
+ rb_insert_color(&slice->rb, &pool->slices_free);
+}
+
+/* insert slice into the busy tree */
+static void bus1_pool_slice_link_busy(struct bus1_pool_slice *slice,
+ struct bus1_pool *pool)
+{
+ struct rb_node **n, *prev = NULL;
+ struct bus1_pool_slice *ps;
+
+ n = &pool->slices_busy.rb_node;
+ while (*n) {
+ prev = *n;
+ ps = container_of(prev, struct bus1_pool_slice, rb);
+ if (WARN_ON(slice->offset == ps->offset))
+ n = &prev->rb_right; /* add anyway */
+ else if (slice->offset < ps->offset)
+ n = &prev->rb_left;
+ else /* if (slice->offset > ps->offset) */
+ n = &prev->rb_right;
+ }
+
+ rb_link_node(&slice->rb, prev, n);
+ rb_insert_color(&slice->rb, &pool->slices_busy);
+
+ pool->allocated_size += slice->size;
+}
+
+/* find free slice big enough to hold @size bytes */
+static struct bus1_pool_slice *
+bus1_pool_slice_find_by_size(struct bus1_pool *pool, size_t size)
+{
+ struct bus1_pool_slice *ps, *closest = NULL;
+ struct rb_node *n;
+
+ n = pool->slices_free.rb_node;
+ while (n) {
+ ps = container_of(n, struct bus1_pool_slice, rb);
+ if (size < ps->size) {
+ closest = ps;
+ n = n->rb_left;
+ } else if (size > ps->size) {
+ n = n->rb_right;
+ } else /* if (size == ps->size) */ {
+ return ps;
+ }
+ }
+
+ return closest;
+}
+
+/* find used slice with given offset */
+static struct bus1_pool_slice *
+bus1_pool_slice_find_by_offset(struct bus1_pool *pool, size_t offset)
+{
+ struct bus1_pool_slice *ps;
+ struct rb_node *n;
+
+ n = pool->slices_busy.rb_node;
+ while (n) {
+ ps = container_of(n, struct bus1_pool_slice, rb);
+ if (offset < ps->offset)
+ n = n->rb_left;
+ else if (offset > ps->offset)
+ n = n->rb_right;
+ else /* if (offset == ps->offset) */
+ return ps;
+ }
+
+ return NULL;
+}
+
+/**
+ * bus1_pool_init() - create memory pool
+ * @pool: pool to operate on
+ * @filename: name to use for the shmem-file (only visible via /proc)
+ *
+ * Initialize a new pool object.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_pool_init(struct bus1_pool *pool, const char *filename)
+{
+ struct bus1_pool_slice *slice;
+ struct page *p;
+ struct file *f;
+ int r;
+
+ /* cannot calculate width of bitfields, so hardcode '4' as flag-size */
+ BUILD_BUG_ON(BUS1_POOL_SLICE_SIZE_BITS + 3 > 32);
+ BUILD_BUG_ON(BUS1_POOL_SLICE_SIZE_MAX > U32_MAX);
+
+ f = shmem_file_setup(filename, ALIGN(BUS1_POOL_SLICE_SIZE_MAX, 8),
+ VM_NORESERVE);
+ if (IS_ERR(f))
+ return PTR_ERR(f);
+
+ r = get_write_access(file_inode(f));
+ if (r < 0) {
+ fput(f);
+ return r;
+ }
+
+ pool->f = f;
+ pool->allocated_size = 0;
+ INIT_LIST_HEAD(&pool->slices);
+ pool->slices_free = RB_ROOT;
+ pool->slices_busy = RB_ROOT;
+
+ slice = bus1_pool_slice_new(0, BUS1_POOL_SLICE_SIZE_MAX);
+ if (IS_ERR(slice)) {
+ bus1_pool_deinit(pool);
+ return PTR_ERR(slice);
+ }
+
+ slice->free = true;
+ slice->ref_kernel = false;
+ slice->ref_user = false;
+
+ list_add(&slice->entry, &pool->slices);
+ bus1_pool_slice_link_free(slice, pool);
+
+ /*
+ * Touch first page of client pool so the initial allocation overhead
+ * is done during peer setup rather than a message transaction. This is
+ * really just an optimization to avoid some random peaks in common
+ * paths. It is not meant as ultimate protection.
+ */
+ p = shmem_read_mapping_page(file_inode(f)->i_mapping, 0);
+ if (!IS_ERR(p))
+ put_page(p);
+
+ return 0;
+}
+
+/**
+ * bus1_pool_deinit() - destroy pool
+ * @pool: pool to destroy, or NULL
+ *
+ * This destroys a pool that was previously create via bus1_pool_init(). If
+ * NULL is passed, or if @pool->f is NULL (i.e., the pool was initialized to 0
+ * but not created via bus1_pool_init(), yet), then this is a no-op.
+ *
+ * The caller must make sure that no kernel reference to any slice exists. Any
+ * pending user-space reference to any slice is dropped by this function.
+ */
+void bus1_pool_deinit(struct bus1_pool *pool)
+{
+ struct bus1_pool_slice *slice;
+
+ if (!pool || !pool->f)
+ return;
+
+ while ((slice = list_first_entry_or_null(&pool->slices,
+ struct bus1_pool_slice,
+ entry))) {
+ WARN_ON(slice->ref_kernel);
+ list_del(&slice->entry);
+ bus1_pool_slice_free(slice);
+ }
+
+ put_write_access(file_inode(pool->f));
+ fput(pool->f);
+ pool->f = NULL;
+}
+
+/**
+ * bus1_pool_alloc() - allocate memory
+ * @pool: pool to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * This allocates a new slice of @size bytes from the memory pool at @pool. The
+ * slice must be released via bus1_pool_release_kernel() by the caller. All
+ * slices are aligned to 8 bytes (both offset and size).
+ *
+ * If no suitable slice can be allocated, an error is returned.
+ *
+ * Each pool slice can have two different references, a kernel reference and a
+ * user-space reference. Initially, it only has a kernel-reference, which must
+ * be dropped via bus1_pool_release_kernel(). However, if you previously
+ * publish the slice via bus1_pool_publish(), it will also have a user-space
+ * reference, which user-space must (indirectly) release via a call to
+ * bus1_pool_release_user().
+ * A slice is only actually freed if neither reference exists, anymore. Hence,
+ * pool-slice can be held by both, the kernel and user-space, and both can rely
+ * on it staying around as long as they wish.
+ *
+ * Return: Pointer to new slice, or ERR_PTR on failure.
+ */
+struct bus1_pool_slice *bus1_pool_alloc(struct bus1_pool *pool, size_t size)
+{
+ struct bus1_pool_slice *slice, *ps;
+ size_t slice_size;
+
+ slice_size = ALIGN(size, 8);
+ if (slice_size == 0 || slice_size > BUS1_POOL_SLICE_SIZE_MAX)
+ return ERR_PTR(-EMSGSIZE);
+
+ /* find smallest suitable, free slice */
+ slice = bus1_pool_slice_find_by_size(pool, slice_size);
+ if (!slice)
+ return ERR_PTR(-EXFULL);
+
+ /* split slice if it doesn't match exactly */
+ if (slice_size < slice->size) {
+ ps = bus1_pool_slice_new(slice->offset + slice_size,
+ slice->size - slice_size);
+ if (IS_ERR(ps))
+ return ERR_CAST(ps);
+
+ ps->free = true;
+ ps->ref_kernel = false;
+ ps->ref_user = false;
+
+ list_add(&ps->entry, &slice->entry); /* add after @slice */
+ bus1_pool_slice_link_free(ps, pool);
+
+ slice->size = slice_size;
+ }
+
+ /* move from free-tree to busy-tree */
+ rb_erase(&slice->rb, &pool->slices_free);
+ bus1_pool_slice_link_busy(slice, pool);
+
+ slice->ref_kernel = true;
+ slice->ref_user = false;
+ slice->free = false;
+
+ return slice;
+}
+
+static void bus1_pool_free(struct bus1_pool *pool,
+ struct bus1_pool_slice *slice)
+{
+ struct bus1_pool_slice *ps;
+
+ /* don't free the slice if either has a reference */
+ if (slice->ref_kernel || slice->ref_user || WARN_ON(slice->free))
+ return;
+
+ /*
+ * To release a pool-slice, we first drop it from the busy-tree, then
+ * merge it with possible previous/following free slices and re-add it
+ * to the free-tree.
+ */
+
+ rb_erase(&slice->rb, &pool->slices_busy);
+
+ if (!WARN_ON(slice->size > pool->allocated_size))
+ pool->allocated_size -= slice->size;
+
+ if (pool->slices.next != &slice->entry) {
+ ps = container_of(slice->entry.prev, struct bus1_pool_slice,
+ entry);
+ if (ps->free) {
+ rb_erase(&ps->rb, &pool->slices_free);
+ list_del(&slice->entry);
+ ps->size += slice->size;
+ bus1_pool_slice_free(slice);
+ slice = ps; /* switch to previous slice */
+ }
+ }
+
+ if (pool->slices.prev != &slice->entry) {
+ ps = container_of(slice->entry.next, struct bus1_pool_slice,
+ entry);
+ if (ps->free) {
+ rb_erase(&ps->rb, &pool->slices_free);
+ list_del(&ps->entry);
+ slice->size += ps->size;
+ bus1_pool_slice_free(ps);
+ }
+ }
+
+ slice->free = true;
+ bus1_pool_slice_link_free(slice, pool);
+}
+
+/**
+ * bus1_pool_release_kernel() - release kernel-owned slice reference
+ * @pool: pool to free memory on
+ * @slice: slice to release
+ *
+ * This releases the kernel-reference to a slice that was previously allocated
+ * via bus1_pool_alloc(). This only releases the kernel reference to the slice.
+ * If the slice was already published to user-space, then their reference is
+ * left untouched. Once both references are gone, the memory is actually freed.
+ *
+ * Return: NULL is returned.
+ */
+struct bus1_pool_slice *
+bus1_pool_release_kernel(struct bus1_pool *pool, struct bus1_pool_slice *slice)
+{
+ if (!slice || WARN_ON(!slice->ref_kernel))
+ return NULL;
+
+ /* kernel must own a ref to @slice */
+ slice->ref_kernel = false;
+
+ bus1_pool_free(pool, slice);
+
+ return NULL;
+}
+
+/**
+ * bus1_pool_publish() - publish a slice
+ * @pool: pool to operate on
+ * @slice: slice to publish
+ *
+ * Publish a pool slice to user-space, so user-space can get access to it via
+ * the mapped pool memory. If the slice was already published, this is a no-op.
+ * Otherwise, the slice is marked as public and will only get freed once both
+ * the user-space reference *and* kernel-space reference are released.
+ */
+void bus1_pool_publish(struct bus1_pool *pool, struct bus1_pool_slice *slice)
+{
+ /* kernel must own a ref to @slice to publish it */
+ WARN_ON(!slice->ref_kernel);
+ slice->ref_user = true;
+}
+
+/**
+ * bus1_pool_release_user() - release a public slice
+ * @pool: pool to operate on
+ * @offset: offset of slice to release
+ * @n_slicesp: output variable to store number of released slices, or NULL
+ *
+ * Release the user-space reference to a pool-slice, specified via the offset
+ * of the slice. If both, the user-space reference *and* the kernel-space
+ * reference to the slice are gone, the slice will be actually freed.
+ *
+ * If no slice exists with the given offset, or if there is no user-space
+ * reference to the specified slice, an error is returned.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_pool_release_user(struct bus1_pool *pool,
+ size_t offset,
+ size_t *n_slicesp)
+{
+ struct bus1_pool_slice *slice;
+
+ slice = bus1_pool_slice_find_by_offset(pool, offset);
+ if (!slice || !slice->ref_user)
+ return -ENXIO;
+
+ if (n_slicesp)
+ *n_slicesp = !slice->ref_kernel;
+
+ slice->ref_user = false;
+ bus1_pool_free(pool, slice);
+
+ return 0;
+}
+
+/**
+ * bus1_pool_flush() - flush all user references
+ * @pool: pool to flush
+ * @n_slicesp: output variable to store number of released slices, or NULL
+ *
+ * This flushes all user-references to any slice in @pool. Kernel references
+ * are left untouched.
+ */
+void bus1_pool_flush(struct bus1_pool *pool, size_t *n_slicesp)
+{
+ struct bus1_pool_slice *slice;
+ struct rb_node *node, *t;
+ size_t n_slices = 0;
+
+ for (node = rb_first(&pool->slices_busy);
+ node && ((t = rb_next(node)), true);
+ node = t) {
+ slice = container_of(node, struct bus1_pool_slice, rb);
+ if (!slice->ref_user)
+ continue;
+
+ if (!slice->ref_kernel)
+ ++n_slices;
+
+ /*
+ * @slice (or the logically previous/next slice) might be freed
+ * by bus1_pool_free(). However, this only ever affects 'free'
+ * slices, never busy slices. Hence, @t is protected from
+ * removal.
+ */
+ slice->ref_user = false;
+ bus1_pool_free(pool, slice);
+ }
+
+ if (n_slicesp)
+ *n_slicesp = n_slices;
+}
+
+/**
+ * bus1_pool_mmap() - mmap the pool
+ * @pool: pool to operate on
+ * @vma: VMA to map to
+ *
+ * This maps the pools shmem file to the provided VMA. Only read-only mappings
+ * are allowed.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_pool_mmap(struct bus1_pool *pool, struct vm_area_struct *vma)
+{
+ if (unlikely(vma->vm_flags & VM_WRITE))
+ return -EPERM; /* deny write-access to the pool */
+
+ /* replace the connection file with our shmem file */
+ if (vma->vm_file)
+ fput(vma->vm_file);
+ vma->vm_file = get_file(pool->f);
+ vma->vm_flags &= ~VM_MAYWRITE;
+
+ /* calls into shmem_mmap(), which simply sets vm_ops */
+ return pool->f->f_op->mmap(pool->f, vma);
+}
+
+/**
+ * bus1_pool_write_iovec() - copy user memory to a slice
+ * @pool: pool to operate on
+ * @slice: slice to write to
+ * @offset: relative offset into slice memory
+ * @iov: iovec array, pointing to data to copy
+ * @n_iov: number of elements in @iov
+ * @total_len: total number of bytes to copy
+ *
+ * This copies the memory pointed to by @iov into the memory slice @slice at
+ * relative offset @offset (relative to begin of slice).
+ *
+ * Return: Numbers of bytes copied, negative error code on failure.
+ */
+ssize_t bus1_pool_write_iovec(struct bus1_pool *pool,
+ struct bus1_pool_slice *slice,
+ loff_t offset,
+ struct iovec *iov,
+ size_t n_iov,
+ size_t total_len)
+{
+ struct iov_iter iter;
+ ssize_t len;
+
+ if (WARN_ON(offset + total_len < offset) ||
+ WARN_ON(offset + total_len > slice->size) ||
+ WARN_ON(slice->ref_user))
+ return -EFAULT;
+ if (total_len < 1)
+ return 0;
+
+ offset += slice->offset;
+ iov_iter_init(&iter, WRITE, iov, n_iov, total_len);
+
+ len = vfs_iter_write(pool->f, &iter, &offset);
+
+ return (len >= 0 && len != total_len) ? -EFAULT : len;
+}
+
+/**
+ * bus1_pool_write_kvec() - copy kernel memory to a slice
+ * @pool: pool to operate on
+ * @slice: slice to write to
+ * @offset: relative offset into slice memory
+ * @iov: kvec array, pointing to data to copy
+ * @n_iov: number of elements in @iov
+ * @total_len: total number of bytes to copy
+ *
+ * This copies the memory pointed to by @iov into the memory slice @slice at
+ * relative offset @offset (relative to begin of slice).
+ *
+ * Return: Numbers of bytes copied, negative error code on failure.
+ */
+ssize_t bus1_pool_write_kvec(struct bus1_pool *pool,
+ struct bus1_pool_slice *slice,
+ loff_t offset,
+ struct kvec *iov,
+ size_t n_iov,
+ size_t total_len)
+{
+ struct iov_iter iter;
+ mm_segment_t old_fs;
+ ssize_t len;
+
+ if (WARN_ON(offset + total_len < offset) ||
+ WARN_ON(offset + total_len > slice->size) ||
+ WARN_ON(slice->ref_user))
+ return -EFAULT;
+ if (total_len < 1)
+ return 0;
+
+ offset += slice->offset;
+ iov_iter_kvec(&iter, WRITE | ITER_KVEC, iov, n_iov, total_len);
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ len = vfs_iter_write(pool->f, &iter, &offset);
+ set_fs(old_fs);
+
+ return (len >= 0 && len != total_len) ? -EFAULT : len;
+}
diff --git a/ipc/bus1/util/pool.h b/ipc/bus1/util/pool.h
new file mode 100644
index 0000000..f0e369b
--- /dev/null
+++ b/ipc/bus1/util/pool.h
@@ -0,0 +1,164 @@
+#ifndef __BUS1_POOL_H
+#define __BUS1_POOL_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Pools
+ *
+ * A pool is a shmem-backed memory pool shared between userspace and the kernel.
+ * The pool is used to transfer memory from the kernel to userspace without
+ * requiring userspace to allocate the memory.
+ *
+ * The pool is managed in slices, which are published to userspace when they are
+ * ready to be read and must be released by userspace when userspace is done
+ * with them.
+ *
+ * Userspace has read-only access to its pools and the kernel has read-write
+ * access, but published slices are not altered.
+ */
+
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/types.h>
+
+struct file;
+struct iovec;
+struct kvec;
+
+/* internal: number of bits available to slice size */
+#define BUS1_POOL_SLICE_SIZE_BITS (29)
+#define BUS1_POOL_SLICE_SIZE_MAX ((1 << BUS1_POOL_SLICE_SIZE_BITS) - 1)
+
+/**
+ * struct bus1_pool_slice - pool slice
+ * @offset: relative offset in parent pool
+ * @size: slice size
+ * @free: whether this slice is in-use or not
+ * @ref_kernel: whether a kernel reference exists
+ * @ref_user: whether a user reference exists
+ * @entry: link into linear list of slices
+ * @rb: link to busy/free rb-tree
+ *
+ * Each chunk of memory in the pool is managed as a slice. A slice can be
+ * accessible by both the kernel and user-space, and their access rights are
+ * managed independently. As long as the kernel has a reference to a slice, its
+ * offset and size can be accessed freely and will not change. Once the kernel
+ * drops its reference, it must not access the slice, anymore.
+ *
+ * To allow user-space access, the slice must be published. This marks the slice
+ * as referenced by user-space. Note that all slices are always readable by
+ * user-space, since the entire pool can be mapped. Publishing a slice only
+ * marks the slice as referenced by user-space, so it will not be modified or
+ * removed. Once user-space releases its reference, it should no longer access
+ * the slice as it might be modified and/or overwritten by other data.
+ *
+ * Only if neither kernel nor user-space have a reference to a slice, the slice
+ * is released. The kernel reference can only be acquired/released once, but
+ * user-space references can be published/released several times. In particular,
+ * if the kernel retains a reference when a slice is published and later
+ * released by userspace, the same slice can be published again in the future.
+ *
+ * Note that both kernel-space and user-space must be aware that slice
+ * references are not ref-counted. They are simple booleans. For the kernel-side
+ * this is obvious, as no ref/unref functions are provided. But user-space must
+ * be aware that the same slice being published several times does not increase
+ * the reference count.
+ */
+struct bus1_pool_slice {
+ u32 offset;
+
+ /* merge @size with flags to save 8 bytes per existing slice */
+ u32 size : BUS1_POOL_SLICE_SIZE_BITS;
+ u32 free : 1;
+ u32 ref_kernel : 1;
+ u32 ref_user : 1;
+
+ struct list_head entry;
+ struct rb_node rb;
+};
+
+/**
+ * struct bus1_pool - client pool
+ * @f: backing shmem file
+ * @allocated_size: currently allocated memory in bytes
+ * @slices: all slices sorted by address
+ * @slices_busy: tree of allocated slices
+ * @slices_free: tree of free slices
+ *
+ * A pool is used to allocate memory slices that can be shared between
+ * kernel-space and user-space. A pool is always backed by a shmem-file and puts
+ * a simple slice-allocator on top. User-space gets read-only access to the
+ * entire pool, kernel-space gets read/write access via accessor-functions.
+ *
+ * Pools are used to transfer large sets of data to user-space, without
+ * requiring a round-trip to ask user-space for a suitable memory chunk.
+ * Instead, the kernel simply allocates slices in the pool and tells user-space
+ * where it put the data.
+ *
+ * All pool operations must be serialized by the caller. No internal lock is
+ * provided. Slices can be queried/modified unlocked. But any pool operation
+ * (allocation, release, flush, ...) must be serialized.
+ */
+struct bus1_pool {
+ struct file *f;
+ size_t allocated_size;
+ struct list_head slices;
+ struct rb_root slices_busy;
+ struct rb_root slices_free;
+};
+
+#define BUS1_POOL_NULL ((struct bus1_pool){})
+
+int bus1_pool_init(struct bus1_pool *pool, const char *filename);
+void bus1_pool_deinit(struct bus1_pool *pool);
+
+struct bus1_pool_slice *bus1_pool_alloc(struct bus1_pool *pool, size_t size);
+struct bus1_pool_slice *bus1_pool_release_kernel(struct bus1_pool *pool,
+ struct bus1_pool_slice *slice);
+void bus1_pool_publish(struct bus1_pool *pool, struct bus1_pool_slice *slice);
+int bus1_pool_release_user(struct bus1_pool *pool,
+ size_t offset,
+ size_t *n_slicesp);
+void bus1_pool_flush(struct bus1_pool *pool, size_t *n_slicesp);
+int bus1_pool_mmap(struct bus1_pool *pool, struct vm_area_struct *vma);
+
+ssize_t bus1_pool_write_iovec(struct bus1_pool *pool,
+ struct bus1_pool_slice *slice,
+ loff_t offset,
+ struct iovec *iov,
+ size_t n_iov,
+ size_t total_len);
+ssize_t bus1_pool_write_kvec(struct bus1_pool *pool,
+ struct bus1_pool_slice *slice,
+ loff_t offset,
+ struct kvec *iov,
+ size_t n_iov,
+ size_t total_len);
+
+/**
+ * bus1_pool_slice_is_public() - check whether a slice is public
+ * @slice: slice to check
+ *
+ * This checks whether @slice is public. That is, bus1_pool_publish() has been
+ * called and the user has not released their reference, yet.
+ *
+ * Note that if you need reliable results, you better make sure this cannot
+ * race calls to bus1_pool_publish() or bus1_pool_release_user().
+ *
+ * Return: True if public, false if not.
+ */
+static inline bool bus1_pool_slice_is_public(struct bus1_pool_slice *slice)
+{
+ WARN_ON(!slice->ref_kernel);
+ return slice->ref_user;
+}
+
+#endif /* __BUS1_POOL_H */
--
2.10.1

2016-10-26 19:22:27

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 12/14] bus1: hook up file-operations

From: Tom Gundersen <[email protected]>

This hooks up all the file-operations on a bus1-file-descriptor. It
implements the ioctls as defined in the UAPI, as well as mmap() and
poll() support.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/main.c | 46 +++
ipc/bus1/peer.c | 934 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
ipc/bus1/peer.h | 8 +
3 files changed, 987 insertions(+), 1 deletion(-)

diff --git a/ipc/bus1/main.c b/ipc/bus1/main.c
index 51034f3..d5a726a 100644
--- a/ipc/bus1/main.c
+++ b/ipc/bus1/main.c
@@ -14,10 +14,17 @@
#include <linux/init.h>
#include <linux/miscdevice.h>
#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/seq_file.h>
+#include <linux/uio.h>
+#include <uapi/linux/bus1.h>
#include "main.h"
#include "peer.h"
#include "tests.h"
#include "user.h"
+#include "util/active.h"
+#include "util/pool.h"
+#include "util/queue.h"

static int bus1_fop_open(struct inode *inode, struct file *file)
{
@@ -37,6 +44,41 @@ static int bus1_fop_release(struct inode *inode, struct file *file)
return 0;
}

+static unsigned int bus1_fop_poll(struct file *file,
+ struct poll_table_struct *wait)
+{
+ struct bus1_peer *peer = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &peer->waitq, wait);
+
+ /* access queue->front unlocked */
+ rcu_read_lock();
+ if (bus1_active_is_deactivated(&peer->active)) {
+ mask = POLLHUP;
+ } else {
+ mask = POLLOUT | POLLWRNORM;
+ if (bus1_queue_is_readable_rcu(&peer->data.queue))
+ mask |= POLLIN | POLLRDNORM;
+ }
+ rcu_read_unlock();
+
+ return mask;
+}
+
+static int bus1_fop_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct bus1_peer *peer = file->private_data;
+ int r;
+
+ if (!bus1_peer_acquire(peer))
+ return -ESHUTDOWN;
+
+ r = bus1_pool_mmap(&peer->data.pool, vma);
+ bus1_peer_release(peer);
+ return r;
+}
+
static void bus1_fop_show_fdinfo(struct seq_file *m, struct file *file)
{
struct bus1_peer *peer = file->private_data;
@@ -48,7 +90,11 @@ const struct file_operations bus1_fops = {
.owner = THIS_MODULE,
.open = bus1_fop_open,
.release = bus1_fop_release,
+ .poll = bus1_fop_poll,
.llseek = noop_llseek,
+ .mmap = bus1_fop_mmap,
+ .unlocked_ioctl = bus1_peer_ioctl,
+ .compat_ioctl = bus1_peer_ioctl,
.show_fdinfo = bus1_fop_show_fdinfo,
};

diff --git a/ipc/bus1/peer.c b/ipc/bus1/peer.c
index 0ff7a98..f0da4a7 100644
--- a/ipc/bus1/peer.c
+++ b/ipc/bus1/peer.c
@@ -23,11 +23,52 @@
#include <linux/uaccess.h>
#include <linux/uio.h>
#include <linux/wait.h>
+#include <uapi/linux/bus1.h>
+#include "handle.h"
#include "main.h"
+#include "message.h"
#include "peer.h"
+#include "tx.h"
#include "user.h"
#include "util.h"
#include "util/active.h"
+#include "util/pool.h"
+#include "util/queue.h"
+
+static struct bus1_queue_node *
+bus1_peer_free_qnode(struct bus1_queue_node *qnode)
+{
+ struct bus1_message *m;
+ struct bus1_handle *h;
+
+ /*
+ * Queue-nodes are generic entities that can only be destroyed by who
+ * created them. That is, they have no embedded release callback.
+ * Instead, we must detect them by type. Since the queue logic is kept
+ * generic, it cannot provide this helper. Instead, we have this small
+ * destructor here, which simply dispatches to the correct handler.
+ */
+
+ if (qnode) {
+ switch (bus1_queue_node_get_type(qnode)) {
+ case BUS1_MSG_DATA:
+ m = container_of(qnode, struct bus1_message, qnode);
+ bus1_message_unref(m);
+ break;
+ case BUS1_MSG_NODE_DESTROY:
+ case BUS1_MSG_NODE_RELEASE:
+ h = container_of(qnode, struct bus1_handle, qnode);
+ bus1_handle_unref(h);
+ break;
+ case BUS1_MSG_NONE:
+ default:
+ WARN(1, "Unknown message type\n");
+ break;
+ }
+ }
+
+ return NULL;
+}

/**
* bus1_peer_new() - allocate new peer
@@ -47,6 +88,7 @@ struct bus1_peer *bus1_peer_new(void)
const struct cred *cred = current_cred();
struct bus1_peer *peer;
struct bus1_user *user;
+ int r;

user = bus1_user_ref_by_uid(cred->uid);
if (IS_ERR(user))
@@ -75,9 +117,14 @@ struct bus1_peer *bus1_peer_new(void)

/* initialize peer-private section */
mutex_init(&peer->local.lock);
+ peer->local.seed = NULL;
peer->local.map_handles = RB_ROOT;
peer->local.handle_ids = 0;

+ r = bus1_pool_init(&peer->data.pool, KBUILD_MODNAME "-peer");
+ if (r < 0)
+ goto error;
+
if (!IS_ERR_OR_NULL(bus1_debugdir)) {
char idstr[22];

@@ -96,6 +143,103 @@ struct bus1_peer *bus1_peer_new(void)

bus1_active_activate(&peer->active);
return peer;
+
+error:
+ bus1_peer_free(peer);
+ return ERR_PTR(r);
+}
+
+static void bus1_peer_flush(struct bus1_peer *peer, u64 flags)
+{
+ struct bus1_queue_node *qlist, *qnode;
+ struct bus1_handle *h, *safe;
+ struct bus1_tx tx;
+ size_t n_slices;
+ u64 ts;
+ int n;
+
+ lockdep_assert_held(&peer->local.lock);
+
+ bus1_tx_init(&tx, peer);
+
+ if (flags & BUS1_PEER_RESET_FLAG_FLUSH) {
+ /* protect handles on the seed */
+ if (!(flags & BUS1_PEER_RESET_FLAG_FLUSH_SEED) &&
+ peer->local.seed) {
+ /*
+ * XXX: When the flush operation does not ask for a
+ * RESET of the seed, we want to protect the nodes
+ * that were instantiated with this seed.
+ * Right now, we do not support this, but rather
+ * treat all nodes as local nodes. If node
+ * injection will be supported one day, we should
+ * make sure to drop n_user of all seed-handles to
+ * 0 here, to make sure they're skipped in the
+ * mass-destruction below.
+ */
+ }
+
+ /* first destroy all live anchors */
+ mutex_lock(&peer->data.lock);
+ rbtree_postorder_for_each_entry_safe(h, safe,
+ &peer->local.map_handles,
+ rb_to_peer) {
+ if (!bus1_handle_is_anchor(h) ||
+ !bus1_handle_is_live(h))
+ continue;
+
+ bus1_handle_destroy_locked(h, &tx);
+ }
+ mutex_unlock(&peer->data.lock);
+
+ /* atomically commit the destruction transaction */
+ ts = bus1_tx_commit(&tx);
+
+ /* now release all user handles */
+ rbtree_postorder_for_each_entry_safe(h, safe,
+ &peer->local.map_handles,
+ rb_to_peer) {
+ n = atomic_xchg(&h->n_user, 0);
+ bus1_handle_forget_keep(h);
+
+ if (bus1_handle_is_anchor(h)) {
+ if (n > 1)
+ bus1_handle_release_n(h, n - 1, true);
+ bus1_handle_release(h, false);
+ } else {
+ bus1_handle_release_n(h, n, true);
+ }
+ }
+ peer->local.map_handles = RB_ROOT;
+
+ /* finally flush the queue and pool */
+ mutex_lock(&peer->data.lock);
+ qlist = bus1_queue_flush(&peer->data.queue, ts);
+ bus1_pool_flush(&peer->data.pool, &n_slices);
+ mutex_unlock(&peer->data.lock);
+
+ while ((qnode = qlist)) {
+ qlist = qnode->next;
+ qnode->next = NULL;
+ bus1_peer_free_qnode(qnode);
+ }
+ }
+
+ /* drop seed if requested */
+ if (flags & BUS1_PEER_RESET_FLAG_FLUSH_SEED)
+ peer->local.seed = bus1_message_unref(peer->local.seed);
+
+ bus1_tx_deinit(&tx);
+}
+
+static void bus1_peer_cleanup(struct bus1_active *a, void *userdata)
+{
+ struct bus1_peer *peer = container_of(a, struct bus1_peer, active);
+
+ mutex_lock(&peer->local.lock);
+ bus1_peer_flush(peer, BUS1_PEER_RESET_FLAG_FLUSH |
+ BUS1_PEER_RESET_FLAG_FLUSH_SEED);
+ mutex_unlock(&peer->local.lock);
}

static int bus1_peer_disconnect(struct bus1_peer *peer)
@@ -104,7 +248,7 @@ static int bus1_peer_disconnect(struct bus1_peer *peer)
bus1_active_drain(&peer->active, &peer->waitq);

if (!bus1_active_cleanup(&peer->active, &peer->waitq,
- NULL, NULL))
+ bus1_peer_cleanup, NULL))
return -ESHUTDOWN;

return 0;
@@ -133,6 +277,7 @@ struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)

/* deinitialize peer-private section */
WARN_ON(!RB_EMPTY_ROOT(&peer->local.map_handles));
+ WARN_ON(peer->local.seed);
mutex_destroy(&peer->local.lock);

/* deinitialize data section */
@@ -150,3 +295,790 @@ struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)

return NULL;
}
+
+static int bus1_peer_ioctl_peer_query(struct bus1_peer *peer,
+ unsigned long arg)
+{
+ struct bus1_cmd_peer_reset __user *uparam = (void __user *)arg;
+ struct bus1_cmd_peer_reset param;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_PEER_QUERY) != sizeof(param));
+
+ if (copy_from_user(&param, uparam, sizeof(param)))
+ return -EFAULT;
+ if (unlikely(param.flags))
+ return -EINVAL;
+
+ mutex_lock(&peer->local.lock);
+ param.peer_flags = peer->flags & BUS1_PEER_FLAG_WANT_SECCTX;
+ param.max_slices = -1;
+ param.max_handles = -1;
+ param.max_inflight_bytes = -1;
+ param.max_inflight_fds = -1;
+ mutex_unlock(&peer->local.lock);
+
+ return copy_to_user(uparam, &param, sizeof(param)) ? -EFAULT : 0;
+}
+
+static int bus1_peer_ioctl_peer_reset(struct bus1_peer *peer,
+ unsigned long arg)
+{
+ struct bus1_cmd_peer_reset __user *uparam = (void __user *)arg;
+ struct bus1_cmd_peer_reset param;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_PEER_RESET) != sizeof(param));
+
+ if (copy_from_user(&param, uparam, sizeof(param)))
+ return -EFAULT;
+ if (unlikely(param.flags & ~(BUS1_PEER_RESET_FLAG_FLUSH |
+ BUS1_PEER_RESET_FLAG_FLUSH_SEED)))
+ return -EINVAL;
+ if (unlikely(param.peer_flags != -1 &&
+ (param.peer_flags & ~BUS1_PEER_FLAG_WANT_SECCTX)))
+ return -EINVAL;
+ if (unlikely(param.max_slices != -1 ||
+ param.max_handles != -1 ||
+ param.max_inflight_bytes != -1 ||
+ param.max_inflight_fds != -1))
+ return -EINVAL;
+
+ mutex_lock(&peer->local.lock);
+
+ if (param.peer_flags != -1)
+ peer->flags = param.peer_flags;
+
+ bus1_peer_flush(peer, param.flags);
+
+ mutex_unlock(&peer->local.lock);
+
+ return 0;
+}
+
+static int bus1_peer_ioctl_handle_release(struct bus1_peer *peer,
+ unsigned long arg)
+{
+ struct bus1_handle *h = NULL;
+ bool is_new, strong = true;
+ u64 id;
+ int r;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_HANDLE_RELEASE) != sizeof(id));
+
+ if (get_user(id, (const u64 __user *)arg))
+ return -EFAULT;
+
+ mutex_lock(&peer->local.lock);
+
+ h = bus1_handle_import(peer, id, &is_new);
+ if (IS_ERR(h)) {
+ r = PTR_ERR(h);
+ goto exit;
+ }
+
+ if (is_new) {
+ /*
+ * A handle is non-public only if the import lazily created the
+ * node. In that case the node is live and the last reference
+ * cannot be dropped until the node is destroyed. Hence, we
+ * return EBUSY.
+ *
+ * Since we did not modify the node, and the node was lazily
+ * created, there is no point in keeping the node allocated. We
+ * simply pretend we didn't allocate it so the next operation
+ * will just do the lazy allocation again.
+ */
+ bus1_handle_forget(h);
+ r = -EBUSY;
+ goto exit;
+ }
+
+ if (atomic_read(&h->n_user) == 1 && bus1_handle_is_anchor(h)) {
+ if (bus1_handle_is_live(h)) {
+ r = -EBUSY;
+ goto exit;
+ }
+
+ strong = false;
+ }
+
+ WARN_ON(atomic_dec_return(&h->n_user) < 0);
+ bus1_handle_forget(h);
+ bus1_handle_release(h, strong);
+
+ r = 0;
+
+exit:
+ mutex_unlock(&peer->local.lock);
+ bus1_handle_unref(h);
+ return r;
+}
+
+static int bus1_peer_transfer(struct bus1_peer *src,
+ struct bus1_peer *dst,
+ struct bus1_cmd_handle_transfer *param)
+{
+ struct bus1_handle *src_h = NULL, *dst_h = NULL;
+ bool is_new;
+ int r;
+
+ bus1_mutex_lock2(&src->local.lock, &dst->local.lock);
+
+ src_h = bus1_handle_import(src, param->src_handle, &is_new);
+ if (IS_ERR(src_h)) {
+ r = PTR_ERR(src_h);
+ src_h = NULL;
+ goto exit;
+ }
+
+ if (!bus1_handle_is_live(src_h)) {
+ /*
+ * If @src_h has a destruction queued, we cannot guarantee that
+ * we can join the transaction. Hence, we bail out and tell the
+ * caller that the node is already destroyed.
+ *
+ * In case @src_h->anchor is on one of the peers involved, this
+ * is properly synchronized. However, if it is a 3rd party node
+ * then it might not be committed, yet.
+ *
+ * XXX: We really ought to settle on the destruction. This
+ * requires some waitq to settle on, though.
+ */
+ param->dst_handle = BUS1_HANDLE_INVALID;
+ r = 0;
+ goto exit;
+ }
+
+ dst_h = bus1_handle_ref_by_other(dst, src_h);
+ if (!dst_h) {
+ dst_h = bus1_handle_new_remote(dst, src_h);
+ if (IS_ERR(dst_h)) {
+ r = PTR_ERR(dst_h);
+ dst_h = NULL;
+ goto exit;
+ }
+ }
+
+ if (is_new) {
+ WARN_ON(src_h != bus1_handle_acquire(src_h, false));
+ WARN_ON(atomic_inc_return(&src_h->n_user) != 1);
+ }
+
+ dst_h = bus1_handle_acquire(dst_h, true);
+ param->dst_handle = bus1_handle_identify(dst_h);
+ bus1_handle_export(dst_h);
+ WARN_ON(atomic_inc_return(&dst_h->n_user) < 1);
+
+ r = 0;
+
+exit:
+ bus1_handle_forget(src_h);
+ bus1_mutex_unlock2(&src->local.lock, &dst->local.lock);
+ bus1_handle_unref(dst_h);
+ bus1_handle_unref(src_h);
+ return r;
+}
+
+static int bus1_peer_ioctl_handle_transfer(struct bus1_peer *src,
+ unsigned long arg)
+{
+ struct bus1_cmd_handle_transfer __user *uparam = (void __user *)arg;
+ struct bus1_cmd_handle_transfer param;
+ struct bus1_peer *dst = NULL;
+ struct fd dst_f;
+ int r;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_HANDLE_TRANSFER) != sizeof(param));
+
+ if (copy_from_user(&param, (void __user *)arg, sizeof(param)))
+ return -EFAULT;
+ if (unlikely(param.flags))
+ return -EINVAL;
+
+ if (param.dst_fd != -1) {
+ dst_f = fdget(param.dst_fd);
+ if (!dst_f.file)
+ return -EBADF;
+ if (dst_f.file->f_op != &bus1_fops) {
+ fdput(dst_f);
+ return -EOPNOTSUPP;
+ }
+
+ dst = bus1_peer_acquire(dst_f.file->private_data);
+ fdput(dst_f);
+ if (!dst)
+ return -ESHUTDOWN;
+ }
+
+ r = bus1_peer_transfer(src, dst ?: src, &param);
+ bus1_peer_release(dst);
+ if (r < 0)
+ return r;
+
+ return copy_to_user(uparam, &param, sizeof(param)) ? -EFAULT : 0;
+}
+
+static int bus1_peer_ioctl_nodes_destroy(struct bus1_peer *peer,
+ unsigned long arg)
+{
+ struct bus1_cmd_nodes_destroy param;
+ size_t n_charge = 0, n_discharge = 0;
+ struct bus1_handle *h, *list = BUS1_TAIL;
+ const u64 __user *ptr_nodes;
+ struct bus1_tx tx;
+ bool is_new;
+ u64 i, id;
+ int r;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_NODES_DESTROY) != sizeof(param));
+
+ if (copy_from_user(&param, (void __user *)arg, sizeof(param)))
+ return -EFAULT;
+ if (unlikely(param.flags & ~BUS1_NODES_DESTROY_FLAG_RELEASE_HANDLES))
+ return -EINVAL;
+ if (unlikely(param.ptr_nodes != (u64)(unsigned long)param.ptr_nodes))
+ return -EFAULT;
+
+ mutex_lock(&peer->local.lock);
+
+ bus1_tx_init(&tx, peer);
+ ptr_nodes = (const u64 __user *)(unsigned long)param.ptr_nodes;
+
+ for (i = 0; i < param.n_nodes; ++i) {
+ if (get_user(id, ptr_nodes + i)) {
+ r = -EFAULT;
+ goto exit;
+ }
+
+ h = bus1_handle_import(peer, id, &is_new);
+ if (IS_ERR(h)) {
+ r = PTR_ERR(h);
+ goto exit;
+ }
+
+ if (h->tlink) {
+ bus1_handle_unref(h);
+ r = -ENOTUNIQ;
+ goto exit;
+ }
+
+ h->tlink = list;
+ list = h;
+
+ if (!bus1_handle_is_anchor(h)) {
+ r = -EREMOTE;
+ goto exit;
+ }
+
+ if (!bus1_handle_is_live(h)) {
+ r = -ESTALE;
+ goto exit;
+ }
+
+ if (is_new)
+ ++n_charge;
+ }
+
+ /* nothing below this point can fail, anymore */
+
+ mutex_lock(&peer->data.lock);
+ for (h = list; h != BUS1_TAIL; h = h->tlink) {
+ if (!bus1_handle_is_public(h)) {
+ WARN_ON(h != bus1_handle_acquire_locked(h, false));
+ WARN_ON(atomic_inc_return(&h->n_user) != 1);
+ }
+
+ bus1_handle_destroy_locked(h, &tx);
+ }
+ mutex_unlock(&peer->data.lock);
+
+ bus1_tx_commit(&tx);
+
+ while (list != BUS1_TAIL) {
+ h = list;
+ list = h->tlink;
+ h->tlink = NULL;
+
+ if (param.flags & BUS1_NODES_DESTROY_FLAG_RELEASE_HANDLES) {
+ ++n_discharge;
+ if (atomic_dec_return(&h->n_user) == 0) {
+ bus1_handle_forget(h);
+ bus1_handle_release(h, false);
+ } else {
+ bus1_handle_release(h, true);
+ }
+ }
+
+ bus1_handle_unref(h);
+ }
+
+ r = 0;
+
+exit:
+ while (list != BUS1_TAIL) {
+ h = list;
+ list = h->tlink;
+ h->tlink = NULL;
+
+ bus1_handle_forget(h);
+ bus1_handle_unref(h);
+ }
+ bus1_tx_deinit(&tx);
+ mutex_unlock(&peer->local.lock);
+ return r;
+}
+
+static int bus1_peer_ioctl_slice_release(struct bus1_peer *peer,
+ unsigned long arg)
+{
+ size_t n_slices = 0;
+ u64 offset;
+ int r;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_SLICE_RELEASE) != sizeof(offset));
+
+ if (get_user(offset, (const u64 __user *)arg))
+ return -EFAULT;
+
+ mutex_lock(&peer->data.lock);
+ r = bus1_pool_release_user(&peer->data.pool, offset, &n_slices);
+ mutex_unlock(&peer->data.lock);
+ return r;
+}
+
+static struct bus1_message *bus1_peer_new_message(struct bus1_peer *peer,
+ struct bus1_factory *f,
+ u64 id)
+{
+ struct bus1_message *m = NULL;
+ struct bus1_handle *h = NULL;
+ struct bus1_peer *p = NULL;
+ bool is_new;
+ int r;
+
+ h = bus1_handle_import(peer, id, &is_new);
+ if (IS_ERR(h))
+ return ERR_CAST(h);
+
+ if (h->tlink) {
+ r = -ENOTUNIQ;
+ goto error;
+ }
+
+ if (bus1_handle_is_anchor(h))
+ p = bus1_peer_acquire(peer);
+ else
+ p = bus1_handle_acquire_owner(h);
+ if (!p) {
+ r = -ESHUTDOWN;
+ goto error;
+ }
+
+ m = bus1_factory_instantiate(f, h, p);
+ if (IS_ERR(m)) {
+ r = PTR_ERR(m);
+ goto error;
+ }
+
+ /* marker to detect duplicates */
+ h->tlink = BUS1_TAIL;
+
+ /* m->dst pins the handle for us */
+ bus1_handle_unref(h);
+
+ /* merge charge into factory (which shares the lookup with us) */
+ if (is_new)
+ ++f->n_handles_charge;
+
+ return m;
+
+error:
+ bus1_peer_release(p);
+ if (is_new)
+ bus1_handle_forget(h);
+ bus1_handle_unref(h);
+ return ERR_PTR(r);
+}
+
+static int bus1_peer_ioctl_send(struct bus1_peer *peer,
+ unsigned long arg)
+{
+ struct bus1_queue_node *mlist = NULL;
+ struct bus1_factory *factory = NULL;
+ const u64 __user *ptr_destinations;
+ struct bus1_cmd_send param;
+ struct bus1_message *m;
+ struct bus1_peer *p;
+ size_t i, n_charge = 0;
+ struct bus1_tx tx;
+ u8 stack[512];
+ u64 id;
+ int r;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_SEND) != sizeof(param));
+
+ if (copy_from_user(&param, (void __user *)arg, sizeof(param)))
+ return -EFAULT;
+ if (unlikely(param.flags & ~(BUS1_SEND_FLAG_CONTINUE |
+ BUS1_SEND_FLAG_SEED)))
+ return -EINVAL;
+
+ /* check basic limits; avoids integer-overflows later on */
+ if (unlikely(param.n_destinations > INT_MAX) ||
+ unlikely(param.n_vecs > UIO_MAXIOV) ||
+ unlikely(param.n_fds > BUS1_FD_MAX))
+ return -EMSGSIZE;
+
+ /* 32bit pointer validity checks */
+ if (unlikely(param.ptr_destinations !=
+ (u64)(unsigned long)param.ptr_destinations) ||
+ unlikely(param.ptr_errors !=
+ (u64)(unsigned long)param.ptr_errors) ||
+ unlikely(param.ptr_vecs !=
+ (u64)(unsigned long)param.ptr_vecs) ||
+ unlikely(param.ptr_handles !=
+ (u64)(unsigned long)param.ptr_handles) ||
+ unlikely(param.ptr_fds !=
+ (u64)(unsigned long)param.ptr_fds))
+ return -EFAULT;
+
+ mutex_lock(&peer->local.lock);
+
+ bus1_tx_init(&tx, peer);
+ ptr_destinations =
+ (const u64 __user *)(unsigned long)param.ptr_destinations;
+
+ factory = bus1_factory_new(peer, &param, stack, sizeof(stack));
+ if (IS_ERR(factory)) {
+ r = PTR_ERR(factory);
+ factory = NULL;
+ goto exit;
+ }
+
+ if (param.flags & BUS1_SEND_FLAG_SEED) {
+ if (unlikely((param.flags & BUS1_SEND_FLAG_CONTINUE) ||
+ param.n_destinations)) {
+ r = -EINVAL;
+ goto exit;
+ }
+
+ /* XXX: set seed */
+ r = -ENOTSUPP;
+ goto exit;
+ } else {
+ for (i = 0; i < param.n_destinations; ++i) {
+ if (get_user(id, ptr_destinations + i)) {
+ r = -EFAULT;
+ goto exit;
+ }
+
+ m = bus1_peer_new_message(peer, factory, id);
+ if (IS_ERR(m)) {
+ r = PTR_ERR(m);
+ goto exit;
+ }
+
+ if (!bus1_handle_is_public(m->dst))
+ ++n_charge;
+
+ m->qnode.next = mlist;
+ mlist = &m->qnode;
+ }
+
+ r = bus1_factory_seal(factory);
+ if (r < 0)
+ goto exit;
+
+ /*
+ * Now everything is prepared, charged, and pinned. Iterate
+ * each message, acquire references, and stage the message.
+ * From here on, we must not error out, anymore.
+ */
+
+ while (mlist) {
+ m = container_of(mlist, struct bus1_message, qnode);
+ mlist = m->qnode.next;
+ m->qnode.next = NULL;
+
+ if (!bus1_handle_is_public(m->dst)) {
+ --factory->n_handles_charge;
+ WARN_ON(m->dst != bus1_handle_acquire(m->dst,
+ false));
+ WARN_ON(atomic_inc_return(&m->dst->n_user)
+ != 1);
+ }
+
+ m->dst->tlink = NULL;
+
+ /* this consumes @m and @m->qnode.owner */
+ bus1_message_stage(m, &tx);
+ }
+
+ WARN_ON(factory->n_handles_charge != 0);
+ bus1_tx_commit(&tx);
+ }
+
+ r = 0;
+
+exit:
+ while (mlist) {
+ m = container_of(mlist, struct bus1_message, qnode);
+ mlist = m->qnode.next;
+ m->qnode.next = NULL;
+
+ p = m->qnode.owner;
+ m->dst->tlink = NULL;
+
+ bus1_handle_forget(m->dst);
+ bus1_message_unref(m);
+ bus1_peer_release(p);
+ }
+ bus1_factory_free(factory);
+ bus1_tx_deinit(&tx);
+ mutex_unlock(&peer->local.lock);
+ return r;
+}
+
+static struct bus1_queue_node *bus1_peer_peek(struct bus1_peer *peer,
+ struct bus1_cmd_recv *param,
+ bool *morep)
+{
+ struct bus1_queue_node *qnode;
+ struct bus1_message *m;
+ struct bus1_handle *h;
+ u64 ts;
+
+ lockdep_assert_held(&peer->local.lock);
+
+ if (unlikely(param->flags & BUS1_RECV_FLAG_SEED)) {
+ if (!peer->local.seed)
+ return ERR_PTR(-EAGAIN);
+
+ *morep = false;
+ return &peer->local.seed->qnode;
+ }
+
+ mutex_lock(&peer->data.lock);
+ while ((qnode = bus1_queue_peek(&peer->data.queue, morep))) {
+ switch (bus1_queue_node_get_type(qnode)) {
+ case BUS1_MSG_DATA:
+ m = container_of(qnode, struct bus1_message, qnode);
+ h = m->dst;
+ break;
+ case BUS1_MSG_NODE_DESTROY:
+ case BUS1_MSG_NODE_RELEASE:
+ m = NULL;
+ h = container_of(qnode, struct bus1_handle, qnode);
+ break;
+ case BUS1_MSG_NONE:
+ default:
+ mutex_unlock(&peer->data.lock);
+ WARN(1, "Unknown message type\n");
+ return ERR_PTR(-ENOTRECOVERABLE);
+ }
+
+ ts = bus1_queue_node_get_timestamp(qnode);
+ if (ts <= peer->data.queue.flush ||
+ !bus1_handle_is_public(h) ||
+ !bus1_handle_is_live_at(h, ts)) {
+ bus1_queue_remove(&peer->data.queue, &peer->waitq,
+ qnode);
+ if (m) {
+ mutex_unlock(&peer->data.lock);
+ bus1_message_unref(m);
+ mutex_lock(&peer->data.lock);
+ } else {
+ bus1_handle_unref(h);
+ }
+
+ continue;
+ }
+
+ if (!m && !(param->flags & BUS1_RECV_FLAG_PEEK))
+ bus1_queue_remove(&peer->data.queue, &peer->waitq,
+ qnode);
+
+ break;
+ }
+ mutex_unlock(&peer->data.lock);
+
+ return qnode ?: ERR_PTR(-EAGAIN);
+}
+
+static int bus1_peer_ioctl_recv(struct bus1_peer *peer,
+ unsigned long arg)
+{
+ struct bus1_queue_node *qnode = NULL;
+ struct bus1_cmd_recv param;
+ struct bus1_message *m;
+ struct bus1_handle *h;
+ unsigned int type;
+ bool more = false;
+ int r;
+
+ BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_RECV) != sizeof(param));
+
+ if (copy_from_user(&param, (void __user *)arg, sizeof(param)))
+ return -EFAULT;
+ if (unlikely(param.flags & ~(BUS1_RECV_FLAG_PEEK |
+ BUS1_RECV_FLAG_SEED |
+ BUS1_RECV_FLAG_INSTALL_FDS)))
+ return -EINVAL;
+
+ mutex_lock(&peer->local.lock);
+
+ qnode = bus1_peer_peek(peer, &param, &more);
+ if (IS_ERR(qnode)) {
+ r = PTR_ERR(qnode);
+ goto exit;
+ }
+
+ type = bus1_queue_node_get_type(qnode);
+ switch (type) {
+ case BUS1_MSG_DATA:
+ m = container_of(qnode, struct bus1_message, qnode);
+ WARN_ON(m->dst->id == BUS1_HANDLE_INVALID);
+
+ if (param.max_offset < m->slice->offset + m->slice->size) {
+ r = -ERANGE;
+ goto exit;
+ }
+
+ r = bus1_message_install(m, &param);
+ if (r < 0)
+ goto exit;
+
+ param.msg.type = BUS1_MSG_DATA;
+ param.msg.flags = m->flags;
+ param.msg.destination = m->dst->id;
+ param.msg.uid = m->uid;
+ param.msg.gid = m->gid;
+ param.msg.pid = m->pid;
+ param.msg.tid = m->tid;
+ param.msg.offset = m->slice->offset;
+ param.msg.n_bytes = m->n_bytes;
+ param.msg.n_handles = m->n_handles;
+ param.msg.n_fds = m->n_files;
+ param.msg.n_secctx = m->n_secctx;
+
+ if (likely(!(param.flags & BUS1_RECV_FLAG_PEEK))) {
+ if (unlikely(param.flags & BUS1_RECV_FLAG_SEED)) {
+ peer->local.seed = NULL;
+ } else {
+ mutex_lock(&peer->data.lock);
+ bus1_queue_remove(&peer->data.queue,
+ &peer->waitq, qnode);
+ mutex_unlock(&peer->data.lock);
+ }
+ bus1_message_unref(m);
+ }
+ break;
+ case BUS1_MSG_NODE_DESTROY:
+ case BUS1_MSG_NODE_RELEASE:
+ h = container_of(qnode, struct bus1_handle, qnode);
+ WARN_ON(h->id == BUS1_HANDLE_INVALID);
+
+ param.msg.type = type;
+ param.msg.flags = 0;
+ param.msg.destination = h->id;
+ param.msg.uid = -1;
+ param.msg.gid = -1;
+ param.msg.pid = 0;
+ param.msg.tid = 0;
+ param.msg.offset = BUS1_OFFSET_INVALID;
+ param.msg.n_bytes = 0;
+ param.msg.n_handles = 0;
+ param.msg.n_fds = 0;
+ param.msg.n_secctx = 0;
+
+ if (likely(!(param.flags & BUS1_RECV_FLAG_PEEK)))
+ bus1_handle_unref(h);
+ break;
+ case BUS1_MSG_NONE:
+ default:
+ WARN(1, "Unknown message type\n");
+ r = -ENOTRECOVERABLE;
+ goto exit;
+ }
+
+ if (more)
+ param.msg.flags |= BUS1_MSG_FLAG_CONTINUE;
+
+ if (copy_to_user((void __user *)arg, &param, sizeof(param)))
+ r = -EFAULT;
+ else
+ r = 0;
+
+exit:
+ mutex_unlock(&peer->local.lock);
+ return r;
+}
+
+/**
+ * bus1_peer_ioctl() - handle peer ioctls
+ * @file: file the ioctl is called on
+ * @cmd: ioctl command
+ * @arg: ioctl argument
+ *
+ * This handles the given ioctl (cmd+arg) on a peer. This expects the peer to
+ * be stored in the private_data field of @file.
+ *
+ * Multiple ioctls can be called in parallel just fine. No locking is needed.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+long bus1_peer_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct bus1_peer *peer = file->private_data;
+ int r;
+
+ /*
+ * First handle ioctls that do not require an active-reference, then
+ * all the remaining ones wrapped in an active reference.
+ */
+ switch (cmd) {
+ case BUS1_CMD_PEER_DISCONNECT:
+ if (unlikely(arg))
+ return -EINVAL;
+
+ r = bus1_peer_disconnect(peer);
+ break;
+ default:
+ if (!bus1_peer_acquire(peer))
+ return -ESHUTDOWN;
+
+ switch (cmd) {
+ case BUS1_CMD_PEER_QUERY:
+ r = bus1_peer_ioctl_peer_query(peer, arg);
+ break;
+ case BUS1_CMD_PEER_RESET:
+ r = bus1_peer_ioctl_peer_reset(peer, arg);
+ break;
+ case BUS1_CMD_HANDLE_RELEASE:
+ r = bus1_peer_ioctl_handle_release(peer, arg);
+ break;
+ case BUS1_CMD_HANDLE_TRANSFER:
+ r = bus1_peer_ioctl_handle_transfer(peer, arg);
+ break;
+ case BUS1_CMD_NODES_DESTROY:
+ r = bus1_peer_ioctl_nodes_destroy(peer, arg);
+ break;
+ case BUS1_CMD_SLICE_RELEASE:
+ r = bus1_peer_ioctl_slice_release(peer, arg);
+ break;
+ case BUS1_CMD_SEND:
+ r = bus1_peer_ioctl_send(peer, arg);
+ break;
+ case BUS1_CMD_RECV:
+ r = bus1_peer_ioctl_recv(peer, arg);
+ break;
+ default:
+ r = -ENOTTY;
+ break;
+ }
+
+ bus1_peer_release(peer);
+ break;
+ }
+
+ return r;
+}
diff --git a/ipc/bus1/peer.h b/ipc/bus1/peer.h
index 5eb558f..26c051f 100644
--- a/ipc/bus1/peer.h
+++ b/ipc/bus1/peer.h
@@ -52,11 +52,13 @@
#include <linux/rcupdate.h>
#include <linux/rbtree.h>
#include <linux/wait.h>
+#include <uapi/linux/bus1.h>
#include "user.h"
#include "util/active.h"
#include "util/pool.h"
#include "util/queue.h"

+struct bus1_message;
struct cred;
struct dentry;
struct pid_namespace;
@@ -73,8 +75,12 @@ struct pid_namespace;
* @active: active references
* @debugdir: debugfs root of this peer, or NULL/ERR_PTR
* @data.lock: data lock
+ * @data.pool: data pool
* @data.queue: message queue
* @local.lock: local peer runtime lock
+ * @local.seed: pinned seed message
+ * @local.map_handles: map of owned handles (by handle ID)
+ * @local.handle_ids: handle ID allocator
*/
struct bus1_peer {
u64 id;
@@ -95,6 +101,7 @@ struct bus1_peer {

struct {
struct mutex lock;
+ struct bus1_message *seed;
struct rb_root map_handles;
u64 handle_ids;
} local;
@@ -102,6 +109,7 @@ struct bus1_peer {

struct bus1_peer *bus1_peer_new(void);
struct bus1_peer *bus1_peer_free(struct bus1_peer *peer);
+long bus1_peer_ioctl(struct file *file, unsigned int cmd, unsigned long arg);

/**
* bus1_peer_acquire() - acquire active reference to peer
--
2.10.1

2016-10-26 19:22:34

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 14/14] bus1: basic user-space kselftests

From: Tom Gundersen <[email protected]>

This adds kselftests integration and provides some basic API tests for
bus1.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
tools/testing/selftests/bus1/.gitignore | 2 +
tools/testing/selftests/bus1/Makefile | 19 ++
tools/testing/selftests/bus1/bus1-ioctl.h | 111 +++++++
tools/testing/selftests/bus1/test-api.c | 532 ++++++++++++++++++++++++++++++
tools/testing/selftests/bus1/test-io.c | 198 +++++++++++
tools/testing/selftests/bus1/test.h | 114 +++++++
6 files changed, 976 insertions(+)
create mode 100644 tools/testing/selftests/bus1/.gitignore
create mode 100644 tools/testing/selftests/bus1/Makefile
create mode 100644 tools/testing/selftests/bus1/bus1-ioctl.h
create mode 100644 tools/testing/selftests/bus1/test-api.c
create mode 100644 tools/testing/selftests/bus1/test-io.c
create mode 100644 tools/testing/selftests/bus1/test.h

diff --git a/tools/testing/selftests/bus1/.gitignore b/tools/testing/selftests/bus1/.gitignore
new file mode 100644
index 0000000..76ecb9c
--- /dev/null
+++ b/tools/testing/selftests/bus1/.gitignore
@@ -0,0 +1,2 @@
+test-api
+test-io
diff --git a/tools/testing/selftests/bus1/Makefile b/tools/testing/selftests/bus1/Makefile
new file mode 100644
index 0000000..cbcf689
--- /dev/null
+++ b/tools/testing/selftests/bus1/Makefile
@@ -0,0 +1,19 @@
+# Makefile for bus1 selftests
+
+CC = $(CROSS_COMPILE)gcc
+CFLAGS += -D_FILE_OFFSET_BITS=64 -Wall -g -O2
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+
+TEST_PROGS := test-api test-io
+
+all: $(TEST_PROGS)
+
+%: %.c bus1-ioctl.h test.h ../../../../usr/include/linux/bus1.h
+ $(CC) $(CFLAGS) $< -o $@
+
+include ../lib.mk
+
+clean:
+ $(RM) $(TEST_PROGS)
diff --git a/tools/testing/selftests/bus1/bus1-ioctl.h b/tools/testing/selftests/bus1/bus1-ioctl.h
new file mode 100644
index 0000000..552bd5d
--- /dev/null
+++ b/tools/testing/selftests/bus1/bus1-ioctl.h
@@ -0,0 +1,111 @@
+#pragma once
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#include <assert.h>
+#include <inttypes.h>
+#include <linux/bus1.h>
+#include <stdlib.h>
+#include <sys/uio.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+static inline int
+bus1_ioctl(int fd, unsigned int cmd, void *arg)
+{
+ return (ioctl(fd, cmd, arg) >= 0) ? 0: -errno;
+}
+
+static inline int
+bus1_ioctl_peer_disconnect(int fd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_PEER_DISCONNECT) == sizeof(uint64_t),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_PEER_DISCONNECT, NULL);
+}
+
+static inline int
+bus1_ioctl_peer_query(int fd, struct bus1_cmd_peer_reset *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_PEER_QUERY) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_PEER_QUERY, cmd);
+}
+
+static inline int
+bus1_ioctl_peer_reset(int fd, struct bus1_cmd_peer_reset *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_PEER_RESET) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_PEER_RESET, cmd);
+}
+
+static inline int
+bus1_ioctl_handle_release(int fd, uint64_t *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_HANDLE_RELEASE) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_HANDLE_RELEASE, cmd);
+}
+
+static inline int
+bus1_ioctl_handle_transfer(int fd, struct bus1_cmd_handle_transfer *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_HANDLE_TRANSFER) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_HANDLE_TRANSFER, cmd);
+}
+
+static inline int
+bus1_ioctl_nodes_destroy(int fd, struct bus1_cmd_nodes_destroy *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_NODES_DESTROY) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_NODES_DESTROY, cmd);
+}
+
+static inline int
+bus1_ioctl_slice_release(int fd, uint64_t *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_SLICE_RELEASE) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_SLICE_RELEASE, cmd);
+}
+
+static inline int
+bus1_ioctl_send(int fd, struct bus1_cmd_send *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_SEND) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_SEND, cmd);
+}
+
+static inline int
+bus1_ioctl_recv(int fd, struct bus1_cmd_recv *cmd)
+{
+ static_assert(_IOC_SIZE(BUS1_CMD_RECV) == sizeof(*cmd),
+ "ioctl is called with invalid argument size");
+
+ return bus1_ioctl(fd, BUS1_CMD_RECV, cmd);
+}
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/tools/testing/selftests/bus1/test-api.c b/tools/testing/selftests/bus1/test-api.c
new file mode 100644
index 0000000..a289197
--- /dev/null
+++ b/tools/testing/selftests/bus1/test-api.c
@@ -0,0 +1,532 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include "test.h"
+
+/* make sure /dev/busX exists, is a cdev and accessible */
+static void test_api_cdev(void)
+{
+ const uint8_t *map;
+ struct stat st;
+ size_t n_map;
+ int r, fd;
+
+ r = access(test_path, F_OK);
+ assert(r >= 0);
+
+ r = stat(test_path, &st);
+ assert(r >= 0);
+ assert((st.st_mode & S_IFMT) == S_IFCHR);
+
+ r = open(test_path, O_RDWR | O_CLOEXEC | O_NONBLOCK | O_NOCTTY);
+ assert(r >= 0);
+ close(r);
+
+ fd = test_open(&map, &n_map);
+ test_close(fd, map, n_map);
+}
+
+/* make sure basic connect works */
+static void test_api_connect(void)
+{
+ struct bus1_cmd_peer_reset cmd_reset = {
+ .flags = 0,
+ .peer_flags = -1,
+ .max_slices = -1,
+ .max_handles = -1,
+ .max_inflight_bytes = -1,
+ .max_inflight_fds = -1,
+ };
+ const uint8_t *map1;
+ size_t n_map1;
+ int r, fd1;
+
+ /* create @fd1 */
+
+ fd1 = test_open(&map1, &n_map1);
+
+ /* test empty RESET */
+
+ r = bus1_ioctl_peer_reset(fd1, &cmd_reset);
+ assert(r >= 0);
+
+ /* test DISCONNECT and verify ESHUTDOWN afterwards */
+
+ r = bus1_ioctl_peer_disconnect(fd1);
+ assert(r >= 0);
+
+ r = bus1_ioctl_peer_disconnect(fd1);
+ assert(r < 0);
+ assert(r == -ESHUTDOWN);
+
+ r = bus1_ioctl_peer_reset(fd1, &cmd_reset);
+ assert(r < 0);
+ assert(r == -ESHUTDOWN);
+
+ /* cleanup */
+
+ test_close(fd1, map1, n_map1);
+}
+
+/* make sure basic transfer works */
+static void test_api_transfer(void)
+{
+ struct bus1_cmd_handle_transfer cmd_transfer;
+ const uint8_t *map1, *map2;
+ size_t n_map1, n_map2;
+ int r, fd1, fd2;
+
+ /* setup */
+
+ fd1 = test_open(&map1, &n_map1);
+ fd2 = test_open(&map2, &n_map2);
+
+ /* import a handle from @fd1 into @fd2 */
+
+ cmd_transfer = (struct bus1_cmd_handle_transfer){
+ .flags = 0,
+ .src_handle = 0x100,
+ .dst_fd = fd2,
+ .dst_handle = BUS1_HANDLE_INVALID,
+ };
+ r = bus1_ioctl_handle_transfer(fd1, &cmd_transfer);
+ assert(r >= 0);
+ assert(cmd_transfer.dst_handle != BUS1_HANDLE_INVALID);
+ assert(cmd_transfer.dst_handle & BUS1_HANDLE_FLAG_MANAGED);
+ assert(cmd_transfer.dst_handle & BUS1_HANDLE_FLAG_REMOTE);
+
+ /* cleanup */
+
+ test_close(fd2, map2, n_map2);
+ test_close(fd1, map1, n_map1);
+}
+
+/* test release notification */
+static void test_api_notify_release(void)
+{
+ struct bus1_cmd_handle_transfer cmd_transfer;
+ struct bus1_cmd_recv cmd_recv;
+ const uint8_t *map1;
+ uint64_t id = 0x100;
+ size_t n_map1;
+ int r, fd1;
+
+ /* setup */
+
+ fd1 = test_open(&map1, &n_map1);
+
+ /* import a handle from @fd1 into @fd2 */
+
+ cmd_transfer = (struct bus1_cmd_handle_transfer){
+ .flags = 0,
+ .src_handle = id,
+ .dst_fd = -1,
+ .dst_handle = BUS1_HANDLE_INVALID,
+ };
+ r = bus1_ioctl_handle_transfer(fd1, &cmd_transfer);
+ assert(r >= 0);
+ assert(cmd_transfer.dst_handle == id);
+
+ /* no message can be queued */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /* release handle to trigger release notification */
+
+ r = bus1_ioctl_handle_release(fd1, &id);
+ assert(r == 0);
+
+ /* dequeue release notification */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_NODE_RELEASE);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == id);
+
+ /* no more messages */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /*
+ * Trigger the same thing again.
+ */
+
+ cmd_transfer = (struct bus1_cmd_handle_transfer){
+ .flags = 0,
+ .src_handle = id,
+ .dst_fd = -1,
+ .dst_handle = BUS1_HANDLE_INVALID,
+ };
+ r = bus1_ioctl_handle_transfer(fd1, &cmd_transfer);
+ assert(r >= 0);
+ assert(cmd_transfer.dst_handle == id);
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ r = bus1_ioctl_handle_release(fd1, &id);
+ assert(r == 0);
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_NODE_RELEASE);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == id);
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /* cleanup */
+
+ test_close(fd1, map1, n_map1);
+}
+
+/* test destroy notification */
+static void test_api_notify_destroy(void)
+{
+ struct bus1_cmd_handle_transfer cmd_transfer;
+ struct bus1_cmd_nodes_destroy cmd_destroy;
+ struct bus1_cmd_recv cmd_recv;
+ uint64_t node = 0x100, handle;
+ const uint8_t *map1, *map2;
+ size_t n_map1, n_map2;
+ int r, fd1, fd2;
+
+ /* setup */
+
+ fd1 = test_open(&map1, &n_map1);
+ fd2 = test_open(&map2, &n_map2);
+
+ /* import a handle from @fd1 into @fd2 */
+
+ cmd_transfer = (struct bus1_cmd_handle_transfer){
+ .flags = 0,
+ .src_handle = node,
+ .dst_fd = fd2,
+ .dst_handle = BUS1_HANDLE_INVALID,
+ };
+ r = bus1_ioctl_handle_transfer(fd1, &cmd_transfer);
+ assert(r >= 0);
+ handle = cmd_transfer.dst_handle;
+
+ /* both queues must be empty */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map2,
+ };
+ r = bus1_ioctl_recv(fd2, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /* destroy node and trigger destruction notification */
+
+ cmd_destroy = (struct bus1_cmd_nodes_destroy){
+ .flags = 0,
+ .ptr_nodes = (unsigned long)&node,
+ .n_nodes = 1,
+ };
+ r = bus1_ioctl_nodes_destroy(fd1, &cmd_destroy);
+ assert(r >= 0);
+
+ /* dequeue destruction notification */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_NODE_DESTROY);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == node);
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd2, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_NODE_DESTROY);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == handle);
+
+ /* cleanup */
+
+ test_close(fd2, map2, n_map2);
+ test_close(fd1, map1, n_map1);
+}
+
+/* make sure basic unicasts works */
+static void test_api_unicast(void)
+{
+ struct bus1_cmd_send cmd_send;
+ struct bus1_cmd_recv cmd_recv;
+ const uint8_t *map1;
+ uint64_t id = 0x100;
+ size_t n_map1;
+ int r, fd1;
+
+ /* setup */
+
+ fd1 = test_open(&map1, &n_map1);
+
+ /* send empty message */
+
+ cmd_send = (struct bus1_cmd_send){
+ .flags = 0,
+ .ptr_destinations = (unsigned long)&id,
+ .ptr_errors = 0,
+ .n_destinations = 1,
+ .ptr_vecs = 0,
+ .n_vecs = 0,
+ .ptr_handles = 0,
+ .n_handles = 0,
+ .ptr_fds = 0,
+ .n_fds = 0,
+ };
+ r = bus1_ioctl_send(fd1, &cmd_send);
+ assert(r >= 0);
+
+ /* retrieve empty message */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_DATA);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == id);
+
+ /* queue must be empty now */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /* cleanup */
+
+ test_close(fd1, map1, n_map1);
+}
+
+/* make sure basic multicasts works */
+static void test_api_multicast(void)
+{
+ struct bus1_cmd_send cmd_send;
+ struct bus1_cmd_recv cmd_recv;
+ uint64_t ids[] = { 0x100, 0x200 };
+ uint64_t data[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
+ struct iovec vec = { data, sizeof(data) };
+ const uint8_t *map1;
+ size_t n_map1;
+ int r, fd1;
+
+ /* setup */
+
+ fd1 = test_open(&map1, &n_map1);
+
+ /* send multicast */
+
+ cmd_send = (struct bus1_cmd_send){
+ .flags = 0,
+ .ptr_destinations = (unsigned long)ids,
+ .ptr_errors = 0,
+ .n_destinations = sizeof(ids) / sizeof(*ids),
+ .ptr_vecs = (unsigned long)&vec,
+ .n_vecs = 1,
+ .ptr_handles = 0,
+ .n_handles = 0,
+ .ptr_fds = 0,
+ .n_fds = 0,
+ };
+ r = bus1_ioctl_send(fd1, &cmd_send);
+ assert(r >= 0);
+
+ /* retrieve messages */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_DATA);
+ assert(cmd_recv.msg.flags == BUS1_MSG_FLAG_CONTINUE);
+ assert(cmd_recv.msg.destination == ids[0] ||
+ cmd_recv.msg.destination == ids[1]);
+ assert(cmd_recv.msg.n_bytes == sizeof(data));
+ assert(!memcmp(map1 + cmd_recv.msg.offset, data, sizeof(data)));
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_DATA);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == ids[0] ||
+ cmd_recv.msg.destination == ids[1]);
+ assert(cmd_recv.msg.n_bytes == sizeof(data));
+ assert(!memcmp(map1 + cmd_recv.msg.offset, data, sizeof(data)));
+
+ /* queue must be empty now */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /* cleanup */
+
+ test_close(fd1, map1, n_map1);
+}
+
+/* make sure basic payload-handles work */
+static void test_api_handle(void)
+{
+ struct bus1_cmd_send cmd_send;
+ struct bus1_cmd_recv cmd_recv;
+ uint64_t id = 0x100;
+ const uint8_t *map1;
+ size_t n_map1;
+ int r, fd1;
+
+ /* setup */
+
+ fd1 = test_open(&map1, &n_map1);
+
+ /* send message */
+
+ cmd_send = (struct bus1_cmd_send){
+ .flags = 0,
+ .ptr_destinations = (unsigned long)&id,
+ .ptr_errors = 0,
+ .n_destinations = 1,
+ .ptr_vecs = 0,
+ .n_vecs = 0,
+ .ptr_handles = (unsigned long)&id,
+ .n_handles = 1,
+ .ptr_fds = 0,
+ .n_fds = 0,
+ };
+ r = bus1_ioctl_send(fd1, &cmd_send);
+ assert(r >= 0);
+
+ /* retrieve messages */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_DATA);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == id);
+ assert(cmd_recv.msg.n_handles == 1);
+
+ /* queue must be empty now */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /* releasing one reference must trigger a release notification */
+
+ r = bus1_ioctl_handle_release(fd1, &id);
+ assert(r >= 0);
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_NODE_RELEASE);
+ assert(cmd_recv.msg.flags == 0);
+ assert(cmd_recv.msg.destination == id);
+
+ /* queue must be empty again */
+
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = n_map1,
+ };
+ r = bus1_ioctl_recv(fd1, &cmd_recv);
+ assert(r == -EAGAIN);
+
+ /* cleanup */
+
+ test_close(fd1, map1, n_map1);
+}
+
+int main(int argc, char **argv)
+{
+ int r;
+
+ r = test_parse_argv(argc, argv);
+ if (r > 0) {
+ test_api_cdev();
+ test_api_connect();
+ test_api_transfer();
+ test_api_notify_release();
+ test_api_notify_destroy();
+ test_api_unicast();
+ test_api_multicast();
+ test_api_handle();
+ }
+
+ return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
+}
diff --git a/tools/testing/selftests/bus1/test-io.c b/tools/testing/selftests/bus1/test-io.c
new file mode 100644
index 0000000..6cb48e7
--- /dev/null
+++ b/tools/testing/selftests/bus1/test-io.c
@@ -0,0 +1,198 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <time.h>
+#include "test.h"
+
+#define MAX_DESTINATIONS (256)
+
+static inline uint64_t nsec_from_clock(clockid_t clock)
+{
+ struct timespec ts;
+ int r;
+
+ r = clock_gettime(clock, &ts);
+ assert(r >= 0);
+ return ts.tv_sec * UINT64_C(1000000000) + ts.tv_nsec;
+}
+
+static void test_one_uds(int uds[2], void *payload, size_t n_bytes)
+{
+ int r;
+
+ /* send */
+ r = write(uds[0], payload, n_bytes);
+ assert(r == n_bytes);
+
+ /* receive */
+ r = recv(uds[1], payload, n_bytes, 0);
+ assert(r == n_bytes);
+}
+
+static uint64_t test_iterate_uds(unsigned int iterations, size_t n_bytes)
+{
+ int uds[2];
+ char payload[n_bytes];
+ unsigned int i;
+ uint64_t time_start, time_end;
+ int r;
+
+ /* create socket pair */
+ r = socketpair(AF_UNIX, SOCK_SEQPACKET, 0, uds);
+ assert(r >= 0);
+
+ /* caches */
+ test_one_uds(uds, payload, n_bytes);
+
+ time_start = nsec_from_clock(CLOCK_THREAD_CPUTIME_ID);
+ for (i = 0; i < iterations; i++)
+ test_one_uds(uds, payload, n_bytes);
+ time_end = nsec_from_clock(CLOCK_THREAD_CPUTIME_ID);
+
+ /* cleanup */
+ close(uds[0]);
+ close(uds[1]);
+
+ return (time_end - time_start) / iterations;
+}
+
+static void test_one(int fd1,
+ int *fds,
+ uint64_t *handles,
+ unsigned int n_destinations,
+ char *payload,
+ size_t n_bytes)
+{
+ struct bus1_cmd_send cmd_send;
+ struct bus1_cmd_recv cmd_recv;
+ struct iovec vec = { payload, n_bytes };
+ size_t i;
+ int r;
+
+ cmd_send = (struct bus1_cmd_send){
+ .flags = 0,
+ .ptr_destinations = (unsigned long)handles,
+ .ptr_errors = 0,
+ .n_destinations = n_destinations,
+ .ptr_vecs = (unsigned long)&vec,
+ .n_vecs = 1,
+ .ptr_handles = 0,
+ .n_handles = 0,
+ .ptr_fds = 0,
+ .n_fds = 0,
+ };
+ r = bus1_ioctl_send(fd1, &cmd_send);
+ assert(r >= 0);
+
+ for (i = 0; i < n_destinations; ++i) {
+ cmd_recv = (struct bus1_cmd_recv){
+ .flags = 0,
+ .max_offset = -1,
+ };
+ r = bus1_ioctl_recv(fds[i], &cmd_recv);
+ assert(r >= 0);
+ assert(cmd_recv.msg.type == BUS1_MSG_DATA);
+ assert(cmd_recv.msg.n_bytes == n_bytes);
+
+ r = bus1_ioctl_slice_release(fds[i],
+ (uint64_t *)&cmd_recv.msg.offset);
+ assert(r >= 0);
+ }
+}
+
+static uint64_t test_iterate(unsigned int iterations,
+ unsigned int n_destinations,
+ size_t n_bytes)
+{
+ struct bus1_cmd_handle_transfer cmd_transfer;
+ const uint8_t *maps[MAX_DESTINATIONS + 1];
+ size_t n_maps[MAX_DESTINATIONS + 1];
+ uint64_t handles[MAX_DESTINATIONS + 1];
+ int r, fds[MAX_DESTINATIONS + 1];
+ uint64_t time_start, time_end;
+ char payload[n_bytes];
+ size_t i;
+
+ assert(n_destinations <= MAX_DESTINATIONS);
+
+ /* setup */
+ fds[0] = test_open(&maps[0], &n_maps[0]);
+
+ for (i = 1; i < sizeof(fds) / sizeof(*fds); ++i) {
+ fds[i] = test_open(&maps[i], &n_maps[i]);
+
+ cmd_transfer = (struct bus1_cmd_handle_transfer){
+ .flags = 0,
+ .src_handle = 0x100,
+ .dst_fd = fds[0],
+ .dst_handle = BUS1_HANDLE_INVALID,
+ };
+ r = bus1_ioctl_handle_transfer(fds[i], &cmd_transfer);
+ assert(r >= 0);
+ handles[i] = cmd_transfer.dst_handle;
+ }
+
+ /* caches */
+ test_one(fds[0], fds + 1, handles + 1, n_destinations, payload,
+ n_bytes);
+
+ time_start = nsec_from_clock(CLOCK_THREAD_CPUTIME_ID);
+ for (i = 0; i < iterations; i++)
+ test_one(fds[0], fds + 1, handles + 1, n_destinations, payload,
+ n_bytes);
+ time_end = nsec_from_clock(CLOCK_THREAD_CPUTIME_ID);
+
+ for (i = 0; i < sizeof(fds) / sizeof(*fds); ++i)
+ test_close(fds[i], maps[i], n_maps[i]);
+
+ return (time_end - time_start) / iterations;
+}
+
+static void test_io(void)
+{
+ unsigned long base;
+ unsigned int i;
+
+ fprintf(stderr, "UDS took %lu ns without payload\n",
+ test_iterate_uds(100000, 0));
+ fprintf(stderr, "UDS took %lu ns\n",
+ test_iterate_uds(100000, 1024));
+
+ base = test_iterate(1000000, 0, 1024);
+
+ fprintf(stderr, "it took %lu ns for no destinations\n", base);
+ fprintf(stderr,
+ "it took %lu ns + %lu ns for one destination without payload\n",
+ base, test_iterate(100000, 1, 0) - base);
+ fprintf(stderr, "it took %lu ns + %lu ns for one destination\n", base,
+ test_iterate(100000, 1, 1024) - base);
+
+ for (i = 1; i < 9; ++i) {
+ unsigned int dests = 1UL << i;
+
+ fprintf(stderr, "it took %lu ns + %lu ns per destination for %u destinations\n",
+ base, (test_iterate(100000 >> i, dests, 1024) - base) / dests, dests);
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int r;
+
+ r = test_parse_argv(argc, argv);
+ if (r > 0) {
+ test_io();
+ }
+
+ return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
+}
diff --git a/tools/testing/selftests/bus1/test.h b/tools/testing/selftests/bus1/test.h
new file mode 100644
index 0000000..fee815e
--- /dev/null
+++ b/tools/testing/selftests/bus1/test.h
@@ -0,0 +1,114 @@
+#ifndef __TEST_H
+#define __TEST_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/* include standard environment for all tests */
+#include <assert.h>
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <linux/bus1.h>
+#include <linux/sched.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/uio.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "bus1-ioctl.h"
+
+static char *test_path;
+static char *test_arg_module = "bus1";
+
+#define c_align_to(_val, _to) (((_val) + (_to) - 1) & ~((_to) - 1))
+
+static inline int test_parse_argv(int argc, char **argv)
+{
+ enum {
+ ARG_MODULE = 0x100,
+ };
+ static const struct option options[] = {
+ { "help", no_argument, NULL, 'h' },
+ { "module", required_argument, NULL, ARG_MODULE },
+ {}
+ };
+ char *t;
+ int c;
+
+ t = getenv("BUS1EXT");
+ if (t) {
+ test_arg_module = malloc(strlen(t) + 4);
+ assert(test_arg_module);
+ strcpy(test_arg_module, "bus");
+ strcpy(test_arg_module + 3, t);
+ }
+
+ while ((c = getopt_long(argc, argv, "h", options, NULL)) >= 0) {
+ switch (c) {
+ case 'h':
+ fprintf(stderr,
+ "Usage: %s [OPTIONS...] ...\n\n"
+ "Run bus1 test.\n\n"
+ "\t-h, --help Print this help\n"
+ "\t --module=bus1 Module to use\n"
+ , program_invocation_short_name);
+
+ return 0;
+
+ case ARG_MODULE:
+ test_arg_module = optarg;
+ break;
+
+ case '?':
+ /* fallthrough */
+ default:
+ return -EINVAL;
+ }
+ }
+
+ /* store cdev-path for tests to access ("/dev/<module>") */
+ free(test_path);
+ test_path = malloc(strlen(test_arg_module) + 6);
+ assert(test_path);
+ strcpy(test_path, "/dev/");
+ strcpy(test_path + 5, test_arg_module);
+
+ return 1;
+}
+
+static inline int test_open(const uint8_t **mapp, size_t *n_mapp)
+{
+ const size_t size = 16UL * 1024UL * 1024UL;
+ int fd;
+
+ fd = open(test_path, O_RDWR | O_CLOEXEC | O_NONBLOCK | O_NOCTTY);
+ assert(fd >= 0);
+
+ *mapp = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
+ assert(*mapp != MAP_FAILED);
+
+ *n_mapp = size;
+ return fd;
+}
+
+static inline void test_close(int fd, const uint8_t *map, size_t n_map)
+{
+ munmap((void *)map, n_map);
+ close(fd);
+}
+
+#endif /* __TEST_H */
--
2.10.1

2016-10-26 19:23:06

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 13/14] bus1: limit and protect resources

From: Tom Gundersen <[email protected]>

This adds resource-counters to peers and users. They limit the number
of objects that a peer can operate on. The root limits per user can be
configured as a module option. Everything else is just based on them.

This also adds LSM hooks. They are not integrated with ./security/, yet,
but just provide the hooks as discussed with the SELinux maintainers.
Since the operations on bus1 are very similar in nature to Binder, the
hooks are the same as well (since the LSM people seemed pretty happy
with the Binder hooks, anyway).

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/message.c | 47 +++++-
ipc/bus1/peer.c | 95 ++++++++++-
ipc/bus1/peer.h | 2 +
ipc/bus1/security.h | 45 +++++
ipc/bus1/user.c | 475 ++++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/user.h | 75 ++++++++-
6 files changed, 728 insertions(+), 11 deletions(-)
create mode 100644 ipc/bus1/security.h

diff --git a/ipc/bus1/message.c b/ipc/bus1/message.c
index 4c5c905..6145d5f 100644
--- a/ipc/bus1/message.c
+++ b/ipc/bus1/message.c
@@ -26,6 +26,7 @@
#include "handle.h"
#include "message.h"
#include "peer.h"
+#include "security.h"
#include "tx.h"
#include "user.h"
#include "util.h"
@@ -242,9 +243,16 @@ int bus1_factory_seal(struct bus1_factory *f)
struct bus1_handle *h;
struct bus1_flist *e;
size_t i;
+ int r;

lockdep_assert_held(&f->peer->local.lock);

+ r = bus1_user_charge(&f->peer->user->limits.n_handles,
+ &f->peer->data.limits.n_handles,
+ f->n_handles_charge);
+ if (r < 0)
+ return r;
+
for (i = 0, e = f->handles;
i < f->n_handles;
e = bus1_flist_next(e, &i)) {
@@ -291,11 +299,29 @@ struct bus1_message *bus1_factory_instantiate(struct bus1_factory *f,
transmit_secctx = f->has_secctx &&
(READ_ONCE(peer->flags) & BUS1_PEER_FLAG_WANT_SECCTX);

+ r = bus1_user_charge(&peer->user->limits.n_slices,
+ &peer->data.limits.n_slices, 1);
+ if (r < 0)
+ return ERR_PTR(r);
+
+ r = bus1_user_charge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, f->n_handles);
+ if (r < 0) {
+ bus1_user_discharge(&peer->user->limits.n_slices,
+ &peer->data.limits.n_slices, 1);
+ return ERR_PTR(r);
+ }
+
size = sizeof(*m) + bus1_flist_inline_size(f->n_handles) +
f->n_files * sizeof(struct file *);
m = kmalloc(size, GFP_KERNEL);
- if (!m)
+ if (!m) {
+ bus1_user_discharge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, f->n_handles);
+ bus1_user_discharge(&peer->user->limits.n_slices,
+ &peer->data.limits.n_slices, 1);
return ERR_PTR(-ENOMEM);
+ }

/* set to default first, so the destructor can be called anytime */
kref_init(&m->ref);
@@ -329,6 +355,8 @@ struct bus1_message *bus1_factory_instantiate(struct bus1_factory *f,
m->slice = bus1_pool_alloc(&peer->data.pool, size);
mutex_unlock(&peer->data.lock);
if (IS_ERR(m->slice)) {
+ bus1_user_discharge(&peer->user->limits.n_slices,
+ &peer->data.limits.n_slices, 1);
r = PTR_ERR(m->slice);
m->slice = NULL;
goto error;
@@ -376,6 +404,11 @@ struct bus1_message *bus1_factory_instantiate(struct bus1_factory *f,

/* import files */
while (m->n_files < f->n_files) {
+ r = security_bus1_transfer_file(f->peer, peer,
+ f->files[m->n_files]);
+ if (r < 0)
+ goto error;
+
m->files[m->n_files] = get_file(f->files[m->n_files]);
++m->n_files;
}
@@ -436,10 +469,15 @@ void bus1_message_free(struct kref *k)
bus1_handle_unref(e->ptr);
}
}
+ bus1_user_discharge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, m->n_handles_charge);
bus1_flist_deinit(m->handles, m->n_handles);

if (m->slice) {
mutex_lock(&peer->data.lock);
+ if (!bus1_pool_slice_is_public(m->slice))
+ bus1_user_discharge(&peer->user->limits.n_slices,
+ &peer->data.limits.n_slices, 1);
bus1_pool_release_kernel(&peer->data.pool, m->slice);
mutex_unlock(&peer->data.lock);
}
@@ -575,7 +613,12 @@ int bus1_message_install(struct bus1_message *m, struct bus1_cmd_recv *param)
}

/* charge resources */
- if (!peek) {
+ if (peek) {
+ r = bus1_user_charge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, n_handles);
+ if (r < 0)
+ goto exit;
+ } else {
WARN_ON(n_handles < m->n_handles_charge);
m->n_handles_charge -= n_handles;
}
diff --git a/ipc/bus1/peer.c b/ipc/bus1/peer.c
index f0da4a7..db29a69 100644
--- a/ipc/bus1/peer.c
+++ b/ipc/bus1/peer.c
@@ -114,6 +114,7 @@ struct bus1_peer *bus1_peer_new(void)
mutex_init(&peer->data.lock);
peer->data.pool = BUS1_POOL_NULL;
bus1_queue_init(&peer->data.queue);
+ bus1_user_limits_init(&peer->data.limits, peer->user);

/* initialize peer-private section */
mutex_init(&peer->local.lock);
@@ -201,6 +202,8 @@ static void bus1_peer_flush(struct bus1_peer *peer, u64 flags)
rb_to_peer) {
n = atomic_xchg(&h->n_user, 0);
bus1_handle_forget_keep(h);
+ bus1_user_discharge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, n);

if (bus1_handle_is_anchor(h)) {
if (n > 1)
@@ -218,6 +221,9 @@ static void bus1_peer_flush(struct bus1_peer *peer, u64 flags)
bus1_pool_flush(&peer->data.pool, &n_slices);
mutex_unlock(&peer->data.lock);

+ bus1_user_discharge(&peer->user->limits.n_slices,
+ &peer->data.limits.n_slices, n_slices);
+
while ((qnode = qlist)) {
qlist = qnode->next;
qnode->next = NULL;
@@ -281,6 +287,7 @@ struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)
mutex_destroy(&peer->local.lock);

/* deinitialize data section */
+ bus1_user_limits_deinit(&peer->data.limits);
bus1_queue_deinit(&peer->data.queue);
bus1_pool_deinit(&peer->data.pool);
mutex_destroy(&peer->data.lock);
@@ -311,10 +318,10 @@ static int bus1_peer_ioctl_peer_query(struct bus1_peer *peer,

mutex_lock(&peer->local.lock);
param.peer_flags = peer->flags & BUS1_PEER_FLAG_WANT_SECCTX;
- param.max_slices = -1;
- param.max_handles = -1;
- param.max_inflight_bytes = -1;
- param.max_inflight_fds = -1;
+ param.max_slices = peer->data.limits.max_slices;
+ param.max_handles = peer->data.limits.max_handles;
+ param.max_inflight_bytes = peer->data.limits.max_inflight_bytes;
+ param.max_inflight_fds = peer->data.limits.max_inflight_fds;
mutex_unlock(&peer->local.lock);

return copy_to_user(uparam, &param, sizeof(param)) ? -EFAULT : 0;
@@ -336,10 +343,14 @@ static int bus1_peer_ioctl_peer_reset(struct bus1_peer *peer,
if (unlikely(param.peer_flags != -1 &&
(param.peer_flags & ~BUS1_PEER_FLAG_WANT_SECCTX)))
return -EINVAL;
- if (unlikely(param.max_slices != -1 ||
- param.max_handles != -1 ||
- param.max_inflight_bytes != -1 ||
- param.max_inflight_fds != -1))
+ if (unlikely((param.max_slices != -1 &&
+ param.max_slices > INT_MAX) ||
+ (param.max_handles != -1 &&
+ param.max_handles > INT_MAX) ||
+ (param.max_inflight_bytes != -1 &&
+ param.max_inflight_bytes > INT_MAX) ||
+ (param.max_inflight_fds != -1 &&
+ param.max_inflight_fds > INT_MAX)))
return -EINVAL;

mutex_lock(&peer->local.lock);
@@ -347,6 +358,34 @@ static int bus1_peer_ioctl_peer_reset(struct bus1_peer *peer,
if (param.peer_flags != -1)
peer->flags = param.peer_flags;

+ if (param.max_slices != -1) {
+ atomic_add((int)param.max_slices -
+ (int)peer->data.limits.max_slices,
+ &peer->data.limits.n_slices);
+ peer->data.limits.max_slices = param.max_slices;
+ }
+
+ if (param.max_handles != -1) {
+ atomic_add((int)param.max_handles -
+ (int)peer->data.limits.max_handles,
+ &peer->data.limits.n_handles);
+ peer->data.limits.max_handles = param.max_handles;
+ }
+
+ if (param.max_inflight_bytes != -1) {
+ atomic_add((int)param.max_inflight_bytes -
+ (int)peer->data.limits.max_inflight_bytes,
+ &peer->data.limits.n_inflight_bytes);
+ peer->data.limits.max_inflight_bytes = param.max_inflight_bytes;
+ }
+
+ if (param.max_inflight_fds != -1) {
+ atomic_add((int)param.max_inflight_fds -
+ (int)peer->data.limits.max_inflight_fds,
+ &peer->data.limits.n_inflight_fds);
+ peer->data.limits.max_inflight_fds = param.max_inflight_fds;
+ }
+
bus1_peer_flush(peer, param.flags);

mutex_unlock(&peer->local.lock);
@@ -403,6 +442,8 @@ static int bus1_peer_ioctl_handle_release(struct bus1_peer *peer,

WARN_ON(atomic_dec_return(&h->n_user) < 0);
bus1_handle_forget(h);
+ bus1_user_discharge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, 1);
bus1_handle_release(h, strong);

r = 0;
@@ -458,7 +499,20 @@ static int bus1_peer_transfer(struct bus1_peer *src,
}
}

+ r = bus1_user_charge(&dst->user->limits.n_handles,
+ &dst->data.limits.n_handles, 1);
+ if (r < 0)
+ goto exit;
+
if (is_new) {
+ r = bus1_user_charge(&src->user->limits.n_handles,
+ &src->data.limits.n_handles, 1);
+ if (r < 0) {
+ bus1_user_discharge(&dst->user->limits.n_handles,
+ &dst->data.limits.n_handles, 1);
+ goto exit;
+ }
+
WARN_ON(src_h != bus1_handle_acquire(src_h, false));
WARN_ON(atomic_inc_return(&src_h->n_user) != 1);
}
@@ -543,6 +597,16 @@ static int bus1_peer_ioctl_nodes_destroy(struct bus1_peer *peer,
bus1_tx_init(&tx, peer);
ptr_nodes = (const u64 __user *)(unsigned long)param.ptr_nodes;

+ /*
+ * We must limit the work that user-space can dispatch in one go. We
+ * use the maximum number of handles as natural limit. You cannot hit
+ * it, anyway, except if your call would fail without it as well.
+ */
+ if (unlikely(param.n_nodes > peer->user->limits.max_handles)) {
+ r = -EINVAL;
+ goto exit;
+ }
+
for (i = 0; i < param.n_nodes; ++i) {
if (get_user(id, ptr_nodes + i)) {
r = -EFAULT;
@@ -578,6 +642,11 @@ static int bus1_peer_ioctl_nodes_destroy(struct bus1_peer *peer,
++n_charge;
}

+ r = bus1_user_charge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, n_charge);
+ if (r < 0)
+ goto exit;
+
/* nothing below this point can fail, anymore */

mutex_lock(&peer->data.lock);
@@ -611,6 +680,9 @@ static int bus1_peer_ioctl_nodes_destroy(struct bus1_peer *peer,
bus1_handle_unref(h);
}

+ bus1_user_discharge(&peer->user->limits.n_handles,
+ &peer->data.limits.n_handles, n_discharge);
+
r = 0;

exit:
@@ -642,6 +714,8 @@ static int bus1_peer_ioctl_slice_release(struct bus1_peer *peer,
mutex_lock(&peer->data.lock);
r = bus1_pool_release_user(&peer->data.pool, offset, &n_slices);
mutex_unlock(&peer->data.lock);
+ bus1_user_discharge(&peer->user->limits.n_slices,
+ &peer->data.limits.n_slices, n_slices);
return r;
}

@@ -747,6 +821,11 @@ static int bus1_peer_ioctl_send(struct bus1_peer *peer,
ptr_destinations =
(const u64 __user *)(unsigned long)param.ptr_destinations;

+ if (unlikely(param.n_destinations > peer->user->limits.max_handles)) {
+ r = -EINVAL;
+ goto exit;
+ }
+
factory = bus1_factory_new(peer, &param, stack, sizeof(stack));
if (IS_ERR(factory)) {
r = PTR_ERR(factory);
diff --git a/ipc/bus1/peer.h b/ipc/bus1/peer.h
index 26c051f..f601b8e 100644
--- a/ipc/bus1/peer.h
+++ b/ipc/bus1/peer.h
@@ -77,6 +77,7 @@ struct pid_namespace;
* @data.lock: data lock
* @data.pool: data pool
* @data.queue: message queue
+ * @data.limits: resource limit counter
* @local.lock: local peer runtime lock
* @local.seed: pinned seed message
* @local.map_handles: map of owned handles (by handle ID)
@@ -97,6 +98,7 @@ struct bus1_peer {
struct mutex lock;
struct bus1_pool pool;
struct bus1_queue queue;
+ struct bus1_user_limits limits;
} data;

struct {
diff --git a/ipc/bus1/security.h b/ipc/bus1/security.h
new file mode 100644
index 0000000..5addf09
--- /dev/null
+++ b/ipc/bus1/security.h
@@ -0,0 +1,45 @@
+#ifndef __BUS1_SECURITY_H
+#define __BUS1_SECURITY_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Security
+ *
+ * This implements LSM hooks for bus1. Out-of-tree modules cannot provide their
+ * own hooks, so we just provide stubs that are to be converted into real LSM
+ * hooks once this is no longer out-of-tree.
+ */
+
+struct bus1_handle;
+struct bus1_peer;
+struct file;
+
+static inline int security_bus1_transfer_message(struct bus1_peer *from,
+ struct bus1_peer *to)
+{
+ return 0;
+}
+
+static inline int security_bus1_transfer_handle(struct bus1_peer *from,
+ struct bus1_peer *to,
+ struct bus1_handle *node)
+{
+ return 0;
+}
+
+static inline int security_bus1_transfer_file(struct bus1_peer *from,
+ struct bus1_peer *to,
+ struct file *what)
+{
+ return 0;
+}
+
+#endif /* __BUS1_SECURITY_H */
diff --git a/ipc/bus1/user.c b/ipc/bus1/user.c
index 0498ab4..9db5ffd 100644
--- a/ipc/bus1/user.c
+++ b/ipc/bus1/user.c
@@ -23,6 +23,28 @@
static DEFINE_MUTEX(bus1_user_lock);
static DEFINE_IDR(bus1_user_idr);

+static unsigned int bus1_user_max_slices = 16384;
+static unsigned int bus1_user_max_handles = 65536;
+static unsigned int bus1_user_max_inflight_bytes = 16 * 1024 * 1024;
+static unsigned int bus1_user_max_inflight_fds = 4096;
+
+module_param_named(user_slices_max, bus1_user_max_slices,
+ uint, 0644);
+module_param_named(user_handles_max, bus1_user_max_handles,
+ uint, 0644);
+module_param_named(user_inflight_bytes_max, bus1_user_max_inflight_bytes,
+ uint, 0644);
+module_param_named(user_inflight_fds_max, bus1_user_max_inflight_fds,
+ uint, 0644);
+MODULE_PARM_DESC(user_max_slices,
+ "Max number of slices for each user.");
+MODULE_PARM_DESC(user_max_handles,
+ "Max number of handles for each user.");
+MODULE_PARM_DESC(user_max_inflight_bytes,
+ "Max number of inflight bytes for each user.");
+MODULE_PARM_DESC(user_max_inflight_fds,
+ "Max number of inflight fds for each user.");
+
/**
* bus1_user_modexit() - clean up global resources of user accounting
*
@@ -40,6 +62,113 @@ void bus1_user_modexit(void)
idr_init(&bus1_user_idr);
}

+static struct bus1_user_usage *bus1_user_usage_new(void)
+{
+ struct bus1_user_usage *usage;
+
+ usage = kzalloc(sizeof(*usage), GFP_KERNEL);
+ if (!usage)
+ return ERR_PTR(-ENOMEM);
+
+ return usage;
+}
+
+static struct bus1_user_usage *
+bus1_user_usage_free(struct bus1_user_usage *usage)
+{
+ if (usage) {
+ WARN_ON(atomic_read(&usage->n_slices));
+ WARN_ON(atomic_read(&usage->n_handles));
+ WARN_ON(atomic_read(&usage->n_bytes));
+ WARN_ON(atomic_read(&usage->n_fds));
+ kfree(usage);
+ }
+
+ return NULL;
+}
+
+/**
+ * bus1_user_limits_init() - initialize resource limit counter
+ * @limits: object to initialize
+ * @source: source to initialize from, or NULL
+ *
+ * This initializes the resource-limit counter @limit. The initial limits are
+ * taken from @source, if given. If NULL, the global default limits are taken.
+ */
+void bus1_user_limits_init(struct bus1_user_limits *limits,
+ struct bus1_user *source)
+{
+ if (source) {
+ limits->max_slices = source->limits.max_slices;
+ limits->max_handles = source->limits.max_handles;
+ limits->max_inflight_bytes = source->limits.max_inflight_bytes;
+ limits->max_inflight_fds = source->limits.max_inflight_fds;
+ } else {
+ limits->max_slices = bus1_user_max_slices;
+ limits->max_handles = bus1_user_max_handles;
+ limits->max_inflight_bytes = bus1_user_max_inflight_bytes;
+ limits->max_inflight_fds = bus1_user_max_inflight_fds;
+ }
+
+ atomic_set(&limits->n_slices, limits->max_slices);
+ atomic_set(&limits->n_handles, limits->max_handles);
+ atomic_set(&limits->n_inflight_bytes, limits->max_inflight_bytes);
+ atomic_set(&limits->n_inflight_fds, limits->max_inflight_fds);
+
+ idr_init(&limits->usages);
+}
+
+/**
+ * bus1_user_limits_deinit() - deinitialize source limit counter
+ * @limits: object to deinitialize
+ *
+ * This should be called on destruction of @limits. It verifies the correctness
+ * of the limits and emits warnings if something went wrong.
+ */
+void bus1_user_limits_deinit(struct bus1_user_limits *limits)
+{
+ struct bus1_user_usage *usage;
+ int i;
+
+ idr_for_each_entry(&limits->usages, usage, i)
+ bus1_user_usage_free(usage);
+
+ idr_destroy(&limits->usages);
+
+ WARN_ON(atomic_read(&limits->n_slices) !=
+ limits->max_slices);
+ WARN_ON(atomic_read(&limits->n_handles) !=
+ limits->max_handles);
+ WARN_ON(atomic_read(&limits->n_inflight_bytes) !=
+ limits->max_inflight_bytes);
+ WARN_ON(atomic_read(&limits->n_inflight_fds) !=
+ limits->max_inflight_fds);
+}
+
+static struct bus1_user_usage *
+bus1_user_limits_map(struct bus1_user_limits *limits, struct bus1_user *actor)
+{
+ struct bus1_user_usage *usage;
+ int r;
+
+ usage = idr_find(&limits->usages, __kuid_val(actor->uid));
+ if (usage)
+ return usage;
+
+ usage = bus1_user_usage_new();
+ if (!IS_ERR(usage))
+ return ERR_CAST(usage);
+
+ r = idr_alloc(&limits->usages, usage, __kuid_val(actor->uid),
+ __kuid_val(actor->uid) + 1, GFP_KERNEL);
+ if (r < 0) {
+ bus1_user_usage_free(usage);
+ return ERR_PTR(r);
+ }
+
+ return usage;
+}
+
static struct bus1_user *bus1_user_new(void)
{
struct bus1_user *user;
@@ -51,6 +180,7 @@ static struct bus1_user *bus1_user_new(void)
kref_init(&user->ref);
user->uid = INVALID_UID;
mutex_init(&user->lock);
+ bus1_user_limits_init(&user->limits, NULL);

return user;
}
@@ -63,6 +193,7 @@ static void bus1_user_free(struct kref *ref)

if (likely(uid_valid(user->uid)))
idr_remove(&bus1_user_idr, __kuid_val(user->uid));
+ bus1_user_limits_deinit(&user->limits);
mutex_destroy(&user->lock);
kfree_rcu(user, rcu);
}
@@ -151,3 +282,347 @@ struct bus1_user *bus1_user_unref(struct bus1_user *user)

return NULL;
}
+
+/**
+ * bus1_user_charge() - charge a user resource
+ * @global: global resource to charge on
+ * @local: local resource to charge on
+ * @charge: charge to apply
+ *
+ * This charges @charge on two resource counters. Only if both charges apply,
+ * this returns success. It is an error to call this with negative charges.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_user_charge(atomic_t *global, atomic_t *local, int charge)
+{
+ int v;
+
+ WARN_ON(charge < 0);
+
+ if (!charge)
+ return 0;
+
+ v = bus1_atomic_add_if_ge(global, -charge, charge);
+ if (v < charge)
+ return -EDQUOT;
+
+ v = bus1_atomic_add_if_ge(local, -charge, charge);
+ if (v < charge) {
+ atomic_add(charge, global);
+ return -EDQUOT;
+ }
+
+ return 0;
+}
+
+/**
+ * bus1_user_discharge() - discharge a user resource
+ * @global: global resource to charge on
+ * @local: local resource to charge on
+ * @charge: charge to apply
+ *
+ * This discharges @charge on two resource counters. This always succeeds. It
+ * is an error to call this with a negative charge.
+ */
+void bus1_user_discharge(atomic_t *global, atomic_t *local, int charge)
+{
+ WARN_ON(charge < 0);
+ atomic_add(charge, local);
+ atomic_add(charge, global);
+}
+
+static int bus1_user_charge_one(atomic_t *global_remaining,
+ atomic_t *local_remaining,
+ int global_share,
+ int local_share,
+ int charge)
+{
+ int v, global_reserved, local_reserved;
+
+ WARN_ON(charge < 0);
+
+ /*
+ * Try charging a single resource type. If limits are exceeded, return
+ * an error-code, otherwise apply charges.
+ *
+ * @remaining: per-user atomic that counts all instances of this
+ * resource for this single user. It is initially set to the
+ * limit for this user. For each accounted resource, we
+ * decrement it. Thus, it must not drop below 0, or you
+ * exceeded your quota.
+ * @share: current amount of resources that the acting task has in
+ * the local peer.
+ * @charge: number of resources to charge with this operation
+ *
+ * We try charging @charge on @remaining. The applied logic is: The
+ * caller is not allowed to account for more than the half of the
+ * remaining space (including what its current share). That is, if 'n'
+ * free resources are remaining, then after charging @charge, it must
+ * not drop below @share+@charge. That is, the remaining resources after
+ * the charge are still at least as big as what the caller has charged
+ * in total.
+ */
+
+ if (charge > charge * 2)
+ return -EDQUOT;
+
+ global_reserved = global_share + charge * 2;
+
+ if (global_share > global_reserved || charge * 2 > global_reserved)
+ return -EDQUOT;
+
+ v = bus1_atomic_add_if_ge(global_remaining, -charge, global_reserved);
+ if (v < charge)
+ return -EDQUOT;
+
+ local_reserved = local_share + charge * 2;
+
+ if (local_share > local_reserved || charge * 2 > local_reserved)
+ return -EDQUOT;
+
+ v = bus1_atomic_add_if_ge(local_remaining, -charge, local_reserved);
+ if (v < charge) {
+ atomic_add(charge, global_remaining);
+ return -EDQUOT;
+ }
+
+ return 0;
+}
+
+static int bus1_user_charge_quota_locked(struct bus1_user_usage *q_global,
+ struct bus1_user_usage *q_local,
+ struct bus1_user_limits *l_global,
+ struct bus1_user_limits *l_local,
+ int n_slices,
+ int n_handles,
+ int n_bytes,
+ int n_fds)
+{
+ int r;
+
+ r = bus1_user_charge_one(&l_global->n_slices, &l_local->n_slices,
+ atomic_read(&q_global->n_slices),
+ atomic_read(&q_local->n_slices),
+ n_slices);
+ if (r < 0)
+ return r;
+
+ r = bus1_user_charge_one(&l_global->n_handles, &l_local->n_handles,
+ atomic_read(&q_global->n_handles),
+ atomic_read(&q_local->n_handles),
+ n_handles);
+ if (r < 0)
+ goto revert_slices;
+
+ r = bus1_user_charge_one(&l_global->n_inflight_bytes,
+ &l_local->n_inflight_bytes,
+ atomic_read(&q_global->n_bytes),
+ atomic_read(&q_local->n_bytes),
+ n_bytes);
+ if (r < 0)
+ goto revert_handles;
+
+ r = bus1_user_charge_one(&l_global->n_inflight_fds,
+ &l_local->n_inflight_fds,
+ atomic_read(&q_global->n_fds),
+ atomic_read(&q_local->n_fds),
+ n_fds);
+ if (r < 0)
+ goto revert_bytes;
+
+ atomic_add(n_slices, &q_global->n_slices);
+ atomic_add(n_handles, &q_global->n_handles);
+ atomic_add(n_bytes, &q_global->n_bytes);
+ atomic_add(n_fds, &q_global->n_fds);
+
+ atomic_add(n_slices, &q_local->n_slices);
+ atomic_add(n_handles, &q_local->n_handles);
+ atomic_add(n_bytes, &q_local->n_bytes);
+ atomic_add(n_fds, &q_local->n_fds);
+
+ return 0;
+
+revert_bytes:
+ atomic_add(n_bytes, &l_local->n_inflight_bytes);
+ atomic_add(n_bytes, &l_global->n_inflight_bytes);
+revert_handles:
+ atomic_add(n_handles, &l_local->n_handles);
+ atomic_add(n_handles, &l_global->n_handles);
+revert_slices:
+ atomic_add(n_slices, &l_local->n_slices);
+ atomic_add(n_slices, &l_global->n_slices);
+ return r;
+}
+
+/**
+ * bus1_user_charge_quota() - charge quota resources
+ * @user: user to charge on
+ * @actor: user to charge as
+ * @limits: local limits to charge on
+ * @n_slices: number of slices to charge
+ * @n_handles: number of handles to charge
+ * @n_bytes: number of bytes to charge
+ * @n_fds: number of FDs to charge
+ *
+ * This charges the given resources on @user and @limits. It does both, local
+ * and remote charges. It is all charged for user @actor.
+ *
+ * Negative charges always succeed. Positive charges might fail if quota is
+ * denied. Note that a single call is always atomic, so either all succeed or
+ * all fail. Hence, it makes little sense to mix negative and positive charges
+ * in a single call.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_user_charge_quota(struct bus1_user *user,
+ struct bus1_user *actor,
+ struct bus1_user_limits *limits,
+ int n_slices,
+ int n_handles,
+ int n_bytes,
+ int n_fds)
+{
+ struct bus1_user_usage *u_usage, *usage;
+ int r;
+
+ WARN_ON(n_slices < 0 || n_handles < 0 || n_bytes < 0 || n_fds < 0);
+
+ mutex_lock(&user->lock);
+
+ usage = bus1_user_limits_map(limits, actor);
+ if (IS_ERR(usage)) {
+ r = PTR_ERR(usage);
+ goto exit;
+ }
+
+ u_usage = bus1_user_limits_map(&user->limits, actor);
+ if (IS_ERR(u_usage)) {
+ r = PTR_ERR(u_usage);
+ goto exit;
+ }
+
+ r = bus1_user_charge_quota_locked(u_usage, usage, &user->limits,
+ limits, n_slices, n_handles,
+ n_bytes, n_fds);
+
+exit:
+ mutex_unlock(&user->lock);
+ return r;
+}
+
+/**
+ * bus1_user_discharge_quota() - discharge quota resources
+ * @user: user to charge on
+ * @actor: user to charge as
+ * @l_local: local limits to charge on
+ * @n_slices: number of slices to charge
+ * @n_handles: number of handles to charge
+ * @n_bytes: number of bytes to charge
+ * @n_fds: number of FDs to charge
+ *
+ * This discharges the given resources on @user and @limits. It does both local
+ * and remote charges. It is all discharged for user @actor.
+ */
+void bus1_user_discharge_quota(struct bus1_user *user,
+ struct bus1_user *actor,
+ struct bus1_user_limits *l_local,
+ int n_slices,
+ int n_handles,
+ int n_bytes,
+ int n_fds)
+{
+ struct bus1_user_usage *q_global, *q_local;
+ struct bus1_user_limits *l_global = &user->limits;
+
+ WARN_ON(n_slices < 0 || n_handles < 0 || n_bytes < 0 || n_fds < 0);
+
+ mutex_lock(&user->lock);
+
+ q_local = bus1_user_limits_map(l_local, actor);
+ if (WARN_ON(IS_ERR(q_local)))
+ goto exit;
+
+ q_global = bus1_user_limits_map(&user->limits, actor);
+ if (WARN_ON(IS_ERR(q_global)))
+ goto exit;
+
+ atomic_sub(n_slices, &q_global->n_slices);
+ atomic_sub(n_handles, &q_global->n_handles);
+ atomic_sub(n_bytes, &q_global->n_bytes);
+ atomic_sub(n_fds, &q_global->n_fds);
+
+ atomic_sub(n_slices, &q_local->n_slices);
+ atomic_sub(n_handles, &q_local->n_handles);
+ atomic_sub(n_bytes, &q_local->n_bytes);
+ atomic_sub(n_fds, &q_local->n_fds);
+
+ atomic_add(n_slices, &l_global->n_slices);
+ atomic_add(n_handles, &l_global->n_handles);
+ atomic_add(n_bytes, &l_global->n_inflight_bytes);
+ atomic_add(n_fds, &l_global->n_inflight_fds);
+
+ atomic_add(n_slices, &l_local->n_slices);
+ atomic_add(n_handles, &l_local->n_handles);
+ atomic_add(n_bytes, &l_local->n_inflight_bytes);
+ atomic_add(n_fds, &l_local->n_inflight_fds);
+
+exit:
+ mutex_unlock(&user->lock);
+}
+
+/**
+ * bus1_user_commit_quota() - commit quota resources
+ * @user: user to charge on
+ * @actor: user to charge as
+ * @l_local: local limits to charge on
+ * @n_slices: number of slices to charge
+ * @n_handles: number of handles to charge
+ * @n_bytes: number of bytes to charge
+ * @n_fds: number of FDs to charge
+ *
+ * This commits the given resources on @user and @limits. Committing a quota
+ * means discharging the usage objects but leaving the limits untouched.
+ */
+void bus1_user_commit_quota(struct bus1_user *user,
+ struct bus1_user *actor,
+ struct bus1_user_limits *l_local,
+ int n_slices,
+ int n_handles,
+ int n_bytes,
+ int n_fds)
+{
+ struct bus1_user_usage *q_global, *q_local;
+ struct bus1_user_limits *l_global = &user->limits;
+
+ WARN_ON(n_slices < 0 || n_handles < 0 || n_bytes < 0 || n_fds < 0);
+
+ mutex_lock(&user->lock);
+
+ q_local = bus1_user_limits_map(l_local, actor);
+ if (WARN_ON(IS_ERR(q_local)))
+ goto exit;
+
+ q_global = bus1_user_limits_map(&user->limits, actor);
+ if (WARN_ON(IS_ERR(q_global)))
+ goto exit;
+
+ atomic_sub(n_slices, &q_global->n_slices);
+ atomic_sub(n_handles, &q_global->n_handles);
+ atomic_sub(n_bytes, &q_global->n_bytes);
+ atomic_sub(n_fds, &q_global->n_fds);
+
+ atomic_sub(n_slices, &q_local->n_slices);
+ atomic_sub(n_handles, &q_local->n_handles);
+ atomic_sub(n_bytes, &q_local->n_bytes);
+ atomic_sub(n_fds, &q_local->n_fds);
+
+ atomic_add(n_bytes, &l_global->n_inflight_bytes);
+ atomic_add(n_fds, &l_global->n_inflight_fds);
+
+ atomic_add(n_bytes, &l_local->n_inflight_bytes);
+ atomic_add(n_fds, &l_local->n_inflight_fds);
+
+exit:
+ mutex_unlock(&user->lock);
+}
diff --git a/ipc/bus1/user.h b/ipc/bus1/user.h
index 6cdc264..48f987c8 100644
--- a/ipc/bus1/user.h
+++ b/ipc/bus1/user.h
@@ -41,6 +41,45 @@
#include <linux/mutex.h>
#include <linux/types.h>
#include <linux/uidgid.h>
+#include "util.h"
+
+/**
+ * struct bus1_user_usage - usage counters
+ * @n_slices: number of used slices
+ * @n_handles: number of used handles
+ * @n_bytes: number of used bytes
+ * @n_fds: number of used fds
+ */
+struct bus1_user_usage {
+ atomic_t n_slices;
+ atomic_t n_handles;
+ atomic_t n_bytes;
+ atomic_t n_fds;
+};
+
+/**
+ * struct bus1_user_limits - resource limit counters
+ * @n_slices: number of remaining quota for owned slices
+ * @n_handles: number of remaining quota for owned handles
+ * @n_inflight_bytes: number of remaining quota for inflight bytes
+ * @n_inflight_fds: number of remaining quota for inflight FDs
+ * @max_slices: maximum number of owned slices
+ * @max_handles: maximum number of owned handles
+ * @max_inflight_bytes: maximum number of inflight bytes
+ * @max_inflight_fds: maximum number of inflight FDs
+ * @usages: idr of usage entries per uid
+ */
+struct bus1_user_limits {
+ atomic_t n_slices;
+ atomic_t n_handles;
+ atomic_t n_inflight_bytes;
+ atomic_t n_inflight_fds;
+ unsigned int max_slices;
+ unsigned int max_handles;
+ unsigned int max_inflight_bytes;
+ unsigned int max_inflight_fds;
+ struct idr usages;
+};

/**
* struct bus1_user - resource accounting for users
@@ -48,20 +87,54 @@
* @uid: UID of the user
* @lock: object lock
* @rcu: rcu
+ * @limits: resource limit counters
*/
struct bus1_user {
struct kref ref;
kuid_t uid;
struct mutex lock;
- struct rcu_head rcu;
+ union {
+ struct rcu_head rcu;
+ struct bus1_user_limits limits;
+ };
};

/* module cleanup */
void bus1_user_modexit(void);

+/* limits */
+void bus1_user_limits_init(struct bus1_user_limits *limits,
+ struct bus1_user *source);
+void bus1_user_limits_deinit(struct bus1_user_limits *limits);
+
/* users */
struct bus1_user *bus1_user_ref_by_uid(kuid_t uid);
struct bus1_user *bus1_user_ref(struct bus1_user *user);
struct bus1_user *bus1_user_unref(struct bus1_user *user);

+/* charges */
+int bus1_user_charge(atomic_t *global, atomic_t *local, int charge);
+void bus1_user_discharge(atomic_t *global, atomic_t *local, int charge);
+int bus1_user_charge_quota(struct bus1_user *user,
+ struct bus1_user *actor,
+ struct bus1_user_limits *limits,
+ int n_slices,
+ int n_handles,
+ int n_bytes,
+ int n_fds);
+void bus1_user_discharge_quota(struct bus1_user *user,
+ struct bus1_user *actor,
+ struct bus1_user_limits *l_local,
+ int n_slices,
+ int n_handles,
+ int n_bytes,
+ int n_fds);
+void bus1_user_commit_quota(struct bus1_user *user,
+ struct bus1_user *actor,
+ struct bus1_user_limits *l_local,
+ int n_slices,
+ int n_handles,
+ int n_bytes,
+ int n_fds);
+
#endif /* __BUS1_USER_H */
--
2.10.1

2016-10-26 19:23:22

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 11/14] bus1: implement message transmission

From: Tom Gundersen <[email protected]>

While notifications already work and simply require linking bus1_handle
objects into the destination queue, real messages require proper
payloads. This implements two core objects: Message objects and
factories.

The message factory is similar to transaction contexts, and lives
completely on the stack. It is used to import the parameters given by
user-space in a SEND ioctl. It parses and validates them. With this
message factors we can now instantiate many messages, one for each
destination of a multicast.

Messages need to carry a bunch of data, mainly:
- metadata: This just matches what Unix-sockets do (uid, gid, pid,
tid, and secctx)
- payload: Random memory passed in as iovec-array by user-space
- files: Set of file-descriptors, very similar to SCM_RIGHTS
- handles: Set of local handles to transfer to the destination

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 1 +
ipc/bus1/message.c | 613 +++++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/message.h | 171 +++++++++++++++
ipc/bus1/peer.c | 2 +
ipc/bus1/peer.h | 2 +
ipc/bus1/util.c | 162 ++++++++++++++
ipc/bus1/util.h | 7 +
7 files changed, 958 insertions(+)
create mode 100644 ipc/bus1/message.c
create mode 100644 ipc/bus1/message.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index b87cddb..05434bda 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,6 +1,7 @@
bus1-y := \
handle.o \
main.o \
+ message.o \
peer.o \
tx.o \
user.o \
diff --git a/ipc/bus1/message.c b/ipc/bus1/message.c
new file mode 100644
index 0000000..4c5c905
--- /dev/null
+++ b/ipc/bus1/message.c
@@ -0,0 +1,613 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/cred.h>
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/kref.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/sched.h>
+#include <linux/security.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uidgid.h>
+#include <linux/uio.h>
+#include <uapi/linux/bus1.h>
+#include "handle.h"
+#include "message.h"
+#include "peer.h"
+#include "tx.h"
+#include "user.h"
+#include "util.h"
+#include "util/flist.h"
+#include "util/pool.h"
+#include "util/queue.h"
+
+static size_t bus1_factory_size(struct bus1_cmd_send *param)
+{
+ /* make sure @size cannot overflow */
+ BUILD_BUG_ON(UIO_MAXIOV > U16_MAX);
+ BUILD_BUG_ON(BUS1_FD_MAX > U16_MAX);
+
+ /* make sure we do not violate alignment rules */
+ BUILD_BUG_ON(__alignof(struct bus1_flist) < __alignof(struct iovec));
+ BUILD_BUG_ON(__alignof(struct iovec) < __alignof(struct file *));
+
+ return sizeof(struct bus1_factory) +
+ bus1_flist_inline_size(param->n_handles) +
+ param->n_vecs * sizeof(struct iovec) +
+ param->n_fds * sizeof(struct file *);
+}
+
+/**
+ * bus1_factory_new() - create new message factory
+ * @peer: peer to operate as
+ * @param: factory parameters
+ * @stack: optional stack for factory, or NULL
+ * @n_stack: size of space at @stack
+ *
+ * This allocates a new message factory. It imports data from @param and
+ * prepares the factory for a transaction. From this factory, messages can be
+ * instantiated. This is used both for unicasts and multicasts.
+ *
+ * If @stack is given, this tries to place the factory on the specified stack
+ * space. The caller must guarantee that the factory does not outlive the stack
+ * frame. If this is not wanted, pass 0 as @n_stack.
+ * In either case, if the stack frame is too small, this will allocate the
+ * factory on the heap.
+ *
+ * Return: Pointer to factory, or ERR_PTR on failure.
+ */
+struct bus1_factory *bus1_factory_new(struct bus1_peer *peer,
+ struct bus1_cmd_send *param,
+ void *stack,
+ size_t n_stack)
+{
+ const struct iovec __user *ptr_vecs;
+ const u64 __user *ptr_handles;
+ const int __user *ptr_fds;
+ struct bus1_factory *f;
+ struct bus1_flist *e;
+ struct file *file;
+ size_t i, size;
+ bool is_new;
+ int r, fd;
+ u32 sid;
+ u64 id;
+
+ lockdep_assert_held(&peer->local.lock);
+
+ size = bus1_factory_size(param);
+ if (unlikely(size > n_stack)) {
+ f = kmalloc(size, GFP_TEMPORARY);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+
+ f->on_stack = false;
+ } else {
+ f = stack;
+ f->on_stack = true;
+ }
+
+ /* set to default first, so the destructor can be called anytime */
+ f->peer = peer;
+ f->param = param;
+ f->cred = current_cred();
+ f->pid = task_tgid(current);
+ f->tid = task_pid(current);
+
+ f->has_secctx = false;
+
+ f->length_vecs = 0;
+ f->n_vecs = param->n_vecs;
+ f->n_handles = 0;
+ f->n_handles_charge = 0;
+ f->n_files = 0;
+ f->n_secctx = 0;
+ f->vecs = (void *)(f + 1) + bus1_flist_inline_size(param->n_handles);
+ f->files = (void *)(f->vecs + param->n_vecs);
+ f->secctx = NULL;
+ bus1_flist_init(f->handles, f->param->n_handles);
+
+ /* import vecs */
+ ptr_vecs = (const struct iovec __user *)(unsigned long)param->ptr_vecs;
+ r = bus1_import_vecs(f->vecs, &f->length_vecs, ptr_vecs, f->n_vecs);
+ if (r < 0)
+ goto error;
+
+ /* import handles */
+ r = bus1_flist_populate(f->handles, f->param->n_handles, GFP_TEMPORARY);
+ if (r < 0)
+ goto error;
+
+ ptr_handles = (const u64 __user *)(unsigned long)param->ptr_handles;
+ for (i = 0, e = f->handles;
+ i < f->param->n_handles;
+ e = bus1_flist_next(e, &i)) {
+ if (get_user(id, ptr_handles + f->n_handles)) {
+ r = -EFAULT;
+ goto error;
+ }
+
+ e->ptr = bus1_handle_import(peer, id, &is_new);
+ if (IS_ERR(e->ptr)) {
+ r = PTR_ERR(e->ptr);
+ goto error;
+ }
+
+ ++f->n_handles;
+ if (is_new)
+ ++f->n_handles_charge;
+ }
+
+ /* import files */
+ ptr_fds = (const int __user *)(unsigned long)param->ptr_fds;
+ while (f->n_files < param->n_fds) {
+ if (get_user(fd, ptr_fds + f->n_files)) {
+ r = -EFAULT;
+ goto error;
+ }
+
+ file = bus1_import_fd(fd);
+ if (IS_ERR(file)) {
+ r = PTR_ERR(file);
+ goto error;
+ }
+
+ f->files[f->n_files++] = file;
+ }
+
+ /* import secctx */
+ security_task_getsecid(current, &sid);
+ r = security_secid_to_secctx(sid, &f->secctx, &f->n_secctx);
+ if (r != -EOPNOTSUPP) {
+ if (r < 0)
+ goto error;
+
+ f->has_secctx = true;
+ }
+
+ return f;
+
+error:
+ bus1_factory_free(f);
+ return ERR_PTR(r);
+}
+
+/**
+ * bus1_factory_free() - destroy message factory
+ * @f: factory to operate on, or NULL
+ *
+ * This destroys the message factory @f, previously created via
+ * bus1_factory_new(). All pinned resources are freed. Messages created via the
+ * factory are unaffected.
+ *
+ * If @f is NULL, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+struct bus1_factory *bus1_factory_free(struct bus1_factory *f)
+{
+ struct bus1_flist *e;
+ size_t i;
+
+ if (f) {
+ lockdep_assert_held(&f->peer->local.lock);
+
+ if (f->has_secctx)
+ security_release_secctx(f->secctx, f->n_secctx);
+
+ for (i = 0; i < f->n_files; ++i)
+ fput(f->files[i]);
+
+ /* Iterate and forget imported handles (f->n_handles)... */
+ for (i = 0, e = f->handles;
+ i < f->n_handles;
+ e = bus1_flist_next(e, &i)) {
+ bus1_handle_forget(e->ptr);
+ bus1_handle_unref(e->ptr);
+ }
+ /* ...but free total space (f->param->n_handles). */
+ bus1_flist_deinit(f->handles, f->param->n_handles);
+
+ if (!f->on_stack)
+ kfree(f);
+ }
+
+ return NULL;
+}
+
+/**
+ * bus1_factory_seal() - charge and commit local resources
+ * @f: factory to use
+ *
+ * The factory needs to pin and possibly create local peer resources. This
+ * commits those resources. You should call this after you instantiated all
+ * messages, since you cannot undo it easily.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_factory_seal(struct bus1_factory *f)
+{
+ struct bus1_handle *h;
+ struct bus1_flist *e;
+ size_t i;
+
+ lockdep_assert_held(&f->peer->local.lock);
+
+ for (i = 0, e = f->handles;
+ i < f->n_handles;
+ e = bus1_flist_next(e, &i)) {
+ h = e->ptr;
+ if (bus1_handle_is_public(h))
+ continue;
+
+ --f->n_handles_charge;
+ WARN_ON(h != bus1_handle_acquire(h, false));
+ WARN_ON(atomic_inc_return(&h->n_user) != 1);
+ }
+
+ return 0;
+}
+
+/**
+ * bus1_factory_instantiate() - instantiate a message from a factory
+ * @f: factory to use
+ * @handle: destination handle
+ * @peer: destination peer
+ *
+ * This instantiates a new message targetted at @handle, based on the plans in
+ * the message factory @f.
+ *
+ * The newly created message is not linked into any contexts, but is available
+ * for free use to the caller.
+ *
+ * Return: Pointer to new message, or ERR_PTR on failure.
+ */
+struct bus1_message *bus1_factory_instantiate(struct bus1_factory *f,
+ struct bus1_handle *handle,
+ struct bus1_peer *peer)
+{
+ struct bus1_flist *src_e, *dst_e;
+ struct bus1_message *m;
+ bool transmit_secctx;
+ struct kvec vec;
+ size_t size, i, j;
+ u64 offset;
+ int r;
+
+ lockdep_assert_held(&f->peer->local.lock);
+
+ transmit_secctx = f->has_secctx &&
+ (READ_ONCE(peer->flags) & BUS1_PEER_FLAG_WANT_SECCTX);
+
+ size = sizeof(*m) + bus1_flist_inline_size(f->n_handles) +
+ f->n_files * sizeof(struct file *);
+ m = kmalloc(size, GFP_KERNEL);
+ if (!m)
+ return ERR_PTR(-ENOMEM);
+
+ /* set to default first, so the destructor can be called anytime */
+ kref_init(&m->ref);
+ bus1_queue_node_init(&m->qnode, BUS1_MSG_DATA);
+ m->qnode.owner = peer;
+ m->dst = bus1_handle_ref(handle);
+ m->user = bus1_user_ref(f->peer->user);
+
+ m->flags = 0;
+ m->uid = from_kuid_munged(peer->cred->user_ns, f->cred->uid);
+ m->gid = from_kgid_munged(peer->cred->user_ns, f->cred->gid);
+ m->pid = pid_nr_ns(f->pid, peer->pid_ns);
+ m->tid = pid_nr_ns(f->tid, peer->pid_ns);
+
+ m->n_bytes = f->length_vecs;
+ m->n_handles = 0;
+ m->n_handles_charge = f->n_handles;
+ m->n_files = 0;
+ m->n_secctx = 0;
+ m->slice = NULL;
+ m->files = (void *)(m + 1) + bus1_flist_inline_size(f->n_handles);
+ bus1_flist_init(m->handles, f->n_handles);
+
+ /* allocate pool slice */
+ size = max_t(size_t, 8,
+ ALIGN(m->n_bytes, 8) +
+ ALIGN(f->n_handles * sizeof(u64), 8) +
+ ALIGN(f->n_files * sizeof(int), 8) +
+ ALIGN(f->n_secctx, 8));
+ mutex_lock(&peer->data.lock);
+ m->slice = bus1_pool_alloc(&peer->data.pool, size);
+ mutex_unlock(&peer->data.lock);
+ if (IS_ERR(m->slice)) {
+ r = PTR_ERR(m->slice);
+ m->slice = NULL;
+ goto error;
+ }
+
+ /* import blob */
+ r = bus1_pool_write_iovec(&peer->data.pool, m->slice, 0, f->vecs,
+ f->n_vecs, f->length_vecs);
+ if (r < 0)
+ goto error;
+
+ /* import handles */
+ r = bus1_flist_populate(m->handles, f->n_handles, GFP_KERNEL);
+ if (r < 0)
+ goto error;
+
+ r = 0;
+ m->n_handles = f->n_handles;
+ i = 0;
+ j = 0;
+ src_e = f->handles;
+ dst_e = m->handles;
+ while (i < f->n_handles) {
+ WARN_ON(i != j);
+
+ dst_e->ptr = bus1_handle_ref_by_other(peer, src_e->ptr);
+ if (!dst_e->ptr) {
+ dst_e->ptr = bus1_handle_new_remote(peer, src_e->ptr);
+ if (IS_ERR(dst_e->ptr) && r >= 0) {
+ /*
+ * Continue on error until we imported all
+ * handles. Otherwise, trailing entries in the
+ * array will be stale, and the destructor
+ * cannot tell which.
+ */
+ r = PTR_ERR(dst_e->ptr);
+ }
+ }
+
+ src_e = bus1_flist_next(src_e, &i);
+ dst_e = bus1_flist_next(dst_e, &j);
+ }
+ if (r < 0)
+ goto error;
+
+ /* import files */
+ while (m->n_files < f->n_files) {
+ m->files[m->n_files] = get_file(f->files[m->n_files]);
+ ++m->n_files;
+ }
+
+ /* import secctx */
+ if (transmit_secctx) {
+ offset = ALIGN(m->n_bytes, 8) +
+ ALIGN(m->n_handles * sizeof(u64), 8) +
+ ALIGN(m->n_files * sizeof(int), 8);
+ vec = (struct kvec){
+ .iov_base = f->secctx,
+ .iov_len = f->n_secctx,
+ };
+
+ r = bus1_pool_write_kvec(&peer->data.pool, m->slice, offset,
+ &vec, 1, vec.iov_len);
+ if (r < 0)
+ goto error;
+
+ m->n_secctx = f->n_secctx;
+ m->flags |= BUS1_MSG_FLAG_HAS_SECCTX;
+ }
+
+ return m;
+
+error:
+ bus1_message_unref(m);
+ return ERR_PTR(r);
+}
+
+/**
+ * bus1_message_free() - destroy message
+ * @k: kref belonging to a message
+ *
+ * This frees the message belonging to the reference counter @k. It is supposed
+ * to be used with kref_put(). See bus1_message_unref(). Like all queue nodes,
+ * the memory deallocation is rcu-delayed.
+ */
+void bus1_message_free(struct kref *k)
+{
+ struct bus1_message *m = container_of(k, struct bus1_message, ref);
+ struct bus1_peer *peer = m->qnode.owner;
+ struct bus1_flist *e;
+ size_t i;
+
+ WARN_ON(!peer);
+ lockdep_assert_held(&peer->active);
+
+ for (i = 0; i < m->n_files; ++i)
+ fput(m->files[i]);
+
+ for (i = 0, e = m->handles;
+ i < m->n_handles;
+ e = bus1_flist_next(e, &i)) {
+ if (!IS_ERR_OR_NULL(e->ptr)) {
+ if (m->qnode.group)
+ bus1_handle_release(e->ptr, true);
+ bus1_handle_unref(e->ptr);
+ }
+ }
+ bus1_flist_deinit(m->handles, m->n_handles);
+
+ if (m->slice) {
+ mutex_lock(&peer->data.lock);
+ bus1_pool_release_kernel(&peer->data.pool, m->slice);
+ mutex_unlock(&peer->data.lock);
+ }
+
+ bus1_user_unref(m->user);
+ bus1_handle_unref(m->dst);
+ bus1_queue_node_deinit(&m->qnode);
+ kfree_rcu(m, qnode.rcu);
+}
+
+/**
+ * bus1_message_stage() - stage message
+ * @m: message to operate on
+ * @tx: transaction to stage on
+ *
+ * This acquires all resources of the message @m and then stages the message on
+ * @tx. Like all stage operations, this cannot be undone. Hence, you must make
+ * sure you can continue to commit the transaction without erroring-out in
+ * between.
+ *
+ * This consumes the caller's reference on @m, plus the active reference on the
+ * destination peer.
+ */
+void bus1_message_stage(struct bus1_message *m, struct bus1_tx *tx)
+{
+ struct bus1_peer *peer = m->qnode.owner;
+ struct bus1_flist *e;
+ size_t i;
+
+ WARN_ON(!peer);
+ lockdep_assert_held(&peer->active);
+
+ for (i = 0, e = m->handles;
+ i < m->n_handles;
+ e = bus1_flist_next(e, &i))
+ e->ptr = bus1_handle_acquire(e->ptr, true);
+
+ /* this consumes an active reference on m->qnode.owner */
+ bus1_tx_stage_sync(tx, &m->qnode);
+}
+
+/**
+ * bus1_message_install() - install message payload into target process
+ * @m: message to operate on
+ * @inst_fds: whether to install FDs
+ *
+ * This installs the payload FDs and handles of @message into the receiving
+ * peer and the calling process. Handles are always installed, FDs are only
+ * installed if explicitly requested via @param.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_message_install(struct bus1_message *m, struct bus1_cmd_recv *param)
+{
+ size_t i, j, n, size, offset, n_handles = 0, n_fds = 0;
+ const bool inst_fds = param->flags & BUS1_RECV_FLAG_INSTALL_FDS;
+ const bool peek = param->flags & BUS1_RECV_FLAG_PEEK;
+ struct bus1_peer *peer = m->qnode.owner;
+ struct bus1_handle *h;
+ struct bus1_flist *e;
+ struct kvec vec;
+ u64 ts, *handles;
+ u8 stack[512];
+ void *buffer = stack;
+ int r, *fds;
+
+ WARN_ON(!peer);
+ lockdep_assert_held(&peer->local.lock);
+
+ size = max(m->n_files, min_t(size_t, m->n_handles, BUS1_FLIST_BATCH));
+ size *= max(sizeof(*fds), sizeof(*handles));
+ if (unlikely(size > sizeof(stack))) {
+ buffer = kmalloc(size, GFP_TEMPORARY);
+ if (!buffer)
+ return -ENOMEM;
+ }
+
+ if (m->n_handles > 0) {
+ handles = buffer;
+ ts = bus1_queue_node_get_timestamp(&m->qnode);
+ offset = ALIGN(m->n_bytes, 8);
+
+ i = 0;
+ while ((n = bus1_flist_walk(m->handles, m->n_handles,
+ &e, &i)) > 0) {
+ WARN_ON(i > m->n_handles);
+ WARN_ON(i > BUS1_FLIST_BATCH);
+
+ for (j = 0; j < n; ++j) {
+ h = e[j].ptr;
+ if (h && bus1_handle_is_live_at(h, ts)) {
+ handles[j] = bus1_handle_identify(h);
+ ++n_handles;
+ } else {
+ bus1_handle_release(h, true);
+ e[j].ptr = bus1_handle_unref(h);
+ handles[j] = BUS1_HANDLE_INVALID;
+ }
+ }
+
+ vec.iov_base = buffer;
+ vec.iov_len = n * sizeof(u64);
+
+ r = bus1_pool_write_kvec(&peer->data.pool, m->slice,
+ offset, &vec, 1, vec.iov_len);
+ if (r < 0)
+ goto exit;
+
+ offset += n * sizeof(u64);
+ }
+ }
+
+ if (inst_fds && m->n_files > 0) {
+ fds = buffer;
+
+ for ( ; n_fds < m->n_files; ++n_fds) {
+ r = get_unused_fd_flags(O_CLOEXEC);
+ if (r < 0)
+ goto exit;
+
+ fds[n_fds] = r;
+ }
+
+ vec.iov_base = fds;
+ vec.iov_len = n_fds * sizeof(int);
+ offset = ALIGN(m->n_bytes, 8) +
+ ALIGN(m->n_handles * sizeof(u64), 8);
+
+ r = bus1_pool_write_kvec(&peer->data.pool, m->slice, offset,
+ &vec, 1, vec.iov_len);
+ if (r < 0)
+ goto exit;
+ }
+
+ /* charge resources */
+ if (!peek) {
+ WARN_ON(n_handles < m->n_handles_charge);
+ m->n_handles_charge -= n_handles;
+ }
+
+ /* publish pool slice */
+ mutex_lock(&peer->data.lock);
+ bus1_pool_publish(&peer->data.pool, m->slice);
+ mutex_unlock(&peer->data.lock);
+
+ /* commit handles */
+ for (i = 0, e = m->handles;
+ i < m->n_handles;
+ e = bus1_flist_next(e, &i)) {
+ h = e->ptr;
+ if (!IS_ERR_OR_NULL(h)) {
+ WARN_ON(h != bus1_handle_acquire(h, true));
+ WARN_ON(atomic_inc_return(&h->n_user) < 1);
+ }
+ }
+
+ /* commit FDs */
+ while (n_fds > 0) {
+ --n_fds;
+ fd_install(fds[n_fds], get_file(m->files[n_fds]));
+ }
+
+ r = 0;
+
+exit:
+ while (n_fds-- > 0)
+ put_unused_fd(fds[n_fds]);
+ if (buffer != stack)
+ kfree(buffer);
+ return r;
+}
diff --git a/ipc/bus1/message.h b/ipc/bus1/message.h
new file mode 100644
index 0000000..e8c982f
--- /dev/null
+++ b/ipc/bus1/message.h
@@ -0,0 +1,171 @@
+#ifndef __BUS1_MESSAGE_H
+#define __BUS1_MESSAGE_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Messages
+ *
+ * XXX
+ */
+
+#include <linux/kernel.h>
+#include <linux/kref.h>
+#include "util/flist.h"
+#include "util/queue.h"
+
+struct bus1_cmd_send;
+struct bus1_handle;
+struct bus1_peer;
+struct bus1_pool_slice;
+struct bus1_tx;
+struct bus1_user;
+struct cred;
+struct file;
+struct iovec;
+struct pid;
+
+/**
+ * struct bus1_factory - message factory
+ * @peer: sending peer
+ * @param: factory parameters
+ * @cred: sender credentials
+ * @pid: sender PID
+ * @tid: sender TID
+ * @on_stack: whether object lives on stack
+ * @has_secctx: whether secctx has been set
+ * @length_vecs: total length of data in vectors
+ * @n_vecs: number of vectors
+ * @n_handles: number of handles
+ * @n_handles_charge: number of handles to charge on commit
+ * @n_files: number of files
+ * @n_secctx: length of secctx
+ * @vecs: vector array
+ * @files: file array
+ * @secctx: allocated secctx
+ * @handles: handle array
+ */
+struct bus1_factory {
+ struct bus1_peer *peer;
+ struct bus1_cmd_send *param;
+ const struct cred *cred;
+ struct pid *pid;
+ struct pid *tid;
+
+ bool on_stack : 1;
+ bool has_secctx : 1;
+
+ size_t length_vecs;
+ size_t n_vecs;
+ size_t n_handles;
+ size_t n_handles_charge;
+ size_t n_files;
+ u32 n_secctx;
+ struct iovec *vecs;
+ struct file **files;
+ char *secctx;
+
+ struct bus1_flist handles[];
+};
+
+/**
+ * struct bus1_message - data messages
+ * @ref: reference counter
+ * @qnode: embedded queue node
+ * @dst: destination handle
+ * @user: sending user
+ * @flags: message flags
+ * @uid: sender UID
+ * @gid: sender GID
+ * @pid: sender PID
+ * @tid: sender TID
+ * @n_bytes: number of user-bytes transmitted
+ * @n_handles: number of handles transmitted
+ * @n_handles_charge: number of handle charges
+ * @n_files: number of files transmitted
+ * @n_secctx: number of bytes of security context transmitted
+ * @slice: actual message data
+ * @files: passed file descriptors
+ * @handles: passed handles
+ */
+struct bus1_message {
+ struct kref ref;
+ struct bus1_queue_node qnode;
+ struct bus1_handle *dst;
+ struct bus1_user *user;
+
+ u64 flags;
+ uid_t uid;
+ gid_t gid;
+ pid_t pid;
+ pid_t tid;
+
+ size_t n_bytes;
+ size_t n_handles;
+ size_t n_handles_charge;
+ size_t n_files;
+ size_t n_secctx;
+ struct bus1_pool_slice *slice;
+ struct file **files;
+
+ struct bus1_flist handles[];
+};
+
+struct bus1_factory *bus1_factory_new(struct bus1_peer *peer,
+ struct bus1_cmd_send *param,
+ void *stack,
+ size_t n_stack);
+struct bus1_factory *bus1_factory_free(struct bus1_factory *f);
+int bus1_factory_seal(struct bus1_factory *f);
+struct bus1_message *bus1_factory_instantiate(struct bus1_factory *f,
+ struct bus1_handle *handle,
+ struct bus1_peer *peer);
+
+void bus1_message_free(struct kref *k);
+void bus1_message_stage(struct bus1_message *m, struct bus1_tx *tx);
+int bus1_message_install(struct bus1_message *m, struct bus1_cmd_recv *param);
+
+/**
+ * bus1_message_ref() - acquire object reference
+ * @m: message to operate on, or NULL
+ *
+ * This acquires a single reference to @m. The caller must already hold a
+ * reference when calling this.
+ *
+ * If @m is NULL, this is a no-op.
+ *
+ * Return: @m is returned.
+ */
+static inline struct bus1_message *bus1_message_ref(struct bus1_message *m)
+{
+ if (m)
+ kref_get(&m->ref);
+ return m;
+}
+
+/**
+ * bus1_message_unref() - release object reference
+ * @m: message to operate on, or NULL
+ *
+ * This releases a single object reference to @m. If the reference counter
+ * drops to 0, the message is destroyed.
+ *
+ * If @m is NULL, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+static inline struct bus1_message *bus1_message_unref(struct bus1_message *m)
+{
+ if (m)
+ kref_put(&m->ref, bus1_message_free);
+ return NULL;
+}
+
+#endif /* __BUS1_MESSAGE_H */
diff --git a/ipc/bus1/peer.c b/ipc/bus1/peer.c
index a1525cb..0ff7a98 100644
--- a/ipc/bus1/peer.c
+++ b/ipc/bus1/peer.c
@@ -70,6 +70,7 @@ struct bus1_peer *bus1_peer_new(void)

/* initialize data section */
mutex_init(&peer->data.lock);
+ peer->data.pool = BUS1_POOL_NULL;
bus1_queue_init(&peer->data.queue);

/* initialize peer-private section */
@@ -136,6 +137,7 @@ struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)

/* deinitialize data section */
bus1_queue_deinit(&peer->data.queue);
+ bus1_pool_deinit(&peer->data.pool);
mutex_destroy(&peer->data.lock);

/* deinitialize constant fields */
diff --git a/ipc/bus1/peer.h b/ipc/bus1/peer.h
index 655d3ac..5eb558f 100644
--- a/ipc/bus1/peer.h
+++ b/ipc/bus1/peer.h
@@ -54,6 +54,7 @@
#include <linux/wait.h>
#include "user.h"
#include "util/active.h"
+#include "util/pool.h"
#include "util/queue.h"

struct cred;
@@ -88,6 +89,7 @@ struct bus1_peer {

struct {
struct mutex lock;
+ struct bus1_pool pool;
struct bus1_queue queue;
} data;

diff --git a/ipc/bus1/util.c b/ipc/bus1/util.c
index 8acf798..687f40d 100644
--- a/ipc/bus1/util.c
+++ b/ipc/bus1/util.c
@@ -9,12 +9,174 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/atomic.h>
+#include <linux/compat.h>
#include <linux/debugfs.h>
#include <linux/err.h>
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/kernel.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uio.h>
+#include <net/sock.h>
+#include "main.h"
#include "util.h"

+/**
+ * bus1_import_vecs() - import vectors from user
+ * @out_vecs: kernel memory to store vecs, preallocated
+ * @out_length: output storage for sum of all vectors lengths
+ * @vecs: user pointer for vectors
+ * @n_vecs: number of vectors to import
+ *
+ * This copies the given vectors from user memory into the preallocated kernel
+ * buffer. Sanity checks are performed on the memory of the vector-array, the
+ * memory pointed to by the vectors and on the overall size calculation.
+ *
+ * If the vectors were copied successfully, @out_length will contain the sum of
+ * all vector-lengths.
+ *
+ * Unlike most other functions, this function might modify its output buffer
+ * even if it fails. That is, @out_vecs might contain garbage if this function
+ * fails. This is done for performance reasons.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int bus1_import_vecs(struct iovec *out_vecs,
+ size_t *out_length,
+ const void __user *vecs,
+ size_t n_vecs)
+{
+ size_t i, length = 0;
+
+ if (n_vecs > UIO_MAXIOV)
+ return -EMSGSIZE;
+ if (n_vecs == 0) {
+ *out_length = 0;
+ return 0;
+ }
+
+ if (IS_ENABLED(CONFIG_COMPAT) && in_compat_syscall()) {
+ /*
+ * Compat types and macros are protected by CONFIG_COMPAT,
+ * rather than providing a fallback. We want compile-time
+ * coverage, so provide fallback types. The IS_ENABLED(COMPAT)
+ * condition guarantees this is collected by the dead-code
+ * elimination, anyway.
+ */
+#if IS_ENABLED(CONFIG_COMPAT)
+ const struct compat_iovec __user *uvecs = vecs;
+ compat_uptr_t v_base;
+ compat_size_t v_len;
+ compat_ssize_t v_slen;
+#else
+ const struct iovec __user *uvecs = vecs;
+ void __user *v_base;
+ size_t v_len;
+ ssize_t v_slen;
+#endif
+ void __user *v_ptr;
+
+ if (unlikely(!access_ok(VERIFY_READ, vecs,
+ sizeof(*uvecs) * n_vecs)))
+ return -EFAULT;
+
+ for (i = 0; i < n_vecs; ++i) {
+ if (unlikely(__get_user(v_base, &uvecs[i].iov_base) ||
+ __get_user(v_len, &uvecs[i].iov_len)))
+ return -EFAULT;
+
+#if IS_ENABLED(CONFIG_COMPAT)
+ v_ptr = compat_ptr(v_base);
+#else
+ v_ptr = v_base;
+#endif
+ v_slen = v_len;
+
+ if (unlikely(v_slen < 0 ||
+ (typeof(v_len))v_slen != v_len))
+ return -EMSGSIZE;
+ if (unlikely(!access_ok(VERIFY_READ, v_ptr, v_len)))
+ return -EFAULT;
+ if (unlikely((size_t)v_len > MAX_RW_COUNT - length))
+ return -EMSGSIZE;
+
+ out_vecs[i].iov_base = v_ptr;
+ out_vecs[i].iov_len = v_len;
+ length += v_len;
+ }
+ } else {
+ void __user *v_base;
+ size_t v_len;
+
+ if (copy_from_user(out_vecs, vecs, sizeof(*out_vecs) * n_vecs))
+ return -EFAULT;
+
+ for (i = 0; i < n_vecs; ++i) {
+ v_base = out_vecs[i].iov_base;
+ v_len = out_vecs[i].iov_len;
+
+ if (unlikely((ssize_t)v_len < 0))
+ return -EMSGSIZE;
+ if (unlikely(!access_ok(VERIFY_READ, v_base, v_len)))
+ return -EFAULT;
+ if (unlikely(v_len > MAX_RW_COUNT - length))
+ return -EMSGSIZE;
+
+ length += v_len;
+ }
+ }
+
+ *out_length = length;
+ return 0;
+}
+
+/**
+ * bus1_import_fd() - import file descriptor from user
+ * @user_fd: pointer to user-supplied file descriptor
+ *
+ * This imports a file-descriptor from the current user-context. The FD number
+ * is copied into kernel-space, then resolved to a file and returned to the
+ * caller. If something goes wrong, an error is returned.
+ *
+ * Neither bus1, nor UDS files are allowed. If those are supplied, EOPNOTSUPP
+ * is returned. Those would require expensive garbage-collection if they're
+ * sent recursively by user-space.
+ *
+ * Return: Pointer to pinned file, ERR_PTR on failure.
+ */
+struct file *bus1_import_fd(int fd)
+{
+ struct file *f, *ret;
+ struct socket *sock;
+ struct inode *inode;
+
+ if (unlikely(fd < 0))
+ return ERR_PTR(-EBADF);
+
+ f = fget_raw(fd);
+ if (unlikely(!f))
+ return ERR_PTR(-EBADF);
+
+ inode = file_inode(f);
+ sock = S_ISSOCK(inode->i_mode) ? SOCKET_I(inode) : NULL;
+
+ if (f->f_mode & FMODE_PATH)
+ ret = f; /* O_PATH is always allowed */
+ else if (f->f_op == &bus1_fops)
+ ret = ERR_PTR(-EOPNOTSUPP); /* disallow bus1 recursion */
+ else if (sock && sock->sk && sock->ops && sock->ops->family == PF_UNIX)
+ ret = ERR_PTR(-EOPNOTSUPP); /* disallow UDS recursion */
+ else
+ ret = f; /* all others are allowed */
+
+ if (f != ret)
+ fput(f);
+
+ return ret;
+}
+
#if defined(CONFIG_DEBUG_FS)

static int bus1_debugfs_atomic_t_get(void *data, u64 *val)
diff --git a/ipc/bus1/util.h b/ipc/bus1/util.h
index c22ecd5..ab41d5e 100644
--- a/ipc/bus1/util.h
+++ b/ipc/bus1/util.h
@@ -26,6 +26,7 @@
#include <linux/types.h>

struct dentry;
+struct iovec;

/**
* BUS1_TAIL - tail pointer in singly-linked lists
@@ -37,6 +38,12 @@ struct dentry;
*/
#define BUS1_TAIL ERR_PTR(-1)

+int bus1_import_vecs(struct iovec *out_vecs,
+ size_t *out_length,
+ const void __user *vecs,
+ size_t n_vecs);
+struct file *bus1_import_fd(int fd);
+
#if defined(CONFIG_DEBUG_FS)

struct dentry *
--
2.10.1

2016-10-26 19:23:27

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 10/14] bus1: add handle management

From: Tom Gundersen <[email protected]>

The object system on a bus is based on 'nodes' and 'handles'. Any peer
can allocate new, local objects at any time. The creator automatically
becomes the sole owner of the object. References to objects can be
passed as payload of messages. The recipient will then gain their own
reference to the object as well. Additionally, an object can be the
destination of a message, in which case the message is always sent to
the original creator (and thus the owner) of the object.

Internally, objects are called 'nodes'. A reference to an object is a
'handle'. Whenever a new node is created, the owner implicitly gains an
handle as well. In fact, handles are the only way to refer to a node.
The node itself is entirely hidden in the implementation, and visible
in the API as an “anchor handle”.

Whenever a handle is passed as payload of a message, the target peer
will gain a handle linked to the same underlying node. This works
regardless of whether the sender is the owner of the underlying node,
or not.

Each peer can identify all its handles (both owned and un-owned) by a
64-bit integer. The namespace is local to each peer, and the numbers
cannot be compared with the numbers of other peers (in fact, they are
very likely to clash, but might still have *different* underlying
nodes). However, if a peer receives a reference to the same node
multiple times, the resulting handle will be the same. The kernel keeps
count of how often each peer owns a handle.

If a peer no longer requires a specific handle, it can release it. If
the peer releases its last reference to a handle, the handle will be
destroyed.

The owner of a node (and *only* the owner) can trigger the destruction
of a node (even if other peers still own handles to it). In this case,
all peers that own a handle are notified of this fact. Once all handles
to a specific node have been released (except for the handle internally
pinned in the node itself), the owner of the node is notified of this,
so it can potentially destroy both any linked state and the node itself.

Node destruction is fully synchronized with any transaction. That is, a
node and all its handles are valid in every message that is transmitted
*before* the notification of its destruction. Furthermore, no message
after this notification will carry the ID of such a destroyed node. Note
that message transactions are asynchronous. That is, there is no unique
point in time that a message is synchronized with another message.
Hence, whether a specific handle passed with a message is still valid or
not, cannot be predicted by the sender, but only by one of the
receivers.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 1 +
ipc/bus1/handle.c | 823 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/handle.h | 312 +++++++++++++++++++++
ipc/bus1/peer.c | 3 +
ipc/bus1/peer.h | 2 +
ipc/bus1/util.h | 83 ++++++
6 files changed, 1224 insertions(+)
create mode 100644 ipc/bus1/handle.c
create mode 100644 ipc/bus1/handle.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index e3c7dd7..b87cddb 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,4 +1,5 @@
bus1-y := \
+ handle.o \
main.o \
peer.o \
tx.o \
diff --git a/ipc/bus1/handle.c b/ipc/bus1/handle.c
new file mode 100644
index 0000000..10f224e
--- /dev/null
+++ b/ipc/bus1/handle.c
@@ -0,0 +1,823 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#include <linux/atomic.h>
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <linux/kref.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+#include <uapi/linux/bus1.h>
+#include "handle.h"
+#include "peer.h"
+#include "tx.h"
+#include "util.h"
+#include "util/queue.h"
+
+static void bus1_handle_init(struct bus1_handle *h, struct bus1_peer *holder)
+{
+ kref_init(&h->ref);
+ atomic_set(&h->n_weak, 0);
+ atomic_set(&h->n_user, 0);
+ h->holder = holder;
+ h->anchor = NULL;
+ h->tlink = NULL;
+ RB_CLEAR_NODE(&h->rb_to_peer);
+ h->id = BUS1_HANDLE_INVALID;
+}
+
+static void bus1_handle_deinit(struct bus1_handle *h)
+{
+ if (h == h->anchor) {
+ WARN_ON(atomic_read(&h->node.n_strong) != 0);
+ WARN_ON(!RB_EMPTY_ROOT(&h->node.map_handles));
+ } else if (h->anchor) {
+ WARN_ON(!RB_EMPTY_NODE(&h->remote.rb_to_anchor));
+ bus1_handle_unref(h->anchor);
+ }
+
+ bus1_queue_node_deinit(&h->qnode);
+ WARN_ON(!RB_EMPTY_NODE(&h->rb_to_peer));
+ WARN_ON(h->tlink);
+ WARN_ON(atomic_read(&h->n_user) != 0);
+ WARN_ON(atomic_read(&h->n_weak) != 0);
+}
+
+/**
+ * bus1_handle_new_anchor() - allocate new anchor handle
+ * @holder: peer to set as holder
+ *
+ * This allocates a new, fresh, anchor handle for free use to the caller.
+ *
+ * Return: Pointer to handle, or ERR_PTR on failure.
+ */
+struct bus1_handle *bus1_handle_new_anchor(struct bus1_peer *holder)
+{
+ struct bus1_handle *anchor;
+
+ anchor = kmalloc(sizeof(*anchor), GFP_KERNEL);
+ if (!anchor)
+ return ERR_PTR(-ENOMEM);
+
+ bus1_handle_init(anchor, holder);
+ anchor->anchor = anchor;
+ bus1_queue_node_init(&anchor->qnode, BUS1_MSG_NODE_RELEASE);
+ anchor->node.map_handles = RB_ROOT;
+ anchor->node.flags = 0;
+ atomic_set(&anchor->node.n_strong, 0);
+
+ return anchor;
+}
+
+/**
+ * bus1_handle_new_remote() - allocate new remote handle
+ * @holder: peer to set as holder
+ * @other: other handle to link to
+ *
+ * This allocates a new, fresh, remote handle for free use to the caller. The
+ * handle will use the same anchor as @other (or @other in case it is an
+ * anchor).
+ *
+ * Return: Pointer to handle, or ERR_PTR on failure.
+ */
+struct bus1_handle *bus1_handle_new_remote(struct bus1_peer *holder,
+ struct bus1_handle *other)
+{
+ struct bus1_handle *remote;
+
+ if (WARN_ON(!other))
+ return ERR_PTR(-ENOTRECOVERABLE);
+
+ remote = kmalloc(sizeof(*remote), GFP_KERNEL);
+ if (!remote)
+ return ERR_PTR(-ENOMEM);
+
+ bus1_handle_init(remote, holder);
+ remote->anchor = bus1_handle_ref(other->anchor);
+ bus1_queue_node_init(&remote->qnode, BUS1_MSG_NODE_DESTROY);
+ RB_CLEAR_NODE(&remote->remote.rb_to_anchor);
+
+ return remote;
+}
+
+/**
+ * bus1_handle_free() - free handle
+ * @k: kref of handle to free
+ *
+ * This frees the handle belonging to the kref @k. It is meant to be used as
+ * callback for kref_put(). The actual memory release is rcu-delayed so the
+ * handle stays around at least until the next grace period.
+ */
+void bus1_handle_free(struct kref *k)
+{
+ struct bus1_handle *h = container_of(k, struct bus1_handle, ref);
+
+ bus1_handle_deinit(h);
+ kfree_rcu(h, qnode.rcu);
+}
+
+static struct bus1_peer *bus1_handle_acquire_holder(struct bus1_handle *handle)
+{
+ struct bus1_peer *peer = NULL;
+
+ /*
+ * The holder of a handle is set during ATTACH and remains set until
+ * the handle is destroyed. This ACQUIRE pairs with the RELEASE during
+ * ATTACH, and guarantees handle->holder is non-NULL, if n_weak is set.
+ *
+ * We still need to do this under rcu-lock. During DETACH, n_weak drops
+ * to 0, and then may be followed by a kfree_rcu() on the peer. Hence,
+ * we guarantee that if we read n_weak > 0 and the holder in the same
+ * critical section, it must be accessible.
+ */
+ rcu_read_lock();
+ if (atomic_read_acquire(&handle->n_weak) > 0)
+ peer = bus1_peer_acquire(lockless_dereference(handle->holder));
+ rcu_read_unlock();
+
+ return peer;
+}
+
+/**
+ * bus1_handle_acquire_owner() - acquire owner of a handle
+ * @handle: handle to operate on
+ *
+ * This tries to acquire the owner of a handle. If the owner is already
+ * detached, this will return NULL.
+ *
+ * Return: Pointer to owner on success, NULL on failure.
+ */
+struct bus1_peer *bus1_handle_acquire_owner(struct bus1_handle *handle)
+{
+ return bus1_handle_acquire_holder(handle->anchor);
+}
+
+static void bus1_handle_queue_release(struct bus1_handle *handle)
+{
+ struct bus1_handle *anchor = handle->anchor;
+ struct bus1_peer *owner;
+
+ if (test_bit(BUS1_HANDLE_BIT_RELEASED, &anchor->node.flags) ||
+ test_bit(BUS1_HANDLE_BIT_DESTROYED, &anchor->node.flags))
+ return;
+
+ owner = anchor->holder;
+ lockdep_assert_held(&owner->data.lock);
+
+ if (!bus1_queue_node_is_queued(&anchor->qnode)) {
+ /*
+ * A release notification is a unicast message. Hence, we can
+ * simply queue it right away without any pre-staging.
+ * Furthermore, no transaction context is needed. But we still
+ * need a group tag. NULL would serve well, but disallows
+ * re-use detection. Hence, we use the sending peer as group
+ * tag (there cannot be any conflicts since we have a unique
+ * commit timestamp for this message, thus any group tag would
+ * work fine).
+ * If the group tag is already set, we know the release
+ * notification was already used before. Hence, we must
+ * re-initialize the object.
+ */
+ if (anchor->qnode.group) {
+ WARN_ON(anchor->qnode.group != owner);
+ bus1_queue_node_deinit(&anchor->qnode);
+ bus1_queue_node_init(&anchor->qnode,
+ BUS1_MSG_NODE_RELEASE);
+ }
+
+ anchor->qnode.group = owner;
+ bus1_handle_ref(anchor);
+ bus1_queue_commit_unstaged(&owner->data.queue, &owner->waitq,
+ &anchor->qnode);
+ }
+}
+
+static void bus1_handle_flush_release(struct bus1_handle *handle)
+{
+ struct bus1_handle *anchor = handle->anchor;
+ struct bus1_peer *owner;
+
+ if (test_bit(BUS1_HANDLE_BIT_RELEASED, &anchor->node.flags) ||
+ test_bit(BUS1_HANDLE_BIT_DESTROYED, &anchor->node.flags))
+ return;
+
+ owner = anchor->holder;
+ lockdep_assert_held(&owner->data.lock);
+
+ if (bus1_queue_node_is_queued(&anchor->qnode)) {
+ bus1_queue_remove(&owner->data.queue, &owner->waitq,
+ &anchor->qnode);
+ bus1_handle_unref(anchor);
+ }
+}
+
+/**
+ * bus1_handle_ref_by_other() - lookup handle on a peer
+ * @peer: peer to lookup handle for
+ * @handle: other handle to match for
+ *
+ * This looks for an handle held by @peer, which points to the same node as
+ * @handle (i.e., it is linked to @handle->anchor). If @peer does not hold such
+ * a handle, this returns NULL. Otherwise, an object reference is acquired and
+ * returned as pointer.
+ *
+ * The caller must hold an active reference to @peer.
+ *
+ * Return: Pointer to handle if found, NULL if not found.
+ */
+struct bus1_handle *bus1_handle_ref_by_other(struct bus1_peer *peer,
+ struct bus1_handle *handle)
+{
+ struct bus1_handle *h, *res = NULL;
+ struct bus1_peer *owner = NULL;
+ struct rb_node *n;
+
+ if (peer == handle->anchor->holder)
+ return bus1_handle_ref(handle->anchor);
+
+ owner = bus1_handle_acquire_owner(handle);
+ if (!owner)
+ return NULL;
+
+ mutex_lock(&owner->data.lock);
+ n = handle->anchor->node.map_handles.rb_node;
+ while (n) {
+ h = container_of(n, struct bus1_handle, remote.rb_to_anchor);
+ if (peer < h->holder) {
+ n = n->rb_left;
+ } else if (peer > h->holder) {
+ n = n->rb_right;
+ } else /* if (peer == h->holder) */ {
+ res = bus1_handle_ref(h);
+ break;
+ }
+ }
+ mutex_unlock(&owner->data.lock);
+
+ bus1_peer_release(owner);
+ return res;
+}
+
+static struct bus1_handle *bus1_handle_splice(struct bus1_handle *handle)
+{
+ struct bus1_queue_node *qnode = &handle->qnode;
+ struct bus1_handle *h, *anchor = handle->anchor;
+ struct rb_node *n, **slot;
+
+ n = NULL;
+ slot = &anchor->node.map_handles.rb_node;
+ while (*slot) {
+ n = *slot;
+ h = container_of(n, struct bus1_handle, remote.rb_to_anchor);
+ if (unlikely(handle->holder == h->holder)) {
+ /* conflict detected; return ref to caller */
+ return bus1_handle_ref(h);
+ } else if (handle->holder < h->holder) {
+ slot = &n->rb_left;
+ } else /* if (handle->holder > h->holder) */ {
+ slot = &n->rb_right;
+ }
+ }
+
+ rb_link_node(&handle->remote.rb_to_anchor, n, slot);
+ rb_insert_color(&handle->remote.rb_to_anchor,
+ &anchor->node.map_handles);
+ /* map_handles pins one ref of each entry */
+ bus1_handle_ref(handle);
+
+ /*
+ * If a destruction is ongoing on @anchor, we must try joining it. If
+ * @qnode->group is set, we already tried joining it and can skip it.
+ * If it is not set, we acquire the owner and try joining once. See
+ * bus1_tx_join() for details.
+ *
+ * Note that we must not react to a possible failure! Any such reaction
+ * would be out-of-order, hence just ignore it silently. We simply end
+ * up with a stale handle, which is completely fine.
+ */
+ if (test_bit(BUS1_HANDLE_BIT_DESTROYED, &anchor->node.flags) &&
+ !qnode->group) {
+ qnode->owner = bus1_peer_acquire(handle->holder);
+ if (qnode->owner && bus1_tx_join(&anchor->qnode, qnode))
+ bus1_handle_ref(handle);
+ else
+ qnode->owner = bus1_peer_release(qnode->owner);
+ }
+
+ return NULL;
+}
+
+/**
+ * bus1_handle_acquire_locked() - acquire strong reference
+ * @handle: handle to operate on, or NULL
+ * @strong: whether to acquire a strong reference
+ *
+ * This is the same as bus1_handle_acquire_slow(), but requires the caller to
+ * hold the data lock of @holder and the owner.
+ *
+ * Return: Acquired handle (possibly a conflict).
+ */
+struct bus1_handle *bus1_handle_acquire_locked(struct bus1_handle *handle,
+ bool strong)
+{
+ struct bus1_handle *h, *anchor = handle->anchor;
+ struct bus1_peer *owner = NULL;
+
+ if (!test_bit(BUS1_HANDLE_BIT_RELEASED, &anchor->node.flags))
+ owner = anchor->holder;
+
+ /*
+ * Verify the correct locks are held: If @handle is already attached,
+ * its holder must match @holder (otherwise, its holder must be NULL).
+ * In all cases, @holder must be locked.
+ * Additionally, the owner must be locked as well. However, the owner
+ * might be released already. The caller must guarantee that if the
+ * owner is not released, yet, it must be locked.
+ */
+ WARN_ON(!handle->holder);
+ lockdep_assert_held(&handle->holder->data.lock);
+ if (owner)
+ lockdep_assert_held(&owner->data.lock);
+
+ if (atomic_read(&handle->n_weak) == 0) {
+ if (test_bit(BUS1_HANDLE_BIT_RELEASED, &anchor->node.flags)) {
+ /*
+ * When the node is already released, any attach ends
+ * up as stale handle. So nothing special to do here.
+ */
+ } else if (handle == anchor) {
+ /*
+ * Attach of an anchor: There is nothing to do, we
+ * simply verify the map is empty and continue.
+ */
+ WARN_ON(!RB_EMPTY_ROOT(&handle->node.map_handles));
+ } else if (owner) {
+ /*
+ * Attach of a remote: If the node is not released,
+ * yet, we insert it into the lookup tree. Otherwise,
+ * we leave it around as stale handle. Note that
+ * tree-insertion might race. If a conflict is detected
+ * we drop this handle and restart with the conflict.
+ */
+ h = bus1_handle_splice(handle);
+ if (unlikely(h)) {
+ bus1_handle_unref(handle);
+ WARN_ON(atomic_read(&h->n_weak) != 1);
+ return bus1_handle_acquire_locked(h, strong);
+ }
+ }
+
+ bus1_handle_ref(handle);
+
+ /*
+ * This RELEASE pairs with the ACQUIRE in
+ * bus1_handle_acquire_holder(). It simply guarantees that
+ * handle->holder is set before n_weak>0 is visible. It does
+ * not give any guarantees on the validity of the holder. All
+ * it does is guarantee it is non-NULL and will stay constant.
+ */
+ atomic_set_release(&handle->n_weak, 1);
+ } else {
+ WARN_ON(atomic_inc_return(&handle->n_weak) < 1);
+ }
+
+ if (strong && atomic_inc_return(&anchor->node.n_strong) == 1) {
+ if (owner)
+ bus1_handle_flush_release(anchor);
+ }
+
+ return handle;
+}
+
+/**
+ * bus1_handle_acquire_slow() - slow-path of handle acquisition
+ * @handle: handle to acquire
+ * @strong: whether to acquire a strong reference
+ *
+ * This is the slow-path of bus1_handle_acquire(). See there for details.
+ *
+ * Return: Acquired handle (possibly a conflict).
+ */
+struct bus1_handle *bus1_handle_acquire_slow(struct bus1_handle *handle,
+ bool strong)
+{
+ const bool is_anchor = (handle == handle->anchor);
+ struct bus1_peer *owner;
+
+ if (is_anchor)
+ owner = handle->holder;
+ else
+ owner = bus1_handle_acquire_owner(handle);
+
+ bus1_mutex_lock2(&handle->holder->data.lock,
+ owner ? &owner->data.lock : NULL);
+ handle = bus1_handle_acquire_locked(handle, strong);
+ bus1_mutex_unlock2(&handle->holder->data.lock,
+ owner ? &owner->data.lock : NULL);
+
+ if (!is_anchor)
+ bus1_peer_release(owner);
+
+ return handle;
+}
+
+static void bus1_handle_release_locked(struct bus1_handle *h,
+ struct bus1_peer *owner,
+ bool strong)
+{
+ struct bus1_handle *t, *safe, *anchor = h->anchor;
+
+ if (atomic_dec_return(&h->n_weak) == 0) {
+ if (test_bit(BUS1_HANDLE_BIT_RELEASED, &anchor->node.flags)) {
+ /*
+ * In case a node is already released, all its handles
+ * are already stale (and new handles are instantiated
+ * as stale). Nothing to do.
+ */
+ } else if (h == anchor) {
+ /*
+ * Releasing an anchor requires us to drop all remotes
+ * from the map. We do not detach them, though, we just
+ * clear the map and drop the pinned reference.
+ */
+ WARN_ON(!owner);
+ rbtree_postorder_for_each_entry_safe(t, safe,
+ &h->node.map_handles,
+ remote.rb_to_anchor) {
+ RB_CLEAR_NODE(&t->remote.rb_to_anchor);
+ /* drop reference held by link into map */
+ bus1_handle_unref(t);
+ }
+ h->node.map_handles = RB_ROOT;
+ bus1_handle_flush_release(h);
+ set_bit(BUS1_HANDLE_BIT_RELEASED, &h->node.flags);
+ } else if (!owner) {
+ /*
+ * If an owner is disconnected, its nodes remain until
+ * the owner is drained. In that period, it is
+ * impossible for any handle-release to acquire, and
+ * thus lock, the owner. Therefore, if that happens we
+ * leave the handle linked and rely on the owner
+ * cleanup to flush them all.
+ *
+ * A side-effect of this is that the holder field must
+ * remain set, even though it must not be dereferenced
+ * as it is a stale pointer. This is required to keep
+ * the rbtree lookup working. Anyone dereferencing the
+ * holder of a remote must therefore either hold a weak
+ * reference or check for n_weak with the owner locked.
+ */
+ } else if (!WARN_ON(RB_EMPTY_NODE(&h->remote.rb_to_anchor))) {
+ rb_erase(&h->remote.rb_to_anchor,
+ &anchor->node.map_handles);
+ RB_CLEAR_NODE(&h->remote.rb_to_anchor);
+ /* drop reference held by link into map */
+ bus1_handle_unref(h);
+ }
+
+ /* queue release after detach but before unref */
+ if (strong && atomic_dec_return(&anchor->node.n_strong) == 0) {
+ if (owner)
+ bus1_handle_queue_release(anchor);
+ }
+
+ /*
+ * This is the reference held by n_weak>0 (or 'holder valid').
+ * Note that the holder-field will remain set and stale.
+ */
+ bus1_handle_unref(h);
+ } else if (strong && atomic_dec_return(&anchor->node.n_strong) == 0) {
+ /* still weak refs left, only queue release notification */
+ if (owner)
+ bus1_handle_queue_release(anchor);
+ }
+}
+
+/**
+ * bus1_handle_release_slow() - slow-path of handle release
+ * @handle: handle to release
+ * @strong: whether to release a strong reference
+ *
+ * This is the slow-path of bus1_handle_release(). See there for details.
+ */
+void bus1_handle_release_slow(struct bus1_handle *handle, bool strong)
+{
+ const bool is_anchor = (handle == handle->anchor);
+ struct bus1_peer *owner, *holder;
+
+ /*
+ * Caller must own an active reference to the holder of @handle.
+ * Furthermore, since the caller also owns a weak reference to @handle
+ * we know that its holder cannot be NULL nor modified in parallel.
+ */
+ holder = handle->holder;
+ WARN_ON(!holder);
+ lockdep_assert_held(&holder->active);
+
+ if (is_anchor)
+ owner = holder;
+ else
+ owner = bus1_handle_acquire_owner(handle);
+
+ bus1_mutex_lock2(&holder->data.lock,
+ owner ? &owner->data.lock : NULL);
+ bus1_handle_release_locked(handle, owner, strong);
+ bus1_mutex_unlock2(&holder->data.lock,
+ owner ? &owner->data.lock : NULL);
+
+ if (!is_anchor)
+ bus1_peer_release(owner);
+}
+
+/**
+ * bus1_handle_destroy_locked() - stage node destruction
+ * @handle: handle to destroy
+ * @tx: transaction to use
+ *
+ * This stages a destruction on @handle. That is, it marks @handle as destroyed
+ * and stages a release-notification for all live handles via @tx. It is the
+ * responsibility of the caller to commit @tx.
+ *
+ * The given handle must be an anchor and not destroyed, yet. Furthermore, the
+ * caller must hold the local-lock and data-lock of the owner.
+ */
+void bus1_handle_destroy_locked(struct bus1_handle *handle, struct bus1_tx *tx)
+{
+ struct bus1_peer *owner = handle->holder;
+ struct bus1_handle *t, *safe;
+
+ if (WARN_ON(handle != handle->anchor || !owner))
+ return;
+
+ lockdep_assert_held(&owner->local.lock);
+ lockdep_assert_held(&owner->data.lock);
+
+ if (WARN_ON(test_and_set_bit(BUS1_HANDLE_BIT_DESTROYED,
+ &handle->node.flags)))
+ return;
+
+ /* flush release and reuse qnode for destruction */
+ if (bus1_queue_node_is_queued(&handle->qnode)) {
+ bus1_queue_remove(&owner->data.queue, &owner->waitq,
+ &handle->qnode);
+ bus1_handle_unref(handle);
+ }
+ bus1_queue_node_deinit(&handle->qnode);
+ bus1_queue_node_init(&handle->qnode, BUS1_MSG_NODE_DESTROY);
+
+ bus1_tx_stage_sync(tx, &handle->qnode);
+ bus1_handle_ref(handle);
+
+ /* collect all handles in the transaction */
+ rbtree_postorder_for_each_entry_safe(t, safe,
+ &handle->node.map_handles,
+ remote.rb_to_anchor) {
+ /*
+ * Bail if the qnode of the remote-handle was already used for
+ * a destruction notification.
+ */
+ if (WARN_ON(t->qnode.group))
+ continue;
+
+ /*
+ * We hold the owner-lock, so we cannot lock any other peer.
+ * Therefore, just acquire the peer and remember it on @tx. It
+ * will be staged just before @tx is committed.
+ * Note that this modifies the qnode of the remote only
+ * partially. Neither timestamps nor rb-links are modified.
+ */
+ t->qnode.owner = bus1_handle_acquire_holder(t);
+ if (t->qnode.owner) {
+ bus1_tx_stage_later(tx, &t->qnode);
+ bus1_handle_ref(t);
+ }
+ }
+}
+
+/**
+ * bus1_handle_is_live_at() - check whether handle is live at a given time
+ * @h: handle to check
+ * @timestamp: timestamp to check
+ *
+ * This checks whether the handle @h is live at the time of @timestamp. The
+ * caller must make sure that @timestamp was acquired on the clock of the
+ * holder of @h.
+ *
+ * Note that this does not synchronize on the node owner. That is, usually you
+ * want to call this at the time of RECV, so it is guaranteed that there is no
+ * staging message in front of @timestamp. Otherwise, a node owner might
+ * acquire a commit-timestamp for the destruction of @h lower than @timestamp.
+ *
+ * The caller must hold the data-lock of the holder of @h.
+ *
+ * Return: True if live at the given timestamp, false if destroyed.
+ */
+bool bus1_handle_is_live_at(struct bus1_handle *h, u64 timestamp)
+{
+ u64 ts;
+
+ WARN_ON(timestamp & 1);
+ lockdep_assert_held(&h->holder->data.lock);
+
+ if (!test_bit(BUS1_HANDLE_BIT_DESTROYED, &h->anchor->node.flags))
+ return true;
+
+ /*
+ * If BIT_DESTROYED is set, we know that the qnode can only be used for
+ * a destruction notification. Furthermore, we know that its timestamp
+ * is protected by the data-lock of the holder, so we can read it
+ * safely here.
+ * If the timestamp is not set, or staging, or higher than, or equal
+ * to, @timestamp, then the destruction cannot have been ordered before
+ * @timestamp, so the handle must be live.
+ */
+ ts = bus1_queue_node_get_timestamp(&h->qnode);
+ return (ts == 0) || (ts & 1) || (timestamp <= ts);
+}
+
+/**
+ * bus1_handle_import() - import handle
+ * @peer: peer to operate on
+ * @id: ID of handle
+ * @is_newp: store whether handle is new
+ *
+ * This searches the ID-namespace of @peer for a handle with the given ID. If
+ * found, it is referenced, returned to the caller, and @is_newp is set to
+ * false.
+ *
+ * If not found and @id is a remote ID, then an error is returned. But if it
+ * is a local ID, a new handle is created and placed in the lookup tree. In
+ * this case @is_newp is set to true.
+ *
+ * Return: Pointer to referenced handle is returned.
+ */
+struct bus1_handle *bus1_handle_import(struct bus1_peer *peer,
+ u64 id,
+ bool *is_newp)
+{
+ struct bus1_handle *h;
+ struct rb_node *n, **slot;
+
+ lockdep_assert_held(&peer->local.lock);
+
+ n = NULL;
+ slot = &peer->local.map_handles.rb_node;
+ while (*slot) {
+ n = *slot;
+ h = container_of(n, struct bus1_handle, rb_to_peer);
+ if (id < h->id) {
+ slot = &n->rb_left;
+ } else if (id > h->id) {
+ slot = &n->rb_right;
+ } else /* if (id == h->id) */ {
+ *is_newp = false;
+ return bus1_handle_ref(h);
+ }
+ }
+
+ if (id & (BUS1_HANDLE_FLAG_MANAGED | BUS1_HANDLE_FLAG_REMOTE))
+ return ERR_PTR(-ENXIO);
+
+ h = bus1_handle_new_anchor(peer);
+ if (IS_ERR(h))
+ return ERR_CAST(h);
+
+ h->id = id;
+ bus1_handle_ref(h);
+ rb_link_node(&h->rb_to_peer, n, slot);
+ rb_insert_color(&h->rb_to_peer, &peer->local.map_handles);
+
+ *is_newp = true;
+ return h;
+}
+
+/**
+ * bus1_handle_identify() - identify handle
+ * @h: handle to operate on
+ *
+ * This returns the ID of @h. If no ID was assigned, yet, a new one is picked.
+ *
+ * Return: The ID of @h is returned.
+ */
+u64 bus1_handle_identify(struct bus1_handle *h)
+{
+ WARN_ON(!h->holder);
+ lockdep_assert_held(&h->holder->local.lock);
+
+ if (h->id == BUS1_HANDLE_INVALID) {
+ h->id = ++h->holder->local.handle_ids << 3;
+ h->id |= BUS1_HANDLE_FLAG_MANAGED;
+ if (h != h->anchor)
+ h->id |= BUS1_HANDLE_FLAG_REMOTE;
+ }
+
+ return h->id;
+}
+
+/**
+ * bus1_handle_export() - export handle
+ * @handle: handle to operate on
+ *
+ * This exports @handle into the ID namespace of its holder. That is, if
+ * @handle is not linked into the ID namespace yet, it is linked into it.
+ *
+ * If @handle is already linked, nothing is done.
+ */
+void bus1_handle_export(struct bus1_handle *handle)
+{
+ struct bus1_handle *h;
+ struct rb_node *n, **slot;
+
+ /*
+ * The caller must own a weak reference to @handle when calling this.
+ * Hence, we know that its holder is valid. Also verify that the caller
+ * holds the required active reference and local lock.
+ */
+ WARN_ON(!handle->holder);
+ lockdep_assert_held(&handle->holder->local.lock);
+
+ if (RB_EMPTY_NODE(&handle->rb_to_peer)) {
+ bus1_handle_identify(handle);
+
+ n = NULL;
+ slot = &handle->holder->local.map_handles.rb_node;
+ while (*slot) {
+ n = *slot;
+ h = container_of(n, struct bus1_handle, rb_to_peer);
+ if (WARN_ON(handle->id == h->id))
+ return;
+ else if (handle->id < h->id)
+ slot = &n->rb_left;
+ else /* if (handle->id > h->id) */
+ slot = &n->rb_right;
+ }
+
+ bus1_handle_ref(handle);
+ rb_link_node(&handle->rb_to_peer, n, slot);
+ rb_insert_color(&handle->rb_to_peer,
+ &handle->holder->local.map_handles);
+ }
+}
+
+static void bus1_handle_forget_internal(struct bus1_handle *h, bool erase_rb)
+{
+ /*
+ * The passed handle might not have any weak references. Hence, we
+ * require the caller to pass the holder explicitly as @peer. However,
+ * if @handle has weak references, we want to WARN if it does not match
+ * @peer. Since this is unlocked, we use ACCESS_ONCE() here to get a
+ * consistent value. This is purely for debugging.
+ */
+ lockdep_assert_held(&h->holder->local.lock);
+
+ if (bus1_handle_is_public(h) || RB_EMPTY_NODE(&h->rb_to_peer))
+ return;
+
+ if (erase_rb)
+ rb_erase(&h->rb_to_peer, &h->holder->local.map_handles);
+ RB_CLEAR_NODE(&h->rb_to_peer);
+ h->id = BUS1_HANDLE_INVALID;
+ bus1_handle_unref(h);
+}
+
+/**
+ * bus1_handle_forget() - forget handle
+ * @h: handle to operate on, or NULL
+ *
+ * If @h is not public, but linked into the ID-lookup tree, this will remove it
+ * from the tree and clear the ID of @h. It basically undoes what
+ * bus1_handle_import() and bus1_handle_export() do.
+ *
+ * Note that there is no counter in bus1_handle_import() or
+ * bus1_handle_export(). That is, if you call bus1_handle_import() multiple
+ * times, a single bus1_handle_forget() undoes it. It is the callers
+ * responsibility to not release the local-lock randomly, and to properly
+ * detect cases where the same handle is used multiple times.
+ */
+void bus1_handle_forget(struct bus1_handle *h)
+{
+ if (h)
+ bus1_handle_forget_internal(h, true);
+}
+
+/**
+ * bus1_handle_forget_keep() - forget handle but keep rb-tree order
+ * @h: handle to operate on, or NULL
+ *
+ * This is like bus1_handle_forget(), but does not modify the ID-namespace
+ * rb-tree. That is, the backlink in @h is cleared (h->rb_to_peer), but the
+ * rb-tree is not rebalanced. As such, you can use it with
+ * rbtree_postorder_for_each_entry_safe() to drop all entries.
+ */
+void bus1_handle_forget_keep(struct bus1_handle *h)
+{
+ if (h)
+ bus1_handle_forget_internal(h, false);
+}
diff --git a/ipc/bus1/handle.h b/ipc/bus1/handle.h
new file mode 100644
index 0000000..9f01569
--- /dev/null
+++ b/ipc/bus1/handle.h
@@ -0,0 +1,312 @@
+#ifndef __BUS1_HANDLE_H
+#define __BUS1_HANDLE_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Handles
+ *
+ * The object system on a bus is based on 'nodes' and 'handles'. Any peer can
+ * allocate new, local objects at any time. The creator automatically becomes
+ * the sole owner of the object. References to objects can be passed as payload
+ * of messages. The recipient will then gain their own reference to the object
+ * as well. Additionally, an object can be the destination of a message, in
+ * which case the message is always sent to the original creator (and thus the
+ * owner) of the object.
+ *
+ * Internally, objects are called 'nodes'. A reference to an object is a
+ * 'handle'. Whenever a new node is created, the owner implicitly gains an
+ * handle as well. In fact, handles are the only way to refer to a node. The
+ * node itself is entirely hidden in the implementation, and visible in the API
+ * as an "anchor handle".
+ *
+ * Whenever a handle is passed as payload of a message, the target peer will
+ * gain a handle linked to the same underlying node. This works regardless
+ * of whether the sender is the owner of the underlying node, or not.
+ *
+ * Each peer can identify all its handles (both owned and un-owned) by a 64-bit
+ * integer. The namespace is local to each peer, and the numbers cannot be
+ * compared with the numbers of other peers (in fact, they are very likely
+ * to clash, but might still have *different* underlying nodes). However, if a
+ * peer receives a reference to the same node multiple times, the resulting
+ * handle will be the same. The kernel keeps count of how often each peer owns
+ * a handle.
+ *
+ * If a peer no longer requires a specific handle, it can release it. If the
+ * peer releases its last reference to a handle, the handle will be destroyed.
+ *
+ * The owner of a node (and *only* the owner) can trigger the destruction of a
+ * node (even if other peers still own handles to it). In this case, all peers
+ * that own a handle are notified of this fact.
+ * Once all handles to a specific node have been released (except for the handle
+ * internally pinned in the node itself), the owner of the node is notified of
+ * this, so it can potentially destroy both any linked state and the node
+ * itself.
+ *
+ * Node destruction is fully synchronized with any transaction. That is, a node
+ * and all its handles are valid in every message that is transmitted *before*
+ * the notification of its destruction. Furthermore, no message after this
+ * notification will carry the ID of such a destroyed node.
+ * Note that message transactions are asynchronous. That is, there is no unique
+ * point in time that a message is synchronized with another message. Hence,
+ * whether a specific handle passed with a message is still valid or not,
+ * cannot be predicted by the sender, but only by one of the receivers.
+ */
+
+#include <linux/atomic.h>
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <linux/kref.h>
+#include <linux/rbtree.h>
+#include "util.h"
+#include "util/queue.h"
+
+struct bus1_peer;
+struct bus1_tx;
+
+/**
+ * enum bus1_handle_bits - node flags
+ * @BUS1_HANDLE_BIT_RELEASED: The anchor handle has been released.
+ * Any further attach operation will still
+ * work, but result in a stale attach,
+ * even in case of re-attach of the anchor
+ * itself.
+ * @BUS1_HANDLE_BIT_DESTROYED: A destruction has already been
+ * scheduled for this node.
+ */
+enum bus1_handle_bits {
+ BUS1_HANDLE_BIT_RELEASED,
+ BUS1_HANDLE_BIT_DESTROYED,
+};
+
+/**
+ * struct bus1_handle - object handle
+ * @ref: object reference counter
+ * @n_weak: number of weak references
+ * @n_user: number of user references
+ * @holder: holder of this handle
+ * @anchor: anchor handle
+ * @tlink: singly-linked list for free use
+ * @rb_to_peer: rb-link into peer by ID
+ * @id: current ID
+ * @qnode: queue node for notifications
+ * @node.map_handles: map of attached handles by peer
+ * @node.flags: node flags
+ * @node.n_strong: number of strong references
+ * @remote.rb_to_anchor: rb-link into node by peer
+ */
+struct bus1_handle {
+ struct kref ref;
+ atomic_t n_weak;
+ atomic_t n_user;
+ struct bus1_peer *holder;
+ struct bus1_handle *anchor;
+ struct bus1_handle *tlink;
+ struct rb_node rb_to_peer;
+ u64 id;
+ struct bus1_queue_node qnode;
+ union {
+ struct {
+ struct rb_root map_handles;
+ unsigned long flags;
+ atomic_t n_strong;
+ } node;
+ struct {
+ struct rb_node rb_to_anchor;
+ } remote;
+ };
+};
+
+struct bus1_handle *bus1_handle_new_anchor(struct bus1_peer *holder);
+struct bus1_handle *bus1_handle_new_remote(struct bus1_peer *holder,
+ struct bus1_handle *other);
+void bus1_handle_free(struct kref *ref);
+struct bus1_peer *bus1_handle_acquire_owner(struct bus1_handle *handle);
+
+struct bus1_handle *bus1_handle_ref_by_other(struct bus1_peer *peer,
+ struct bus1_handle *handle);
+
+struct bus1_handle *bus1_handle_acquire_slow(struct bus1_handle *handle,
+ bool strong);
+struct bus1_handle *bus1_handle_acquire_locked(struct bus1_handle *handle,
+ bool strong);
+void bus1_handle_release_slow(struct bus1_handle *h, bool strong);
+
+void bus1_handle_destroy_locked(struct bus1_handle *h, struct bus1_tx *tx);
+bool bus1_handle_is_live_at(struct bus1_handle *h, u64 timestamp);
+
+struct bus1_handle *bus1_handle_import(struct bus1_peer *peer,
+ u64 id,
+ bool *is_newp);
+u64 bus1_handle_identify(struct bus1_handle *h);
+void bus1_handle_export(struct bus1_handle *h);
+void bus1_handle_forget(struct bus1_handle *h);
+void bus1_handle_forget_keep(struct bus1_handle *h);
+
+/**
+ * bus1_handle_is_anchor() - check whether handle is an anchor
+ * @h: handle to check
+ *
+ * This checks whether @h is an anchor. That is, @h was created via
+ * bus1_handle_new_anchor(), rather than via bus1_handle_new_remote().
+ *
+ * Return: True if it is an anchor, false if not.
+ */
+static inline bool bus1_handle_is_anchor(struct bus1_handle *h)
+{
+ return h == h->anchor;
+}
+
+/**
+ * bus1_handle_is_live() - check whether handle is live
+ * @h: handle to check
+ *
+ * This checks whether the given handle is still live. That is, its anchor was
+ * not destroyed, yet.
+ *
+ * Return: True if it is live, false if already destroyed.
+ */
+static inline bool bus1_handle_is_live(struct bus1_handle *h)
+{
+ return !test_bit(BUS1_HANDLE_BIT_DESTROYED, &h->anchor->node.flags);
+}
+
+/**
+ * bus1_handle_is_public() - check whether handle is public
+ * @h: handle to check
+ *
+ * This checks whether the given handle is public. That is, it was exported to
+ * user-space and at least one public reference is left.
+ *
+ * Return: True if it is public, false if not.
+ */
+static inline bool bus1_handle_is_public(struct bus1_handle *h)
+{
+ return atomic_read(&h->n_user) > 0;
+}
+
+/**
+ * bus1_handle_ref() - acquire object reference
+ * @h: handle to operate on, or NULL
+ *
+ * This acquires an object reference to @h. The caller must already hold a
+ * reference. Otherwise, the behavior is undefined.
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: @h is returned.
+ */
+static inline struct bus1_handle *bus1_handle_ref(struct bus1_handle *h)
+{
+ if (h)
+ kref_get(&h->ref);
+ return h;
+}
+
+/**
+ * bus1_handle_unref() - release object reference
+ * @h: handle to operate on, or NULL
+ *
+ * This releases an object reference. If the reference count drops to 0, the
+ * object is released (rcu-delayed).
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+static inline struct bus1_handle *bus1_handle_unref(struct bus1_handle *h)
+{
+ if (h)
+ kref_put(&h->ref, bus1_handle_free);
+ return NULL;
+}
+
+/**
+ * bus1_handle_acquire() - acquire weak/strong reference
+ * @h: handle to operate on, or NULL
+ * @strong: whether to acquire a strong reference
+ *
+ * This acquires a weak/strong reference to the node @h is attached to.
+ * This always succeeds. However, if a conflict is detected, @h is
+ * unreferenced and the conflicting handle is returned (with an object
+ * reference taken and strong reference acquired).
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: Pointer to the acquired handle is returned.
+ */
+static inline struct bus1_handle *
+bus1_handle_acquire(struct bus1_handle *h,
+ bool strong)
+{
+ if (h) {
+ if (bus1_atomic_add_if_ge(&h->n_weak, 1, 1) < 1) {
+ h = bus1_handle_acquire_slow(h, strong);
+ } else if (bus1_atomic_add_if_ge(&h->anchor->node.n_strong,
+ 1, 1) < 1) {
+ WARN_ON(h != bus1_handle_acquire_slow(h, strong));
+ WARN_ON(atomic_dec_return(&h->n_weak) < 1);
+ }
+ }
+ return h;
+}
+
+/**
+ * bus1_handle_release() - release weak/strong reference
+ * @h: handle to operate on, or NULL
+ * @strong: whether to release a strong reference
+ *
+ * This releases a weak or strong reference to the node @h is attached to.
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+static inline struct bus1_handle *
+bus1_handle_release(struct bus1_handle *h, bool strong)
+{
+ if (h) {
+ if (strong &&
+ bus1_atomic_add_if_ge(&h->anchor->node.n_strong, -1, 2) < 2)
+ bus1_handle_release_slow(h, true);
+ else if (bus1_atomic_add_if_ge(&h->n_weak, -1, 2) < 2)
+ bus1_handle_release_slow(h, false);
+ }
+ return NULL;
+}
+
+/**
+ * bus1_handle_release_n() - release multiple references
+ * @h: handle to operate on, or NULL
+ * @n: number of references to release
+ * @strong: whether to release strong references
+ *
+ * This releases @n weak or strong references to the node @h is attached to.
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+static inline struct bus1_handle *
+bus1_handle_release_n(struct bus1_handle *h, unsigned int n, bool strong)
+{
+ if (h && n > 0) {
+ if (n > 1) {
+ if (strong)
+ WARN_ON(atomic_sub_return(n - 1,
+ &h->anchor->node.n_strong) < 1);
+ WARN_ON(atomic_sub_return(n - 1, &h->n_weak) < 1);
+ }
+ bus1_handle_release(h, strong);
+ }
+ return NULL;
+}
+
+#endif /* __BUS1_HANDLE_H */
diff --git a/ipc/bus1/peer.c b/ipc/bus1/peer.c
index 3421f8c..a1525cb 100644
--- a/ipc/bus1/peer.c
+++ b/ipc/bus1/peer.c
@@ -74,6 +74,8 @@ struct bus1_peer *bus1_peer_new(void)

/* initialize peer-private section */
mutex_init(&peer->local.lock);
+ peer->local.map_handles = RB_ROOT;
+ peer->local.handle_ids = 0;

if (!IS_ERR_OR_NULL(bus1_debugdir)) {
char idstr[22];
@@ -129,6 +131,7 @@ struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)
bus1_peer_disconnect(peer);

/* deinitialize peer-private section */
+ WARN_ON(!RB_EMPTY_ROOT(&peer->local.map_handles));
mutex_destroy(&peer->local.lock);

/* deinitialize data section */
diff --git a/ipc/bus1/peer.h b/ipc/bus1/peer.h
index 149ddf6..655d3ac 100644
--- a/ipc/bus1/peer.h
+++ b/ipc/bus1/peer.h
@@ -93,6 +93,8 @@ struct bus1_peer {

struct {
struct mutex lock;
+ struct rb_root map_handles;
+ u64 handle_ids;
} local;
};

diff --git a/ipc/bus1/util.h b/ipc/bus1/util.h
index b9f9e8d..c22ecd5 100644
--- a/ipc/bus1/util.h
+++ b/ipc/bus1/util.h
@@ -27,6 +27,16 @@

struct dentry;

+/**
+ * BUS1_TAIL - tail pointer in singly-linked lists
+ *
+ * Several places of bus1 use singly-linked lists. Usually, the tail pointer is
+ * simply set to NULL. However, sometimes we need to be able to detect whether
+ * a node is linked in O(1). For that we set the tail pointer to BUS1_TAIL
+ * rather than NULL.
+ */
+#define BUS1_TAIL ERR_PTR(-1)
+
#if defined(CONFIG_DEBUG_FS)

struct dentry *
@@ -48,4 +58,77 @@ bus1_debugfs_create_atomic_x(const char *name,

#endif

+/**
+ * bus1_atomic_add_if_ge() - add, if above threshold
+ * @a: atomic_t to operate on
+ * @add: value to add
+ * @t: threshold
+ *
+ * Atomically add @add to @a, if @a is greater than, or equal to, @t.
+ *
+ * If [a + add] triggers an overflow, the operation is undefined. The caller
+ * must verify that this cannot happen.
+ *
+ * Return: The old value of @a is returned.
+ */
+static inline int bus1_atomic_add_if_ge(atomic_t *a, int add, int t)
+{
+ int v, v1;
+
+ for (v = atomic_read(a); v >= t; v = v1) {
+ v1 = atomic_cmpxchg(a, v, v + add);
+ if (likely(v1 == v))
+ return v;
+ }
+
+ return v;
+}
+
+/**
+ * bus1_mutex_lock2() - lock two mutices of the same class
+ * @a: first mutex, or NULL
+ * @b: second mutex, or NULL
+ *
+ * This locks both mutices @a and @b. The order in which they are taken is
+ * their memory location, thus allowing to lock 2 mutices of the same class at
+ * the same time.
+ *
+ * It is valid to pass the same mutex as @a and @b, in which case it is only
+ * locked once.
+ *
+ * Use bus1_mutex_unlock2() to exit the critical section.
+ */
+static inline void bus1_mutex_lock2(struct mutex *a, struct mutex *b)
+{
+ if (a < b) {
+ if (a)
+ mutex_lock(a);
+ if (b && b != a)
+ mutex_lock_nested(b, !!a);
+ } else {
+ if (b)
+ mutex_lock(b);
+ if (a && a != b)
+ mutex_lock_nested(a, !!b);
+ }
+}
+
+/**
+ * bus1_mutex_unlock2() - lock two mutices of the same class
+ * @a: first mutex, or NULL
+ * @b: second mutex, or NULL
+ *
+ * Unlock both mutices @a and @b. If they point to the same mutex, it is only
+ * unlocked once.
+ *
+ * Usually used in combination with bus1_mutex_lock2().
+ */
+static inline void bus1_mutex_unlock2(struct mutex *a, struct mutex *b)
+{
+ if (a)
+ mutex_unlock(a);
+ if (b && b != a)
+ mutex_unlock(b);
+}
+
#endif /* __BUS1_UTIL_H */
--
2.10.1

2016-10-26 19:24:06

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 09/14] bus1: provide transaction context for multicasts

From: Tom Gundersen <[email protected]>

The transaction engine is an object that lives on the stack and is used
to stage and commit multicasts properly. Unlike unicasts, a multicast
cannot just be queued on each destination, but must be properly
synchronized. This requires us to first stage each message on their
respective destination, then sync and tick the clocks, and eventual
commit all messages.

The transaction context implements this logic for both, unicasts and
multicasts. It hides the timestamp handling and takes care to properly
synchronize accesses to the peer queues.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 1 +
ipc/bus1/peer.c | 2 +
ipc/bus1/peer.h | 3 +
ipc/bus1/tx.c | 360 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/tx.h | 102 ++++++++++++++++
5 files changed, 468 insertions(+)
create mode 100644 ipc/bus1/tx.c
create mode 100644 ipc/bus1/tx.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index c689917..e3c7dd7 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,6 +1,7 @@
bus1-y := \
main.o \
peer.o \
+ tx.o \
user.o \
util.o \
util/active.o \
diff --git a/ipc/bus1/peer.c b/ipc/bus1/peer.c
index a6fbca01..3421f8c 100644
--- a/ipc/bus1/peer.c
+++ b/ipc/bus1/peer.c
@@ -70,6 +70,7 @@ struct bus1_peer *bus1_peer_new(void)

/* initialize data section */
mutex_init(&peer->data.lock);
+ bus1_queue_init(&peer->data.queue);

/* initialize peer-private section */
mutex_init(&peer->local.lock);
@@ -131,6 +132,7 @@ struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)
mutex_destroy(&peer->local.lock);

/* deinitialize data section */
+ bus1_queue_deinit(&peer->data.queue);
mutex_destroy(&peer->data.lock);

/* deinitialize constant fields */
diff --git a/ipc/bus1/peer.h b/ipc/bus1/peer.h
index 277fcf8..149ddf6 100644
--- a/ipc/bus1/peer.h
+++ b/ipc/bus1/peer.h
@@ -54,6 +54,7 @@
#include <linux/wait.h>
#include "user.h"
#include "util/active.h"
+#include "util/queue.h"

struct cred;
struct dentry;
@@ -71,6 +72,7 @@ struct pid_namespace;
* @active: active references
* @debugdir: debugfs root of this peer, or NULL/ERR_PTR
* @data.lock: data lock
+ * @data.queue: message queue
* @local.lock: local peer runtime lock
*/
struct bus1_peer {
@@ -86,6 +88,7 @@ struct bus1_peer {

struct {
struct mutex lock;
+ struct bus1_queue queue;
} data;

struct {
diff --git a/ipc/bus1/tx.c b/ipc/bus1/tx.c
new file mode 100644
index 0000000..6ff8949
--- /dev/null
+++ b/ipc/bus1/tx.c
@@ -0,0 +1,360 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#include <linux/bitops.h>
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include "peer.h"
+#include "tx.h"
+#include "util/active.h"
+#include "util/queue.h"
+
+static void bus1_tx_push(struct bus1_tx *tx,
+ struct bus1_queue_node **list,
+ struct bus1_queue_node *qnode)
+{
+ struct bus1_peer *peer = qnode->owner;
+
+ /*
+ * Push @qnode onto one of the lists in @tx (specified as @list). Note
+ * that each list has different locking/ordering requirements, which
+ * the caller has to verify. This helper does not check them.
+ *
+ * Whenever something is pushed on a list, we make sure it has the tx
+ * set as group. Furthermore, we tell lockdep that its peer was
+ * released. This is required to allow holding hundreds of peers in a
+ * multicast without exceeding the lockdep limits of allowed locks held
+ * in parallel.
+ * Note that pushing a qnode on a list consumes the qnode together with
+ * its set owner. The caller must not access it, except by popping it
+ * from the list or using one of the internal list-iterators. In other
+ * words, we say that a caller must be aware of lockdep limitations
+ * whenever they hold an unlimited number of peers. However, if they
+ * make sure they only ever hold a fixed number, but use transaction
+ * lists to stash them, the transaction lists make sure to properly
+ * avoid lockdep limitations.
+ */
+
+ WARN_ON(qnode->group && tx != qnode->group);
+ WARN_ON(qnode->next || qnode == *list);
+
+ qnode->group = tx;
+ qnode->next = *list;
+ *list = qnode;
+
+ if (peer)
+ bus1_active_lockdep_released(&peer->active);
+}
+
+static struct bus1_queue_node *
+bus1_tx_pop(struct bus1_tx *tx, struct bus1_queue_node **list)
+{
+ struct bus1_queue_node *qnode = *list;
+ struct bus1_peer *peer;
+
+ /*
+ * This pops the first entry off a list on a transaction. Different
+ * lists have different locking requirements. This helper does not
+ * validate the context.
+ *
+ * Note that we need to tell lockdep about the acquired peer when
+ * returning the qnode. See bus1_tx_push() for details.
+ */
+
+ if (qnode) {
+ *list = qnode->next;
+ qnode->next = NULL;
+ peer = qnode->owner;
+ if (peer)
+ bus1_active_lockdep_acquired(&peer->active);
+ }
+
+ return qnode;
+}
+
+/*
+ * This starts an iterator for a singly-linked list with head-elements given as
+ * @list. @iter is filled with the first element, and its *acquired* peer is
+ * returned. You *must* call bus1_tx_next() on @iter, otherwise you will run
+ * into lockdep-ref-leaks. IOW: don't bail out of your loop with 'break'.
+ *
+ * It is supposed to be used like this:
+ *
+ * for (peer = bus1_tx_first(tx, &tx->foo, &qnode);
+ * qnode;
+ * peer = bus1_tx_next(tx, &qnode))
+ * bar();
+ */
+static struct bus1_peer *bus1_tx_first(struct bus1_tx *tx,
+ struct bus1_queue_node *list,
+ struct bus1_queue_node **iter)
+{
+ struct bus1_peer *peer;
+
+ if ((*iter = list)) {
+ peer = list->owner;
+ if (!peer)
+ return tx->origin;
+
+ bus1_active_lockdep_acquired(&peer->active);
+ return peer;
+ }
+
+ return NULL;
+}
+
+/*
+ * This continues an iteration of a singly-linked list started via
+ * bus1_tx_first(). It returns the same information (see it for details).
+ */
+static struct bus1_peer *bus1_tx_next(struct bus1_tx *tx,
+ struct bus1_queue_node **iter)
+{
+ struct bus1_queue_node *qnode = *iter;
+ struct bus1_peer *peer = qnode->owner;
+
+ if (peer)
+ bus1_active_lockdep_released(&peer->active);
+
+ return bus1_tx_first(tx, qnode->next, iter);
+}
+
+static void bus1_tx_stage(struct bus1_tx *tx,
+ struct bus1_queue_node *qnode,
+ struct bus1_queue_node **list,
+ u64 *timestamp)
+{
+ struct bus1_peer *peer = qnode->owner ?: tx->origin;
+
+ WARN_ON(test_bit(BUS1_TX_BIT_SEALED, &tx->flags));
+ WARN_ON(bus1_queue_node_is_queued(qnode));
+ lockdep_assert_held(&peer->data.lock);
+
+ bus1_tx_push(tx, list, qnode);
+ *timestamp = bus1_queue_stage(&peer->data.queue, qnode, *timestamp);
+}
+
+/**
+ * bus1_tx_stage_sync() - stage message
+ * @tx: transaction to operate on
+ * @qnode: message to stage
+ *
+ * This stages @qnode on the transaction @tx. It is an error to call this on a
+ * qnode that is already staged. The caller must set qnode->owner to the
+ * destination peer and acquire it. If it is NULL, it is assumed to be the same
+ * as the origin of the transaction.
+ *
+ * The caller must hold the data-lock of the destination peer.
+ *
+ * This consumes @qnode. The caller must increment the required reference
+ * counts to make sure @qnode does not vanish.
+ */
+void bus1_tx_stage_sync(struct bus1_tx *tx, struct bus1_queue_node *qnode)
+{
+ bus1_tx_stage(tx, qnode, &tx->sync, &tx->timestamp);
+}
+
+/**
+ * bus1_tx_stage_later() - postpone message
+ * @tx: transaction to operate on
+ * @qnode: message to postpone
+ *
+ * This queues @qnode on @tx, but does not stage it. It will be staged just
+ * before the transaction is committed. This can be used over
+ * bus1_tx_stage_sync() if no immediate staging is necessary, or if required
+ * locks cannot be taken.
+ *
+ * It is a caller-error if @qnode is already part of a transaction.
+ */
+void bus1_tx_stage_later(struct bus1_tx *tx, struct bus1_queue_node *qnode)
+{
+ bus1_tx_push(tx, &tx->postponed, qnode);
+}
+
+/**
+ * bus1_tx_join() - HIC SUNT DRACONES!
+ * @whom: whom to join
+ * @qnode: who joins
+ *
+ * This makes @qnode join the on-going transaction of @whom. That is, it is
+ * semantically equivalent of calling:
+ *
+ * bus1_tx_stage_sync(whom->group, qnode);
+ *
+ * However, you can only dereference whom->group while it is still ongoing.
+ * Once committed, it might be a stale pointer. This function safely checks for
+ * the required conditions and bails out if too late.
+ *
+ * The caller must hold the data locks of both peers (target of @whom and
+ * @qnode). @node->owner must not be NULL! Furthermore, @qnode must not be
+ * staged into any transaction, yet.
+ *
+ * In general, this function is not what you want. There is no guarantee that
+ * you can join the transaction, hence a negative return value must be expected
+ * by the caller and handled gracefully. In that case, this function guarantees
+ * that the clock of the holder of @qnode is synced with the transaction of
+ * @whom, and as such is correctly ordered against the transaction.
+ *
+ * If this function returns "false", you must settle on the transaction before
+ * visibly reacting to it. That is, user-space must not see that you failed to
+ * join the transaction before the transaction is settled!
+ *
+ * Return: True if successfull, false if too late.
+ */
+bool bus1_tx_join(struct bus1_queue_node *whom, struct bus1_queue_node *qnode)
+{
+ struct bus1_peer *peer = qnode->owner;
+ struct bus1_tx *tx;
+ u64 timestamp;
+
+ WARN_ON(!peer);
+ WARN_ON(qnode->group);
+ lockdep_assert_held(&peer->data.lock);
+
+ if (bus1_queue_node_is_staging(whom)) {
+ /*
+ * The anchor we want to join is marked as staging. We know its
+ * holder is locked by the caller, hence we know that its
+ * transaction must still be ongoing and at some point commit
+ * @whom (blocking on the lock we currently hold). This means,
+ * we are allowed to dereference @whom->group safely.
+ * Now, if the transaction has not yet acquired a commit
+ * timestamp, we simply stage @qnode and asynchronously join
+ * the transaction. But if the transaction is already sealed,
+ * we cannot join anymore. Hence, we instead copy the timestamp
+ * for our fallback.
+ */
+ WARN_ON(!(tx = whom->group));
+ lockdep_assert_held(&tx->origin->data.lock);
+
+ if (!test_bit(BUS1_TX_BIT_SEALED, &tx->flags)) {
+ bus1_tx_stage(tx, qnode, &tx->async, &tx->async_ts);
+ return true;
+ }
+
+ timestamp = tx->timestamp;
+ } else {
+ /*
+ * The anchor to join is not marked as staging, hence we cannot
+ * dereference its transaction (the stack-frame might be gone
+ * already). Instead, we just copy the timestamp and try our
+ * fallback below.
+ */
+ timestamp = bus1_queue_node_get_timestamp(whom);
+ }
+
+ /*
+ * The transaction of @whom has already acquired a commit timestamp.
+ * Hence, we cannot join the transaction. However, we can try to inject
+ * a synthetic entry into the queue of @peer. All we must make sure is
+ * that there is at least one entry ordered in front of it. Hence, we
+ * use bus1_queue_commit_synthetic(). If this synthetic entry would be
+ * the new front, the commit fails. This is, because we cannot know
+ * whether this peer already dequeued something to-be-ordered after
+ * this fake entry.
+ * In the case that the insertion fails, we make sure to have synced
+ * its clock before. This guarantees that any further actions of this
+ * peer are guaranteed to be ordered after the transaction to join.
+ */
+ qnode->group = whom->group;
+ bus1_queue_sync(&peer->data.queue, timestamp);
+ return bus1_queue_commit_synthetic(&peer->data.queue, qnode, timestamp);
+}
+
+/**
+ * bus1_tx_commit() - commit transaction
+ * @tx: transaction to operate on
+ *
+ * Commit a transaction. First all postponed entries are staged, then we commit
+ * all messages that belong to this transaction. This works with any number of
+ * messages.
+ *
+ * Return: This returns the commit timestamp used.
+ */
+u64 bus1_tx_commit(struct bus1_tx *tx)
+{
+ struct bus1_queue_node *qnode, **tail;
+ struct bus1_peer *peer, *origin = tx->origin;
+
+ if (WARN_ON(test_bit(BUS1_TX_BIT_SEALED, &tx->flags)))
+ return tx->timestamp;
+
+ /*
+ * Stage Round
+ * Callers can stage messages manually via bus1_tx_stage_*(). However,
+ * if they cannot lock the destination queue for whatever reason, we
+ * support postponing it. In that case, it is linked into tx->postponed
+ * and we stage it here for them.
+ */
+ while ((qnode = bus1_tx_pop(tx, &tx->postponed))) {
+ peer = qnode->owner ?: tx->origin;
+
+ mutex_lock(&peer->data.lock);
+ bus1_tx_stage_sync(tx, qnode);
+ mutex_unlock(&peer->data.lock);
+ }
+
+ /*
+ * Acquire Commit TS
+ * Now that everything is staged, we atomically acquire a commit
+ * timestamp from the transaction origin. We store it on the
+ * transaction, so async joins are still possible. We also seal the
+ * transaction at the same time, to prevent async stages.
+ */
+ mutex_lock(&origin->data.lock);
+ bus1_queue_sync(&origin->data.queue, max(tx->timestamp, tx->async_ts));
+ tx->timestamp = bus1_queue_tick(&origin->data.queue);
+ WARN_ON(test_and_set_bit(BUS1_TX_BIT_SEALED, &tx->flags));
+ mutex_unlock(&origin->data.lock);
+
+ /*
+ * Sync Round
+ * Before any effect of this transaction is visible, we must make sure
+ * to sync all clocks. This guarantees that the first receiver of the
+ * message cannot (via side-channels) induce messages into the queue of
+ * the other receivers, before they get the message as well.
+ */
+ tail = &tx->sync;
+ do {
+ for (peer = bus1_tx_first(tx, *tail, &qnode);
+ qnode;
+ peer = bus1_tx_next(tx, &qnode)) {
+ tail = &qnode->next;
+
+ mutex_lock(&peer->data.lock);
+ bus1_queue_sync(&peer->data.queue, tx->timestamp);
+ mutex_unlock(&peer->data.lock);
+ }
+
+ /* append async-list to the tail of the previous list */
+ *tail = tx->async;
+ tx->async = NULL;
+ } while (*tail);
+
+ /*
+ * Commit Round
+ * Now that everything is staged and the clocks synced, we can finally
+ * commit all the messages on their respective queues. Iterate over
+ * each message again, commit it, and release the pinned destination.
+ */
+ while ((qnode = bus1_tx_pop(tx, &tx->sync))) {
+ peer = qnode->owner ?: tx->origin;
+
+ mutex_lock(&peer->data.lock);
+ bus1_queue_commit_staged(&peer->data.queue, &peer->waitq,
+ qnode, tx->timestamp);
+ mutex_unlock(&peer->data.lock);
+
+ bus1_peer_release(qnode->owner);
+ }
+
+ return tx->timestamp;
+}
diff --git a/ipc/bus1/tx.h b/ipc/bus1/tx.h
new file mode 100644
index 0000000..a057df4
--- /dev/null
+++ b/ipc/bus1/tx.h
@@ -0,0 +1,102 @@
+#ifndef __BUS1_TX_H
+#define __BUS1_TX_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Transactions
+ *
+ * The transaction engine is an object that lives an the stack and is used to
+ * stage and commit multicasts properly. Unlike unicasts, a multicast cannot
+ * just be queued on each destination, but must be properly synchronized. This
+ * requires us to first stage each message on their respective destination,
+ * then sync and tick the clocks, and eventual commit all messages.
+ */
+
+#include <linux/err.h>
+#include <linux/kernel.h>
+
+struct bus1_peer;
+struct bus1_queue_node;
+
+/**
+ * enum bus1_tx_bits - transaction flags
+ * @BUS1_TX_BIT_SEALED: The transaction is sealed, no new messages can
+ * be added to the transaction. The commit of all
+ * staged messages is ongoing.
+ */
+enum bus1_tx_bits {
+ BUS1_TX_BIT_SEALED,
+};
+
+/**
+ * struct bus1_tx - transaction context
+ * @origin: origin of this transaction
+ * @sync: unlocked list of staged messages
+ * @async: locked list of staged messages
+ * @postponed: unlocked list of unstaged messages
+ * @flags: transaction flags
+ * @timestamp: unlocked timestamp of this transaction
+ * @async_ts: locked timestamp cache of async list
+ */
+struct bus1_tx {
+ struct bus1_peer *origin;
+ struct bus1_queue_node *sync;
+ struct bus1_queue_node *async;
+ struct bus1_queue_node *postponed;
+ unsigned long flags;
+ u64 timestamp;
+ u64 async_ts;
+};
+
+void bus1_tx_stage_sync(struct bus1_tx *tx, struct bus1_queue_node *qnode);
+void bus1_tx_stage_later(struct bus1_tx *tx, struct bus1_queue_node *qnode);
+
+bool bus1_tx_join(struct bus1_queue_node *whom, struct bus1_queue_node *qnode);
+
+u64 bus1_tx_commit(struct bus1_tx *tx);
+
+/**
+ * bus1_tx_init() - initialize transaction context
+ * @tx: transaction context to operate on
+ * @origin: origin of this transaction
+ *
+ * This initializes a transaction context. The initiating peer must be pinned
+ * by the caller for the entire lifetime of @tx (until bus1_tx_deinit() is
+ * called) and given as @origin.
+ */
+static inline void bus1_tx_init(struct bus1_tx *tx, struct bus1_peer *origin)
+{
+ tx->origin = origin;
+ tx->sync = NULL;
+ tx->async = NULL;
+ tx->postponed = NULL;
+ tx->flags = 0;
+ tx->timestamp = 0;
+ tx->async_ts = 0;
+}
+
+/**
+ * bus1_tx_deinit() - deinitialize transaction context
+ * @tx: transaction context to operate on
+ *
+ * This deinitializes a transaction context previously created via
+ * bus1_tx_init(). This is merely for debugging, as no resources are pinned on
+ * the transaction. However, if any message was staged on the transaction, it
+ * must be committed via bus1_tx_commit() before it is deinitialized.
+ */
+static inline void bus1_tx_deinit(struct bus1_tx *tx)
+{
+ WARN_ON(tx->sync);
+ WARN_ON(tx->async);
+ WARN_ON(tx->postponed);
+}
+
+#endif /* __BUS1_TX_H */
--
2.10.1

2016-10-26 19:24:26

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 08/14] bus1: implement peer management context

From: Tom Gundersen <[email protected]>

A peer context provides access to the bus1 system. A peer itself is not
a routable entity, but rather only a local anchor to serve as gateway to
the bus. To participate on the bus, you need to allocate a peer. This
peer manages all your state on the bus, including all allocated nodes,
owned handles, incoming messages, and more.

A peer is split into 3 sections:
- A static section that is initialized at peer creation and never
changes
- A peer-local section that is only ever accessed by ioctls done by
the peer itself.
- A data section that might be accessed by remote peers when
interacting with this peer.

All peers on the system operate on the same level. There is no context
a peer is linked into. Hence, you can never lock multiple peers at the
same time. Instead, peers provide active-references. Before performing
an operation on a peer, an active reference must be acquired, and hold
as long as the operation goes on. When done, the reference is released
again. When a peer is disconnected, no more active references can be
acquired, and any outstanding operation is waited for before the peer
is destroyed.

Additionally to active-references, there are 2 locks: A peer-local lock
and a data lock. The peer-local lock is used to synchronize operations
done by the peer itself. It is never acquired by a remote peer. The
data lock protects the data of the peer, which might be modified by
remote peers. The data lock nests underneath the local-lock.
Furthermore, the data-lock critical sections must be kept small and
never block indefinitely. Remote peers might wait for data-locks, hence
they must rely on not being DoSed. The local peer lock, however, is
private to the peer itself. Not such restrictions apply. It is mostly
used to give the impression of atomic operations (i.e., making the API
appear consistent and coherent).

This only adds the peer context, the ioctls will be implemented in
follow-up patches.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 2 +
ipc/bus1/main.c | 17 +++++++
ipc/bus1/main.h | 14 ++++++
ipc/bus1/peer.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/peer.h | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/util.c | 52 +++++++++++++++++++
ipc/bus1/util.h | 51 +++++++++++++++++++
7 files changed, 427 insertions(+)
create mode 100644 ipc/bus1/peer.c
create mode 100644 ipc/bus1/peer.h
create mode 100644 ipc/bus1/util.c
create mode 100644 ipc/bus1/util.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index 94d79e0..c689917 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,6 +1,8 @@
bus1-y := \
main.o \
+ peer.o \
user.o \
+ util.o \
util/active.o \
util/flist.o \
util/pool.o \
diff --git a/ipc/bus1/main.c b/ipc/bus1/main.c
index 526347d..51034f3 100644
--- a/ipc/bus1/main.c
+++ b/ipc/bus1/main.c
@@ -15,24 +15,41 @@
#include <linux/miscdevice.h>
#include <linux/module.h>
#include "main.h"
+#include "peer.h"
#include "tests.h"
#include "user.h"

static int bus1_fop_open(struct inode *inode, struct file *file)
{
+ struct bus1_peer *peer;
+
+ peer = bus1_peer_new();
+ if (IS_ERR(peer))
+ return PTR_ERR(peer);
+
+ file->private_data = peer;
return 0;
}

static int bus1_fop_release(struct inode *inode, struct file *file)
{
+ bus1_peer_free(file->private_data);
return 0;
}

+static void bus1_fop_show_fdinfo(struct seq_file *m, struct file *file)
+{
+ struct bus1_peer *peer = file->private_data;
+
+ seq_printf(m, KBUILD_MODNAME "-peer:\t%16llx\n", peer->id);
+}
+
const struct file_operations bus1_fops = {
.owner = THIS_MODULE,
.open = bus1_fop_open,
.release = bus1_fop_release,
.llseek = noop_llseek,
+ .show_fdinfo = bus1_fop_show_fdinfo,
};

static struct miscdevice bus1_misc = {
diff --git a/ipc/bus1/main.h b/ipc/bus1/main.h
index 76fce66..dd319d9 100644
--- a/ipc/bus1/main.h
+++ b/ipc/bus1/main.h
@@ -49,6 +49,20 @@
* ordered, including unicasts, multicasts, and notifications.
*/

+/**
+ * Locking
+ *
+ * Most of the bus1 objects form a hierarchy, as such, their locks must be
+ * ordered. Not all orders are explicitly defined (e.g., they might define
+ * orthogonal hierarchies), but this list gives a rough overview:
+ *
+ * bus1_peer.active
+ * bus1_peer.local.lock
+ * bus1_peer.data.lock
+ * bus1_user.lock
+ * bus1_user_lock
+ */
+
struct dentry;
struct file_operations;

diff --git a/ipc/bus1/peer.c b/ipc/bus1/peer.c
new file mode 100644
index 0000000..a6fbca01
--- /dev/null
+++ b/ipc/bus1/peer.c
@@ -0,0 +1,145 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/atomic.h>
+#include <linux/cred.h>
+#include <linux/debugfs.h>
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/pid_namespace.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/uio.h>
+#include <linux/wait.h>
+#include "main.h"
+#include "peer.h"
+#include "user.h"
+#include "util.h"
+#include "util/active.h"
+
+/**
+ * bus1_peer_new() - allocate new peer
+ *
+ * Allocate a new peer. It is immediately activated and ready for use. It is
+ * not linked into any context. The caller will get exclusively access to the
+ * peer object on success.
+ *
+ * Note that the peer is opened on behalf of 'current'. That is, it pins its
+ * credentials and namespaces.
+ *
+ * Return: Pointer to peer, ERR_PTR on failure.
+ */
+struct bus1_peer *bus1_peer_new(void)
+{
+ static atomic64_t peer_ids = ATOMIC64_INIT(0);
+ const struct cred *cred = current_cred();
+ struct bus1_peer *peer;
+ struct bus1_user *user;
+
+ user = bus1_user_ref_by_uid(cred->uid);
+ if (IS_ERR(user))
+ return ERR_CAST(user);
+
+ peer = kmalloc(sizeof(*peer), GFP_KERNEL);
+ if (!peer) {
+ bus1_user_unref(user);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* initialize constant fields */
+ peer->id = atomic64_inc_return(&peer_ids);
+ peer->flags = 0;
+ peer->cred = get_cred(current_cred());
+ peer->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ peer->user = user;
+ peer->debugdir = NULL;
+ init_waitqueue_head(&peer->waitq);
+ bus1_active_init(&peer->active);
+
+ /* initialize data section */
+ mutex_init(&peer->data.lock);
+
+ /* initialize peer-private section */
+ mutex_init(&peer->local.lock);
+
+ if (!IS_ERR_OR_NULL(bus1_debugdir)) {
+ char idstr[22];
+
+ snprintf(idstr, sizeof(idstr), "peer-%llx", peer->id);
+
+ peer->debugdir = debugfs_create_dir(idstr, bus1_debugdir);
+ if (!peer->debugdir) {
+ pr_err("cannot create debugfs dir for peer %llx\n",
+ peer->id);
+ } else if (!IS_ERR_OR_NULL(peer->debugdir)) {
+ bus1_debugfs_create_atomic_x("active", S_IRUGO,
+ peer->debugdir,
+ &peer->active.count);
+ }
+ }
+
+ bus1_active_activate(&peer->active);
+ return peer;
+}
+
+static int bus1_peer_disconnect(struct bus1_peer *peer)
+{
+ bus1_active_deactivate(&peer->active);
+ bus1_active_drain(&peer->active, &peer->waitq);
+
+ if (!bus1_active_cleanup(&peer->active, &peer->waitq,
+ NULL, NULL))
+ return -ESHUTDOWN;
+
+ return 0;
+}
+
+/**
+ * bus1_peer_free() - destroy peer
+ * @peer: peer to destroy, or NULL
+ *
+ * Destroy a peer object that was previously allocated via bus1_peer_new().
+ * This synchronously waits for any outstanding operations on this peer to
+ * finish, then releases all linked resources and deallocates the peer in an
+ * rcu-delayed manner.
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)
+{
+ if (!peer)
+ return NULL;
+
+ /* disconnect from environment */
+ bus1_peer_disconnect(peer);
+
+ /* deinitialize peer-private section */
+ mutex_destroy(&peer->local.lock);
+
+ /* deinitialize data section */
+ mutex_destroy(&peer->data.lock);
+
+ /* deinitialize constant fields */
+ debugfs_remove_recursive(peer->debugdir);
+ bus1_active_deinit(&peer->active);
+ peer->user = bus1_user_unref(peer->user);
+ put_pid_ns(peer->pid_ns);
+ put_cred(peer->cred);
+ kfree_rcu(peer, rcu);
+
+ return NULL;
+}
diff --git a/ipc/bus1/peer.h b/ipc/bus1/peer.h
new file mode 100644
index 0000000..277fcf8
--- /dev/null
+++ b/ipc/bus1/peer.h
@@ -0,0 +1,146 @@
+#ifndef __BUS1_PEER_H
+#define __BUS1_PEER_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Peers
+ *
+ * A peer context provides access to the bus1 system. A peer itself is not a
+ * routable entity, but rather only a local anchor to serve as gateway to the
+ * bus. To participate on the bus, you need to allocate a peer. This peer
+ * manages all your state on the bus, including all allocated nodes, owned
+ * handles, incoming messages, and more.
+ *
+ * A peer is split into 3 sections:
+ * - A static section that is initialized at peer creation and never changes
+ * - A peer-local section that is only ever accessed by ioctls done by the
+ * peer itself.
+ * - A data section that might be accessed by remote peers when interacting
+ * with this peer.
+ *
+ * All peers on the system operate on the same level. There is no context a
+ * peer is linked into. Hence, you can never lock multiple peers at the same
+ * time. Instead, peers provide active-references. Before performing an
+ * operation on a peer, an active reference must be acquired, and hold as long
+ * as the operation goes on. When done, the reference is released again.
+ * When a peer is disconnected, no more active references can be acquired, and
+ * any outstanding operation is waited for before the peer is destroyed.
+ *
+ * Additionally to active-references, there are 2 locks: A peer-local lock and
+ * a data lock. The peer-local lock is used to synchronize operations done by
+ * the peer itself. It is never acquired by a remote peer. The data lock
+ * protects the data of the peer, which might be modified by remote peers. The
+ * data lock nests underneath the local-lock. Furthermore, the data-lock
+ * critical sections must be kept small and never block indefinitely. Remote
+ * peers might wait for data-locks, hence they must rely on not being DoSed.
+ * The local peer lock, however, is private to the peer itself. Not such
+ * restrictions apply. It is mostly used to give the impression of atomic
+ * operations (i.e., making the API appear consistent and coherent).
+ */
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/rcupdate.h>
+#include <linux/rbtree.h>
+#include <linux/wait.h>
+#include "user.h"
+#include "util/active.h"
+
+struct cred;
+struct dentry;
+struct pid_namespace;
+
+/**
+ * struct bus1_peer - peer context
+ * @id: peer ID
+ * @flags: peer flags
+ * @cred: pinned credentials
+ * @pid_ns: pinned pid-namespace
+ * @user: pinned user
+ * @rcu: rcu-delayed kfree of peer
+ * @waitq: peer wide wait queue
+ * @active: active references
+ * @debugdir: debugfs root of this peer, or NULL/ERR_PTR
+ * @data.lock: data lock
+ * @local.lock: local peer runtime lock
+ */
+struct bus1_peer {
+ u64 id;
+ u64 flags;
+ const struct cred *cred;
+ struct pid_namespace *pid_ns;
+ struct bus1_user *user;
+ struct rcu_head rcu;
+ wait_queue_head_t waitq;
+ struct bus1_active active;
+ struct dentry *debugdir;
+
+ struct {
+ struct mutex lock;
+ } data;
+
+ struct {
+ struct mutex lock;
+ } local;
+};
+
+struct bus1_peer *bus1_peer_new(void);
+struct bus1_peer *bus1_peer_free(struct bus1_peer *peer);
+
+/**
+ * bus1_peer_acquire() - acquire active reference to peer
+ * @peer: peer to operate on, or NULL
+ *
+ * Acquire a new active reference to the given peer. If the peer was not
+ * activated yet, or if it was already deactivated, this will fail.
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: Pointer to peer, NULL on failure.
+ */
+static inline struct bus1_peer *bus1_peer_acquire(struct bus1_peer *peer)
+{
+ if (peer && bus1_active_acquire(&peer->active))
+ return peer;
+ return NULL;
+}
+
+/**
+ * bus1_peer_release() - release an active reference
+ * @peer: handle to release, or NULL
+ *
+ * This releases an active reference to a peer, acquired previously via
+ * bus1_peer_acquire().
+ *
+ * If NULL is passed, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+static inline struct bus1_peer *bus1_peer_release(struct bus1_peer *peer)
+{
+ if (peer) {
+ /*
+ * An active reference is sufficient to keep a peer alive. As
+ * such, releasing the active-reference might wake up a pending
+ * peer destruction. But bus1_active_release() has to first
+ * drop the ref, then wake up the wake-queue. Taking an rcu
+ * read lock guarantees the wake-queue (i.e., its underlying
+ * peer) is still around for the wake-up operation.
+ */
+ rcu_read_lock();
+ bus1_active_release(&peer->active, &peer->waitq);
+ rcu_read_unlock();
+ }
+ return NULL;
+}
+
+#endif /* __BUS1_PEER_H */
diff --git a/ipc/bus1/util.c b/ipc/bus1/util.c
new file mode 100644
index 0000000..8acf798
--- /dev/null
+++ b/ipc/bus1/util.c
@@ -0,0 +1,52 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/atomic.h>
+#include <linux/debugfs.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include "util.h"
+
+#if defined(CONFIG_DEBUG_FS)
+
+static int bus1_debugfs_atomic_t_get(void *data, u64 *val)
+{
+ *val = atomic_read((atomic_t *)data);
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(bus1_debugfs_atomic_x_ro,
+ bus1_debugfs_atomic_t_get,
+ NULL,
+ "%llx\n");
+
+/**
+ * bus1_debugfs_create_atomic_x() - create debugfs file for hex atomic_t
+ * @name: file name to use
+ * @mode: permissions for the file
+ * @parent: parent directory
+ * @value: variable to read from, or write to
+ *
+ * This is almost equivalent to debugfs_create_atomic_t() but prints/reads the
+ * data as hexadecimal value. So far, only read-only attributes are supported.
+ *
+ * Return: Pointer to new dentry, NULL/ERR_PTR if disabled or on failure.
+ */
+struct dentry *bus1_debugfs_create_atomic_x(const char *name,
+ umode_t mode,
+ struct dentry *parent,
+ atomic_t *value)
+{
+ return debugfs_create_file_unsafe(name, mode, parent, value,
+ &bus1_debugfs_atomic_x_ro);
+}
+
+#endif /* defined(CONFIG_DEBUG_FS) */
diff --git a/ipc/bus1/util.h b/ipc/bus1/util.h
new file mode 100644
index 0000000..b9f9e8d
--- /dev/null
+++ b/ipc/bus1/util.h
@@ -0,0 +1,51 @@
+#ifndef __BUS1_UTIL_H
+#define __BUS1_UTIL_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * Utilities
+ *
+ * Random utility functions that don't belong to a specific object. Some of
+ * them are copies from internal kernel functions (which lack an export
+ * annotation), some of them are variants of internal kernel functions, and
+ * some of them are our own.
+ */
+
+#include <linux/atomic.h>
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+
+struct dentry;
+
+#if defined(CONFIG_DEBUG_FS)
+
+struct dentry *
+bus1_debugfs_create_atomic_x(const char *name,
+ umode_t mode,
+ struct dentry *parent,
+ atomic_t *value);
+
+#else
+
+static inline struct dentry *
+bus1_debugfs_create_atomic_x(const char *name,
+ umode_t mode,
+ struct dentry *parent,
+ atomic_t *value)
+{
+ return ERR_PTR(-ENODEV);
+}
+
+#endif
+
+#endif /* __BUS1_UTIL_H */
--
2.10.1

2016-10-26 19:24:54

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

From: Tom Gundersen <[email protected]>

Add the CONFIG_BUS1 option to enable the bus1 kernel messaging bus. If
enabled, provide the bus1.ko module with a stub cdev /dev/bus1. So far
it does not expose any API, but the full intended uapi is provided in
include/uapi/linux/bus1.h already.

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
include/uapi/linux/bus1.h | 138 ++++++++++++++++++++++++++++++++++++++++++++++
init/Kconfig | 17 ++++++
ipc/Makefile | 1 +
ipc/bus1/Makefile | 6 ++
ipc/bus1/main.c | 80 +++++++++++++++++++++++++++
ipc/bus1/main.h | 74 +++++++++++++++++++++++++
ipc/bus1/tests.c | 19 +++++++
ipc/bus1/tests.h | 32 +++++++++++
8 files changed, 367 insertions(+)
create mode 100644 include/uapi/linux/bus1.h
create mode 100644 ipc/bus1/Makefile
create mode 100644 ipc/bus1/main.c
create mode 100644 ipc/bus1/main.h
create mode 100644 ipc/bus1/tests.c
create mode 100644 ipc/bus1/tests.h

diff --git a/include/uapi/linux/bus1.h b/include/uapi/linux/bus1.h
new file mode 100644
index 0000000..8ec3357
--- /dev/null
+++ b/include/uapi/linux/bus1.h
@@ -0,0 +1,138 @@
+#ifndef _UAPI_LINUX_BUS1_H
+#define _UAPI_LINUX_BUS1_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+#define BUS1_FD_MAX (256)
+
+#define BUS1_IOCTL_MAGIC 0x96
+#define BUS1_HANDLE_INVALID ((__u64)-1)
+#define BUS1_OFFSET_INVALID ((__u64)-1)
+
+enum {
+ BUS1_HANDLE_FLAG_MANAGED = 1ULL << 0,
+ BUS1_HANDLE_FLAG_REMOTE = 1ULL << 1,
+};
+
+enum {
+ BUS1_PEER_FLAG_WANT_SECCTX = 1ULL << 0,
+};
+
+enum {
+ BUS1_PEER_RESET_FLAG_FLUSH = 1ULL << 0,
+ BUS1_PEER_RESET_FLAG_FLUSH_SEED = 1ULL << 1,
+};
+
+struct bus1_cmd_peer_reset {
+ __u64 flags;
+ __u64 peer_flags;
+ __u32 max_slices;
+ __u32 max_handles;
+ __u32 max_inflight_bytes;
+ __u32 max_inflight_fds;
+} __attribute__((__aligned__(8)));
+
+struct bus1_cmd_handle_transfer {
+ __u64 flags;
+ __u64 src_handle;
+ __u64 dst_fd;
+ __u64 dst_handle;
+} __attribute__((__aligned__(8)));
+
+enum {
+ BUS1_NODES_DESTROY_FLAG_RELEASE_HANDLES = 1ULL << 0,
+};
+
+struct bus1_cmd_nodes_destroy {
+ __u64 flags;
+ __u64 ptr_nodes;
+ __u64 n_nodes;
+} __attribute__((__aligned__(8)));
+
+enum {
+ BUS1_SEND_FLAG_CONTINUE = 1ULL << 0,
+ BUS1_SEND_FLAG_SEED = 1ULL << 1,
+};
+
+struct bus1_cmd_send {
+ __u64 flags;
+ __u64 ptr_destinations;
+ __u64 ptr_errors;
+ __u64 n_destinations;
+ __u64 ptr_vecs;
+ __u64 n_vecs;
+ __u64 ptr_handles;
+ __u64 n_handles;
+ __u64 ptr_fds;
+ __u64 n_fds;
+} __attribute__((__aligned__(8)));
+
+enum {
+ BUS1_RECV_FLAG_PEEK = 1ULL << 0,
+ BUS1_RECV_FLAG_SEED = 1ULL << 1,
+ BUS1_RECV_FLAG_INSTALL_FDS = 1ULL << 2,
+};
+
+enum {
+ BUS1_MSG_NONE,
+ BUS1_MSG_DATA,
+ BUS1_MSG_NODE_DESTROY,
+ BUS1_MSG_NODE_RELEASE,
+};
+
+enum {
+ BUS1_MSG_FLAG_HAS_SECCTX = 1ULL << 0,
+ BUS1_MSG_FLAG_CONTINUE = 1ULL << 1,
+};
+
+struct bus1_cmd_recv {
+ __u64 flags;
+ __u64 max_offset;
+ struct {
+ __u64 type;
+ __u64 flags;
+ __u64 destination;
+ __u32 uid;
+ __u32 gid;
+ __u32 pid;
+ __u32 tid;
+ __u64 offset;
+ __u64 n_bytes;
+ __u64 n_handles;
+ __u64 n_fds;
+ __u64 n_secctx;
+ } __attribute__((__aligned__(8))) msg;
+} __attribute__((__aligned__(8)));
+
+enum {
+ BUS1_CMD_PEER_DISCONNECT = _IOWR(BUS1_IOCTL_MAGIC, 0x00,
+ __u64),
+ BUS1_CMD_PEER_QUERY = _IOWR(BUS1_IOCTL_MAGIC, 0x01,
+ struct bus1_cmd_peer_reset),
+ BUS1_CMD_PEER_RESET = _IOWR(BUS1_IOCTL_MAGIC, 0x02,
+ struct bus1_cmd_peer_reset),
+ BUS1_CMD_HANDLE_RELEASE = _IOWR(BUS1_IOCTL_MAGIC, 0x10,
+ __u64),
+ BUS1_CMD_HANDLE_TRANSFER = _IOWR(BUS1_IOCTL_MAGIC, 0x11,
+ struct bus1_cmd_handle_transfer),
+ BUS1_CMD_NODES_DESTROY = _IOWR(BUS1_IOCTL_MAGIC, 0x20,
+ struct bus1_cmd_nodes_destroy),
+ BUS1_CMD_SLICE_RELEASE = _IOWR(BUS1_IOCTL_MAGIC, 0x30,
+ __u64),
+ BUS1_CMD_SEND = _IOWR(BUS1_IOCTL_MAGIC, 0x40,
+ struct bus1_cmd_send),
+ BUS1_CMD_RECV = _IOWR(BUS1_IOCTL_MAGIC, 0x50,
+ struct bus1_cmd_recv),
+};
+
+#endif /* _UAPI_LINUX_BUS1_H */
diff --git a/init/Kconfig b/init/Kconfig
index 34407f1..04c7daf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -273,6 +273,23 @@ config POSIX_MQUEUE_SYSCTL
depends on SYSCTL
default y

+config BUS1
+ tristate "Bus1 Kernel Message Bus"
+ help
+ The Bus1 Kernel Message Bus defines and implements a distributed
+ object model. It provides a capability-based IPC system for machine
+ local communication.
+
+ The Bus1 IPC system is exposed via /dev/bus1. If debugfs is enabled,
+ bus1 exposes additional debug information there.
+
+config BUS1_TESTS
+ bool "Bus1 Self-Tests"
+ depends on BUS1
+ help
+ Enable and run the bus1 self-tests before loading the module. The
+ overhead is minimal, so there is generally no harm in enabling it.
+
config CROSS_MEMORY_ATTACH
bool "Enable process_vm_readv/writev syscalls"
depends on MMU
diff --git a/ipc/Makefile b/ipc/Makefile
index 86c7300..eee12d1 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
obj-$(CONFIG_IPC_NS) += namespace.o
obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
+obj-$(CONFIG_BUS1) += bus1/

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
new file mode 100644
index 0000000..d3a4491
--- /dev/null
+++ b/ipc/bus1/Makefile
@@ -0,0 +1,6 @@
+bus1-y := \
+ main.o
+
+obj-$(CONFIG_BUS1) += bus1.o
+
+bus1-$(CONFIG_BUS1_TESTS) += tests.o
diff --git a/ipc/bus1/main.c b/ipc/bus1/main.c
new file mode 100644
index 0000000..02412a7
--- /dev/null
+++ b/ipc/bus1/main.c
@@ -0,0 +1,80 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/debugfs.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include "main.h"
+#include "tests.h"
+
+static int bus1_fop_open(struct inode *inode, struct file *file)
+{
+ return 0;
+}
+
+static int bus1_fop_release(struct inode *inode, struct file *file)
+{
+ return 0;
+}
+
+const struct file_operations bus1_fops = {
+ .owner = THIS_MODULE,
+ .open = bus1_fop_open,
+ .release = bus1_fop_release,
+ .llseek = noop_llseek,
+};
+
+static struct miscdevice bus1_misc = {
+ .fops = &bus1_fops,
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = KBUILD_MODNAME,
+ .mode = S_IRUGO | S_IWUGO,
+};
+
+struct dentry *bus1_debugdir;
+
+static int __init bus1_modinit(void)
+{
+ int r;
+
+ r = bus1_tests_run();
+ if (r < 0)
+ return r;
+
+ bus1_debugdir = debugfs_create_dir(KBUILD_MODNAME, NULL);
+ if (!bus1_debugdir)
+ pr_err("cannot create debugfs root\n");
+
+ r = misc_register(&bus1_misc);
+ if (r < 0)
+ goto error;
+
+ pr_info("loaded\n");
+ return 0;
+
+error:
+ debugfs_remove(bus1_debugdir);
+ return r;
+}
+
+static void __exit bus1_modexit(void)
+{
+ misc_deregister(&bus1_misc);
+ debugfs_remove(bus1_debugdir);
+ pr_info("unloaded\n");
+}
+
+module_init(bus1_modinit);
+module_exit(bus1_modexit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Bus based interprocess communication");
diff --git a/ipc/bus1/main.h b/ipc/bus1/main.h
new file mode 100644
index 0000000..76fce66
--- /dev/null
+++ b/ipc/bus1/main.h
@@ -0,0 +1,74 @@
+#ifndef __BUS1_MAIN_H
+#define __BUS1_MAIN_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Bus1 Overview
+ *
+ * bus1 is a local IPC system, which provides a decentralized infrastructure to
+ * share objects between local peers. The main building blocks are nodes and
+ * handles. Nodes represent objects of a local peer, while handles represent
+ * descriptors that point to a node. Nodes can be created and destroyed by any
+ * peer, and they will always remain owned by their respective creator. Handles,
+ * on the other hand, are used to refer to nodes and can be passed around with
+ * messages as auxiliary data. Whenever a handle is transferred, the receiver
+ * will get its own handle allocated, pointing to the same node as the original
+ * handle.
+ *
+ * Any peer can send messages directed at one of their handles. This will
+ * transfer the message to the owner of the node the handle points to. If a
+ * peer does not posess a handle to a given node, it will not be able to send a
+ * message to that node. That is, handles provide exclusive access management.
+ * Anyone that somehow acquired a handle to a node is privileged to further
+ * send this handle to other peers. As such, access management is transitive.
+ * Once a peer acquired a handle, it cannot be revoked again. However, a node
+ * owner can, at anytime, destroy a node. This will effectively unbind all
+ * existing handles to that node on any peer, notifying each one of the
+ * destruction.
+ *
+ * Unlike nodes and handles, peers cannot be addressed directly. In fact, peers
+ * are completely disconnected entities. A peer is merely an anchor of a set of
+ * nodes and handles, including an incoming message queue for any of those.
+ * Whether multiple nodes are all part of the same peer, or part of different
+ * peers does not affect the remote view of those. Peers solely exist as
+ * management entity and command dispatcher to local processes.
+ *
+ * The set of actors on a system is completely decentralized. There is no
+ * global component involved that provides a central registry or discovery
+ * mechanism. Furthermore, communication between peers only involves those
+ * peers, and does not affect any other peer in any way. No global
+ * communication lock is taken. However, any communication is still globally
+ * ordered, including unicasts, multicasts, and notifications.
+ */
+
+struct dentry;
+struct file_operations;
+
+/**
+ * bus1_fops - file-operations of bus1 character devices
+ *
+ * All bus1 peers are backed by a character device with @bus1_fops used as
+ * file-operations. That is, a file is a bus1 peer if, and only if, its f_op
+ * pointer contains @bus1_fops.
+ */
+extern const struct file_operations bus1_fops;
+
+/**
+ * bus1_debugdir - debugfs root directory
+ *
+ * If debugfs is enabled, this is set to point to the debugfs root directory
+ * for this module. If debugfs is disabled, or if the root directory could not
+ * be created, this is set to NULL or ERR_PTR (which debugfs functions can deal
+ * with seamlessly).
+ */
+extern struct dentry *bus1_debugdir;
+
+#endif /* __BUS1_MAIN_H */
diff --git a/ipc/bus1/tests.c b/ipc/bus1/tests.c
new file mode 100644
index 0000000..6fd2946
--- /dev/null
+++ b/ipc/bus1/tests.c
@@ -0,0 +1,19 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include "tests.h"
+
+int bus1_tests_run(void)
+{
+ pr_info("run selftests..\n");
+ return 0;
+}
diff --git a/ipc/bus1/tests.h b/ipc/bus1/tests.h
new file mode 100644
index 0000000..fb554e2
--- /dev/null
+++ b/ipc/bus1/tests.h
@@ -0,0 +1,32 @@
+#ifndef __BUS1_TESTS_H
+#define __BUS1_TESTS_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * Kernel Selftests
+ *
+ * These tests are built into the kernel module itself if, and only if, the
+ * required configuration is selected. On every module load, the selftests will
+ * be run. On production builds, this option should not be selected.
+ */
+
+#include <linux/kernel.h>
+
+#if IS_ENABLED(CONFIG_BUS1_TESTS)
+int bus1_tests_run(void);
+#else
+static inline int bus1_tests_run(void)
+{
+ return 0;
+}
+#endif
+
+#endif /* __BUS1_TESTS_H */
--
2.10.1

2016-10-26 19:24:52

by David Herrmann

[permalink] [raw]
Subject: [RFC v1 03/14] bus1: util - active reference utility library

From: Tom Gundersen <[email protected]>

The bus1_active object implements active references. They work
similarly to plain object reference counters, but allow disabling
any new references from being taken.

Each bus1_active object goes through a set of states:
NEW: Initial state, no active references can be acquired
ACTIVE: Live state, active references can be acquired
DRAINING: Deactivated but lingering, no active references
can be acquired
DRAINED: Deactivated and all active references were dropped
RELEASED: Fully drained and synchronously released

Initially, all bus1_active objects are in state NEW. As soon as they're
activated, they enter ACTIVE and active references can be acquired.
This is the normal, live state. Once the object is deactivated, it
enters state DRAINING. No new active references can be acquired, but
some threads might still own active references. Once all those are
dropped, the object enters state DRAINED. Now the object can be
released a *single* time, before it enters state RELEASED and is
finished. It cannot be re-used anymore.

Active-references are very useful to track threads that invoke callbacks
on an object. As long as a callback is running, an active reference is
held, and as such the object is usually protected from being destroyed.
The destructor of the object needs to deactivate *and* drain the object,
before releasing resources.

Active references will be used heavy by the upcoming bus1_peer object.
Whenever a peer operates on a remote peer, it must acquire and hold an
active reference on that remote peer. This guarantees that the remote
peer will wait for this operation to finish before possibly
disconnecting from the bus.
In concept, active-references can be seen as rw-locks. However, they
have much more strict state-transitions. Prior art can be seen in
super-blocks ('atomic_t s_active'), and kernfs ('atomic_t active').

Signed-off-by: Tom Gundersen <[email protected]>
Signed-off-by: David Herrmann <[email protected]>
---
ipc/bus1/Makefile | 3 +-
ipc/bus1/util/active.c | 419 +++++++++++++++++++++++++++++++++++++++++++++++++
ipc/bus1/util/active.h | 154 ++++++++++++++++++
3 files changed, 575 insertions(+), 1 deletion(-)
create mode 100644 ipc/bus1/util/active.c
create mode 100644 ipc/bus1/util/active.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index d3a4491..9e491691 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -1,5 +1,6 @@
bus1-y := \
- main.o
+ main.o \
+ util/active.o

obj-$(CONFIG_BUS1) += bus1.o

diff --git a/ipc/bus1/util/active.c b/ipc/bus1/util/active.c
new file mode 100644
index 0000000..5f5fdaa
--- /dev/null
+++ b/ipc/bus1/util/active.c
@@ -0,0 +1,419 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include "active.h"
+
+/*
+ * Bias values track states of "active references". They're all negative. If an
+ * object is active, its active-ref-counter is >=0 and tracks all active
+ * references. Once an object is deactivated, we subtract ACTIVE_BIAS. This
+ * means, the counter is now negative but still counts the active references.
+ * Once it drops to exactly ACTIVE_BIAS, we know all active references were
+ * dropped. Exactly one thread will change it to ACTIVE_RELEASE now, perform
+ * cleanup and then put it into ACTIVE_DONE. Once released, all other threads
+ * that tried deactivating the node will now be woken up (thus, they wait until
+ * the object is fully done).
+ * The initial state during object setup is ACTIVE_NEW. If an object is
+ * directly deactivated without having ever been active, it is put into
+ * ACTIVE_RELEASE_DIRECT instead of ACTIVE_BIAS. This tracks this one-bit state
+ * across deactivation. The task putting it into ACTIVE_RELEASE now knows
+ * whether the object was active before or not.
+ *
+ * We support lockdep annotations for 'active references'. We treat active
+ * references as a read-trylock, and deactivation as a write-lock.
+ *
+ * Some archs implement atomic_sub(v) with atomic_add(-v), so reserve INT_MIN
+ * to avoid overflows if multiplied by -1.
+ */
+#define BUS1_ACTIVE_RELEASE_DIRECT (BUS1_ACTIVE_BIAS - 1)
+#define BUS1_ACTIVE_RELEASE (BUS1_ACTIVE_BIAS - 2)
+#define BUS1_ACTIVE_DONE (BUS1_ACTIVE_BIAS - 3)
+#define BUS1_ACTIVE_NEW (BUS1_ACTIVE_BIAS - 4)
+#define _BUS1_ACTIVE_RESERVED (BUS1_ACTIVE_BIAS - 5)
+
+/**
+ * bus1_active_init_private() - initialize object
+ * @active: object to initialize
+ *
+ * This initializes an active-object. The initial state is NEW, and as such no
+ * active reference can be acquired. The object must be activated first.
+ *
+ * This is an internal helper. Always use the public bus1_active_init() macro
+ * which does proper lockdep initialization for private key classes.
+ */
+void bus1_active_init_private(struct bus1_active *active)
+{
+ atomic_set(&active->count, BUS1_ACTIVE_NEW);
+}
+
+/**
+ * bus1_active_deinit() - destroy object
+ * @active: object to destroy
+ *
+ * Destroy an active-object. The object must have been initialized via
+ * bus1_active_init(), deactivated via bus1_active_deactivate(), drained via
+ * bus1_active_drain() and cleaned via bus1_active_cleanup(), before you can
+ * destroy it. Alternatively, it can also be destroyed if still in state NEW.
+ *
+ * This function only does sanity checks, it does not modify the object itself.
+ * There is no allocated memory, so there is nothing to do.
+ */
+void bus1_active_deinit(struct bus1_active *active)
+{
+ int v;
+
+ v = atomic_read(&active->count);
+ WARN_ON(v != BUS1_ACTIVE_NEW && v != BUS1_ACTIVE_DONE);
+}
+
+/**
+ * bus1_active_is_new() - check whether object is new
+ * @active: object to check
+ *
+ * This checks whether the object is new, that is, it was never activated nor
+ * deactivated.
+ *
+ * Return: True if new, false if not.
+ */
+bool bus1_active_is_new(struct bus1_active *active)
+{
+ return atomic_read(&active->count) == BUS1_ACTIVE_NEW;
+}
+
+/**
+ * bus1_active_is_active() - check whether object is active
+ * @active: object to check
+ *
+ * This checks whether the given active-object is active. That is, the object
+ * was already activated, but not deactivated, yet.
+ *
+ * Note that this function does not give any guarantee that the object is still
+ * active/inactive at the time this call returns. It only serves as a barrier.
+ *
+ * Return: True if active, false if not.
+ */
+bool bus1_active_is_active(struct bus1_active *active)
+{
+ return atomic_read(&active->count) >= 0;
+}
+
+/**
+ * bus1_active_is_deactivated() - check whether object was deactivated
+ * @active: object to check
+ *
+ * This checks whether the given active-object was already deactivated. That
+ * is, the object was actively deactivated (state NEW does *not* count as
+ * deactivated) via bus1_active_deactivate().
+ *
+ * Once this function returns true, it cannot change again on this object.
+ *
+ * Return: True if already deactivated, false if not.
+ */
+bool bus1_active_is_deactivated(struct bus1_active *active)
+{
+ int v = atomic_read(&active->count);
+
+ return v > BUS1_ACTIVE_NEW && v < 0;
+}
+
+/**
+ * bus1_active_is_drained() - check whether object is drained
+ * @active: object to check
+ *
+ * This checks whether the given object was already deactivated and is fully
+ * drained. That is, no active references to the object exist, nor can they be
+ * acquired, anymore.
+ *
+ * Return: True if drained, false if not.
+ */
+bool bus1_active_is_drained(struct bus1_active *active)
+{
+ int v = atomic_read(&active->count);
+
+ return v > BUS1_ACTIVE_NEW && v <= BUS1_ACTIVE_BIAS;
+}
+
+/**
+ * bus1_active_activate() - activate object
+ * @active: object to activate
+ *
+ * This activates the given object, if it is still in state NEW. Otherwise, it
+ * is a no-op (and the object might already be deactivated).
+ *
+ * Once this returns successfully, active references can be acquired.
+ *
+ * Return: True if this call activated it, false if it was already activated,
+ * or deactivated.
+ */
+bool bus1_active_activate(struct bus1_active *active)
+{
+ return atomic_cmpxchg(&active->count,
+ BUS1_ACTIVE_NEW, 0) == BUS1_ACTIVE_NEW;
+}
+
+/**
+ * bus1_active_deactivate() - deactivate object
+ * @active: object to deactivate
+ *
+ * This deactivates the given object, if not already done by someone else. Once
+ * this returns, no new active references can be acquired.
+ *
+ * Return: True if this call deactivated the object, false if it was already
+ * deactivated by someone else.
+ */
+bool bus1_active_deactivate(struct bus1_active *active)
+{
+ int v, v1;
+
+ v = atomic_cmpxchg(&active->count,
+ BUS1_ACTIVE_NEW, BUS1_ACTIVE_RELEASE_DIRECT);
+ if (unlikely(v == BUS1_ACTIVE_NEW))
+ return true;
+
+ /*
+ * This adds BUS1_ACTIVE_BIAS to the counter, unless its negative:
+ * atomic_add_unless_negative(&active->count, BUS1_ACTIVE_BIAS)
+ * No such global helper exists, so it is inline here.
+ */
+ for (v = atomic_read(&active->count); v >= 0; v = v1) {
+ v1 = atomic_cmpxchg(&active->count, v, v + BUS1_ACTIVE_BIAS);
+ if (likely(v1 == v))
+ return true;
+ }
+
+ return false;
+}
+
+/**
+ * bus1_active_drain() - drain active references
+ * @active: object to drain
+ * @waitq: wait-queue linked to @active
+ *
+ * This waits for all active-references on @active to be dropped. It uses the
+ * passed wait-queue to sleep. It must be the same wait-queue that is used when
+ * calling bus1_active_release().
+ *
+ * The caller must guarantee that bus1_active_deactivate() was called before.
+ *
+ * This function can be safely called in parallel on multiple CPUs.
+ *
+ * Semantically (and also enforced by lockdep), this call behaves like a
+ * down_write(), followed by an up_write(), on this active object.
+ */
+void bus1_active_drain(struct bus1_active *active, wait_queue_head_t *waitq)
+{
+ if (WARN_ON(!bus1_active_is_deactivated(active)))
+ return;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /*
+ * We pretend this is a down_write_interruptible() and all but
+ * the release-context get interrupted. This is required, as we
+ * cannot call lock_acquired() on multiple threads without
+ * synchronization. Hence, only the release-context will do
+ * this, all others just release the lock.
+ */
+ lock_acquire_exclusive(&active->dep_map, /* lock */
+ 0, /* subclass */
+ 0, /* try-lock */
+ NULL, /* nest underneath */
+ _RET_IP_); /* IP */
+ if (atomic_read(&active->count) > BUS1_ACTIVE_BIAS)
+ lock_contended(&active->dep_map, _RET_IP_);
+#endif
+
+ /* wait until all active references were dropped */
+ wait_event(*waitq, atomic_read(&active->count) <= BUS1_ACTIVE_BIAS);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /*
+ * Pretend that no-one got the lock, but everyone got interrupted
+ * instead. That is, they released the lock without ever actually
+ * getting it locked.
+ */
+ lock_release(&active->dep_map, /* lock */
+ 1, /* nested (no-op) */
+ _RET_IP_); /* instruction pointer */
+#endif
+}
+
+/**
+ * bus1_active_cleanup() - cleanup drained object
+ * @active: object to release
+ * @waitq: wait-queue linked to @active, or NULL
+ * @cleanup: cleanup callback, or NULL
+ * @userdata: userdata for callback
+ *
+ * This performs the final object cleanup. The caller must guarantee that the
+ * object is drained, by calling bus1_active_drain().
+ *
+ * This function invokes the passed cleanup callback on the object. However, it
+ * guarantees that this is done exactly once. If there're multiple parallel
+ * callers, this will pick one randomly and make all others wait until it is
+ * done. If you call this after it was already cleaned up, this is a no-op
+ * and only serves as barrier.
+ *
+ * If @waitq is NULL, the wait is skipped and the call returns immediately. In
+ * this case, another thread has entered before, but there is no guarantee that
+ * they finished executing the cleanup callback, yet.
+ *
+ * If @waitq is non-NULL, this call behaves like a down_write(), followed by an
+ * up_write(), just like bus1_active_drain(). If @waitq is NULL, this rather
+ * behaves like a down_write_trylock(), optionally followed by an up_write().
+ *
+ * Return: True if this is the thread that released it, false otherwise.
+ */
+bool bus1_active_cleanup(struct bus1_active *active,
+ wait_queue_head_t *waitq,
+ void (*cleanup)(struct bus1_active *, void *),
+ void *userdata)
+{
+ int v;
+
+ if (WARN_ON(!bus1_active_is_drained(active)))
+ return false;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /*
+ * We pretend this is a down_write_interruptible() and all but
+ * the release-context get interrupted. This is required, as we
+ * cannot call lock_acquired() on multiple threads without
+ * synchronization. Hence, only the release-context will do
+ * this, all others just release the lock.
+ */
+ lock_acquire_exclusive(&active->dep_map,/* lock */
+ 0, /* subclass */
+ !waitq, /* try-lock */
+ NULL, /* nest underneath */
+ _RET_IP_); /* IP */
+#endif
+
+ /* mark object as RELEASE */
+ v = atomic_cmpxchg(&active->count,
+ BUS1_ACTIVE_RELEASE_DIRECT, BUS1_ACTIVE_RELEASE);
+ if (v != BUS1_ACTIVE_RELEASE_DIRECT)
+ v = atomic_cmpxchg(&active->count,
+ BUS1_ACTIVE_BIAS, BUS1_ACTIVE_RELEASE);
+
+ /*
+ * If this is the thread that marked the object as RELEASE, we
+ * perform the actual release. Otherwise, we wait until the
+ * release is done and the node is marked as DRAINED.
+ */
+ if (v == BUS1_ACTIVE_BIAS || v == BUS1_ACTIVE_RELEASE_DIRECT) {
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /* we're the release-context and acquired the lock */
+ lock_acquired(&active->dep_map, _RET_IP_);
+#endif
+
+ if (cleanup)
+ cleanup(active, userdata);
+
+ /* mark as DONE */
+ atomic_set(&active->count, BUS1_ACTIVE_DONE);
+ if (waitq)
+ wake_up_all(waitq);
+ } else if (waitq) {
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /* we're contended against the release context */
+ lock_contended(&active->dep_map, _RET_IP_);
+#endif
+
+ /* wait until object is DONE */
+ wait_event(*waitq,
+ atomic_read(&active->count) == BUS1_ACTIVE_DONE);
+ }
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /*
+ * No-one but the release-context acquired the lock. However,
+ * that does not matter as we simply treat this as
+ * 'interrupted'. Everyone releases the lock, but only one
+ * caller really got it.
+ */
+ lock_release(&active->dep_map, /* lock */
+ 1, /* nested (no-op) */
+ _RET_IP_); /* instruction pointer */
+#endif
+
+ /* true if we released it */
+ return v == BUS1_ACTIVE_BIAS || v == BUS1_ACTIVE_RELEASE_DIRECT;
+}
+
+/**
+ * bus1_active_lockdep_acquired() - acquire lockdep reader
+ * @active: object to acquire lockdep reader of, or NULL
+ *
+ * Whenever you acquire an active reference via bus1_active_acquire(), this
+ * function is implicitly called afterwards. It enables lockdep annotations and
+ * tells lockdep that you acquired the active reference.
+ *
+ * However, lockdep cannot support arbitrary depths, hence, we allow
+ * temporarily dropping the lockdep-annotation via
+ * bus1_active_lockdep_release(), and acquiring them later again via
+ * bus1_active_lockdep_acquire().
+ *
+ * Example: If you need to pin a large number of objects, you would acquire each
+ * of them individually via bus1_active_acquire(). Then you would
+ * perform state tracking, etc. on that object. Before you continue
+ * with the next, you call bus1_active_lockdep_released(), to pretend
+ * you released the lock (but you still retain your active reference).
+ * Now you continue with pinning the next object, etc. until you
+ * pinned all objects you need.
+ *
+ * If you now need to access one of your pinned objects (or want to
+ * release them eventually), you call bus1_active_lockdep_acquired()
+ * before accessing the object. This enables the lockdep annotations
+ * again. This cannot fail, ever. You still own the active reference
+ * at all times.
+ * Once you're done with the single object, you either release your
+ * entire active reference via bus1_active_release(), or you
+ * temporarily disable lockdep via bus1_active_lockdep_released()
+ * again, in case you need the pinned object again later.
+ *
+ * Note that you can acquired multiple active references just fine. The only
+ * reason those lockdep helpers are provided, is if you need to acquire a
+ * *large* number at the same time. Lockdep is usually limited to a depths of 64
+ * so you cannot hold more locks at the same time.
+ */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void bus1_active_lockdep_acquired(struct bus1_active *active)
+{
+ if (active)
+ lock_acquire_shared(&active->dep_map, /* lock */
+ 0, /* subclass */
+ 1, /* try-lock */
+ NULL, /* nest underneath */
+ _RET_IP_); /* IP */
+}
+#endif
+
+/**
+ * bus1_active_lockdep_released() - release lockdep reader
+ * @active: object to release lockdep reader of, or NULL
+ *
+ * This is the counterpart of bus1_active_lockdep_acquired(). See its
+ * documentation for details.
+ */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void bus1_active_lockdep_released(struct bus1_active *active)
+{
+ if (active)
+ lock_release(&active->dep_map, /* lock */
+ 1, /* nested (no-op) */
+ _RET_IP_); /* instruction pointer */
+}
+#endif
diff --git a/ipc/bus1/util/active.h b/ipc/bus1/util/active.h
new file mode 100644
index 0000000..462e7cf
--- /dev/null
+++ b/ipc/bus1/util/active.h
@@ -0,0 +1,154 @@
+#ifndef __BUS1_ACTIVE_H
+#define __BUS1_ACTIVE_H
+
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+/**
+ * DOC: Active References
+ *
+ * The bus1_active object implements active references. They work similarly to
+ * plain object reference counters, but allow disabling any new references from
+ * being taken.
+ *
+ * Each bus1_active object goes through a set of states:
+ * NEW: Initial state, no active references can be acquired
+ * ACTIVE: Live state, active references can be acquired
+ * DRAINING: Deactivated but lingering, no active references can be acquired
+ * DRAINED: Deactivated and all active references were dropped
+ * RELEASED: Fully drained and synchronously released
+ *
+ * Initially, all bus1_active objects are in state NEW. As soon as they're
+ * activated, they enter ACTIVE and active references can be acquired. This is
+ * the normal, live state. Once the object is deactivated, it enters state
+ * DRAINING. No new active references can be acquired, but some threads might
+ * still own active references. Once all those are dropped, the object enters
+ * state DRAINED. Now the object can be released a *single* time, before it
+ * enters state RELEASED and is finished. It cannot be re-used anymore.
+ *
+ * Active-references are very useful to track threads that call methods on an
+ * object. As long as a method is running, an active reference is held, and as
+ * such the object is usually protected from being destroyed. The destructor of
+ * the object needs to deactivate *and* drain the object, before releasing
+ * resources.
+ *
+ * Note that active-references cannot be used to manage their own backing
+ * memory. That is, they do not replace normal reference counts.
+ */
+
+#include <linux/atomic.h>
+#include <linux/lockdep.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+
+/* base value for counter-bias, see BUS1_ACTIVE_* constants for details */
+#define BUS1_ACTIVE_BIAS (INT_MIN + 5)
+
+/**
+ * struct bus1_active - active references
+ * @count: active reference counter
+ * @dep_map: lockdep annotations
+ *
+ * This object should be treated like a simple atomic_t. It will only contain
+ * more fields in the case of lockdep-enabled compilations.
+ *
+ * Users must embed this object into their parent structures and create/destroy
+ * it via bus1_active_init() and bus1_active_deinit().
+ */
+struct bus1_active {
+ atomic_t count;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+void bus1_active_init_private(struct bus1_active *active);
+void bus1_active_deinit(struct bus1_active *active);
+bool bus1_active_is_new(struct bus1_active *active);
+bool bus1_active_is_active(struct bus1_active *active);
+bool bus1_active_is_deactivated(struct bus1_active *active);
+bool bus1_active_is_drained(struct bus1_active *active);
+bool bus1_active_activate(struct bus1_active *active);
+bool bus1_active_deactivate(struct bus1_active *active);
+void bus1_active_drain(struct bus1_active *active, wait_queue_head_t *waitq);
+bool bus1_active_cleanup(struct bus1_active *active,
+ wait_queue_head_t *waitq,
+ void (*cleanup) (struct bus1_active *, void *),
+ void *userdata);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+# define bus1_active_init(_active) \
+ ({ \
+ static struct lock_class_key bus1_active_lock_key; \
+ lockdep_init_map(&(_active)->dep_map, "bus1.active", \
+ &bus1_active_lock_key, 0); \
+ bus1_active_init_private(_active); \
+ })
+void bus1_active_lockdep_acquired(struct bus1_active *active);
+void bus1_active_lockdep_released(struct bus1_active *active);
+#else
+# define bus1_active_init(_active) bus1_active_init_private(_active)
+static inline void bus1_active_lockdep_acquired(struct bus1_active *active) {}
+static inline void bus1_active_lockdep_released(struct bus1_active *active) {}
+#endif
+
+/**
+ * bus1_active_acquire() - acquire active reference
+ * @active: object to acquire active reference to, or NULL
+ *
+ * This acquires an active reference to the passed object. If the object was
+ * not activated, yet, or if it was already deactivated, this will fail and
+ * return NULL. If a reference was successfully acquired, this will return
+ * @active.
+ *
+ * If NULL is passed, this is a no-op and always returns NULL.
+ *
+ * This behaves as a down_read_trylock(). Use bus1_active_release() to release
+ * the reference again and get the matching up_read().
+ *
+ * Return: @active if reference was acquired, NULL if not.
+ */
+static inline struct bus1_active *
+bus1_active_acquire(struct bus1_active *active)
+{
+ if (active && atomic_inc_unless_negative(&active->count))
+ bus1_active_lockdep_acquired(active);
+ else
+ active = NULL;
+ return active;
+}
+
+/**
+ * bus1_active_release() - release active reference
+ * @active: object to release active reference of, or NULL
+ * @waitq: wait-queue linked to @active, or NULL
+ *
+ * This releases an active reference that was previously acquired via
+ * bus1_active_acquire().
+ *
+ * This is a no-op if NULL is passed.
+ *
+ * This behaves like an up_read().
+ *
+ * Return: NULL is returned.
+ */
+static inline struct bus1_active *
+bus1_active_release(struct bus1_active *active, wait_queue_head_t *waitq)
+{
+ if (active) {
+ bus1_active_lockdep_released(active);
+ if (atomic_dec_return(&active->count) == BUS1_ACTIVE_BIAS)
+ if (waitq)
+ wake_up(waitq);
+ }
+ return NULL;
+}
+
+#endif /* __BUS1_ACTIVE_H */
--
2.10.1

2016-10-26 19:40:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

So the thing that tends to worry me about these is resource management.

If I understood the documentation correctly, this has per-user
resource management, which guarantees that at least the system won't
run out of memory. Good. The act of sending a message transfers the
resource to the receiver end. Fine.

However, the usual problem ends up being that a bad user can basically
DoS a system agent, especially since for obvious performance reasons
the send/receive has to be asynchronous.

So the usual DoS model is that some user just sends a lot of messages
to a system agent, filling up the system agent resource quota, and
basically killing the system. No, it didn't run out of memory, but the
system agent may not be able to do anything more, since it is now out
of resources.

Keeping the resource management with the sender doesn't solve the
problem, it just reverses it: now the attack will be to send a lot of
queries to the system agent, but then just refuse to listen to the
replies - again causing the system agent to run out of resources.

Usually the way this is resolved this is by forcing a
"request-and-reply" resource management model, where the person who
sends out a request is not only the one who is accounted for the
request, but also accounted for the reply buffer. That way the system
agent never runs out of resources, because it's always the requesting
party that has its resources accounted, never the system agent.

You may well have solved this, but can you describe what the solution
is without forcing people to read the code and try to analyze it?

Linus

2016-10-26 20:34:36

by David Herrmann

[permalink] [raw]
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

Hi

On Wed, Oct 26, 2016 at 9:39 PM, Linus Torvalds
<[email protected]> wrote:
> So the thing that tends to worry me about these is resource management.
>
> If I understood the documentation correctly, this has per-user
> resource management, which guarantees that at least the system won't
> run out of memory. Good. The act of sending a message transfers the
> resource to the receiver end. Fine.
>
> However, the usual problem ends up being that a bad user can basically
> DoS a system agent, especially since for obvious performance reasons
> the send/receive has to be asynchronous.
>
> So the usual DoS model is that some user just sends a lot of messages
> to a system agent, filling up the system agent resource quota, and
> basically killing the system. No, it didn't run out of memory, but the
> system agent may not be able to do anything more, since it is now out
> of resources.
>
> Keeping the resource management with the sender doesn't solve the
> problem, it just reverses it: now the attack will be to send a lot of
> queries to the system agent, but then just refuse to listen to the
> replies - again causing the system agent to run out of resources.
>
> Usually the way this is resolved this is by forcing a
> "request-and-reply" resource management model, where the person who
> sends out a request is not only the one who is accounted for the
> request, but also accounted for the reply buffer. That way the system
> agent never runs out of resources, because it's always the requesting
> party that has its resources accounted, never the system agent.
>
> You may well have solved this, but can you describe what the solution
> is without forcing people to read the code and try to analyze it?

All accounting on bus1 is done on a UID-basis. This is the initial
model that tries to match POSIX semantics. More advanced accounting is
left as a future extension (like cgroup-based, etc.). So whenever we
talk about "user accounting", we talk about the user-abstraction in
bus1 that right now is based on UIDs, but could be extended for other
schemes.

All bus1 resources are owned by a peer, and each peer has a user
assigned (which right now corresponds to file->f_cred->uid). Whenever
a peer allocates resources, it is accounted on its user. There are
global limits per user which cannot be exceeded. Additionally, each
peer can set its own per-peer limits, to further partition the
per-user limits. Of course, per-user limits override per-peer limits,
if necessary.

Now this is all trivial and obvious. It works like any resource
accounting in the kernel. It becomes tricky when we try to transfer
resources. Before SEND, a resource is always accounted on the sender.
After RECV, a resource is accounted on the receiver. That is, resource
ownership is transferred. In most cases this is obvious: memory is
copied from one address-space into another, or file-descriptors are
added into the file-table of another process, etc.

Lastly, when a resource is queued, we decided to go with
receiver-accounting. This means, at the time of SEND resource
ownership is transferred (unlike sender-accounting, which would
transfer it at time of RECV). The reasons are manifold, but mainly we
want RECV to not fail due to accounting, resource exhaustion, etc. We
wanted SEND to do the heavy-lifting, and RECV to just dequeue. By
avoiding sender-based accounting, we avoid attacks where a receiver
does not dequeue messages and thus exhausts the sender's limits. The
issue left is senders DoS'ing a target user. To mitigate this, we
implemented a quota system. Whenever a sender wants to transfer
resources to a receiver, it only gets access to a subset of the
receivers resource limits. The inflight resources are accounted on a
uid<->uid basis, and the current algorithm allows a receiver access to
at most half the limit of the destination not currently used by anyone
else.

Example:
Imagine a receiver with a limit of 1024 handles. A sender transmits a
message to that receiver. It gets access to half the limit not used by
anyone else, hence 512 handles. It does not matter how many senders
there are, nor how many messages are sent, it will reach its quota at
512. As long as they all belong to the same user, they will share the
quota and can queue at most 512 handles. If a second sending user
comes into play, it gets half the remaining not used by anyone else,
which ends up being 256. And so on... If the peer dequeues messages in
between, the numbers get higher again. But if you do the math, the
most you can get is 50% of the targets resources, if you're the only
sender. In all other cases you get less (like intertwined transfers,
etc).

We did look into sender-based inflight accounting, but the same set of
issues arises. Sure, a Request+Reply model would make this easier to
handle, but we want to explicitly support a Subscribe+Event{n} model.
In this case there is more than one Reply to a message.

Long story short: We have uid<->uid quotas so far, which prevent DoS
attacks, unless you get access to a ridiculous amount of local UIDs.
Details on which resources are accounted can be found in the wiki [1].

Thanks
David

[1] https://github.com/bus1/documentation/wiki/Quota

2016-10-26 23:20:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

On Oct 26, 2016 12:21 PM, "David Herrmann" <[email protected]> wrote:
>
> From: Tom Gundersen <[email protected]>
>
> Add the CONFIG_BUS1 option to enable the bus1 kernel messaging bus. If
> enabled, provide the bus1.ko module with a stub cdev /dev/bus1. So far
> it does not expose any API, but the full intended uapi is provided in
> include/uapi/linux/bus1.h already.
>

This may have been covered elsewhere, but could this use syscalls instead?

2016-10-26 23:54:28

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

On Thu, Oct 27, 2016 at 1:19 AM, Andy Lutomirski <[email protected]> wrote:
> This may have been covered elsewhere, but could this use syscalls instead?

Yes, syscalls would work essentially the same. For now, we are using a
cdev as it makes it a lot more convenient to develop and test as an
out-of-tree module, but that could be changed easily before the final
submission, if that's what we want.

Cheers,

Tom

2016-10-27 09:17:02

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

On Thursday, October 27, 2016 1:54:05 AM CEST Tom Gundersen wrote:
> On Thu, Oct 27, 2016 at 1:19 AM, Andy Lutomirski <[email protected]> wrote:
> > This may have been covered elsewhere, but could this use syscalls instead?
>
> Yes, syscalls would work essentially the same. For now, we are using a
> cdev as it makes it a lot more convenient to develop and test as an
> out-of-tree module, but that could be changed easily before the final
> submission, if that's what we want.


Generally speaking, I think syscalls would be appropriate here, and put
bus1 into a similar category as the other ipc interfaces (shm, msg, sem,
mqueue, ...).

However, syscall API design is nontrivial, and will require a bit of
work to come to a set of syscalls that is fairly compact but also
extensible enough. I think it makes sense to go through the exercise
of working out what the syscall interface would end up looking like,
and then make a decision.

There is currently a set of file operations:

@@ -48,7 +90,11 @@ const struct file_operations bus1_fops = {
.owner = THIS_MODULE,
.open = bus1_fop_open,
.release = bus1_fop_release,
+ .poll = bus1_fop_poll,
.llseek = noop_llseek,
+ .mmap = bus1_fop_mmap,
+ .unlocked_ioctl = bus1_peer_ioctl,
+ .compat_ioctl = bus1_peer_ioctl,
.show_fdinfo = bus1_fop_show_fdinfo,
};

and then another set of ioctls:

+enum {
+ BUS1_CMD_PEER_DISCONNECT = _IOWR(BUS1_IOCTL_MAGIC, 0x00,
+ __u64),
+ BUS1_CMD_PEER_QUERY = _IOWR(BUS1_IOCTL_MAGIC, 0x01,
+ struct bus1_cmd_peer_reset),
+ BUS1_CMD_PEER_RESET = _IOWR(BUS1_IOCTL_MAGIC, 0x02,
+ struct bus1_cmd_peer_reset),
+ BUS1_CMD_HANDLE_RELEASE = _IOWR(BUS1_IOCTL_MAGIC, 0x10,
+ __u64),
+ BUS1_CMD_HANDLE_TRANSFER = _IOWR(BUS1_IOCTL_MAGIC, 0x11,
+ struct bus1_cmd_handle_transfer),
+ BUS1_CMD_NODES_DESTROY = _IOWR(BUS1_IOCTL_MAGIC, 0x20,
+ struct bus1_cmd_nodes_destroy),
+ BUS1_CMD_SLICE_RELEASE = _IOWR(BUS1_IOCTL_MAGIC, 0x30,
+ __u64),
+ BUS1_CMD_SEND = _IOWR(BUS1_IOCTL_MAGIC, 0x40,
+ struct bus1_cmd_send),
+ BUS1_CMD_RECV = _IOWR(BUS1_IOCTL_MAGIC, 0x50,
+ struct bus1_cmd_recv),
+};

I think there is no alternative to having some sort of file descriptor
with the basic operations you have above, but there is a question of
how to get that file descriptor if the ioctls get changed to a syscall,
the basic options being:

- Keep using a chardev. This works, but feels a little odd to me,
and I can't think of any other interfaces combining syscalls with
a chardev.

- Have one syscall that returns an open file descriptor, replacing
the fops->open() function. One advantage is that you can pass
additional arguments in that you can't have with open.
An example for this would be mqueue_open().

- Have a mountable file system, and use open() on that to create
connections. Advantages are that it's fairly easy to have one
instance per fs-namespace, and you can have user-defined naming
of objects in the file system.

For the other operations, the obvious translation would be to
turn each ioctl command into one syscall, but that may not always
be the best representation. One limitation is that you cannot
generally have more than six 'long' arguments on a lot of
architectures, and passing 'u64' arguments to syscalls is awkward.

For some of the commands, the transformation would be straightforward
if we assume that the 'u64' arguments can actually be 'long',
I guess like this:

+struct bus1_cmd_handle_transfer {
+ __u64 flags;
+ __u64 src_handle;
+ __u64 dst_fd;
+ __u64 dst_handle;
+} __attribute__((__aligned__(8)));

long bus1_handle_transfer(int fd, unsigned long handle,
int dst_fd, unsigned long *dst_handle, unsigned int flags);

+struct bus1_cmd_nodes_destroy {
+ __u64 flags;
+ __u64 ptr_nodes;
+ __u64 n_nodes;
+} __attribute__((__aligned__(8)));

long bus1_nodes_destroy(int fd, u64 *ptr_nodes,
long n_nodes, unsigned int flags);

However, the peer_reset would exceed the 6-argument limit when you count
the initial file descriptor even if you assume that 'flags' can be
made 32-bit:

+struct bus1_cmd_peer_reset {
+ __u64 flags;
+ __u64 peer_flags;
+ __u32 max_slices;
+ __u32 max_handles;
+ __u32 max_inflight_bytes;
+ __u32 max_inflight_fds;
+} __attribute__((__aligned__(8)));

maybe something slightly ugly like

long bus1_peer_reset(int fd, const struct bus1_peer_limits *param,
unsigned int flags);

a library might provide a wrapper that passes all the limits
as separate arguments.

The receive function would be fairly straightforward again, as
we just pass a pointer to the returned message, and all inputs
can be arguments, but the send command with this structure

+struct bus1_cmd_send {
+ __u64 flags;
+ __u64 ptr_destinations;
+ __u64 ptr_errors;
+ __u64 n_destinations;
+ __u64 ptr_vecs;
+ __u64 n_vecs;
+ __u64 ptr_handles;
+ __u64 n_handles;
+ __u64 ptr_fds;
+ __u64 n_fds;
+} __attribute__((__aligned__(8)));

is really tricky, as it's such a central interface but it's
also really complex, with its five indirect pointers to
variable-length arrays, making a total of 11 arguments
(including the first fd). Turning this into a syscall would
probably make a more efficient interface, so maybe some
of the arrays can be turned into a single argument and
require the user to call it multiple times instead of the
kernel looping around it.

The minimal version would be something like

long bus1_send(int fd, long dst, struct iovec *vecs, int n_vecs,
long handle, int dst_fd);

so you already get to six arguments with one destination, one
handle and one fd but no flags. Replacing vecs/n_vecs with pointer
and length doesn't help either, so I guess whatever we do here
we have to use some indirect structure.

Arnd

2016-10-27 13:49:10

by David Herrmann

[permalink] [raw]
Subject: Re: [RFC v1 04/14] bus1: util - fixed list utility library

Hi

On Thu, Oct 27, 2016 at 2:56 PM, Arnd Bergmann <[email protected]> wrote:
> On Thursday, October 27, 2016 2:48:46 PM CEST David Herrmann wrote:
>> On Thu, Oct 27, 2016 at 2:37 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Oct 26, 2016 at 09:18:00PM +0200, David Herrmann wrote:
>> >> + e = kmalloc_array(sizeof(*e), BUS1_FLIST_BATCH + 1, gfp);
>> >
>> >> +#define BUS1_FLIST_BATCH (1024)
>> >
>> >> +struct bus1_flist {
>> >> + union {
>> >> + struct bus1_flist *next;
>> >> + void *ptr;
>> >> + };
>> >> +};
>> >
>> > So that's an allocation of 8*(1024+1), or slightly more than 2 pages.
>> >
>> > kmalloc will round up to the next power of two, so you'll end up with an
>> > allocation of 16*1024, wasting a whopping 8184 bytes per such allocation
>> > in slack space.
>> >
>> > Please consider using 1023 or something for your batch size, 511 would
>> > get you to exactly 1 page which would be even better.
>>
>> Thanks for the hint! 511 looks like the obvious choice. Maybe even
>> (PAGE_SIZE / sizeof(long) - 1). I will put a comment next to the
>> definition.
>>
>>
>
> PAGE_SIZE can be up to 64KB though, so that might lead wasting a lot
> of memory.

The bus1-flist implementation never over-allocates. It is a fixed size
list, so it only allocates as much memory as needed. The issue PeterZ
pointed out is passing suitable sizes to kmalloc(), which internally
over-allocates to power-of-2 bounds (or some similar bounds). So we
only ever waste space here if kmalloc() internally rounds up. The code
in bus1-flist allocates exactly the needed space.

Thanks
David

2016-10-27 13:50:12

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC v1 04/14] bus1: util - fixed list utility library

On Thursday, October 27, 2016 2:48:46 PM CEST David Herrmann wrote:
> On Thu, Oct 27, 2016 at 2:37 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Oct 26, 2016 at 09:18:00PM +0200, David Herrmann wrote:
> >> + e = kmalloc_array(sizeof(*e), BUS1_FLIST_BATCH + 1, gfp);
> >
> >> +#define BUS1_FLIST_BATCH (1024)
> >
> >> +struct bus1_flist {
> >> + union {
> >> + struct bus1_flist *next;
> >> + void *ptr;
> >> + };
> >> +};
> >
> > So that's an allocation of 8*(1024+1), or slightly more than 2 pages.
> >
> > kmalloc will round up to the next power of two, so you'll end up with an
> > allocation of 16*1024, wasting a whopping 8184 bytes per such allocation
> > in slack space.
> >
> > Please consider using 1023 or something for your batch size, 511 would
> > get you to exactly 1 page which would be even better.
>
> Thanks for the hint! 511 looks like the obvious choice. Maybe even
> (PAGE_SIZE / sizeof(long) - 1). I will put a comment next to the
> definition.
>
>

PAGE_SIZE can be up to 64KB though, so that might lead wasting a lot
of memory.

Arnd

2016-10-27 13:51:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 05/14] bus1: util - pool utility library

On Wed, Oct 26, 2016 at 09:18:01PM +0200, David Herrmann wrote:
> +static struct bus1_pool_slice *
> +bus1_pool_slice_free(struct bus1_pool_slice *slice)
> +{
> + if (!slice)
> + return NULL;
> +
> + kfree(slice);
> +
> + return NULL;
> +}

The return value is never used. Which reduces the entire thing to:

kfree(slice);

since kfree() already accepts a NULL.

> + bus1_pool_slice_free(slice);
> + bus1_pool_slice_free(slice);
> + bus1_pool_slice_free(ps);

2016-10-27 13:51:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 05/14] bus1: util - pool utility library

On Wed, Oct 26, 2016 at 09:18:01PM +0200, David Herrmann wrote:
> +/* insert slice into the free tree */
> +static void bus1_pool_slice_link_free(struct bus1_pool_slice *slice,
> + struct bus1_pool *pool)
> +{
> + struct rb_node **n, *prev = NULL;
> + struct bus1_pool_slice *ps;
> +
> + n = &pool->slices_free.rb_node;
> + while (*n) {
> + prev = *n;
> + ps = container_of(prev, struct bus1_pool_slice, rb);
> + if (slice->size < ps->size)
> + n = &prev->rb_left;
> + else
> + n = &prev->rb_right;
> + }
> +
> + rb_link_node(&slice->rb, prev, n);
> + rb_insert_color(&slice->rb, &pool->slices_free);
> +}

If you only sort free slices by size, how do you merge contiguous free
slices?

> +/* find free slice big enough to hold @size bytes */
> +static struct bus1_pool_slice *
> +bus1_pool_slice_find_by_size(struct bus1_pool *pool, size_t size)
> +{
> + struct bus1_pool_slice *ps, *closest = NULL;
> + struct rb_node *n;
> +
> + n = pool->slices_free.rb_node;
> + while (n) {
> + ps = container_of(n, struct bus1_pool_slice, rb);
> + if (size < ps->size) {
> + closest = ps;
> + n = n->rb_left;
> + } else if (size > ps->size) {
> + n = n->rb_right;
> + } else /* if (size == ps->size) */ {
> + return ps;
> + }
> + }
> +
> + return closest;
> +}
> +
> +/* find used slice with given offset */
> +static struct bus1_pool_slice *
> +bus1_pool_slice_find_by_offset(struct bus1_pool *pool, size_t offset)
> +{
> + struct bus1_pool_slice *ps;
> + struct rb_node *n;
> +
> + n = pool->slices_busy.rb_node;
> + while (n) {
> + ps = container_of(n, struct bus1_pool_slice, rb);
> + if (offset < ps->offset)
> + n = n->rb_left;
> + else if (offset > ps->offset)
> + n = n->rb_right;
> + else /* if (offset == ps->offset) */
> + return ps;
> + }
> +
> + return NULL;
> +}

I find these two function names misleading. They don't find_by_size or
find_by_offset. They find_free_by_size and find_busy_by_offset. You
could reduce that to find_free and find_busy and have the 'size' and
'offset' in the argument name.

2016-10-27 14:20:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 04/14] bus1: util - fixed list utility library

On Wed, Oct 26, 2016 at 09:18:00PM +0200, David Herrmann wrote:
> + e = kmalloc_array(sizeof(*e), BUS1_FLIST_BATCH + 1, gfp);

> +#define BUS1_FLIST_BATCH (1024)

> +struct bus1_flist {
> + union {
> + struct bus1_flist *next;
> + void *ptr;
> + };
> +};

So that's an allocation of 8*(1024+1), or slightly more than 2 pages.

kmalloc will round up to the next power of two, so you'll end up with an
allocation of 16*1024, wasting a whopping 8184 bytes per such allocation
in slack space.

Please consider using 1023 or something for your batch size, 511 would
get you to exactly 1 page which would be even better.

Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

[CC += linuux-api]@vger.kernel.org

Hi David,

Could you please CC linux-api@ on all future iterations of this patch!

Cheers,

Michael



On Wed, Oct 26, 2016 at 9:17 PM, David Herrmann <[email protected]> wrote:
> Hi
>
> This proposal introduces bus1.ko, a kernel messaging bus. This is not a request
> for inclusion, yet. It is rather an initial draft and a Request For Comments.
>
> While bus1 emerged out of the kdbus project, bus1 was started from scratch and
> the concepts have little in common. In a nutshell, bus1 provides a
> capability-based IPC system, similar in nature to Android Binder, Cap'n Proto,
> and seL4. The module is completely generic and does neither require nor mandate
> a user-space counter-part.
>
> o Description
>
> Bus1 is a local IPC system, which provides a decentralized infrastructure to
> share objects between local peers. The main building blocks are nodes and
> handles. Nodes represent objects of a local peer, while handles represent
> descriptors that point to a node. Nodes can be created and destroyed by any
> peer, and they will always remain owned by their respective creator. Handles
> on the other hand, are used to refer to nodes and can be passed around with
> messages as auxiliary data. Whenever a handle is transferred, the receiver
> will get its own handle allocated, pointing to the same node as the original
> handle.
>
> Any peer can send messages directed at one of their handles. This will
> transfer the message to the owner of the node the handle points to. If a
> peer does not posess a handle to a given node, it will not be able to send a
> message to that node. That is, handles provide exclusive access management.
> Anyone that somehow acquired a handle to a node is privileged to further
> send this handle to other peers. As such, access management is transitive.
> Once a peer acquired a handle, it cannot be revoked again. However, a node
> owner can, at anytime, destroy a node. This will effectively unbind all
> existing handles to that node on any peer, notifying each one of the
> destruction.
>
> Unlike nodes and handles, peers cannot be addressed directly. In fact, peers
> are completely disconnected entities. A peer is merely an anchor of a set of
> nodes and handles, including an incoming message queue for any of those.
> Whether multiple nodes are all part of the same peer, or part of different
> peers does not affect the remote view of those. Peers solely exist as
> management entity and command dispatcher to local processes.
>
> The set of actors on a system is completely decentralized. There is no
> global component involved that provides a central registry or discovery
> mechanism. Furthermore, communication between peers only involves those
> peers, and does not affect any other peer in any way. No global
> communication lock is taken. However, any communication is still globally
> ordered, including unicasts, multicasts, and notifications.
>
> o Prior Art
>
> The concepts behind bus1 are almost identical to capability systems like
> Android Binder, Google Mojo, Cap'n Proto, seL4, and more. Bus1 differs from
> them by supporting Global Ordering, Multicasts, Resource Accounting, No
> Global Locking, No Global Context.
>
> While the bus1 UAPI does not expose all features (like soft-references as
> supported by Binder), the in-kernel code includes support for it. Multiple
> UAPIs can be supported on top of the in-kernel bus1 code, including support
> for the Binder UAPI. Efforts on this are still on-going.
>
> o Documentation
>
> The first patch in this series provides the bus1(7) man-page. It explains
> all concepts in bus1 in more detail. Furthermore, it describes the API that
> is available on bus1 file descriptors. The pre-compiled man-page is
> available at:
>
> http://www.bus1.org/bus1.html
>
> There is also a great bunch of in-source documentation available. All
> cross-source-file APIs have KernelDoc annotations. Furthermore, we have an
> introduction for each subsystem, to be found in the header files. The total
> number in lines of code for bus1 is roughly ~4.5k. The remaining ~5k LOC
> are comments and documentation.
>
> o Upstream
>
> The upstream development repository is available on github:
>
> http://github.com/bus1/bus1
>
> It is an out-of-tree repository that allows easy and fast development of
> new bus1 features. The in-tree integration repository is available at:
>
> http://github.com/bus1/linux
>
> o Conferences
>
> Tom and I will be attending Linux Plumbers Conf next week. Please do not
> hesitate to contact us there in person. There will also be a presentation
> [1] of bus1 on the last day of the conference.
>
> Thanks
> Tom & David
>
> [1] https://www.linuxplumbersconf.org/2016/ocw/proposals/3819
>
> Tom Gundersen (14):
> bus1: add bus1(7) man-page
> bus1: provide stub cdev /dev/bus1
> bus1: util - active reference utility library
> bus1: util - fixed list utility library
> bus1: util - pool utility library
> bus1: util - queue utility library
> bus1: tracking user contexts
> bus1: implement peer management context
> bus1: provide transaction context for multicasts
> bus1: add handle management
> bus1: implement message transmission
> bus1: hook up file-operations
> bus1: limit and protect resources
> bus1: basic user-space kselftests
>
> Documentation/bus1/.gitignore | 2 +
> Documentation/bus1/Makefile | 41 +
> Documentation/bus1/bus1.xml | 833 +++++++++++++++++++++
> Documentation/bus1/stylesheet.xsl | 16 +
> include/uapi/linux/bus1.h | 138 ++++
> init/Kconfig | 17 +
> ipc/Makefile | 1 +
> ipc/bus1/Makefile | 16 +
> ipc/bus1/handle.c | 823 ++++++++++++++++++++
> ipc/bus1/handle.h | 312 ++++++++
> ipc/bus1/main.c | 146 ++++
> ipc/bus1/main.h | 88 +++
> ipc/bus1/message.c | 656 ++++++++++++++++
> ipc/bus1/message.h | 171 +++++
> ipc/bus1/peer.c | 1163 +++++++++++++++++++++++++++++
> ipc/bus1/peer.h | 163 ++++
> ipc/bus1/security.h | 45 ++
> ipc/bus1/tests.c | 19 +
> ipc/bus1/tests.h | 32 +
> ipc/bus1/tx.c | 360 +++++++++
> ipc/bus1/tx.h | 102 +++
> ipc/bus1/user.c | 628 ++++++++++++++++
> ipc/bus1/user.h | 140 ++++
> ipc/bus1/util.c | 214 ++++++
> ipc/bus1/util.h | 141 ++++
> ipc/bus1/util/active.c | 419 +++++++++++
> ipc/bus1/util/active.h | 154 ++++
> ipc/bus1/util/flist.c | 116 +++
> ipc/bus1/util/flist.h | 202 +++++
> ipc/bus1/util/pool.c | 572 ++++++++++++++
> ipc/bus1/util/pool.h | 164 ++++
> ipc/bus1/util/queue.c | 445 +++++++++++
> ipc/bus1/util/queue.h | 351 +++++++++
> tools/testing/selftests/bus1/.gitignore | 2 +
> tools/testing/selftests/bus1/Makefile | 19 +
> tools/testing/selftests/bus1/bus1-ioctl.h | 111 +++
> tools/testing/selftests/bus1/test-api.c | 532 +++++++++++++
> tools/testing/selftests/bus1/test-io.c | 198 +++++
> tools/testing/selftests/bus1/test.h | 114 +++
> 39 files changed, 9666 insertions(+)
> create mode 100644 Documentation/bus1/.gitignore
> create mode 100644 Documentation/bus1/Makefile
> create mode 100644 Documentation/bus1/bus1.xml
> create mode 100644 Documentation/bus1/stylesheet.xsl
> create mode 100644 include/uapi/linux/bus1.h
> create mode 100644 ipc/bus1/Makefile
> create mode 100644 ipc/bus1/handle.c
> create mode 100644 ipc/bus1/handle.h
> create mode 100644 ipc/bus1/main.c
> create mode 100644 ipc/bus1/main.h
> create mode 100644 ipc/bus1/message.c
> create mode 100644 ipc/bus1/message.h
> create mode 100644 ipc/bus1/peer.c
> create mode 100644 ipc/bus1/peer.h
> create mode 100644 ipc/bus1/security.h
> create mode 100644 ipc/bus1/tests.c
> create mode 100644 ipc/bus1/tests.h
> create mode 100644 ipc/bus1/tx.c
> create mode 100644 ipc/bus1/tx.h
> create mode 100644 ipc/bus1/user.c
> create mode 100644 ipc/bus1/user.h
> create mode 100644 ipc/bus1/util.c
> create mode 100644 ipc/bus1/util.h
> create mode 100644 ipc/bus1/util/active.c
> create mode 100644 ipc/bus1/util/active.h
> create mode 100644 ipc/bus1/util/flist.c
> create mode 100644 ipc/bus1/util/flist.h
> create mode 100644 ipc/bus1/util/pool.c
> create mode 100644 ipc/bus1/util/pool.h
> create mode 100644 ipc/bus1/util/queue.c
> create mode 100644 ipc/bus1/util/queue.h
> create mode 100644 tools/testing/selftests/bus1/.gitignore
> create mode 100644 tools/testing/selftests/bus1/Makefile
> create mode 100644 tools/testing/selftests/bus1/bus1-ioctl.h
> create mode 100644 tools/testing/selftests/bus1/test-api.c
> create mode 100644 tools/testing/selftests/bus1/test-io.c
> create mode 100644 tools/testing/selftests/bus1/test.h
>
> --
> 2.10.1
>



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

2016-10-27 13:48:11

by David Herrmann

[permalink] [raw]
Subject: Re: [RFC v1 04/14] bus1: util - fixed list utility library

Hi

On Thu, Oct 27, 2016 at 2:37 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 26, 2016 at 09:18:00PM +0200, David Herrmann wrote:
>> + e = kmalloc_array(sizeof(*e), BUS1_FLIST_BATCH + 1, gfp);
>
>> +#define BUS1_FLIST_BATCH (1024)
>
>> +struct bus1_flist {
>> + union {
>> + struct bus1_flist *next;
>> + void *ptr;
>> + };
>> +};
>
> So that's an allocation of 8*(1024+1), or slightly more than 2 pages.
>
> kmalloc will round up to the next power of two, so you'll end up with an
> allocation of 16*1024, wasting a whopping 8184 bytes per such allocation
> in slack space.
>
> Please consider using 1023 or something for your batch size, 511 would
> get you to exactly 1 page which would be even better.

Thanks for the hint! 511 looks like the obvious choice. Maybe even
(PAGE_SIZE / sizeof(long) - 1). I will put a comment next to the
definition.

Thanks
David

2016-10-27 15:00:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 05/14] bus1: util - pool utility library

On Thu, Oct 27, 2016 at 02:59:07PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 26, 2016 at 09:18:01PM +0200, David Herrmann wrote:
> > +/* insert slice into the free tree */
> > +static void bus1_pool_slice_link_free(struct bus1_pool_slice *slice,
> > + struct bus1_pool *pool)
> > +{
> > + struct rb_node **n, *prev = NULL;
> > + struct bus1_pool_slice *ps;
> > +
> > + n = &pool->slices_free.rb_node;
> > + while (*n) {
> > + prev = *n;
> > + ps = container_of(prev, struct bus1_pool_slice, rb);
> > + if (slice->size < ps->size)
> > + n = &prev->rb_left;
> > + else
> > + n = &prev->rb_right;
> > + }
> > +
> > + rb_link_node(&slice->rb, prev, n);
> > + rb_insert_color(&slice->rb, &pool->slices_free);
> > +}
>
> If you only sort free slices by size, how do you merge contiguous free
> slices?

Ah, I see, you also keep an ordered list of slices and use that one
function up from here.

2016-10-27 15:14:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 05/14] bus1: util - pool utility library

On Wed, Oct 26, 2016 at 09:18:01PM +0200, David Herrmann wrote:

All small nits..

> +void bus1_pool_deinit(struct bus1_pool *pool)
> +{
> + struct bus1_pool_slice *slice;
> +
> + if (!pool || !pool->f)
> + return;
> +
> + while ((slice = list_first_entry_or_null(&pool->slices,
> + struct bus1_pool_slice,
> + entry))) {
> + WARN_ON(slice->ref_kernel);
> + list_del(&slice->entry);
> + bus1_pool_slice_free(slice);
> + }

I prefer to write that loop like:

while (!list_empty(&pool->slices)) {
slice = list_first_entry(&pool->slices, struct bus1_pool_slice, entry);
list_del(&slice->entry);

// ...
}



> +static void bus1_pool_free(struct bus1_pool *pool,
> + struct bus1_pool_slice *slice)
> +{
> + struct bus1_pool_slice *ps;
> +
> + /* don't free the slice if either has a reference */
> + if (slice->ref_kernel || slice->ref_user || WARN_ON(slice->free))
> + return;
> +
> + /*
> + * To release a pool-slice, we first drop it from the busy-tree, then
> + * merge it with possible previous/following free slices and re-add it
> + * to the free-tree.
> + */
> +
> + rb_erase(&slice->rb, &pool->slices_busy);
> +
> + if (!WARN_ON(slice->size > pool->allocated_size))
> + pool->allocated_size -= slice->size;
> +
> + if (pool->slices.next != &slice->entry) {
> + ps = container_of(slice->entry.prev, struct bus1_pool_slice,
> + entry);

ps = list_prev_entry(slice, entry);

> + if (ps->free) {
> + rb_erase(&ps->rb, &pool->slices_free);
> + list_del(&slice->entry);
> + ps->size += slice->size;
> + bus1_pool_slice_free(slice);
> + slice = ps; /* switch to previous slice */
> + }
> + }
> +
> + if (pool->slices.prev != &slice->entry) {
> + ps = container_of(slice->entry.next, struct bus1_pool_slice,
> + entry);

ps = list_next_entry(slice, entry);

> + if (ps->free) {
> + rb_erase(&ps->rb, &pool->slices_free);
> + list_del(&ps->entry);
> + slice->size += ps->size;
> + bus1_pool_slice_free(ps);
> + }
> + }
> +
> + slice->free = true;
> + bus1_pool_slice_link_free(slice, pool);
> +}

2016-10-27 15:25:27

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

On Thu, Oct 27, 2016 at 11:11 AM, Arnd Bergmann <[email protected]> wrote:
> On Thursday, October 27, 2016 1:54:05 AM CEST Tom Gundersen wrote:
>> On Thu, Oct 27, 2016 at 1:19 AM, Andy Lutomirski <[email protected]> wrote:
>> > This may have been covered elsewhere, but could this use syscalls instead?
>>
>> Yes, syscalls would work essentially the same. For now, we are using a
>> cdev as it makes it a lot more convenient to develop and test as an
>> out-of-tree module, but that could be changed easily before the final
>> submission, if that's what we want.
>
>
> Generally speaking, I think syscalls would be appropriate here, and put
> bus1 into a similar category as the other ipc interfaces (shm, msg, sem,
> mqueue, ...).

Could you elaborate on why you think syscalls would be more
appropriate than ioctls?

> However, syscall API design is nontrivial, and will require a bit of
> work to come to a set of syscalls that is fairly compact but also
> extensible enough. I think it makes sense to go through the exercise
> of working out what the syscall interface would end up looking like,
> and then make a decision.
>
> There is currently a set of file operations:
>
> @@ -48,7 +90,11 @@ const struct file_operations bus1_fops = {
> .owner = THIS_MODULE,
> .open = bus1_fop_open,
> .release = bus1_fop_release,
> + .poll = bus1_fop_poll,
> .llseek = noop_llseek,
> + .mmap = bus1_fop_mmap,
> + .unlocked_ioctl = bus1_peer_ioctl,
> + .compat_ioctl = bus1_peer_ioctl,
> .show_fdinfo = bus1_fop_show_fdinfo,
> };
>
> and then another set of ioctls:
>
> +enum {
> + BUS1_CMD_PEER_DISCONNECT = _IOWR(BUS1_IOCTL_MAGIC, 0x00,
> + __u64),
> + BUS1_CMD_PEER_QUERY = _IOWR(BUS1_IOCTL_MAGIC, 0x01,
> + struct bus1_cmd_peer_reset),
> + BUS1_CMD_PEER_RESET = _IOWR(BUS1_IOCTL_MAGIC, 0x02,
> + struct bus1_cmd_peer_reset),
> + BUS1_CMD_HANDLE_RELEASE = _IOWR(BUS1_IOCTL_MAGIC, 0x10,
> + __u64),
> + BUS1_CMD_HANDLE_TRANSFER = _IOWR(BUS1_IOCTL_MAGIC, 0x11,
> + struct bus1_cmd_handle_transfer),
> + BUS1_CMD_NODES_DESTROY = _IOWR(BUS1_IOCTL_MAGIC, 0x20,
> + struct bus1_cmd_nodes_destroy),
> + BUS1_CMD_SLICE_RELEASE = _IOWR(BUS1_IOCTL_MAGIC, 0x30,
> + __u64),
> + BUS1_CMD_SEND = _IOWR(BUS1_IOCTL_MAGIC, 0x40,
> + struct bus1_cmd_send),
> + BUS1_CMD_RECV = _IOWR(BUS1_IOCTL_MAGIC, 0x50,
> + struct bus1_cmd_recv),
> +};
>
> I think there is no alternative to having some sort of file descriptor
> with the basic operations you have above, but there is a question of
> how to get that file descriptor if the ioctls get changed to a syscall,
> the basic options being:

I could see the point of wanting a syscall to get the fd (your second
option below), but as I said, not sure I see why we would want to use
syscalls instead of ioctls.

> - Keep using a chardev. This works, but feels a little odd to me,
> and I can't think of any other interfaces combining syscalls with
> a chardev.
>
> - Have one syscall that returns an open file descriptor, replacing
> the fops->open() function. One advantage is that you can pass
> additional arguments in that you can't have with open.
> An example for this would be mqueue_open().

If we are going to change it, this might makes sense to me. It would
allow you to get the fd without having to have access to some
character device.

> - Have a mountable file system, and use open() on that to create
> connections. Advantages are that it's fairly easy to have one
> instance per fs-namespace, and you can have user-defined naming
> of objects in the file system.

Note that currently we only have one object (/dev/bus1) and each fd is
disconnected from anything else on creation, so not sure what benefits
a filesystem (or several instances of it) would give?

> For the other operations, the obvious translation would be to
> turn each ioctl command into one syscall, but that may not always
> be the best representation. One limitation is that you cannot
> generally have more than six 'long' arguments on a lot of
> architectures, and passing 'u64' arguments to syscalls is awkward.
>
> For some of the commands, the transformation would be straightforward
> if we assume that the 'u64' arguments can actually be 'long',
> I guess like this:
>
> +struct bus1_cmd_handle_transfer {
> + __u64 flags;
> + __u64 src_handle;
> + __u64 dst_fd;
> + __u64 dst_handle;
> +} __attribute__((__aligned__(8)));
>
> long bus1_handle_transfer(int fd, unsigned long handle,
> int dst_fd, unsigned long *dst_handle, unsigned int flags);
>
> +struct bus1_cmd_nodes_destroy {
> + __u64 flags;
> + __u64 ptr_nodes;
> + __u64 n_nodes;
> +} __attribute__((__aligned__(8)));
>
> long bus1_nodes_destroy(int fd, u64 *ptr_nodes,
> long n_nodes, unsigned int flags);
>
> However, the peer_reset would exceed the 6-argument limit when you count
> the initial file descriptor even if you assume that 'flags' can be
> made 32-bit:
>
> +struct bus1_cmd_peer_reset {
> + __u64 flags;
> + __u64 peer_flags;
> + __u32 max_slices;
> + __u32 max_handles;
> + __u32 max_inflight_bytes;
> + __u32 max_inflight_fds;
> +} __attribute__((__aligned__(8)));
>
> maybe something slightly ugly like
>
> long bus1_peer_reset(int fd, const struct bus1_peer_limits *param,
> unsigned int flags);
>
> a library might provide a wrapper that passes all the limits
> as separate arguments.
>
> The receive function would be fairly straightforward again, as
> we just pass a pointer to the returned message, and all inputs
> can be arguments, but the send command with this structure
>
> +struct bus1_cmd_send {
> + __u64 flags;
> + __u64 ptr_destinations;
> + __u64 ptr_errors;
> + __u64 n_destinations;
> + __u64 ptr_vecs;
> + __u64 n_vecs;
> + __u64 ptr_handles;
> + __u64 n_handles;
> + __u64 ptr_fds;
> + __u64 n_fds;
> +} __attribute__((__aligned__(8)));
>
> is really tricky, as it's such a central interface but it's
> also really complex, with its five indirect pointers to
> variable-length arrays, making a total of 11 arguments
> (including the first fd). Turning this into a syscall would
> probably make a more efficient interface, so maybe some
> of the arrays can be turned into a single argument and
> require the user to call it multiple times instead of the
> kernel looping around it.
>
> The minimal version would be something like
>
> long bus1_send(int fd, long dst, struct iovec *vecs, int n_vecs,
> long handle, int dst_fd);
>
> so you already get to six arguments with one destination, one
> handle and one fd but no flags. Replacing vecs/n_vecs with pointer
> and length doesn't help either, so I guess whatever we do here
> we have to use some indirect structure.
>
> Arnd

2016-10-27 15:28:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
> Messages can be destined for multiple queues, hence, we need to be
> careful that all queues get a consistent order of incoming messages. We
> define the concept of `global order' to provide a basic set of
> guarantees. This global order is a partial order on the set of all
> messages. The order is defined as:

Ah, ok. So it _is_ a partial order only. I got confused by earlier
reports, and the term 'global order' in general, to think you did a
total order. And then I wondered wth you needed total ordering for.

Maybe best to scrub 'global order' from the entire text. It's not
helpful.

2016-10-27 16:37:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

On Thu, Oct 27, 2016 at 8:25 AM, Tom Gundersen <[email protected]> wrote:
>
> Could you elaborate on why you think syscalls would be more
> appropriate than ioctls?

ioctl's tend to be a horrid mess both for things like compat.but also
for things like system call tracing and filtering (ie BPF).

The compat mess is fixable by making sure you always use 64-bit fields
rather than pointers everywhere and everything is aligned. The
tracing and filtering one not so much.

Linus

2016-10-27 16:40:15

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

On Thu, Oct 27, 2016 at 6:37 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Oct 27, 2016 at 8:25 AM, Tom Gundersen <[email protected]> wrote:
>>
>> Could you elaborate on why you think syscalls would be more
>> appropriate than ioctls?
>
> ioctl's tend to be a horrid mess both for things like compat.but also
> for things like system call tracing and filtering (ie BPF).
>
> The compat mess is fixable by making sure you always use 64-bit fields
> rather than pointers everywhere and everything is aligned.

This we do.

> The
> tracing and filtering one not so much.

Got it. Thanks.

Tom

2016-10-27 16:43:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:

> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
> the time they were sent. Moreover, atomic delivery of messages to
> multiple queues are supported, without any global synchronization, i.e.,
> the order of message delivery is consistent across queues.
>
> Messages can be destined for multiple queues, hence, we need to be
> careful that all queues get a consistent order of incoming messages.

So I read that to mean that if A and B both send a multi-cast message to
C and D, the messages will appear in the same order for both C and D.

Why is this important? It seem that this multi-cast ordering generates
much of the complexity of this patch while this Changelog fails to
explain why this is a desired property.


> We
> define the concept of `global order' to provide a basic set of
> guarantees. This global order is a partial order on the set of all
> messages. The order is defined as:
>
> 1) If a message B was queued *after* a message A, then: A < B
>
> 2) If a message B was queued *after* a message A was dequeued,
> then: A < B
>
> 3) If a message B was dequeued *after* a message A on the same queue,
> then: A < B
>
> (Note: Causality is honored. `after' and `before' do not refer to
> the same task, nor the same queue, but rather any kind of
> synchronization between the two operations.)
>
> The queue object implements this global order in a lockless fashion. It
> solely relies on a distributed clock on each queue. Each message to be
> sent causes a clock tick on the local clock and on all destination
> clocks. Furthermore, all clocks are synchronized, meaning they're
> fast-forwarded in case they're behind the highest of all participating
> peers. No global state tracking is involved.

Yet the code does compares on more than just timestamps. Why are these
secondary (and even tertiary) ordering required?

2016-10-28 11:33:49

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Thu, Oct 27, 2016 at 6:43 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
>
>> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
>> the time they were sent. Moreover, atomic delivery of messages to
>> multiple queues are supported, without any global synchronization, i.e.,
>> the order of message delivery is consistent across queues.
>>
>> Messages can be destined for multiple queues, hence, we need to be
>> careful that all queues get a consistent order of incoming messages.
>
> So I read that to mean that if A and B both send a multi-cast message to
> C and D, the messages will appear in the same order for both C and D.

That is one of the ordering guarantees, yes.

> Why is this important? It seem that this multi-cast ordering generates
> much of the complexity of this patch while this Changelog fails to
> explain why this is a desired property.

I don't think this is the case. The most important guarantee we give
is causal ordering. To make this work with multicast, we must stage
messages first, then commit on a second round. That is, we must find
some way to iterate over all clocks before committing, but at the same
time preventing any races. The multicast-stability as you just described
we get for free by introducing the second-level ordering via
sender-address.

Stability in multicasts without causal order is not necessarily a crucial
feature. However, note that if this ordering is given, it allows reducing
the number of round-trips in dependent systems. Imagine a daemon
reacting to a set of events from different sources. If the actions of that
daemon are solely defined by incoming events, someone else can
deduce the actions the daemon took without requiring the daemon to
send out events by itself. That is, you can just watch the events on the
system, and validly deduce the state of such daemon.

Example: There is a configuration daemon that sends events when
configuration is changed. And there is a hotplug daemon that sends
events when devices are hotplugged. You get an event that the "default
mute-state" for audio devices was changed, after it you get a
hotplugged audio device. You can now rely on the audio daemon to get
the events in the same order, and hence apply the new "default
mute-state" to the new device. No need to query the audio daemon
whether the new device is muted.

But as I said, the causal ordering is what we really want.
Multicast-stability is just a nice side-effect.

It might also be note mentioning: Both Android Binder and Chromium
Mojo make sure they provide causal ordering, since they run into real
issues. Binder allows placing multiple messages under the same
binder-lock, and Mojo provides Associated Interfaces [1]. DBus makes
sure to provide those ordering guarantees as well.

>> We
>> define the concept of `global order' to provide a basic set of
>> guarantees. This global order is a partial order on the set of all
>> messages. The order is defined as:
>>
>> 1) If a message B was queued *after* a message A, then: A < B
>>
>> 2) If a message B was queued *after* a message A was dequeued,
>> then: A < B
>>
>> 3) If a message B was dequeued *after* a message A on the same queue,
>> then: A < B
>>
>> (Note: Causality is honored. `after' and `before' do not refer to
>> the same task, nor the same queue, but rather any kind of
>> synchronization between the two operations.)
>>
>> The queue object implements this global order in a lockless fashion. It
>> solely relies on a distributed clock on each queue. Each message to be
>> sent causes a clock tick on the local clock and on all destination
>> clocks. Furthermore, all clocks are synchronized, meaning they're
>> fast-forwarded in case they're behind the highest of all participating
>> peers. No global state tracking is involved.
>
> Yet the code does compares on more than just timestamps. Why are these
> secondary (and even tertiary) ordering required?

Lamport Timestamps are guaranteed to be unique per-sender, but a receiving
queue can still contain messages with the same timestamp (from different
senders). That is, if two multicasts overlap, they might end up with the same
timestamp, if, and only if, they can have no causal relationship
(i.e., the ioctls
are called concurrently). We want to extend this partial order, though. We
want to provide a stable order in those cases (as described above), so we
need a secondary order (we simply pick the memory address of the sender).
This guarantees that all receivers get the same order of all messages (even
if they have equal timestamps).

Note that equal timestamps only happen if entries block each other.
Hence, we can use the memory address as secondary order, since we know
it is unique in those cases (and cannot be re-used).

Cheers,

Tom

[1] https://docs.google.com/document/d/1ENDDzACX4hplfQ8cCHGo_rXd3IHTu5H4hEZ44Cu8KVs

2016-10-28 12:07:00

by Richard Weinberger

[permalink] [raw]
Subject: Re: [RFC v1 08/14] bus1: implement peer management context

David, Tom,

On Wed, Oct 26, 2016 at 9:18 PM, David Herrmann <[email protected]> wrote:
> +struct bus1_peer *bus1_peer_new(void)
> +{
> + static atomic64_t peer_ids = ATOMIC64_INIT(0);
> + const struct cred *cred = current_cred();
> + struct bus1_peer *peer;
> + struct bus1_user *user;
> +
> + user = bus1_user_ref_by_uid(cred->uid);
> + if (IS_ERR(user))
> + return ERR_CAST(user);
> +
> + peer = kmalloc(sizeof(*peer), GFP_KERNEL);
> + if (!peer) {
> + bus1_user_unref(user);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /* initialize constant fields */
> + peer->id = atomic64_inc_return(&peer_ids);

What is the purpose of this id? Do other components depend on it
and are they aware of possible overflows?
Since it is an 64bit integer overflowing it is hard but not impossible.

--
Thanks,
//richard

2016-10-28 13:05:11

by Richard Weinberger

[permalink] [raw]
Subject: Re: [RFC v1 08/14] bus1: implement peer management context

On Wed, Oct 26, 2016 at 9:18 PM, David Herrmann <[email protected]> wrote:
> + /* initialize constant fields */
> + peer->id = atomic64_inc_return(&peer_ids);
> + peer->flags = 0;
> + peer->cred = get_cred(current_cred());
> + peer->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + peer->user = user;
> + peer->debugdir = NULL;
> + init_waitqueue_head(&peer->waitq);
> + bus1_active_init(&peer->active);
> +
> + /* initialize data section */
> + mutex_init(&peer->data.lock);
> +
> + /* initialize peer-private section */
> + mutex_init(&peer->local.lock);
> +
> + if (!IS_ERR_OR_NULL(bus1_debugdir)) {

How can bus1_debugdir contain an error code? AFACT it is either a
valid dentry or NULL.

> + char idstr[22];
> +
> + snprintf(idstr, sizeof(idstr), "peer-%llx", peer->id);
> +
> + peer->debugdir = debugfs_create_dir(idstr, bus1_debugdir);
> + if (!peer->debugdir) {
> + pr_err("cannot create debugfs dir for peer %llx\n",
> + peer->id);
> + } else if (!IS_ERR_OR_NULL(peer->debugdir)) {
> + bus1_debugfs_create_atomic_x("active", S_IRUGO,
> + peer->debugdir,
> + &peer->active.count);
> + }
> + }
> +
> + bus1_active_activate(&peer->active);

This is a no-nop since bus1_active_init() set ->count to BUS1_ACTIVE_NEW.

> + return peer;
> +}
> +
> +static int bus1_peer_disconnect(struct bus1_peer *peer)
> +{
> + bus1_active_deactivate(&peer->active);
> + bus1_active_drain(&peer->active, &peer->waitq);
> +
> + if (!bus1_active_cleanup(&peer->active, &peer->waitq,
> + NULL, NULL))
> + return -ESHUTDOWN;
> +
> + return 0;
> +}
> +
> +/**
> + * bus1_peer_free() - destroy peer
> + * @peer: peer to destroy, or NULL
> + *
> + * Destroy a peer object that was previously allocated via bus1_peer_new().
> + * This synchronously waits for any outstanding operations on this peer to
> + * finish, then releases all linked resources and deallocates the peer in an
> + * rcu-delayed manner.
> + *
> + * If NULL is passed, this is a no-op.
> + *
> + * Return: NULL is returned.

What about making the function of type void?

> +struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)
> +{
> + if (!peer)
> + return NULL;
> +
> + /* disconnect from environment */
> + bus1_peer_disconnect(peer);
> +
> + /* deinitialize peer-private section */
> + mutex_destroy(&peer->local.lock);
> +
> + /* deinitialize data section */
> + mutex_destroy(&peer->data.lock);
> +
> + /* deinitialize constant fields */
> + debugfs_remove_recursive(peer->debugdir);
> + bus1_active_deinit(&peer->active);
> + peer->user = bus1_user_unref(peer->user);
> + put_pid_ns(peer->pid_ns);
> + put_cred(peer->cred);
> + kfree_rcu(peer, rcu);
> +
> + return NULL;
> +}

--
Thanks,
//richard

2016-10-28 13:11:04

by Richard Weinberger

[permalink] [raw]
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

On Wed, Oct 26, 2016 at 9:17 PM, David Herrmann <[email protected]> wrote:
> Hi
>
> This proposal introduces bus1.ko, a kernel messaging bus. This is not a request
> for inclusion, yet. It is rather an initial draft and a Request For Comments.
>
> While bus1 emerged out of the kdbus project, bus1 was started from scratch and
> the concepts have little in common. In a nutshell, bus1 provides a
> capability-based IPC system, similar in nature to Android Binder, Cap'n Proto,
> and seL4. The module is completely generic and does neither require nor mandate
> a user-space counter-part.

One thing which is not so clear to me is the role of bus1 wrt. containers.
Can a container A exchange messages with a container B?
If not, where is the boundary? I guess it is the pid namespace.

--
Thanks,
//richard

2016-10-28 13:19:23

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 08/14] bus1: implement peer management context

On Fri, Oct 28, 2016 at 2:06 PM, Richard Weinberger
<[email protected]> wrote:
> David, Tom,
>
> On Wed, Oct 26, 2016 at 9:18 PM, David Herrmann <[email protected]> wrote:
>> +struct bus1_peer *bus1_peer_new(void)
>> +{
>> + static atomic64_t peer_ids = ATOMIC64_INIT(0);
>> + const struct cred *cred = current_cred();
>> + struct bus1_peer *peer;
>> + struct bus1_user *user;
>> +
>> + user = bus1_user_ref_by_uid(cred->uid);
>> + if (IS_ERR(user))
>> + return ERR_CAST(user);
>> +
>> + peer = kmalloc(sizeof(*peer), GFP_KERNEL);
>> + if (!peer) {
>> + bus1_user_unref(user);
>> + return ERR_PTR(-ENOMEM);
>> + }
>> +
>> + /* initialize constant fields */
>> + peer->id = atomic64_inc_return(&peer_ids);
>
> What is the purpose of this id? Do other components depend on it
> and are they aware of possible overflows?

The id is used purely to give a name to the peer in debugfs.

> Since it is an 64bit integer overflowing it is hard but not impossible.

Hm, what scenario do you have in mind? I cannot see how this could
happen (short of creating peers in a loop for hundreds of years).

Cheers,

Tom

2016-10-28 13:21:53

by Richard Weinberger

[permalink] [raw]
Subject: Re: [RFC v1 08/14] bus1: implement peer management context

On 28.10.2016 15:18, Tom Gundersen wrote:
> On Fri, Oct 28, 2016 at 2:06 PM, Richard Weinberger
> <[email protected]> wrote:
>> David, Tom,
>>
>> On Wed, Oct 26, 2016 at 9:18 PM, David Herrmann <[email protected]> wrote:
>>> +struct bus1_peer *bus1_peer_new(void)
>>> +{
>>> + static atomic64_t peer_ids = ATOMIC64_INIT(0);
>>> + const struct cred *cred = current_cred();
>>> + struct bus1_peer *peer;
>>> + struct bus1_user *user;
>>> +
>>> + user = bus1_user_ref_by_uid(cred->uid);
>>> + if (IS_ERR(user))
>>> + return ERR_CAST(user);
>>> +
>>> + peer = kmalloc(sizeof(*peer), GFP_KERNEL);
>>> + if (!peer) {
>>> + bus1_user_unref(user);
>>> + return ERR_PTR(-ENOMEM);
>>> + }
>>> +
>>> + /* initialize constant fields */
>>> + peer->id = atomic64_inc_return(&peer_ids);
>>
>> What is the purpose of this id? Do other components depend on it
>> and are they aware of possible overflows?
>
> The id is used purely to give a name to the peer in debugfs.

Okay.

>> Since it is an 64bit integer overflowing it is hard but not impossible.
>
> Hm, what scenario do you have in mind? I cannot see how this could
> happen (short of creating peers in a loop for hundreds of years).

When it is purely for naming creating peers is slow enough it is no problem
at all. That's why I was asking.

Thanks,
//richard

2016-10-28 13:23:25

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 08/14] bus1: implement peer management context

On Fri, Oct 28, 2016 at 3:05 PM, Richard Weinberger
<[email protected]> wrote:
> On Wed, Oct 26, 2016 at 9:18 PM, David Herrmann <[email protected]> wrote:
>> + /* initialize constant fields */
>> + peer->id = atomic64_inc_return(&peer_ids);
>> + peer->flags = 0;
>> + peer->cred = get_cred(current_cred());
>> + peer->pid_ns = get_pid_ns(task_active_pid_ns(current));
>> + peer->user = user;
>> + peer->debugdir = NULL;
>> + init_waitqueue_head(&peer->waitq);
>> + bus1_active_init(&peer->active);
>> +
>> + /* initialize data section */
>> + mutex_init(&peer->data.lock);
>> +
>> + /* initialize peer-private section */
>> + mutex_init(&peer->local.lock);
>> +
>> + if (!IS_ERR_OR_NULL(bus1_debugdir)) {
>
> How can bus1_debugdir contain an error code? AFACT it is either a
> valid dentry or NULL.

If debugfs is not enabled it will be ERR_PTR(-ENODEV).

>> + char idstr[22];
>> +
>> + snprintf(idstr, sizeof(idstr), "peer-%llx", peer->id);
>> +
>> + peer->debugdir = debugfs_create_dir(idstr, bus1_debugdir);
>> + if (!peer->debugdir) {
>> + pr_err("cannot create debugfs dir for peer %llx\n",
>> + peer->id);
>> + } else if (!IS_ERR_OR_NULL(peer->debugdir)) {
>> + bus1_debugfs_create_atomic_x("active", S_IRUGO,
>> + peer->debugdir,
>> + &peer->active.count);
>> + }
>> + }
>> +
>> + bus1_active_activate(&peer->active);
>
> This is a no-nop since bus1_active_init() set ->count to BUS1_ACTIVE_NEW.

bus1_active_activate() changes count from BUS1_ACTIVE_NEW to 0.

>> + return peer;
>> +}
>> +
>> +static int bus1_peer_disconnect(struct bus1_peer *peer)
>> +{
>> + bus1_active_deactivate(&peer->active);
>> + bus1_active_drain(&peer->active, &peer->waitq);
>> +
>> + if (!bus1_active_cleanup(&peer->active, &peer->waitq,
>> + NULL, NULL))
>> + return -ESHUTDOWN;
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * bus1_peer_free() - destroy peer
>> + * @peer: peer to destroy, or NULL
>> + *
>> + * Destroy a peer object that was previously allocated via bus1_peer_new().
>> + * This synchronously waits for any outstanding operations on this peer to
>> + * finish, then releases all linked resources and deallocates the peer in an
>> + * rcu-delayed manner.
>> + *
>> + * If NULL is passed, this is a no-op.
>> + *
>> + * Return: NULL is returned.
>
> What about making the function of type void?

We are consistently returning the type being freed so we can do

foo->bar = bar_free(bar);

Just a matter of style though.

>> +struct bus1_peer *bus1_peer_free(struct bus1_peer *peer)
>> +{
>> + if (!peer)
>> + return NULL;
>> +
>> + /* disconnect from environment */
>> + bus1_peer_disconnect(peer);
>> +
>> + /* deinitialize peer-private section */
>> + mutex_destroy(&peer->local.lock);
>> +
>> + /* deinitialize data section */
>> + mutex_destroy(&peer->data.lock);
>> +
>> + /* deinitialize constant fields */
>> + debugfs_remove_recursive(peer->debugdir);
>> + bus1_active_deinit(&peer->active);
>> + peer->user = bus1_user_unref(peer->user);
>> + put_pid_ns(peer->pid_ns);
>> + put_cred(peer->cred);
>> + kfree_rcu(peer, rcu);
>> +
>> + return NULL;
>> +}
>
> --
> Thanks,
> //richard

2016-10-28 13:33:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:
> On Thu, Oct 27, 2016 at 6:43 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
> >
> >> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
> >> the time they were sent. Moreover, atomic delivery of messages to
> >> multiple queues are supported, without any global synchronization, i.e.,
> >> the order of message delivery is consistent across queues.
> >>
> >> Messages can be destined for multiple queues, hence, we need to be
> >> careful that all queues get a consistent order of incoming messages.
> >
> > So I read that to mean that if A and B both send a multi-cast message to
> > C and D, the messages will appear in the same order for both C and D.
>
> That is one of the ordering guarantees, yes.
>
> > Why is this important? It seem that this multi-cast ordering generates
> > much of the complexity of this patch while this Changelog fails to
> > explain why this is a desired property.
>
> I don't think this is the case. The most important guarantee we give
> is causal ordering.

C and D not observing the message in the same order is consistent with
causality (and actual physics). The cause is A sending something the
effect is C receiving something. These two events must be ordered (which
yields the partial order). But there is no guarantee that different
observers would observe the same order. Esp. since A and B do not share
a clock and these events are not in fact ordered themselves.

When we go back to the example of special relativity, as per the paper,
this is trivially observable if we put A and C together in a frame of
reference and B and D in a different frame and have the two frames move
(at a significant fraction of the speed of light) relative to one
another. The signal, being an emission of light, would not arrive at
both observers in the same order (if the signal was given sufficiently
'simultaneous')

> To make this work with multicast, we must stage messages first, then
> commit on a second round. That is, we must find some way to iterate
> over all clocks before committing, but at the same time preventing any
> races. The multicast-stability as you just described we get for free
> by introducing the second-level ordering via sender-address.

And this, precisely, is what generates all the complexity found in this
patch. You want to strictly provide more than causality, which does
not, as per the argument above, provide this at all.

You're providing a semi-global ordering of things that are themselves
not actually ordered.

> Stability in multicasts without causal order is not necessarily a crucial
> feature. However, note that if this ordering is given, it allows reducing
> the number of round-trips in dependent systems. Imagine a daemon
> reacting to a set of events from different sources. If the actions of that
> daemon are solely defined by incoming events, someone else can
> deduce the actions the daemon took without requiring the daemon to
> send out events by itself. That is, you can just watch the events on the
> system, and validly deduce the state of such daemon.
>
> Example: There is a configuration daemon that sends events when
> configuration is changed. And there is a hotplug daemon that sends
> events when devices are hotplugged. You get an event that the "default
> mute-state" for audio devices was changed, after it you get a
> hotplugged audio device. You can now rely on the audio daemon to get
> the events in the same order, and hence apply the new "default
> mute-state" to the new device. No need to query the audio daemon
> whether the new device is muted.

Which is cute; but is it worth the pain?

> But as I said, the causal ordering is what we really want.
> Multicast-stability is just a nice side-effect.

I'm saying they're not the same thing and multi-cast stability isn't at
all implied.

2016-10-28 13:37:57

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

On Fri, Oct 28, 2016 at 3:11 PM, Richard Weinberger
<[email protected]> wrote:
> On Wed, Oct 26, 2016 at 9:17 PM, David Herrmann <[email protected]> wrote:
>> Hi
>>
>> This proposal introduces bus1.ko, a kernel messaging bus. This is not a request
>> for inclusion, yet. It is rather an initial draft and a Request For Comments.
>>
>> While bus1 emerged out of the kdbus project, bus1 was started from scratch and
>> the concepts have little in common. In a nutshell, bus1 provides a
>> capability-based IPC system, similar in nature to Android Binder, Cap'n Proto,
>> and seL4. The module is completely generic and does neither require nor mandate
>> a user-space counter-part.
>
> One thing which is not so clear to me is the role of bus1 wrt. containers.
> Can a container A exchange messages with a container B?
> If not, where is the boundary? I guess it is the pid namespace.

There is no restriction with respect to containers. The metadata is
translated between namespaces, obviously, but you can send messages to
anyone you have a handle to.

Cheers,

Tom

2016-10-28 13:48:21

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Fri, Oct 28, 2016 at 3:33 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:
>> On Thu, Oct 27, 2016 at 6:43 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
>> >
>> >> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
>> >> the time they were sent. Moreover, atomic delivery of messages to
>> >> multiple queues are supported, without any global synchronization, i.e.,
>> >> the order of message delivery is consistent across queues.
>> >>
>> >> Messages can be destined for multiple queues, hence, we need to be
>> >> careful that all queues get a consistent order of incoming messages.
>> >
>> > So I read that to mean that if A and B both send a multi-cast message to
>> > C and D, the messages will appear in the same order for both C and D.
>>
>> That is one of the ordering guarantees, yes.
>>
>> > Why is this important? It seem that this multi-cast ordering generates
>> > much of the complexity of this patch while this Changelog fails to
>> > explain why this is a desired property.
>>
>> I don't think this is the case. The most important guarantee we give
>> is causal ordering.
>
> C and D not observing the message in the same order is consistent with
> causality (and actual physics). The cause is A sending something the
> effect is C receiving something. These two events must be ordered (which
> yields the partial order). But there is no guarantee that different
> observers would observe the same order. Esp. since A and B do not share
> a clock and these events are not in fact ordered themselves.
>
> When we go back to the example of special relativity, as per the paper,
> this is trivially observable if we put A and C together in a frame of
> reference and B and D in a different frame and have the two frames move
> (at a significant fraction of the speed of light) relative to one
> another. The signal, being an emission of light, would not arrive at
> both observers in the same order (if the signal was given sufficiently
> 'simultaneous')
>
>> To make this work with multicast, we must stage messages first, then
>> commit on a second round. That is, we must find some way to iterate
>> over all clocks before committing, but at the same time preventing any
>> races. The multicast-stability as you just described we get for free
>> by introducing the second-level ordering via sender-address.
>
> And this, precisely, is what generates all the complexity found in this
> patch. You want to strictly provide more than causality, which does
> not, as per the argument above, provide this at all.
>
> You're providing a semi-global ordering of things that are themselves
> not actually ordered.

We are providing two things: causality (as in your physics example
above), and consistency (which, I agree, is cute, but not necessarily
crucial). However, the complexity comes from causality. Consistency is
trivial. The only thing needed for consistency is to tag each message
by its sender and use this to resolve conflicts in the ordering. The
alternative would be to just let these entries order arbitrarily
instead, but conceptually it would not be simpler and it would only
save us a few lines of code.

>> Stability in multicasts without causal order is not necessarily a crucial
>> feature. However, note that if this ordering is given, it allows reducing
>> the number of round-trips in dependent systems. Imagine a daemon
>> reacting to a set of events from different sources. If the actions of that
>> daemon are solely defined by incoming events, someone else can
>> deduce the actions the daemon took without requiring the daemon to
>> send out events by itself. That is, you can just watch the events on the
>> system, and validly deduce the state of such daemon.
>>
>> Example: There is a configuration daemon that sends events when
>> configuration is changed. And there is a hotplug daemon that sends
>> events when devices are hotplugged. You get an event that the "default
>> mute-state" for audio devices was changed, after it you get a
>> hotplugged audio device. You can now rely on the audio daemon to get
>> the events in the same order, and hence apply the new "default
>> mute-state" to the new device. No need to query the audio daemon
>> whether the new device is muted.
>
> Which is cute; but is it worth the pain?
>
>> But as I said, the causal ordering is what we really want.
>> Multicast-stability is just a nice side-effect.
>
> I'm saying they're not the same thing and multi-cast stability isn't at
> all implied.

Yeah, we agree. These are orthogonal concepts. What I meant is that
once we have causality, getting consistency as a side-effect is
virtually free.

2016-10-28 13:55:01

by Richard Weinberger

[permalink] [raw]
Subject: Re: [RFC v1 08/14] bus1: implement peer management context

On 28.10.2016 15:23, Tom Gundersen wrote:
> On Fri, Oct 28, 2016 at 3:05 PM, Richard Weinberger
> <[email protected]> wrote:
>> On Wed, Oct 26, 2016 at 9:18 PM, David Herrmann <[email protected]> wrote:
>>> + /* initialize constant fields */
>>> + peer->id = atomic64_inc_return(&peer_ids);
>>> + peer->flags = 0;
>>> + peer->cred = get_cred(current_cred());
>>> + peer->pid_ns = get_pid_ns(task_active_pid_ns(current));
>>> + peer->user = user;
>>> + peer->debugdir = NULL;
>>> + init_waitqueue_head(&peer->waitq);
>>> + bus1_active_init(&peer->active);
>>> +
>>> + /* initialize data section */
>>> + mutex_init(&peer->data.lock);
>>> +
>>> + /* initialize peer-private section */
>>> + mutex_init(&peer->local.lock);
>>> +
>>> + if (!IS_ERR_OR_NULL(bus1_debugdir)) {
>>
>> How can bus1_debugdir contain an error code? AFACT it is either a
>> valid dentry or NULL.
>
> If debugfs is not enabled it will be ERR_PTR(-ENODEV).

I thought you handle that earlier. But just figured that you check only
for NULL after doing debugfs_create_dir(). This confused me.

>>> + char idstr[22];
>>> +
>>> + snprintf(idstr, sizeof(idstr), "peer-%llx", peer->id);
>>> +
>>> + peer->debugdir = debugfs_create_dir(idstr, bus1_debugdir);
>>> + if (!peer->debugdir) {
>>> + pr_err("cannot create debugfs dir for peer %llx\n",
>>> + peer->id);
>>> + } else if (!IS_ERR_OR_NULL(peer->debugdir)) {
>>> + bus1_debugfs_create_atomic_x("active", S_IRUGO,
>>> + peer->debugdir,
>>> + &peer->active.count);
>>> + }
>>> + }
>>> +
>>> + bus1_active_activate(&peer->active);
>>
>> This is a no-nop since bus1_active_init() set ->count to BUS1_ACTIVE_NEW.
>
> bus1_active_activate() changes count from BUS1_ACTIVE_NEW to 0.

Too many "active" words. ;)
Now it makes sense. BUS1_ACTIVE_NEW is state "NEW"
and the unnamed state "ready to use" is a counter >= 0.

Thanks,
//richard

2016-10-28 13:58:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Fri, Oct 28, 2016 at 03:47:58PM +0200, Tom Gundersen wrote:
> On Fri, Oct 28, 2016 at 3:33 PM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:

> > And this, precisely, is what generates all the complexity found in this
> > patch. You want to strictly provide more than causality, which does
> > not, as per the argument above, provide this at all.
> >
> > You're providing a semi-global ordering of things that are themselves
> > not actually ordered.
>
> We are providing two things: causality (as in your physics example
> above), and consistency (which, I agree, is cute, but not necessarily
> crucial). However, the complexity comes from causality. Consistency is
> trivial. The only thing needed for consistency is to tag each message
> by its sender and use this to resolve conflicts in the ordering. The
> alternative would be to just let these entries order arbitrarily
> instead, but conceptually it would not be simpler and it would only
> save us a few lines of code.

Earlier you wrote:

> >> To make this work with multicast, we must stage messages first, then
> >> commit on a second round. That is, we must find some way to iterate
> >> over all clocks before committing, but at the same time preventing any
> >> races. The multicast-stability as you just described we get for free
> >> by introducing the second-level ordering via sender-address.

But you don't need the two-pass thing at all for causality. The entire
two-pass thing, and the serialization, is part of the consistency thing.

This is not virtually free.

For causality, all you need is a single iteration, delivering the
message one after the other, only ever doing local clock movements. You
do not need to find the max clock in the multicast set and avoid races
etc..

2016-10-28 14:34:13

by Tom Gundersen

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Fri, Oct 28, 2016 at 3:58 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Oct 28, 2016 at 03:47:58PM +0200, Tom Gundersen wrote:
>> On Fri, Oct 28, 2016 at 3:33 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:
>
>> > And this, precisely, is what generates all the complexity found in this
>> > patch. You want to strictly provide more than causality, which does
>> > not, as per the argument above, provide this at all.
>> >
>> > You're providing a semi-global ordering of things that are themselves
>> > not actually ordered.
>>
>> We are providing two things: causality (as in your physics example
>> above), and consistency (which, I agree, is cute, but not necessarily
>> crucial). However, the complexity comes from causality. Consistency is
>> trivial. The only thing needed for consistency is to tag each message
>> by its sender and use this to resolve conflicts in the ordering. The
>> alternative would be to just let these entries order arbitrarily
>> instead, but conceptually it would not be simpler and it would only
>> save us a few lines of code.
>
> Earlier you wrote:
>
>> >> To make this work with multicast, we must stage messages first, then
>> >> commit on a second round. That is, we must find some way to iterate
>> >> over all clocks before committing, but at the same time preventing any
>> >> races. The multicast-stability as you just described we get for free
>> >> by introducing the second-level ordering via sender-address.
>
> But you don't need the two-pass thing at all for causality. The entire
> two-pass thing, and the serialization, is part of the consistency thing.
>
> This is not virtually free.
>
> For causality, all you need is a single iteration, delivering the
> message one after the other, only ever doing local clock movements. You
> do not need to find the max clock in the multicast set and avoid races
> etc..

Ah, I see, we are talking past each other. The property we do want
(which is not trivial) is that we do not want to observe the effect
before the cause. If an event at A causes an event at B, then the two
events should be guaranteed to be observed at C in that order. i.e.,
if you send a multi-cast message from A to B and C and as a result of
receiving the message, B sends a message to C, we want to be
guaranteed that C receives the latter after the former.

If this property is not wanted, then (repeated) unicast can in most
cases be used instead of multi-cast (and a natural optimization, which
we left out for now, would be to skip the staging round for unicast
messages).

Cheers,

Tom

2016-10-28 14:37:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 09/14] bus1: provide transaction context for multicasts

On Wed, Oct 26, 2016 at 09:18:05PM +0200, David Herrmann wrote:
> From: Tom Gundersen <[email protected]>
>
> The transaction engine is an object that lives on the stack and is used
> to stage and commit multicasts properly. Unlike unicasts, a multicast
> cannot just be queued on each destination, but must be properly
> synchronized. This requires us to first stage each message on their
> respective destination, then sync and tick the clocks, and eventual
> commit all messages.
>
> The transaction context implements this logic for both, unicasts and
> multicasts. It hides the timestamp handling and takes care to properly
> synchronize accesses to the peer queues.
>
> Signed-off-by: Tom Gundersen <[email protected]>
> Signed-off-by: David Herrmann <[email protected]>
> ---
> ipc/bus1/Makefile | 1 +
> ipc/bus1/peer.c | 2 +
> ipc/bus1/peer.h | 3 +
> ipc/bus1/tx.c | 360 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ipc/bus1/tx.h | 102 ++++++++++++++++

See, this is way more than 4 lines.

You don't need any of this for causality.

2016-10-28 16:49:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1 06/14] bus1: util - queue utility library

On Fri, Oct 28, 2016 at 04:33:50PM +0200, Tom Gundersen wrote:
> Ah, I see, we are talking past each other.


Ah I see where my reasoning went wobbly, not sure how to fully express
that yet. I think your solution is stronger than strictly required
though, but I'm not sure there's a better one. I'll think on it.

2016-10-29 20:26:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

On Wed, Oct 26, 2016 at 10:34:30PM +0200, David Herrmann wrote:
> Long story short: We have uid<->uid quotas so far, which prevent DoS
> attacks, unless you get access to a ridiculous amount of local UIDs.
> Details on which resources are accounted can be found in the wiki [1].

Does only root user_ns uid count as separate or per-ns too?

In first case we will have vitually unbounded access to UIDs.

The second case can cap number of user namespaces a user can create while
using bus1 inside.

Or am I missing something?

--
Kirill A. Shutemov

2016-10-29 20:26:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC v1 01/14] bus1: add bus1(7) man-page

On Wed, Oct 26, 2016 at 09:17:57PM +0200, David Herrmann wrote:
> +To receive messag payloads, each peer has an associated shmem-backed

s/messag/message/

--
Kirill A. Shutemov

2016-10-29 21:06:12

by Josh Triplett

[permalink] [raw]
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

On Thu, Oct 27, 2016 at 03:45:24AM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 26, 2016 at 10:34:30PM +0200, David Herrmann wrote:
> > Long story short: We have uid<->uid quotas so far, which prevent DoS
> > attacks, unless you get access to a ridiculous amount of local UIDs.
> > Details on which resources are accounted can be found in the wiki [1].
>
> Does only root user_ns uid count as separate or per-ns too?
>
> In first case we will have vitually unbounded access to UIDs.
>
> The second case can cap number of user namespaces a user can create while
> using bus1 inside.

That seems easy enough to solve. Make the uid<->uid quota use uids in
the namespace of the side whose resources the operation uses. That way,
if both sender and recipient live in a user namespace then you get quota
per user in the namespace, but you can't use a user namespace to cheat
and manufacture more users to get more quota when talking to something
*outside* that namespace.

- Josh Triplett

2016-10-29 22:13:42

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC v1 02/14] bus1: provide stub cdev /dev/bus1

On Thursday 27 October 2016, Tom Gundersen wrote:
> On Thu, Oct 27, 2016 at 11:11 AM, Arnd Bergmann <[email protected]> wrote:
> > On Thursday, October 27, 2016 1:54:05 AM CEST Tom Gundersen wrote:
> >> On Thu, Oct 27, 2016 at 1:19 AM, Andy Lutomirski <[email protected]> wrote:
> >> > This may have been covered elsewhere, but could this use syscalls instead?
> >>
> >> Yes, syscalls would work essentially the same. For now, we are using a
> >> cdev as it makes it a lot more convenient to develop and test as an
> >> out-of-tree module, but that could be changed easily before the final
> >> submission, if that's what we want.
> >
> >
> > Generally speaking, I think syscalls would be appropriate here, and put
> > bus1 into a similar category as the other ipc interfaces (shm, msg, sem,
> > mqueue, ...).
>
> Could you elaborate on why you think syscalls would be more
> appropriate than ioctls?

Linus already answered this, but I'd also add that core kernel
features just make sense to be syscalls, rather than stuffing
them in a random device driver.

> > - Have a mountable file system, and use open() on that to create
> > connections. Advantages are that it's fairly easy to have one
> > instance per fs-namespace, and you can have user-defined naming
> > of objects in the file system.
>
> Note that currently we only have one object (/dev/bus1) and each fd is
> disconnected from anything else on creation, so not sure what benefits
> a filesystem (or several instances of it) would give?

I have not tried to understand some of the main concepts of bus1,
so I simply assumed that there was some way of looking up handles
of other instances. Using a file system gives you a natural way
to look up resources by name the way we do e.g. for mq_open(),
and it lets you easy decide whether containers should share
a view of the same namespace by mounting the same instance of
the file system into them or having separate instances.

If you don't ever need to look up a handle by name in bus1, using
a mountable file system would not help you.

Arnd

2016-11-02 14:45:14

by David Herrmann

[permalink] [raw]
Subject: Re: [RFC v1 00/14] Bus1 Kernel Message Bus

Hi

On Thu, Oct 27, 2016 at 2:45 AM, Kirill A. Shutemov
<[email protected]> wrote:
> On Wed, Oct 26, 2016 at 10:34:30PM +0200, David Herrmann wrote:
>> Long story short: We have uid<->uid quotas so far, which prevent DoS
>> attacks, unless you get access to a ridiculous amount of local UIDs.
>> Details on which resources are accounted can be found in the wiki [1].
>
> Does only root user_ns uid count as separate or per-ns too?
>
> In first case we will have vitually unbounded access to UIDs.
>
> The second case can cap number of user namespaces a user can create while
> using bus1 inside.
>
> Or am I missing something?

We use the exact same mechanism as "struct user_struct" (as defined in
linux/sched.h). One instance corresponds to each kuid_t currently in
use. This is analogous to task, epoll, inotify, fanotify, mqueue,
pipes, keys, ... resource accounting.

Could you elaborate on what problem you see?

Thanks
David