This patch-set add multicast support to Unix domain socket familiy for datagram
and seqpacket sockets. This work was made by Alban Crequy as a result of a
research we have been doing to improve the performance of the D-bus IPC system.
The first approach was to create a new AF_DBUS socket address family and
move the routing logic of the D-bus daemon to the kernel. The motivations behind
that approach and the thread of the patches post can be found in [1] and [2].
The feedback was that having D-bus specific code in the kernel is a bad
idea so the second approach was to implement multicast Unix domain sockets so
clients can directly send messages to peers bypassing the D-bus daemon.
A previous version of the patches was already posted by Alban [3] who also has
a good explanation of the implementation on his blog [4].
[1]http://alban-apinc.blogspot.com/2011/12/d-bus-in-kernel-faster.html
[2]http://thread.gmane.org/gmane.linux.kernel/1040481
[3]http://thread.gmane.org/gmane.linux.network/178772
[4]http://alban-apinc.blogspot.com/2011/12/introducing-multicast-unix-sockets.html
The patch-set is composed of the following patches:
[PATCH 01/10] af_unix: Documentation on multicast unix sockets
[PATCH 02/10] af_unix: Add constant for unix socket options level
[PATCH 03/10] af_unix: add setsockopt on unix sockets
[PATCH 04/10] af_unix: create, join and leave multicast groups with setsockopt
[PATCH 05/10] af_unix: find the recipients of a multicast group
[PATCH 06/10] af_unix: Deliver message to several recipients in case of multicast
[PATCH 07/10] af_unix: implement poll(POLLOUT) for multicast sockets
[PATCH 08/10] af_unix: Unsubscribe sockets from their multicast groups on RCV_SHUTDOWN
[PATCH 09/10] Allow server side of SOCK_SEQPACKET sockets to accept a new member
[PATCH 10/10] af_unix: Add a peer BPF for multicast Unix sockets
From: Alban Crequy <[email protected]>
Signed-off-by: Alban Crequy <[email protected]>
Reviewed-by: Ian Molton <[email protected]>
---
.../networking/multicast-unix-sockets.txt | 180 ++++++++++++++++++++
1 files changed, 180 insertions(+), 0 deletions(-)
create mode 100644 Documentation/networking/multicast-unix-sockets.txt
diff --git a/Documentation/networking/multicast-unix-sockets.txt b/Documentation/networking/multicast-unix-sockets.txt
new file mode 100644
index 0000000..ec9a19c
--- /dev/null
+++ b/Documentation/networking/multicast-unix-sockets.txt
@@ -0,0 +1,180 @@
+Multicast Unix sockets
+======================
+
+Multicast is implemented on SOCK_DGRAM and SOCK_SEQPACKET Unix sockets.
+
+An userspace application can create a multicast group with:
+
+ struct unix_mreq mreq = {0,};
+ mreq.address.sun_family = AF_UNIX;
+ mreq.address.sun_path[0] = '\0';
+ strcpy(mreq.address.sun_path + 1, "socket-address");
+
+ sockfd = socket(AF_UNIX, SOCK_DGRAM, 0);
+ ret = setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, &mreq, sizeof(mreq));
+
+This allocates a struct unix_mcast_group, which is reference counted and exists
+as long as the socket who created it exists or the group has at least one
+member.
+
+SOCK_DGRAM sockets can join a multicast group with:
+
+ ret = setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, &mreq, sizeof(mreq));
+
+This allocates a struct unix_mcast, which holds the settings of the membership,
+mainly whether loopback is enabled. A socket can be a member of several
+multicast groups.
+
+Since SOCK_SEQPACKET sockets are connection-oriented the semantics are
+different. A client cannot join a group but it can only connect and the
+multicast accept socket is used to allow the peer to join the group with:
+
+ ret = setsockopt(groupfd, SOL_UNIX, UNIX_CREATE_GROUP, &val, vallen);
+ ret = listen(groupfd, 10);
+ connfd = accept(sockfd, NULL, 0);
+ ret = setsockopt(connfd, SOL_UNIX, UNIX_ACCEPT_GROUP, &mreq, sizeof(mreq));
+
+The socket is part of the multicast group until it is released, shutdown with
+RCV_SHUTDOWN or it leaves explicitely the group:
+
+ ret = setsockopt(sockfd, SOL_UNIX, UNIX_LEAVE_GROUP, &mreq, sizeof(mreq));
+
+Struct unix_mcast nodes are linked in two RCU lists:
+- (struct unix_sock)->mcast_subscriptions
+- (struct unix_mcast_group)->mcast_members
+
+ unix_mcast_group unix_mcast_group
+ | |
+ v v
+unix_sock ----> unix_mcast ----> unix_mcast
+ |
+ v
+unix_sock ----> unix_mcast
+ |
+ v
+unix_sock ----> unix_mcast
+
+
+SOCK_DGRAM semantics
+====================
+
+ G The socket which created the group
+ / | \
+ P1 P2 P3 The member sockets
+
+Messages sent to the group are received by all members except the sender itself
+unless the sending socket has UNIX_MREQ_LOOPBACK set.
+
+Non-members can also send to the group socket G and the message will be
+broadcast to the group members, however socket G does not receive messages sent
+to the group, via it, itself.
+
+
+SOCK_SEQPACKET semantics
+========================
+
+When a connection is performed on a SOCK_SEQPACKET multicast socket, a new
+socket is created and its file descriptor is received by accept().
+
+ L The listening socket
+ / | \
+ A1 A2 A3 The accepted sockets
+ | | |
+ C1 C2 C3 The connected sockets
+
+Messages sent on the C1 socket are received by:
+- C1 itself if UNIX_MREQ_LOOPBACK is set.
+- The peer socket A1 if UNIX_MREQ_SEND_TO_PEER is set.
+- The other members of the multicast group C2 and C3.
+
+Only members can send to the group in this case.
+
+
+Atomic delivery and ordering
+============================
+
+Each message sent is delivered atomically to either none of the recipients or
+all the recipients, even with interruptions and errors.
+
+Locking is used in order to keep the ordering consistent on all recipients. We
+want to avoid the following scenario. Two emitters A and B, and 2 recipients, C
+and D:
+
+ C D
+A -------->| | Step 1: A's message is delivered to C
+B -------->| | Step 2: B's message is delivered to C
+B ---------|--->| Step 3: B's message is delivered to D
+A ---------|--->| Step 4: A's message is delivered to D
+
+Result: - C received (A, B)
+ - D received (B, A)
+
+Although A and B had a list of recipients (C, D) in the same order, C and D
+received the messages in a different order. To avoid this scenario, we need a
+locking mechanism while the messages are being delivered with skb_queue_tail().
+
+Solution 1:
+The easiest implementation would be to use a global spinlock on the group, but
+it creates an avoidable contention, especially when there are two independent
+streams set up with socket filters; e.g. if A sends messages received only by
+C, and B sends messages received only by D.
+
+Solution 2:
+Fine-grained locking could be implemented with a spinlock on each recipient.
+Before delivering the message to the recipients, the sender takes a spinlock on
+each recipient at the same time.
+
+Taking several spinlocks on the same struct can be dangerous and leads to
+deadlocks. This is prevented by sorting the list of sockets by memory address
+and taking the spinlocks in that order. The ordered list of recipients is
+computed on demand when a message is sent and the list is cached for
+performance. When the group membership changes, the generation of the
+membership is incremented and the ordered recipient list is invalidated.
+
+With this solution, the number of spinlocks taken simultaneously can be
+arbitrary big. Whilst it works, it breaks the lockdep mechanism.
+
+Solution 3:
+The current implementation is similar to solution 2 but with a limit on the
+number of spinlocks taken simultaneously (8), so lockdep works fine. A hash
+function and bit array with n=8 specifies which spinlocks to take. Contention
+on independent streams can still happen but it is less likely.
+
+
+Flow control
+============
+
+When a socket's receiving queue is full, the default behavior is to block
+senders (or to return -EAGAIN on non-blocking sockets). The socket can also
+join a multicast group with the flag UNIX_MREQ_DROP_WHEN_FULL. In this case,
+messages sent to the group will not be delivered to that socket when its
+receiving queue is full.
+
+Messages are still delivered atomically to all members who don't have the flag
+UNIX_MREQ_DROP_WHEN_FULL. If send() returns -EAGAIN, nobody received the
+message. If send() blocks because of one member, the other members don't
+receive the message until all sockets (except those with
+UNIX_MREQ_DROP_WHEN_FULL set) can receive at the same time.
+
+poll/epoll/select on POLLOUT events have a consistent behavior; they block if
+at least one member of the multicast group without UNIX_MREQ_DROP_WHEN_FULL has
+a full receiving queue.
+
+
+Multicast socket reference counting
+===================================
+
+A poller for POLLOUT events can block for any member of the group. The poller
+can use the wait queue "peer_wait" of any member. So it is important that Unix
+sockets are not released before all pollers exit. This is achieved by:
+
+- Incrementing the reference counter of a socket when it joins a multicast
+ group.
+- Decrementing it when the group is destroyed, that is when all
+ sockets keeping a reference on the group released their reference on the
+ group.
+
+struct unix_mcast_group keeps track of both current members and previous
+members. When a socket leaves a group, it is removed from the members list and
+put in the dead members list. This is done in order to take advantage of RCU
+lists, which reduces lock contention.
--
1.7.7.6
From: Alban Crequy <[email protected]>
Assign the next free socket options level to be used by the unix
protocol and address family.
Signed-off-by: Alban Crequy <[email protected]>
Reviewed-by: Ian Molton <[email protected]>
---
include/linux/socket.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index d0e77f6..a6b8f35 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -312,6 +312,7 @@ struct ucred {
#define SOL_IUCV 277
#define SOL_CAIF 278
#define SOL_ALG 279
+#define SOL_UNIX 280
/* IPX options */
#define IPX_TYPE 1
--
1.7.7.6
From: Alban Crequy <[email protected]>
unix_setsockopt() is called only on SOCK_DGRAM and SOCK_SEQPACKET unix sockets
Signed-off-by: Alban Crequy <[email protected]>
Reviewed-by: Ian Molton <[email protected]>
---
net/unix/af_unix.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 85d3bb7..3537f20 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -515,6 +515,8 @@ static unsigned int unix_dgram_poll(struct file *, struct socket *,
poll_table *);
static int unix_ioctl(struct socket *, unsigned int, unsigned long);
static int unix_shutdown(struct socket *, int);
+static int unix_setsockopt(struct socket *, int, int,
+ char __user *, unsigned int);
static int unix_stream_sendmsg(struct kiocb *, struct socket *,
struct msghdr *, size_t);
static int unix_stream_recvmsg(struct kiocb *, struct socket *,
@@ -564,7 +566,7 @@ static const struct proto_ops unix_dgram_ops = {
.ioctl = unix_ioctl,
.listen = sock_no_listen,
.shutdown = unix_shutdown,
- .setsockopt = sock_no_setsockopt,
+ .setsockopt = unix_setsockopt,
.getsockopt = sock_no_getsockopt,
.sendmsg = unix_dgram_sendmsg,
.recvmsg = unix_dgram_recvmsg,
@@ -585,7 +587,7 @@ static const struct proto_ops unix_seqpacket_ops = {
.ioctl = unix_ioctl,
.listen = unix_listen,
.shutdown = unix_shutdown,
- .setsockopt = sock_no_setsockopt,
+ .setsockopt = unix_setsockopt,
.getsockopt = sock_no_getsockopt,
.sendmsg = unix_seqpacket_sendmsg,
.recvmsg = unix_seqpacket_recvmsg,
@@ -1583,6 +1585,13 @@ out:
}
+static int unix_setsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, unsigned int optlen)
+{
+ return -EOPNOTSUPP;
+}
+
+
static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
struct msghdr *msg, size_t len)
{
--
1.7.7.6
On Mon, 2012-02-20 at 16:57 +0100, Javier Martinez Canillas wrote:
> This patch-set add multicast support to Unix domain socket familiy for datagram
> and seqpacket sockets. This work was made by Alban Crequy as a result of a
> research we have been doing to improve the performance of the D-bus IPC system.
Do you have links to any modifications to userspace dbus to take
advantage of this?
On Mon, 2012-02-20 at 14:13 -0500, Colin Walters wrote:
> On Mon, 2012-02-20 at 16:57 +0100, Javier Martinez Canillas wrote:
> > This patch-set add multicast support to Unix domain socket familiy for datagram
> > and seqpacket sockets. This work was made by Alban Crequy as a result of a
> > research we have been doing to improve the performance of the D-bus IPC system.
>
> Do you have links to any modifications to userspace dbus to take
> advantage of this?
we have a work in progress at
http://cgit.collabora.com/git/user/rodrigo/dbus.git/ in the
unix-sockets-multicast branch
My first impression is that I'm amazed at how much complicated new
code you have to add to support groups of receivers of AF_UNIX
messages.
I can't see how this is better than doing multicast over ipv4 using
UDP or something like that, code which we have already and has been
tested for decades.
I really don't want to apply this stuff, it looks bloated,
complicated, and there is another avenue for doing what you want to
do.
Applications have to change to support the new multicast facilities,
so they can equally be changed to use a real transport that already
supports multicasting.
On 02/24/2012 09:36 PM, David Miller wrote:
>
> My first impression is that I'm amazed at how much complicated new
> code you have to add to support groups of receivers of AF_UNIX
> messages.
>
> I can't see how this is better than doing multicast over ipv4 using
> UDP or something like that, code which we have already and has been
> tested for decades.
>
Primary for performance reasons. D-bus is an IPC system for processes in
the same machine so traversing the whole TCP/IP stack seems a little
overkill to me. We will try it though to have numbers on the actual
overhead of using UDP multicast over IP instead of multicast Unix domain
sockets.
We also thought of using Netlink sockets since it already supports
multicast and should be more lightweight than IP multicast. But even
Netlink doesn't meet our needs since our multicast on Unix sockets
implementation has different semantics needed for D-bus:
- total order is guaranteed: If sender A sends a message before B, then
receiver C and D should both get message A first and then B.
- slow readers: dropping packets vs blocking the sender. Although
datagrams are not reliable on IP, datagrams on Unix sockets are never
lost. So if one receiver has its buffer full the sender is blocked
instead of dropping packets. That way we guarantee a reliable
communication channel.
- multicast group acess control: controlling who can join the multicast
group.
- multicast on loopback is not supported: which means we have to use a
NIC (i.e: eth0).
> I really don't want to apply this stuff, it looks bloated,
> complicated, and there is another avenue for doing what you want to
> do.
>
We can work to reduce the implementation complexity and make it less
bloated.
Or you don't like the idea in general?
> Applications have to change to support the new multicast facilities,
> so they can equally be changed to use a real transport that already
> supports multicasting.
Yes, this is not about minimizing user-space application change but to
improve the D-bus performance, or any other framework that relies on
multicast communication on a single machine.
Best regards,
Javier
From: Javier Martinez Canillas <[email protected]>
Date: Mon, 27 Feb 2012 15:00:06 +0100
> Primary for performance reasons. D-bus is an IPC system for processes in
> the same machine so traversing the whole TCP/IP stack seems a little
> overkill to me.
You haven't actually tested what the cost of this actually is, so what
you're saying is mere speculation. In many cases TCP/UDP over
loopback is actually faster than AF_UNIX.
Since this is the premise of your whole rebuttal, I'll simply stop
reading here.
Hi David
On Mon, 2012-02-27 at 14:05 -0500, David Miller wrote:
> From: Javier Martinez Canillas <[email protected]>
> Date: Mon, 27 Feb 2012 15:00:06 +0100
>
> > Primary for performance reasons. D-bus is an IPC system for processes in
> > the same machine so traversing the whole TCP/IP stack seems a little
> > overkill to me.
>
> You haven't actually tested what the cost of this actually is, so what
> you're saying is mere speculation. In many cases TCP/UDP over
> loopback is actually faster than AF_UNIX.
>
you're right we haven't tested this, but because of the other points in
Javier's mail, which are the special semantics we need for this to fit
the D-Bus usage:
> - total order is guaranteed: If sender A sends a message before B,
then
> receiver C and D should both get message A first and then B.
>
> - slow readers: dropping packets vs blocking the sender. Although
> datagrams are not reliable on IP, datagrams on Unix sockets are
never
> lost. So if one receiver has its buffer full the sender is blocked
> instead of dropping packets. That way we guarantee a reliable
> communication channel.
>
> - multicast group acess control: controlling who can join the
multicast
> group.
>
> - multicast on loopback is not supported: which means we have to use a
> NIC (i.e: eth0).
Because of all of this, UDP/IP multicast wasn't even considered as an
option. We might be wrong in some/all of those, so could you please
comment on them to check if that's so?
thanks
On Tue, Feb 28, 2012 at 11:47:39AM +0100, Rodrigo Moya wrote:
> > - slow readers: dropping packets vs blocking the sender. Although
> > datagrams are not reliable on IP, datagrams on Unix sockets are
> never
> > lost. So if one receiver has its buffer full the sender is blocked
> > instead of dropping packets. That way we guarantee a reliable
> > communication channel.
This sounds like a terribly nice way to f*ck the entire D-Bus system by
having one broken (or malicious) desktop application. What's the
intended way of coping with users that block the socket by not reading?
-David L.
On 02/28/2012 03:28 PM, David Lamparter wrote:
> On Tue, Feb 28, 2012 at 11:47:39AM +0100, Rodrigo Moya wrote:
>> > - slow readers: dropping packets vs blocking the sender. Although
>> > datagrams are not reliable on IP, datagrams on Unix sockets are
>> never
>> > lost. So if one receiver has its buffer full the sender is blocked
>> > instead of dropping packets. That way we guarantee a reliable
>> > communication channel.
>
> This sounds like a terribly nice way to f*ck the entire D-Bus system by
> having one broken (or malicious) desktop application. What's the
> intended way of coping with users that block the socket by not reading?
>
>
> -David L.
The problem is that D-bus expects a reliable transport method (TCP or
SOCK_STREAM Unix socks) but this is not the case with multicast Unix
sockets. Since our implementation is for SOCK_SEQPACKET and SOCK_DGRAM
socket types.
So, you have to either add another layer to the D-bus protocol to make
it reliable (acks, retransmissions, flow control, etc) or avoid losing
D-bus messages (by blocking the sender if one of the receivers has its
buffer full).
Regards,
Javier
On 02/28/2012 04:24 PM, Javier Martinez Canillas wrote:
> On 02/28/2012 03:28 PM, David Lamparter wrote:
>> On Tue, Feb 28, 2012 at 11:47:39AM +0100, Rodrigo Moya wrote:
>>> > - slow readers: dropping packets vs blocking the sender. Although
>>> > datagrams are not reliable on IP, datagrams on Unix sockets are
>>> never
>>> > lost. So if one receiver has its buffer full the sender is blocked
>>> > instead of dropping packets. That way we guarantee a reliable
>>> > communication channel.
>>
>> This sounds like a terribly nice way to f*ck the entire D-Bus system by
>> having one broken (or malicious) desktop application. What's the
>> intended way of coping with users that block the socket by not reading?
>>
>>
>> -David L.
>
> The problem is that D-bus expects a reliable transport method (TCP or
> SOCK_STREAM Unix socks) but this is not the case with multicast Unix
> sockets. Since our implementation is for SOCK_SEQPACKET and SOCK_DGRAM
> socket types.
>
> So, you have to either add another layer to the D-bus protocol to make
> it reliable (acks, retransmissions, flow control, etc) or avoid losing
> D-bus messages (by blocking the sender if one of the receivers has its
> buffer full).
>
Also, this problem exists with current D-bus implementation. If a
malicious desktop application doesn't read its socket then the messages
sent to it will be buffered in the daemon:
https://bugs.freedesktop.org/show_bug.cgi?id=33606
dbus-daemon memory usage will ballooning until
max_incoming_bytes/max_outgoing_bytes limit is reached (1GB for session
bus in default configuration)
<limit name="max_incoming_bytes">1000000000</limit>
<limit name="max_outgoing_bytes">1000000000</limit>
It only works because not many applications are broken and user-space
memory is virtualized. But if you bypass the daemon and use a multicast
transport layer (as in our multicast Unix socket implementation) you
don't have that much memory to buffer the packets.
So you have to either block the senders or:
- drop the slow reader
- kill the spammer
- have an infinite amount of memory
Regards,
Javier
From: Rodrigo Moya <[email protected]>
Date: Tue, 28 Feb 2012 11:47:39 +0100
> Because of all of this, UDP/IP multicast wasn't even considered as an
> option. We might be wrong in some/all of those, so could you please
> comment on them to check if that's so?
You guys seem to want something that isn't AF_UNIX, ordering guarentees
and whatnot, it really has no place in these protocols.
You've designed a userlevel subsystem with requirements that no existing
socket layer can give, and you just figured you'd work that out later.
I think you rather should have reconsidered these premises and designed
something that could handle reality which is AF_UNIX can't do multicast
and nobody guarentees those strange ordering requirements you seem to
have.