2010-11-22 18:36:44

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 0/9] RFC v2: Multicast and filtering features on AF_UNIX

Hi,

This is a new serie of patches, following my first request for comments
here: http://marc.info/?l=linux-netdev&m=128534977610124

It implements a new multicast features on AF_UNIX datagram and
seqpacket sockets.

My motivation is to use it for D-Bus. The kernel code here does not
contain anything specific to D-Bus, so it could be used for other IPC
mechanisms too.

The patches apply on linux-next-20101122 and can be pulled from:

git://git.collabora.co.uk/git/user/alban/linux-2.6.35.y/.git unix-multicast5

Comments & questions welcome! I would appreciate a review on the design
and know if it goes in the right direction.

Regards,
Alban Crequy


2010-11-22 18:39:25

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 1/9] AF_UNIX: Add constant for Unix socket options level

Assign the next free socket options level to be used by the Unix
protocol and address family.

Signed-off-by: Alban Crequy <[email protected]>
---
include/linux/socket.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 86b652f..7c5a4da 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -307,6 +307,7 @@ struct ucred {
#define SOL_RDS 276
#define SOL_IUCV 277
#define SOL_CAIF 278
+#define SOL_UNIX 279

/* IPX options */
#define IPX_TYPE 1
--
1.7.1

2010-11-22 18:39:37

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 2/9] AF_UNIX: add setsockopt on Unix sockets

Signed-off-by: Alban Crequy <[email protected]>
---
net/unix/af_unix.c | 15 ++++++++++++---
1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 7ff31c6..6eca106 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -512,6 +512,8 @@ static unsigned int unix_dgram_poll(struct file *, struct socket *,
poll_table *);
static int unix_ioctl(struct socket *, unsigned int, unsigned long);
static int unix_shutdown(struct socket *, int);
+static int unix_setsockopt(struct socket *, int, int,
+ char __user *, unsigned int);
static int unix_stream_sendmsg(struct kiocb *, struct socket *,
struct msghdr *, size_t);
static int unix_stream_recvmsg(struct kiocb *, struct socket *,
@@ -538,7 +540,7 @@ static const struct proto_ops unix_stream_ops = {
.ioctl = unix_ioctl,
.listen = unix_listen,
.shutdown = unix_shutdown,
- .setsockopt = sock_no_setsockopt,
+ .setsockopt = unix_setsockopt,
.getsockopt = sock_no_getsockopt,
.sendmsg = unix_stream_sendmsg,
.recvmsg = unix_stream_recvmsg,
@@ -559,7 +561,7 @@ static const struct proto_ops unix_dgram_ops = {
.ioctl = unix_ioctl,
.listen = sock_no_listen,
.shutdown = unix_shutdown,
- .setsockopt = sock_no_setsockopt,
+ .setsockopt = unix_setsockopt,
.getsockopt = sock_no_getsockopt,
.sendmsg = unix_dgram_sendmsg,
.recvmsg = unix_dgram_recvmsg,
@@ -580,7 +582,7 @@ static const struct proto_ops unix_seqpacket_ops = {
.ioctl = unix_ioctl,
.listen = unix_listen,
.shutdown = unix_shutdown,
- .setsockopt = sock_no_setsockopt,
+ .setsockopt = unix_setsockopt,
.getsockopt = sock_no_getsockopt,
.sendmsg = unix_seqpacket_sendmsg,
.recvmsg = unix_dgram_recvmsg,
@@ -1533,6 +1535,13 @@ out:
}


+static int unix_setsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, unsigned int optlen)
+{
+ return -EOPNOTSUPP;
+}
+
+
static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
struct msghdr *msg, size_t len)
{
--
1.7.1

2010-11-22 18:39:42

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages

unix_find_multicast_recipients() builds an array of recipients. It can either
find the peers of a specific multicast address, or find all the peers of all
multicast group the sender is part of.

Signed-off-by: Alban Crequy <[email protected]>
---
net/unix/af_unix.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 144 insertions(+), 0 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 2278829..3cc9695 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,15 +114,48 @@
#include <linux/mount.h>
#include <net/checksum.h>
#include <linux/security.h>
+#include <linux/sort.h>

static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
static DEFINE_SPINLOCK(unix_table_lock);
+static DEFINE_SPINLOCK(unix_multicast_lock);
static atomic_long_t unix_nr_socks;

#define unix_sockets_unbound (&unix_socket_table[UNIX_HASH_SIZE])

#define UNIX_ABSTRACT(sk) (unix_sk(sk)->addr->hash != UNIX_HASH_SIZE)

+struct sock_item {
+ struct sock *s;
+ struct sk_buff *skb;
+ int to_deliver;
+};
+
+struct sock_set {
+ int cnt;
+ struct sock_item items[0];
+};
+
+static void kfree_sock_set(struct sock_set *set)
+{
+ int i;
+ for (i = 0 ; i < set->cnt ; i++)
+ sock_put(set->items[i].s);
+ kfree(set);
+}
+
+static int sock_item_compare(const void *_a, const void *_b)
+{
+ const struct sock_item *a = _a;
+ const struct sock_item *b = _b;
+ if (a->s > b->s)
+ return 1;
+ else if (a->s < b->s)
+ return -1;
+ else
+ return 0;
+}
+
#ifdef CONFIG_SECURITY_NETWORK
static void unix_get_secdata(struct scm_cookie *scm, struct sk_buff *skb)
{
@@ -824,6 +857,117 @@ fail:
return NULL;
}

+static int unix_find_multicast_members(struct sock_set *set,
+ int recipient_cnt,
+ struct sock *sender,
+ struct hlist_head *list)
+{
+ struct unix_mcast *node;
+ struct hlist_node *pos;
+ hlist_for_each_entry(node, pos, list,
+ member_node) {
+ if (set->cnt + 1 > recipient_cnt)
+ return -ENOMEM;
+ if (node->member == unix_sk(sender) &&
+ !(node->flags & UNIX_MREQ_LOOPBACK))
+ continue;
+
+ sock_hold(&node->member->sk);
+ set->items[set->cnt].s = &node->member->sk;
+ set->items[set->cnt].skb = NULL;
+ set->items[set->cnt].to_deliver = 1;
+ set->cnt++;
+ }
+ return 0;
+}
+
+/* Find the recipients for a message sent by 'sender' to 'addr'. If 'dest' is
+ * NULL, the recipients are peers of all subscribed groups.
+ */
+static struct sock_set *unix_find_multicast_recipients(struct sock *sender,
+ struct sock *dest,
+ int *err)
+{
+ struct unix_sock *u = unix_sk(sender);
+ struct unix_mcast *node;
+ struct hlist_node *pos;
+ struct sock_set *set;
+ int recipient_cnt;
+
+ /* We cannot allocate in the spin lock. First, count the recipients */
+try_again:
+ spin_lock(&unix_multicast_lock);
+ if (dest != NULL) {
+ if (unix_sk(dest)->is_mcast_addr) {
+ recipient_cnt = unix_sk(dest)->mcast_members_cnt;
+ } else {
+ recipient_cnt = 1;
+ }
+ } else {
+ recipient_cnt = 0;
+ hlist_for_each_entry(node, pos, &u->mcast_subscriptions,
+ subscription_node) {
+ recipient_cnt += node->addr->mcast_members_cnt;
+ }
+ }
+ spin_unlock(&unix_multicast_lock);
+
+ /* Allocate for the set and hope the number of recipients does not
+ * change while the lock is released. If it changes, we have to try
+ * again... We allocate a bit more than needed, so if a _few_ members
+ * are added in a multicast group meanwhile, we don't always need to
+ * try again. */
+ recipient_cnt += 5;
+
+ set = kmalloc(sizeof(struct sock_set)
+ + sizeof(struct sock_item) * recipient_cnt,
+ GFP_KERNEL);
+ if (!set) {
+ *err = -ENOMEM;
+ return NULL;
+ }
+ set->cnt = 0;
+
+ spin_lock(&unix_multicast_lock);
+ if (dest && unix_sk(dest)->is_mcast_addr) {
+ /* Message sent to a multicast address */
+ if (unix_find_multicast_members(set, recipient_cnt,
+ sender,
+ &unix_sk(dest)->mcast_members)) {
+ spin_unlock(&unix_multicast_lock);
+ kfree_sock_set(set);
+ goto try_again;
+ }
+ } else if (!dest) {
+ /* Destination not specified, sending to all peers of
+ * subscribed groups */
+ hlist_for_each_entry(node, pos, &u->mcast_subscriptions,
+ subscription_node) {
+ if (unix_find_multicast_members(set, recipient_cnt,
+ sender,
+ &node->addr->mcast_members)) {
+ spin_unlock(&unix_multicast_lock);
+ kfree_sock_set(set);
+ goto try_again;
+ }
+ }
+ } else {
+ /* Message sent to a non-multicast address */
+ BUG_ON(recipient_cnt < 1);
+ set->cnt = 1;
+ sock_hold(dest);
+ set->items[0].s = dest;
+ set->items[0].skb = NULL;
+ set->items[0].to_deliver = 1;
+ }
+ spin_unlock(&unix_multicast_lock);
+
+ /* Keep the array ordered to prevent deadlocks on circular waits */
+ sort(set->items, set->cnt, sizeof(struct sock_item),
+ sock_item_compare, NULL);
+ return set;
+}
+

static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
--
1.7.1

2010-11-22 18:39:54

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 5/9] AF_UNIX: Deliver message to several recipients in case of multicast

unix_dgram_sendmsg() implements the delivery both for SOCK_DGRAM and
SOCK_SEQPACKET Unix sockets.

The delivery is done in an atomic way: either the message is delivered to all
recipients or none, even in case of interruptions or errors.

Signed-off-by: Alban Crequy <[email protected]>
---
net/unix/af_unix.c | 247 +++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 188 insertions(+), 59 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3cc9695..9207393 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1553,16 +1553,17 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
{
struct sock_iocb *siocb = kiocb_to_siocb(kiocb);
struct sock *sk = sock->sk;
- struct net *net = sock_net(sk);
struct unix_sock *u = unix_sk(sk);
struct sockaddr_un *sunaddr = msg->msg_name;
- struct sock *other = NULL;
+ struct sock_set *others_set = NULL;
int namelen = 0; /* fake GCC */
int err;
unsigned hash;
struct sk_buff *skb;
+ int i;
long timeo;
struct scm_cookie tmp_scm;
+ int multicast_delivery = !!u->mcast_subscriptions_cnt;

if (NULL == siocb->scm)
siocb->scm = &tmp_scm;
@@ -1580,12 +1581,30 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
if (err < 0)
goto out;
namelen = err;
- } else {
+ } else if (!multicast_delivery) {
+ struct sock *other;
sunaddr = NULL;
err = -ENOTCONN;
other = unix_peer_get(sk);
if (!other)
goto out;
+ err = -ENOMEM;
+ others_set = kmalloc(sizeof(struct sock_set)
+ + sizeof(struct sock_item),
+ GFP_KERNEL);
+ if (!others_set)
+ goto out;
+ others_set->cnt = 1;
+ sock_hold(other);
+ others_set->items[0].s = other;
+ others_set->items[0].skb = NULL;
+ others_set->items[0].to_deliver = 1;
+ } else {
+ sunaddr = NULL;
+ err = -ENOTCONN;
+ others_set = unix_find_multicast_recipients(sk, NULL, &err);
+ if (!others_set)
+ goto out;
}

if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
@@ -1613,90 +1632,200 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);

restart:
- if (!other) {
+ if (!others_set) {
+ struct sock *other;
+ struct unix_sock *otheru;
err = -ECONNRESET;
if (sunaddr == NULL)
goto out_free;

- other = unix_find_other(net, sunaddr, namelen, sk->sk_type,
- hash, &err);
- if (other == NULL)
+ other = unix_find_other(sock_net(sk), sunaddr, namelen,
+ sk->sk_type, hash, &err);
+ if (!other)
goto out_free;
+ otheru = unix_sk(other);
+
+ if (otheru->is_mcast_addr) {
+ /* FIXME: we should send to the requested recipient
+ * specified in sendto(...dest_addr) instead of the
+ * recipient specified by setsockopt... */
+ sock_put(other);
+ others_set = unix_find_multicast_recipients(sk, other,
+ &err);
+ if (!others_set)
+ goto out_free;
+ } else {
+ others_set = kmalloc(sizeof(struct sock_set)
+ + sizeof(struct sock_item),
+ GFP_KERNEL);
+ if (!others_set)
+ goto out_free;
+ others_set->cnt = 1;
+ others_set->items[0].s = other;
+ others_set->items[0].skb = NULL;
+ others_set->items[0].to_deliver = 1;
+ }
}

- unix_state_lock(other);
- err = -EPERM;
- if (!unix_may_send(sk, other))
- goto out_unlock;
+ for (i = 0 ; i < others_set->cnt ; i++) {
+ struct sock *cur = others_set->items[i].s;

- if (sock_flag(other, SOCK_DEAD)) {
- /*
- * Check with 1003.1g - what should
- * datagram error
- */
- unix_state_unlock(other);
- sock_put(other);
+ others_set->items[i].skb = skb_clone(skb, GFP_KERNEL);
+ if (!others_set->items[i].skb) {
+ err = -ENOMEM;
+ goto out_free;
+ }
+ skb_set_owner_w(others_set->items[i].skb, sk);
+ }

- err = 0;
- unix_state_lock(sk);
- if (unix_peer(sk) == other) {
- unix_peer(sk) = NULL;
- unix_state_unlock(sk);
+ for (i = 0 ; i < others_set->cnt ; i++) {
+ struct sock *cur = others_set->items[i].s;

- unix_dgram_disconnected(sk, other);
- sock_put(other);
- err = -ECONNREFUSED;
- } else {
- unix_state_unlock(sk);
+ if (!others_set->items[i].to_deliver)
+ continue;
+
+ unix_state_lock(cur);
+ err = -EPERM;
+ if (!multicast_delivery && !unix_may_send(sk, cur)) {
+ others_set->items[i].to_deliver = 0;
+ unix_state_unlock(cur);
+ kfree_skb(others_set->items[i].skb);
+ if (multicast_delivery)
+ continue;
+ else
+ goto out_free;
}

- other = NULL;
- if (err)
- goto out_free;
- goto restart;
+ if (sock_flag(cur, SOCK_DEAD)) {
+ /*
+ * Check with 1003.1g - what should
+ * datagram error
+ */
+ unix_state_unlock(cur);
+
+ err = 0;
+ unix_state_lock(sk);
+ if (unix_peer(sk) == cur) {
+ unix_peer(sk) = NULL;
+ unix_state_unlock(sk);
+
+ unix_dgram_disconnected(sk, cur);
+ sock_put(cur);
+ err = -ECONNREFUSED;
+ } else {
+ unix_state_unlock(sk);
+ }
+
+ kfree_skb(others_set->items[i].skb);
+ if (err)
+ goto out_free;
+
+ if (multicast_delivery) {
+ others_set->items[i].to_deliver = 0;
+ continue;
+ } else {
+ kfree_sock_set(others_set);
+ others_set = NULL;
+ goto restart;
+ }
+ }
+
+ err = -EPIPE;
+ if (cur->sk_shutdown & RCV_SHUTDOWN) {
+ unix_state_unlock(cur);
+ kfree_skb(others_set->items[i].skb);
+ if (multicast_delivery) {
+ others_set->items[i].to_deliver = 0;
+ continue;
+ } else {
+ goto out_free;
+ }
+ }
+
+ if (sk->sk_type != SOCK_SEQPACKET) {
+ err = security_unix_may_send(sk->sk_socket,
+ cur->sk_socket);
+ if (err) {
+ unix_state_unlock(cur);
+ kfree_skb(others_set->items[i].skb);
+ if (multicast_delivery) {
+ others_set->items[i].to_deliver = 0;
+ continue;
+ } else {
+ goto out_free;
+ }
+ }
+ }
+
+ if (unix_peer(cur) != sk && unix_recvq_full(cur)) {
+ kfree_skb(others_set->items[i].skb);
+
+ if (multicast_delivery) {
+ unix_state_unlock(cur);
+ others_set->items[i].to_deliver = 0;
+ continue;
+ } else {
+ if (!timeo) {
+ unix_state_unlock(cur);
+ err = -EAGAIN;
+ goto out_free;
+ }
+
+ timeo = unix_wait_for_peer(cur, timeo);
+
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
+
+ kfree_sock_set(others_set);
+ others_set = NULL;
+ goto restart;
+ }
+ }
}

- err = -EPIPE;
- if (other->sk_shutdown & RCV_SHUTDOWN)
- goto out_unlock;
+ for (i = 0 ; i < others_set->cnt ; i++) {
+ struct sock *cur = others_set->items[i].s;

- if (sk->sk_type != SOCK_SEQPACKET) {
- err = security_unix_may_send(sk->sk_socket, other->sk_socket);
- if (err)
- goto out_unlock;
+ if (!others_set->items[i].to_deliver)
+ continue;
+
+ if (sock_flag(cur, SOCK_RCVTSTAMP))
+ __net_timestamp(others_set->items[i].skb);
+
+ skb_queue_tail(&cur->sk_receive_queue,
+ others_set->items[i].skb);
}

- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ for (i = 0 ; i < others_set->cnt ; i++) {
+ struct sock *cur = others_set->items[i].s;

- timeo = unix_wait_for_peer(other, timeo);
+ if (!others_set->items[i].to_deliver)
+ continue;

- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ unix_state_unlock(cur);
+ }

- goto restart;
+ for (i = 0 ; i < others_set->cnt ; i++) {
+ struct sock *cur = others_set->items[i].s;
+
+ if (!others_set->items[i].to_deliver)
+ continue;
+
+ cur->sk_data_ready(cur, len);
}

- if (sock_flag(other, SOCK_RCVTSTAMP))
- __net_timestamp(skb);
- skb_queue_tail(&other->sk_receive_queue, skb);
- unix_state_unlock(other);
- other->sk_data_ready(other, len);
- sock_put(other);
+ kfree_skb(skb);
scm_destroy(siocb->scm);
+ if (others_set)
+ kfree_sock_set(others_set);
return len;

-out_unlock:
- unix_state_unlock(other);
out_free:
kfree_skb(skb);
out:
- if (other)
- sock_put(other);
+ if (others_set)
+ kfree_sock_set(others_set);
scm_destroy(siocb->scm);
return err;
}
--
1.7.1

2010-11-22 18:39:57

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 6/9] AF_UNIX: Apply Linux Socket Filtering to Unix sockets

Linux Socket Filters can already be attached to Unix sockets with
setsockopt(sockfd, SOL_SOCKET, SO_{ATTACH,DETACH}_FILTER, ...) But the filter
was never used in Unix sockets so it did not work. This patch uses sk_filter()
to filter buffers before delivery.

Signed-off-by: Alban Crequy <[email protected]>
---
net/unix/af_unix.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 9207393..52e2aa2 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1669,6 +1669,7 @@ restart:

for (i = 0 ; i < others_set->cnt ; i++) {
struct sock *cur = others_set->items[i].s;
+ unsigned int pkt_len;

others_set->items[i].skb = skb_clone(skb, GFP_KERNEL);
if (!others_set->items[i].skb) {
@@ -1676,6 +1677,13 @@ restart:
goto out_free;
}
skb_set_owner_w(others_set->items[i].skb, sk);
+
+ pkt_len = sk_filter(cur, others_set->items[i].skb);
+ if (pkt_len != 0) {
+ others_set->items[i].to_deliver = 0;
+ kfree_skb(others_set->items[i].skb);
+ continue;
+ }
}

for (i = 0 ; i < others_set->cnt ; i++) {
--
1.7.1

2010-11-22 18:40:00

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 7/9] AF_UNIX: Documentation on multicast Unix Sockets

Signed-off-by: Alban Crequy <[email protected]>
---
.../networking/multicast-unix-sockets.txt | 76 ++++++++++++++++++++
1 files changed, 76 insertions(+), 0 deletions(-)
create mode 100644 Documentation/networking/multicast-unix-sockets.txt

diff --git a/Documentation/networking/multicast-unix-sockets.txt b/Documentation/networking/multicast-unix-sockets.txt
new file mode 100644
index 0000000..b9882a1
--- /dev/null
+++ b/Documentation/networking/multicast-unix-sockets.txt
@@ -0,0 +1,76 @@
+Multicast Unix sockets
+======================
+
+Multicast group memberships are stored in struct unix_mcast nodes. An Unix
+socket can join several multicast groups. Struct unix_mcast nodes are doubly
+linked:
+- In (struct unix_sock)->mcast_subscriptions
+- In (struct unix_sock)->mcast_members
+
+Example
+=======
+
+ Addr1 Addr2
+ | |
+ v v
+Socket1 ----> mcast node ----> mcast node
+ |
+ v
+Socket2 ----> mcast node
+ |
+ v
+Socket3 ----> mcast node
+
+
+Addr1 and Addr2 are struct unix_sock with is_mcast_addr set to 1. They are
+bount to a multicast address with:
+ setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, ...).
+
+Socket1, Socket2 and Socket3 are also struct unix_sock. They are associated to
+a multicast address with:
+ setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, ...).
+
+Socket1 joined two multicast groups. Socket2 and Socket3 joined one multicast
+group. The multicast group Addr1 has 3 members. Addr2 has one member.
+
+Atomic delivery and ordering
+============================
+
+Each message sent is delivered atomically to either none of the recipients or
+all the recipients, even with interruptions and errors.
+
+The locking is done to keep the ordering consistent on all recipients. We want
+to avoid the following scenario. Two emitters A and B, and 2 recipients C and
+D:
+
+ C D
+A -------->| | Step 1: A's message is delivered to C
+B -------->| | Step 2: B's message is delivered to C
+B ---------|--->| Step 3: B's message is delivered to D
+A ---------|--->| Step 4: A's message is delivered to D
+
+Although A and B had a list of recipients (C, D) in the same order, C and D
+received the messages in a different order.
+
+
+SOCK_SEQPACKET semantics
+========================
+
+When a connection is performed on a SOCK_SEQPACKET multicast socket, a new
+socket is created and its file descriptor is received with accept(). The new
+socket could join the multicast group from userspace with setsockopt() but
+there would be race: it could lose the first messages sent by an application
+after connect() returns but before setsockopt() is executed.
+
+To avoid that race, the application should use the flag UNIX_MREQ_AUTOJOIN when
+creating the multicast group.
+
+When several connections are established to a SOCK_SEQPACKET multicast socket,
+the creator of the multicast group using UNIX_MREQ_AUTOJOIN would receive the
+messages several times: one time on each accepted socket. To avoid that, the
+creator of the group may prefer to use UNIX_MREQ_SEND_TO_PEER. Then, the
+accepted socket will not be part of the group but will still receive messages
+from its peer.
+
+
+
--
1.7.1

2010-11-22 18:40:07

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 9/9] AF_UNIX: implement poll(POLLOUT) for multicast sockets

When a socket subscribed to a multicast group has its incoming queue full, it
can either block the emission to the multicast group or let the messages be
dropped. The latter is useful to monitor all messages without slowing down the
traffic.

It is specified with the flag UNIX_MREQ_DROP_WHEN_FULL when the multicast group
is joined.

poll(POLLOUT) is implemented by checking all receiving queues of subscribed
sockets. If only one of them has its receiving queue full and does not have
UNIX_MREQ_DROP_WHEN_FULL, the multicast socket is not writeable.

Signed-off-by: Alban Crequy <[email protected]>
---
include/net/af_unix.h | 5 +++++
net/unix/af_unix.c | 38 ++++++++++++++++++++++++++++++++++++--
2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index c82b5f8..d18499a 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -59,6 +59,10 @@ struct unix_skb_parms {
/* ON UNIX_JOIN_GROUP: the messages will also be received by the peer */
#define UNIX_MREQ_SEND_TO_PEER 0x04

+/* ON UNIX_JOIN_GROUP: just drop the message instead of blocking if the
+ * receiving queue is full */
+#define UNIX_MREQ_DROP_WHEN_FULL 0x08
+
struct unix_mreq
{
struct sockaddr_un address;
@@ -84,6 +88,7 @@ struct unix_sock {
unsigned int is_mcast_addr : 1;
unsigned int mcast_auto_join : 1;
unsigned int mcast_send_to_peer : 1;
+ unsigned int mcast_drop_when_peer_full : 1;

/* These multicast fields are protected by the global spinlock
* unix_multicast_lock */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index d3d6270..36ee1fe 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -128,7 +128,8 @@ static atomic_long_t unix_nr_socks;
struct sock_item {
struct sock *s;
struct sk_buff *skb;
- int to_deliver;
+ unsigned int to_deliver : 1;
+ unsigned int drop_when_full : 1;
};

struct sock_set {
@@ -876,6 +877,8 @@ static int unix_find_multicast_members(struct sock_set *set,
set->items[set->cnt].s = &node->member->sk;
set->items[set->cnt].skb = NULL;
set->items[set->cnt].to_deliver = 1;
+ set->items[set->cnt].drop_when_full =
+ !!(node->flags & UNIX_MREQ_DROP_WHEN_FULL);
set->cnt++;
}

@@ -886,6 +889,8 @@ static int unix_find_multicast_members(struct sock_set *set,
set->items[set->cnt].s = unix_peer(sender);
set->items[set->cnt].skb = NULL;
set->items[set->cnt].to_deliver = 1;
+ set->items[set->cnt].drop_when_full =
+ unix_sk(sender)->mcast_drop_when_peer_full;
set->cnt++;
}

@@ -970,6 +975,7 @@ try_again:
set->items[0].s = dest;
set->items[0].skb = NULL;
set->items[0].to_deliver = 1;
+ set->items[0].drop_when_full = 0;
}
spin_unlock(&unix_multicast_lock);

@@ -1805,6 +1811,7 @@ restart:
kfree_skb(others_set->items[i].skb);

if (multicast_delivery) {
+ /* FIXME: check drop_when_full */
unix_state_unlock(cur);
others_set->items[i].to_deliver = 0;
continue;
@@ -1957,7 +1964,10 @@ static int unix_mc_join(struct socket *sock, struct unix_mreq *mreq)
node->flags = mreq->flags;

unix_state_lock(sock->sk);
- unix_sk(sock->sk)->mcast_send_to_peer = !!(mreq->flags & UNIX_MREQ_SEND_TO_PEER);
+ unix_sk(sock->sk)->mcast_send_to_peer =
+ !!(mreq->flags & UNIX_MREQ_SEND_TO_PEER);
+ unix_sk(sock->sk)->mcast_drop_when_peer_full =
+ !!(mreq->flags & UNIX_MREQ_DROP_WHEN_FULL);
unix_state_unlock(sock->sk);

spin_lock(&unix_multicast_lock);
@@ -2258,6 +2268,7 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
goto out_unlock;
}

+ /* FIXME: wake up peers on the multicast group too */
wake_up_interruptible_sync_poll(&u->peer_wait,
POLLOUT | POLLWRNORM | POLLWRBAND);

@@ -2613,6 +2624,9 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
{
struct sock *sk = sock->sk, *other;
unsigned int mask, writable;
+ struct sock_set *others;
+ int err = 0;
+ int i;

sock_poll_wait(file, sk_sleep(sk), wait);
mask = 0;
@@ -2652,6 +2666,26 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
}
sock_put(other);
}
+ /*
+ * On multicast sockets, we need to check if the receiving queue is
+ * full on all peers who don't have UNIX_MREQ_DROP_WHEN_FULL.
+ */
+ others = unix_find_multicast_recipients(sk, NULL, &err);
+ if (!others)
+ goto skip_multicast;
+ for (i = 0 ; i < others->cnt ; i++) {
+ if (others->items[i].drop_when_full)
+ continue;
+ if (unix_peer(others->items[i].s) != sk) {
+ sock_poll_wait(file,
+ &unix_sk(others->items[i].s)->peer_wait, wait);
+ if (unix_recvq_full(others->items[i].s))
+ writable = 0;
+ }
+ }
+ kfree_sock_set(others);
+
+skip_multicast:

if (writable)
mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
--
1.7.1

2010-11-22 18:40:47

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 8/9] AF_UNIX: add options on multicast connected socket

autojoin and send-to-peer

Signed-off-by: Alban Crequy <[email protected]>
---
include/net/af_unix.h | 27 +++++++++++++++++++++------
net/unix/af_unix.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index bf114d5..c82b5f8 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -40,18 +40,31 @@ struct unix_skb_parms {
spin_lock_nested(&unix_sk(s)->lock, \
SINGLE_DEPTH_NESTING)

-#define UNIX_MREQ_LOOPBACK 0x01
+/* UNIX socket options */
+#define UNIX_CREATE_GROUP 1
+#define UNIX_JOIN_GROUP 2
+#define UNIX_LEAVE_GROUP 3
+
+/* Flags on unix_mreq */
+
+/* On UNIX_JOIN_GROUP: the socket will receive its own messages
+ * On UNIX_CREATE_GROUP: the accepted sockets will receive their own messages
+ */
+#define UNIX_MREQ_LOOPBACK 0x01
+
+/* On UNIX_CREATE_GROUP: the accepted socket will be member of the multicast
+ * group */
+#define UNIX_MREQ_AUTOJOIN 0x02
+
+/* ON UNIX_JOIN_GROUP: the messages will also be received by the peer */
+#define UNIX_MREQ_SEND_TO_PEER 0x04
+
struct unix_mreq
{
struct sockaddr_un address;
unsigned int flags;
};

-/* UNIX socket options */
-#define UNIX_CREATE_GROUP 1
-#define UNIX_JOIN_GROUP 2
-#define UNIX_LEAVE_GROUP 3
-
#ifdef __KERNEL__
/* The AF_UNIX socket */
struct unix_sock {
@@ -69,6 +82,8 @@ struct unix_sock {
unsigned int gc_candidate : 1;
unsigned int gc_maybe_cycle : 1;
unsigned int is_mcast_addr : 1;
+ unsigned int mcast_auto_join : 1;
+ unsigned int mcast_send_to_peer : 1;

/* These multicast fields are protected by the global spinlock
* unix_multicast_lock */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 52e2aa2..d3d6270 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -878,6 +878,17 @@ static int unix_find_multicast_members(struct sock_set *set,
set->items[set->cnt].to_deliver = 1;
set->cnt++;
}
+
+ if (unix_peer(sender) && unix_sk(sender)->mcast_send_to_peer) {
+ if (set->cnt + 1 > recipient_cnt)
+ return -ENOMEM;
+ sock_hold(unix_peer(sender));
+ set->items[set->cnt].s = unix_peer(sender);
+ set->items[set->cnt].skb = NULL;
+ set->items[set->cnt].to_deliver = 1;
+ set->cnt++;
+ }
+
return 0;
}

@@ -1226,6 +1237,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
int st;
int err;
long timeo;
+ struct unix_mcast *node = NULL;

err = unix_mkname(sunaddr, addr_len, &hash);
if (err < 0)
@@ -1245,6 +1257,12 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,

err = -ENOMEM;

+ node = kmalloc(sizeof(struct unix_mcast), GFP_KERNEL);
+ if (!node) {
+ err = -ENOMEM;
+ goto out;
+ }
+
/* create new sock for complete connection */
newsk = unix_create1(sock_net(sk), NULL);
if (newsk == NULL)
@@ -1261,6 +1279,8 @@ restart:
if (!other)
goto out;

+ otheru = unix_sk(other);
+
/* Latch state of peer */
unix_state_lock(other);

@@ -1332,6 +1352,21 @@ restart:
goto out_unlock;
}

+ /* Multicast sockets */
+ spin_lock(&unix_multicast_lock);
+ if (otheru->is_mcast_addr && otheru->mcast_auto_join) {
+ node->member = unix_sk(newsk);
+ node->addr = otheru;
+ node->flags = 0;
+
+ hlist_add_head(&node->member_node, &otheru->mcast_members);
+ hlist_add_head(&node->subscription_node,
+ &unix_sk(newsk)->mcast_subscriptions);
+ otheru->mcast_members_cnt++;
+ u->mcast_subscriptions_cnt++;
+ }
+ spin_unlock(&unix_multicast_lock);
+
/* The way is open! Fastly set all the necessary fields... */

sock_hold(sk);
@@ -1341,7 +1376,6 @@ restart:
init_peercred(newsk);
newu = unix_sk(newsk);
newsk->sk_wq = &newu->peer_wq;
- otheru = unix_sk(other);

/* copy address information from listening to new sock*/
if (otheru->addr) {
@@ -1380,6 +1414,8 @@ out_unlock:

out:
kfree_skb(skb);
+ if (node)
+ kfree(node);
if (newsk)
unix_release_sock(newsk, 0);
if (other)
@@ -1868,6 +1904,8 @@ static int unix_mc_create(struct socket *sock, struct unix_mreq *mreq)

unix_state_lock(sock->sk);
unix_sk(sock->sk)->is_mcast_addr = 1;
+ if (mreq->flags & UNIX_MREQ_AUTOJOIN)
+ unix_sk(sock->sk)->mcast_auto_join = 1;
unix_state_unlock(sock->sk);

return 0;
@@ -1918,6 +1956,10 @@ static int unix_mc_join(struct socket *sock, struct unix_mreq *mreq)
node->addr = otheru;
node->flags = mreq->flags;

+ unix_state_lock(sock->sk);
+ unix_sk(sock->sk)->mcast_send_to_peer = !!(mreq->flags & UNIX_MREQ_SEND_TO_PEER);
+ unix_state_unlock(sock->sk);
+
spin_lock(&unix_multicast_lock);
hlist_add_head(&node->member_node, &otheru->mcast_members);
hlist_add_head(&node->subscription_node, &u->mcast_subscriptions);
--
1.7.1

2010-11-22 18:39:40

by Alban Crequy

[permalink] [raw]
Subject: [PATCH 3/9] AF_UNIX: create, join and leave multicast groups with setsockopt

Multicast is implemented on SOCK_DGRAM and SOCK_SEQPACKET Unix sockets.

An userspace application can create a multicast group with:
struct unix_mreq mreq;
mreq.address.sun_family = AF_UNIX;
mreq.address.sun_path[0] = '\0';
strcpy(mreq.address.sun_path + 1, "socket-address");
mreq.flags = 0;

sockfd = socket(AF_UNIX, SOCK_DGRAM, 0);
ret = setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, &mreq, sizeof(mreq));

Then a multicast group can be joined and left with:
ret = setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, &mreq, sizeof(mreq));
ret = setsockopt(sockfd, SOL_UNIX, UNIX_LEAVE_GROUP, &mreq, sizeof(mreq));

A socket can be a member of several multicast group.

Signed-off-by: Alban Crequy <[email protected]>
---
include/net/af_unix.h | 31 +++++++
net/unix/af_unix.c | 217 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 247 insertions(+), 1 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 90c9e28..bf114d5 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -40,6 +40,18 @@ struct unix_skb_parms {
spin_lock_nested(&unix_sk(s)->lock, \
SINGLE_DEPTH_NESTING)

+#define UNIX_MREQ_LOOPBACK 0x01
+struct unix_mreq
+{
+ struct sockaddr_un address;
+ unsigned int flags;
+};
+
+/* UNIX socket options */
+#define UNIX_CREATE_GROUP 1
+#define UNIX_JOIN_GROUP 2
+#define UNIX_LEAVE_GROUP 3
+
#ifdef __KERNEL__
/* The AF_UNIX socket */
struct unix_sock {
@@ -56,8 +68,27 @@ struct unix_sock {
spinlock_t lock;
unsigned int gc_candidate : 1;
unsigned int gc_maybe_cycle : 1;
+ unsigned int is_mcast_addr : 1;
+
+ /* These multicast fields are protected by the global spinlock
+ * unix_multicast_lock */
+ struct hlist_head mcast_subscriptions;
+ struct hlist_head mcast_members;
+ int mcast_subscriptions_cnt;
+ int mcast_members_cnt;
+
struct socket_wq peer_wq;
};
+
+struct unix_mcast
+{
+ struct unix_sock *member;
+ struct unix_sock *addr;
+ unsigned int flags;
+ struct hlist_node subscription_node;
+ struct hlist_node member_node;
+};
+
#define unix_sk(__sk) ((struct unix_sock *)__sk)

#define peer_wait peer_wq.wait
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 6eca106..2278829 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -379,6 +379,9 @@ static int unix_release_sock(struct sock *sk, int embrion)
struct sock *skpair;
struct sk_buff *skb;
int state;
+ struct unix_mcast *node;
+ struct hlist_node *pos;
+ struct hlist_node *pos_tmp;

unix_remove_socket(sk);

@@ -392,6 +395,24 @@ static int unix_release_sock(struct sock *sk, int embrion)
u->mnt = NULL;
state = sk->sk_state;
sk->sk_state = TCP_CLOSE;
+ spin_lock(&unix_multicast_lock);
+ hlist_for_each_entry_safe(node, pos, pos_tmp, &u->mcast_subscriptions,
+ subscription_node) {
+ hlist_del(&node->member_node);
+ hlist_del(&node->subscription_node);
+ node->addr->mcast_members_cnt--;
+ node->member->mcast_subscriptions_cnt--;
+ kfree(node);
+ }
+ hlist_for_each_entry_safe(node, pos, pos_tmp, &u->mcast_members,
+ member_node) {
+ hlist_del(&node->member_node);
+ hlist_del(&node->subscription_node);
+ node->addr->mcast_members_cnt--;
+ node->member->mcast_subscriptions_cnt--;
+ kfree(node);
+ }
+ spin_unlock(&unix_multicast_lock);
unix_state_unlock(sk);

wake_up_interruptible_all(&u->peer_wait);
@@ -631,6 +652,8 @@ static struct sock *unix_create1(struct net *net, struct socket *sock)
atomic_long_set(&u->inflight, 0);
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
+ INIT_HLIST_HEAD(&u->mcast_subscriptions);
+ INIT_HLIST_HEAD(&u->mcast_members);
init_waitqueue_head(&u->peer_wait);
unix_insert_socket(unix_sockets_unbound, sk);
out:
@@ -1535,10 +1558,202 @@ out:
}


+static int unix_mc_create(struct socket *sock, struct unix_mreq *mreq)
+{
+ struct sock *other;
+ int err;
+ unsigned hash;
+ int namelen;
+
+ if (mreq->address.sun_family != AF_UNIX ||
+ mreq->address.sun_path[0] != '\0')
+ return -EINVAL;
+
+ err = unix_mkname(&mreq->address, sizeof(struct sockaddr_un), &hash);
+ if (err < 0)
+ return err;
+
+ namelen = err;
+ other = unix_find_other(sock_net(sock->sk), &mreq->address, namelen,
+ sock->type, hash, &err);
+ if (other)
+ return -EADDRINUSE;
+
+ err = sock->ops->bind(sock,
+ (struct sockaddr*)&mreq->address,
+ sizeof(struct sockaddr_un));
+ if (err < 0)
+ return err;
+
+ unix_state_lock(sock->sk);
+ unix_sk(sock->sk)->is_mcast_addr = 1;
+ unix_state_unlock(sock->sk);
+
+ return 0;
+}
+
+
+static int unix_mc_join(struct socket *sock, struct unix_mreq *mreq)
+{
+ struct unix_sock *u = unix_sk(sock->sk);
+ struct sock *other;
+ struct unix_sock *otheru;
+ struct unix_mcast *node;
+ int err;
+ unsigned hash;
+ int namelen;
+
+ if (mreq->address.sun_family != AF_UNIX ||
+ mreq->address.sun_path[0] != '\0')
+ return -EINVAL;
+
+ err = unix_autobind(sock);
+ if (err < 0)
+ return err;
+
+ err = unix_mkname(&mreq->address, sizeof(struct sockaddr_un), &hash);
+ if (err < 0)
+ return err;
+
+ namelen = err;
+ other = unix_find_other(sock_net(sock->sk), &mreq->address, namelen,
+ sock->type, hash, &err);
+ if (!other)
+ return -EINVAL;
+
+ if (other && !unix_sk(other)->is_mcast_addr) {
+ err = -EADDRINUSE;
+ goto sock_put_out;
+ }
+
+ otheru = unix_sk(other);
+
+ node = kmalloc(sizeof(struct unix_mcast), GFP_KERNEL);
+ if (!node) {
+ err = -ENOMEM;
+ goto sock_put_out;
+ }
+ node->member = u;
+ node->addr = otheru;
+ node->flags = mreq->flags;
+
+ spin_lock(&unix_multicast_lock);
+ hlist_add_head(&node->member_node, &otheru->mcast_members);
+ hlist_add_head(&node->subscription_node, &u->mcast_subscriptions);
+ otheru->mcast_members_cnt++;
+ u->mcast_subscriptions_cnt++;
+ spin_unlock(&unix_multicast_lock);
+
+ return 0;
+
+sock_put_out:
+ sock_put(other);
+ return err;
+}
+
+
+static int unix_mc_leave(struct socket *sock, struct unix_mreq *mreq)
+{
+ struct unix_sock *u = unix_sk(sock->sk);
+ struct sock *other;
+ struct unix_sock *otheru;
+ struct unix_mcast *node;
+ struct hlist_node *pos;
+ int err;
+ unsigned hash;
+ int namelen;
+
+ if (mreq->address.sun_family != AF_UNIX ||
+ mreq->address.sun_path[0] != '\0')
+ return -EINVAL;
+
+ err = unix_mkname(&mreq->address, sizeof(struct sockaddr_un), &hash);
+ if (err < 0)
+ return err;
+
+ namelen = err;
+ other = unix_find_other(sock_net(sock->sk), &mreq->address, namelen,
+ sock->type, hash, &err);
+ if (!other)
+ return -EINVAL;
+
+ otheru = unix_sk(other);
+
+ if (!otheru->is_mcast_addr) {
+ err = -EINVAL;
+ goto sock_put_out;
+ }
+
+ spin_lock(&unix_multicast_lock);
+
+ hlist_for_each_entry(node, pos, &u->mcast_subscriptions,
+ subscription_node) {
+ if (node->addr == otheru)
+ break;
+ }
+
+ if (!pos) {
+ spin_unlock(&unix_multicast_lock);
+ err = -EINVAL;
+ goto sock_put_out;
+ }
+
+ hlist_del(&node->member_node);
+ hlist_del(&node->subscription_node);
+ otheru->mcast_members_cnt--;
+ u->mcast_subscriptions_cnt--;
+ spin_unlock(&unix_multicast_lock);
+ kfree(node);
+ err = 0;
+
+sock_put_out:
+ sock_put(other);
+ return err;
+}
+
+
static int unix_setsockopt(struct socket *sock, int level, int optname,
char __user *optval, unsigned int optlen)
{
- return -EOPNOTSUPP;
+ struct unix_mreq mreq;
+ int err = 0;
+
+ if (level != SOL_UNIX)
+ return -ENOPROTOOPT;
+
+ switch (optname) {
+ case UNIX_CREATE_GROUP:
+ case UNIX_JOIN_GROUP:
+ case UNIX_LEAVE_GROUP:
+ if (optlen < sizeof(struct unix_mreq))
+ return -EINVAL;
+ if (copy_from_user(&mreq, optval, sizeof(struct unix_mreq)))
+ return -EFAULT;
+ break;
+
+ default:
+ break;
+ }
+
+ switch (optname) {
+ case UNIX_CREATE_GROUP:
+ err = unix_mc_create(sock, &mreq);
+ break;
+
+ case UNIX_JOIN_GROUP:
+ err = unix_mc_join(sock, &mreq);
+ break;
+
+ case UNIX_LEAVE_GROUP:
+ err = unix_mc_leave(sock, &mreq);
+ break;
+
+ default:
+ err = -ENOPROTOOPT;
+ break;
+ }
+
+ return err;
}


--
1.7.1

2010-11-22 18:59:51

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 3/9] AF_UNIX: create, join and leave multicast groups with setsockopt

From: Alban Crequy <[email protected]>
Date: Mon, 22 Nov 2010 18:36:16 +0000

> + other = unix_find_other(sock_net(sock->sk), &mreq->address, namelen,
> + sock->type, hash, &err);
> + if (other)
> + return -EADDRINUSE;

Leaks 'other'.

2010-11-22 19:04:55

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages

From: Alban Crequy <[email protected]>
Date: Mon, 22 Nov 2010 18:36:17 +0000

> unix_find_multicast_recipients() builds an array of recipients. It can either
> find the peers of a specific multicast address, or find all the peers of all
> multicast group the sender is part of.
>
> Signed-off-by: Alban Crequy <[email protected]>

You really should use RCU to lock this stuff, this way sends run
lockless and have less worries wrt. the memory allocation. You'll
also only take a spinlock in the write paths which change the
multicast groups, which ought to be rare.

Although to be honest you should optimize the case of small numbers of
recipients, in the same way we optimize small numbers of iovecs on
sends. Have an on-stack array that holds a small number of entries
and use that if the set fits, otherwise dynamic allocation.

2010-11-22 19:14:09

by Rémi Denis-Courmont

[permalink] [raw]
Subject: Re: [PATCH 7/9] AF_UNIX: Documentation on multicast Unix Sockets

Le lundi 22 novembre 2010 20:36:20 Alban Crequy, vous avez ?crit :
> +Multicast Unix sockets
> +======================
> +
> +Multicast group memberships are stored in struct unix_mcast nodes. An Unix
> +socket can join several multicast groups. Struct unix_mcast nodes are
> doubly +linked:
> +- In (struct unix_sock)->mcast_subscriptions
> +- In (struct unix_sock)->mcast_members

I may be stupid, but I found this whole documentation very confusing, and so
the API it tries to describe. Traditionally:
- Senders may or not may be part of the group and are not kept track of.
- Receivers join to the group then receive message sent to it.
- Loopback defines whether a sender receives its own echo if it sends to a
group that it has joined.
- If connected to a multicast group, messages from the socket are routed to
the group (in absence of a contradictoy socket address). This has no effect on
membership to the multicast group under any circumstance.

You cannot 'listen' or 'accept' on a multicast group.

So I am not entirely clear what semantics your patchset is following. But it
does not seem like "multicast" to me and therefore seems not very well
documented :-(

--
R?mi Denis-Courmont
http://www.remlab.net/
http://fi.linkedin.com/in/remidenis

2010-11-22 20:11:28

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH 7/9] AF_UNIX: Documentation on multicast Unix Sockets

Le Mon, 22 Nov 2010 21:07:40 +0200,
"Rémi Denis-Courmont" <[email protected]> a écrit :

> Le lundi 22 novembre 2010 20:36:20 Alban Crequy, vous avez écrit :
> > +Multicast Unix sockets
> > +======================
> > +
> > +Multicast group memberships are stored in struct unix_mcast nodes.
> > An Unix +socket can join several multicast groups. Struct
> > unix_mcast nodes are doubly +linked:
> > +- In (struct unix_sock)->mcast_subscriptions
> > +- In (struct unix_sock)->mcast_members
>
> I may be stupid, but I found this whole documentation very confusing,
> and so the API it tries to describe. Traditionally:
> - Senders may or not may be part of the group and are not kept track
> of.
> - Receivers join to the group then receive message sent to it.
> - Loopback defines whether a sender receives its own echo if it sends
> to a group that it has joined.
> - If connected to a multicast group, messages from the socket are
> routed to the group (in absence of a contradictoy socket address).
> This has no effect on membership to the multicast group under any
> circumstance.

I keep these traditional properties for multicast on Unix sockets.

> You cannot 'listen' or 'accept' on a multicast group.

Datagram sockets cannot listen() or accept() but seqpacket sockets can.
I would like multicast to work on seqpacket sockets too. In this case,
there is a central daemon who listen(), and accept() returns a new
socket. The central daemon controls the lifetime of the multicast
group and can receive the messages from the peers on the socket
returned by accept() if UNIX_MREQ_SEND_TO_PEER is set.

The accepted socket could join the multicast group (and then receive
messages addressed to the group) with the setsockopt() call, but then
there would be a race that it may not receive the first messages if a
peer connect() and send a message immediately afterwards. connect() can
returns on the peer process before the daemon accept() and runs
setsockopt(). I added the flag UNIX_MREQ_AUTOJOIN (to be set when
creating the multicast group) to prevent that race.

Using connected sockets (seqpacket) is useful for D-Bus because a
central daemon can know when members are connecting and disconnecting
and then emit the D-Bus signal 'NameOwnerChanged'.

> So I am not entirely clear what semantics your patchset is following.
> But it does not seem like "multicast" to me and therefore seems not
> very well documented :-(

I am willing to improve it.

--
Alban

2010-11-22 20:14:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages

Alban Crequy <[email protected]> writes:

>+static DEFINE_SPINLOCK(unix_multicast_lock);

For DBUS it's probably ok, but I suspect for other usages
the global lock in the multipath fast path is going to hurt
sooner or later.

> +
> + /* Allocate for the set and hope the number of recipients does not
> + * change while the lock is released. If it changes, we have to try
> + * again... We allocate a bit more than needed, so if a _few_ members
> + * are added in a multicast group meanwhile, we don't always need to
> + * try again. */
> + recipient_cnt += 5;
> +
> + set = kmalloc(sizeof(struct sock_set)
> + + sizeof(struct sock_item) * recipient_cnt,
> + GFP_KERNEL);

FWIW for a large number of sockets this will likely run into
memory fragmentation issues. There are various workarounds like
fallback to vmalloc or use something like flex_arrays.


-Andi
--
[email protected] -- Speaking for myself only.

2010-11-23 15:05:17

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages

Le Mon, 22 Nov 2010 11:05:19 -0800 (PST),
David Miller <[email protected]> a écrit :

> From: Alban Crequy <[email protected]>
> Date: Mon, 22 Nov 2010 18:36:17 +0000
>
> > unix_find_multicast_recipients() builds an array of recipients. It
> > can either find the peers of a specific multicast address, or find
> > all the peers of all multicast group the sender is part of.
> >
> > Signed-off-by: Alban Crequy <[email protected]>
>
> You really should use RCU to lock this stuff, this way sends run
> lockless and have less worries wrt. the memory allocation. You'll
> also only take a spinlock in the write paths which change the
> multicast groups, which ought to be rare.

I understand the benefit to use RCU in order to have lockless sends.

But with RCU I will still have worries about the memory allocation:

- I cannot allocate inside a rcu_read_lock()-rcu_read_unlock() block.

- If I iterate locklessly over the multicast group members with
hlist_for_each_entry_rcu(), new members can be added, so the
array can be allocated with the wrong size and I have to try again
("goto try_again") when this rare case occurs.

- Another idea would be to avoid completely the allocation by inlining
unix_find_multicast_recipients() inside unix_dgram_sendmsg() and
delivering the messages to the recipients as long as the list is
being iterated locklessly. But I want to provide atomicity of
delivery: the message must be delivered with skb_queue_tail() either
to all the recipients or to none of them in case of interruption or
memory pressure. I don't see how I can achieve that without
iterating several times on the list of recipients, hence the
allocation and the copy in the array. I also want to guarantee the
order of delivery as described in multicast-unix-sockets.txt and for
this, I am taking lots of spinlocks anyway. I don't see how to avoid
that, but I would be happy to be wrong and have a better solution.

> Although to be honest you should optimize the case of small numbers of
> recipients, in the same way we optimize small numbers of iovecs on
> sends. Have an on-stack array that holds a small number of entries
> and use that if the set fits, otherwise dynamic allocation.

To give an idea of the number of members in a multicast group for the
D-Bus use case, I have 90 D-Bus connections on my session bus:

$ dbus-send --print-reply --dest=org.freedesktop.DBus \
/org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep '":'|wc -l
90

In common cases, there should be only a few real recipients (1 or 2?)
after the socket filters eliminate most of them, but
unix_find_multicast_recipients() will still allocate an array of
about that size.

2010-11-23 16:08:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages

Le mardi 23 novembre 2010 à 15:03 +0000, Alban Crequy a écrit :
> Le Mon, 22 Nov 2010 11:05:19 -0800 (PST),
> David Miller <[email protected]> a écrit :
>
> > From: Alban Crequy <[email protected]>
> > Date: Mon, 22 Nov 2010 18:36:17 +0000
> >
> > > unix_find_multicast_recipients() builds an array of recipients. It
> > > can either find the peers of a specific multicast address, or find
> > > all the peers of all multicast group the sender is part of.
> > >
> > > Signed-off-by: Alban Crequy <[email protected]>
> >
> > You really should use RCU to lock this stuff, this way sends run
> > lockless and have less worries wrt. the memory allocation. You'll
> > also only take a spinlock in the write paths which change the
> > multicast groups, which ought to be rare.
>
> I understand the benefit to use RCU in order to have lockless sends.
>
> But with RCU I will still have worries about the memory allocation:
>
> - I cannot allocate inside a rcu_read_lock()-rcu_read_unlock() block.
>

Thats not true.

Sames rules than inside a spin_lock() or write_lock() apply.

We already allocate memory inside rcu_read_lock() in network stack.

> - If I iterate locklessly over the multicast group members with
> hlist_for_each_entry_rcu(), new members can be added, so the
> array can be allocated with the wrong size and I have to try again
> ("goto try_again") when this rare case occurs.

You are allowed to allocate memory to add stuff while doing your loop
iteration.

Nothing prevents you to use a chain of items, each item holding up to
128 sockets for example. If full, allocate a new item.

We have such schem in poll()/select() for example

fs/select.c function poll_get_entry()

Use a small embedded struct on stack, and allocate extra items if number
of fd is too big.

(If you cant allocate memory to hold pointers, chance is you wont be
able to clone skbs anyway. One skb is about 400 bytes.)

If new members are added to the group while you are iterating the list,
they wont receive a copy of the message.

Or just chain skbs while you clone them, store in skb->sk the socket...
no need for extra memory allocations.

>
> - Another idea would be to avoid completely the allocation by inlining
> unix_find_multicast_recipients() inside unix_dgram_sendmsg() and
> delivering the messages to the recipients as long as the list is
> being iterated locklessly. But I want to provide atomicity of
> delivery: the message must be delivered with skb_queue_tail() either
> to all the recipients or to none of them in case of interruption or
> memory pressure. I don't see how I can achieve that without
> iterating several times on the list of recipients, hence the
> allocation and the copy in the array. I also want to guarantee the
> order of delivery as described in multicast-unix-sockets.txt and for
> this, I am taking lots of spinlocks anyway. I don't see how to avoid
> that, but I would be happy to be wrong and have a better solution.
>


So if one destination has a full receive queue, you want nobody receive
the message ? That seems a bit risky to me, if someone sends SIGSTOP to
one of your process...



>
> To give an idea of the number of members in a multicast group for the
> D-Bus use case, I have 90 D-Bus connections on my session bus:
>
> $ dbus-send --print-reply --dest=org.freedesktop.DBus \
> /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep '":'|wc -l
> 90
>
> In common cases, there should be only a few real recipients (1 or 2?)
> after the socket filters eliminate most of them, but
> unix_find_multicast_recipients() will still allocate an array of
> about that size.
>

I am not sure if doing 90 clones of skb and filtering them one by one is
going to be fast :-(




2010-11-23 16:56:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages


Hmm, just thought about lockless again, and we had same multicast
problem on udp as well.

We are forced to hold a lock (to forbid concurrent deletes), or we might
going through one 'about to be removed' socket and abort the iteration
in the middle. Some sockets would not receive a copy of the message.

(UDP sockets using SLAB_DESTROY_BY_RCU, we could even have multiple
copies sent to some sockets, if the removed socket is re-inserted in
front of chain because of instant reuse)

To have a true lockless path, you would need to restart the full scan if
you notice a delete was done during the iteration, eventually using a
sequence number per chain. That would be expensive, because you have to
undo all the socket accumulation (refcount) done during the lookup,
before restart the thing.

To avoid starvation, getting a lock at the second iteration would be
good ;)


2010-11-23 17:49:00

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages

Le Tue, 23 Nov 2010 17:08:37 +0100,
Eric Dumazet <[email protected]> a écrit :

> (...)

Thanks for the explanations

> Le mardi 23 novembre 2010 à 15:03 +0000, Alban Crequy a écrit :
> >
> > - Another idea would be to avoid completely the allocation by
> > inlining unix_find_multicast_recipients() inside
> > unix_dgram_sendmsg() and delivering the messages to the recipients
> > as long as the list is being iterated locklessly. But I want to
> > provide atomicity of delivery: the message must be delivered with
> > skb_queue_tail() either to all the recipients or to none of them in
> > case of interruption or memory pressure. I don't see how I can
> > achieve that without iterating several times on the list of
> > recipients, hence the allocation and the copy in the array. I also
> > want to guarantee the order of delivery as described in
> > multicast-unix-sockets.txt and for this, I am taking lots of
> > spinlocks anyway. I don't see how to avoid that, but I would be
> > happy to be wrong and have a better solution.
> >
>
>
> So if one destination has a full receive queue, you want nobody
> receive the message ? That seems a bit risky to me, if someone sends
> SIGSTOP to one of your process...

Yes. For the D-Bus usage, I want to have this guarantee. If random
remote procedure calls are lost, it will break applications built on
top of D-Bus with multicast Unix sockets. The current implementation of
D-Bus avoid this problem by having almost infinite receiving queues in
the process dbus-daemon: 1GB. But in the kernel,
/proc/sys/net/unix/max_dgram_qlen is 10 messages by default. Increasing
it a bit will not fix the problem and increasing it to 1GB is not
reasonable in kernel.

There is different actions the kernel can do when the queue is full:

1. block the sender. It is useful in RPC, we don't want random RPC to
disappear unnoticed.
2. drop the message for recipients with a full queue. It could be
acceptable for some slow monitoring tools that don't want to disturb
the applications.
3. close the receiving socket as a punishment. At least the problem is
not unnoticed and the user can have some error feedback.

I was thinking to make it configurable when a socket joins a multicast
group. So different multicast group members would behave differently.
The flag UNIX_MREQ_DROP_WHEN_FULL is there for that (but not fully
implemented in the patchset).

It makes things more complex for poll(POLLOUT). Before the buffer
reaches the kernel, it cannot run the socket filters, so it is not
possible to know the exact recipients. So poll(POLLOUT) has to block
as soon as only one receiving queue is full (unless the multicast
member has the flag UNIX_MREQ_DROP_WHEN_FULL).

When the peers install sockets filters and there is 2 flows of messages
from A to B and from C to D, if the receiving queue of D is full, it
will also block the communication from A to B: poll(A, POLLOUT) will
block. This is annoying but I don't see how to fix it.

> > To give an idea of the number of members in a multicast group for
> > the D-Bus use case, I have 90 D-Bus connections on my session bus:
> >
> > $ dbus-send --print-reply --dest=org.freedesktop.DBus \
> > /org/freedesktop/DBus org.freedesktop.DBus.ListNames | grep '":'|wc
> > -l 90
> >
> > In common cases, there should be only a few real recipients (1 or
> > 2?) after the socket filters eliminate most of them, but
> > unix_find_multicast_recipients() will still allocate an array of
> > about that size.
> >
>
> I am not sure if doing 90 clones of skb and filtering them one by one
> is going to be fast :-(

Yes... I think it can be optimized. Run the socket filter first by
calling sk_run_filter() directly and then call skb_clone() + pskb_trim()
only on the few remaining sockets.

2010-11-23 18:39:25

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 4/9] AF_UNIX: find the recipients for multicast messages

From: Alban Crequy <[email protected]>
Date: Tue, 23 Nov 2010 17:47:01 +0000

> Le Tue, 23 Nov 2010 17:08:37 +0100,
> Eric Dumazet <[email protected]> a ?crit :
>> I am not sure if doing 90 clones of skb and filtering them one by one
>> is going to be fast :-(
>
> Yes... I think it can be optimized. Run the socket filter first by
> calling sk_run_filter() directly and then call skb_clone() + pskb_trim()
> only on the few remaining sockets.

BTW, we have and have talked about the same exact problem with
AF_PACKET socket users such as DHCP.

We clone and push the packet down into the AF_PACKET protocol
code from the pt_type callback when %99 of the time the socket
filter doesn't match and thus the clone is completely wasted
work.

If we know the socket, or more specifically the filter, early enough,
we could have a special interface like:

struct sk_buff *skb_filter_or_clone(struct sk_buff *skb, ...)

Which returns a non-NULL cloned SKB if the filter accepts the
packet.