Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753893Ab2BTQHM (ORCPT ); Mon, 20 Feb 2012 11:07:12 -0500 Received: from bhuna.collabora.co.uk ([93.93.135.160]:43140 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753728Ab2BTQHH (ORCPT ); Mon, 20 Feb 2012 11:07:07 -0500 From: Javier Martinez Canillas To: "David S. Miller" Cc: Eric Dumazet , Lennart Poettering , Kay Sievers , Alban Crequy , Bart Cerneels , Rodrigo Moya , Sjoerd Simons , netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 01/10] af_unix: Documentation on multicast unix sockets Date: Mon, 20 Feb 2012 16:57:26 +0100 Message-Id: <1329753455-1106-2-git-send-email-javier@collabora.co.uk> X-Mailer: git-send-email 1.7.7.6 In-Reply-To: <1329753455-1106-1-git-send-email-javier@collabora.co.uk> References: <1329753455-1106-1-git-send-email-javier@collabora.co.uk> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8153 Lines: 203 From: Alban Crequy Signed-off-by: Alban Crequy Reviewed-by: Ian Molton --- .../networking/multicast-unix-sockets.txt | 180 ++++++++++++++++++++ 1 files changed, 180 insertions(+), 0 deletions(-) create mode 100644 Documentation/networking/multicast-unix-sockets.txt diff --git a/Documentation/networking/multicast-unix-sockets.txt b/Documentation/networking/multicast-unix-sockets.txt new file mode 100644 index 0000000..ec9a19c --- /dev/null +++ b/Documentation/networking/multicast-unix-sockets.txt @@ -0,0 +1,180 @@ +Multicast Unix sockets +====================== + +Multicast is implemented on SOCK_DGRAM and SOCK_SEQPACKET Unix sockets. + +An userspace application can create a multicast group with: + + struct unix_mreq mreq = {0,}; + mreq.address.sun_family = AF_UNIX; + mreq.address.sun_path[0] = '\0'; + strcpy(mreq.address.sun_path + 1, "socket-address"); + + sockfd = socket(AF_UNIX, SOCK_DGRAM, 0); + ret = setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, &mreq, sizeof(mreq)); + +This allocates a struct unix_mcast_group, which is reference counted and exists +as long as the socket who created it exists or the group has at least one +member. + +SOCK_DGRAM sockets can join a multicast group with: + + ret = setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, &mreq, sizeof(mreq)); + +This allocates a struct unix_mcast, which holds the settings of the membership, +mainly whether loopback is enabled. A socket can be a member of several +multicast groups. + +Since SOCK_SEQPACKET sockets are connection-oriented the semantics are +different. A client cannot join a group but it can only connect and the +multicast accept socket is used to allow the peer to join the group with: + + ret = setsockopt(groupfd, SOL_UNIX, UNIX_CREATE_GROUP, &val, vallen); + ret = listen(groupfd, 10); + connfd = accept(sockfd, NULL, 0); + ret = setsockopt(connfd, SOL_UNIX, UNIX_ACCEPT_GROUP, &mreq, sizeof(mreq)); + +The socket is part of the multicast group until it is released, shutdown with +RCV_SHUTDOWN or it leaves explicitely the group: + + ret = setsockopt(sockfd, SOL_UNIX, UNIX_LEAVE_GROUP, &mreq, sizeof(mreq)); + +Struct unix_mcast nodes are linked in two RCU lists: +- (struct unix_sock)->mcast_subscriptions +- (struct unix_mcast_group)->mcast_members + + unix_mcast_group unix_mcast_group + | | + v v +unix_sock ----> unix_mcast ----> unix_mcast + | + v +unix_sock ----> unix_mcast + | + v +unix_sock ----> unix_mcast + + +SOCK_DGRAM semantics +==================== + + G The socket which created the group + / | \ + P1 P2 P3 The member sockets + +Messages sent to the group are received by all members except the sender itself +unless the sending socket has UNIX_MREQ_LOOPBACK set. + +Non-members can also send to the group socket G and the message will be +broadcast to the group members, however socket G does not receive messages sent +to the group, via it, itself. + + +SOCK_SEQPACKET semantics +======================== + +When a connection is performed on a SOCK_SEQPACKET multicast socket, a new +socket is created and its file descriptor is received by accept(). + + L The listening socket + / | \ + A1 A2 A3 The accepted sockets + | | | + C1 C2 C3 The connected sockets + +Messages sent on the C1 socket are received by: +- C1 itself if UNIX_MREQ_LOOPBACK is set. +- The peer socket A1 if UNIX_MREQ_SEND_TO_PEER is set. +- The other members of the multicast group C2 and C3. + +Only members can send to the group in this case. + + +Atomic delivery and ordering +============================ + +Each message sent is delivered atomically to either none of the recipients or +all the recipients, even with interruptions and errors. + +Locking is used in order to keep the ordering consistent on all recipients. We +want to avoid the following scenario. Two emitters A and B, and 2 recipients, C +and D: + + C D +A -------->| | Step 1: A's message is delivered to C +B -------->| | Step 2: B's message is delivered to C +B ---------|--->| Step 3: B's message is delivered to D +A ---------|--->| Step 4: A's message is delivered to D + +Result: - C received (A, B) + - D received (B, A) + +Although A and B had a list of recipients (C, D) in the same order, C and D +received the messages in a different order. To avoid this scenario, we need a +locking mechanism while the messages are being delivered with skb_queue_tail(). + +Solution 1: +The easiest implementation would be to use a global spinlock on the group, but +it creates an avoidable contention, especially when there are two independent +streams set up with socket filters; e.g. if A sends messages received only by +C, and B sends messages received only by D. + +Solution 2: +Fine-grained locking could be implemented with a spinlock on each recipient. +Before delivering the message to the recipients, the sender takes a spinlock on +each recipient at the same time. + +Taking several spinlocks on the same struct can be dangerous and leads to +deadlocks. This is prevented by sorting the list of sockets by memory address +and taking the spinlocks in that order. The ordered list of recipients is +computed on demand when a message is sent and the list is cached for +performance. When the group membership changes, the generation of the +membership is incremented and the ordered recipient list is invalidated. + +With this solution, the number of spinlocks taken simultaneously can be +arbitrary big. Whilst it works, it breaks the lockdep mechanism. + +Solution 3: +The current implementation is similar to solution 2 but with a limit on the +number of spinlocks taken simultaneously (8), so lockdep works fine. A hash +function and bit array with n=8 specifies which spinlocks to take. Contention +on independent streams can still happen but it is less likely. + + +Flow control +============ + +When a socket's receiving queue is full, the default behavior is to block +senders (or to return -EAGAIN on non-blocking sockets). The socket can also +join a multicast group with the flag UNIX_MREQ_DROP_WHEN_FULL. In this case, +messages sent to the group will not be delivered to that socket when its +receiving queue is full. + +Messages are still delivered atomically to all members who don't have the flag +UNIX_MREQ_DROP_WHEN_FULL. If send() returns -EAGAIN, nobody received the +message. If send() blocks because of one member, the other members don't +receive the message until all sockets (except those with +UNIX_MREQ_DROP_WHEN_FULL set) can receive at the same time. + +poll/epoll/select on POLLOUT events have a consistent behavior; they block if +at least one member of the multicast group without UNIX_MREQ_DROP_WHEN_FULL has +a full receiving queue. + + +Multicast socket reference counting +=================================== + +A poller for POLLOUT events can block for any member of the group. The poller +can use the wait queue "peer_wait" of any member. So it is important that Unix +sockets are not released before all pollers exit. This is achieved by: + +- Incrementing the reference counter of a socket when it joins a multicast + group. +- Decrementing it when the group is destroyed, that is when all + sockets keeping a reference on the group released their reference on the + group. + +struct unix_mcast_group keeps track of both current members and previous +members. When a socket leaves a group, it is removed from the members list and +put in the dead members list. This is done in order to take advantage of RCU +lists, which reduces lock contention. -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/