On Mon, Oct 12, 2015 at 2:17 PM, Dmitry Vyukov <[email protected]> wrote:
> On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet <[email protected]> wrote:
>> On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:
>>
>>> Probably the issue discussed in
>>>
>>> http://thread.gmane.org/gmane.linux.kernel/2057497/
>>>
>>> and previous related threads.
>>>
>>
>> Same issue, but Dmitry apparently did not trust me.
>
> Just wanted to make sure. Let's consider this as fixed.
Is it meant to be fixed now?
I am still seeing use-after-free reports in ep_unregister_pollwait on
d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).
On 11/06/2015 08:06 AM, Dmitry Vyukov wrote:
> On Mon, Oct 12, 2015 at 2:17 PM, Dmitry Vyukov <[email protected]> wrote:
>> On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet <[email protected]> wrote:
>>> On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:
>>>
>>>> Probably the issue discussed in
>>>>
>>>> http://thread.gmane.org/gmane.linux.kernel/2057497/
>>>>
>>>> and previous related threads.
>>>>
>>>
>>> Same issue, but Dmitry apparently did not trust me.
>>
>> Just wanted to make sure. Let's consider this as fixed.
>
>
> Is it meant to be fixed now?
>
> I am still seeing use-after-free reports in ep_unregister_pollwait on
> d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).
>
Hi,
Indeed - No formal patch has been posted for this yet - we've agreed
that Rainer is going to post the patch for this issue.
Thanks,
-Jason
Jason Baron <[email protected]> writes:
> On 11/06/2015 08:06 AM, Dmitry Vyukov wrote:
>> On Mon, Oct 12, 2015 at 2:17 PM, Dmitry Vyukov <[email protected]> wrote:
>>> On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet <[email protected]> wrote:
>>>> On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:
>>>>
>>>>> Probably the issue discussed in
>>>>>
>>>>> http://thread.gmane.org/gmane.linux.kernel/2057497/
>>>>>
>>>>> and previous related threads.
>>>>>
>>>>
>>>> Same issue, but Dmitry apparently did not trust me.
>>>
>>> Just wanted to make sure. Let's consider this as fixed.
>>
>>
>> Is it meant to be fixed now?
>>
>> I am still seeing use-after-free reports in ep_unregister_pollwait on
>> d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).
>>
>
> Indeed - No formal patch has been posted for this yet - we've agreed
> that Rainer is going to post the patch for this issue.
I should be able to put some more work into this on Sunday or possibly
(but unlikely) later today (supposed to mean that I'll also actually do
that).
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusuat <[email protected]>
---
"Why do things always end up messy and complicated"? Patch is against
4.3.0.
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};
static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..4f263e3 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
}
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+/* Needs sk unix state lock. Otherwise lockless check if data can (likely)
+ * be sent.
+ */
+static inline int unix_dgram_peer_recv_ready(struct sock *sk,
+ struct sock *other)
+{
+ return unix_peer(other) == sk || !unix_recvq_full(other);
+}
+
+/* Needs sk unix state lock. After recv_ready indicated not ready,
+ * establish peer_wait connection if still needed.
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1150,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);
if (other != old_peer)
@@ -1565,6 +1691,13 @@ restart:
unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);
unix_dgram_disconnected(sk, other);
@@ -1590,10 +1723,14 @@ restart:
goto out_unlock;
}
- if (unix_peer(other) != sk && unix_recvq_full(other)) {
+ if (!unix_dgram_peer_recv_ready(sk, other)) {
if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
+ if (unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
+
+ goto restart;
}
timeo = unix_wait_for_peer(other, timeo);
@@ -2453,14 +2590,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other &&
+ !unix_dgram_peer_recv_ready(sk, other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}
if (writable)
From: Rainer Weikusat <[email protected]>
Date: Mon, 09 Nov 2015 14:40:48 +0000
> + __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
> + &u->peer_wake);
This is more simply:
__remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, q);
> +static inline int unix_dgram_peer_recv_ready(struct sock *sk,
> + struct sock *other)
Please do not us the inline keyword in foo.c files, let the compiler
decide.
Thanks.
On 11/09/2015 09:40 AM, Rainer Weikusat wrote:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
>
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
>
> Signed-off-by: Rainer Weikusuat <[email protected]>
[...]
> @@ -1590,10 +1723,14 @@ restart:
> goto out_unlock;
> }
>
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> + if (!unix_dgram_peer_recv_ready(sk, other)) {
> if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> + if (unix_dgram_peer_wake_me(sk, other)) {
> + err = -EAGAIN;
> + goto out_unlock;
> + }
> +
> + goto restart;
> }
So this will cause 'unix_state_lock(other) to be called twice in a
row if we 'goto restart' (and hence will softlock the box). It just
needs a 'unix_state_unlock(other);' before the 'goto restart'.
I also tested this patch with a single unix server and 200 client
threads doing roughly epoll() followed by write() until -EAGAIN in a
loop. The throughput for the test was roughly the same as current
upstream, but the cpu usage was a lot higher. I think its b/c this patch
takes the server wait queue lock in the _poll() routine. This causes a
lot of contention. The previous patch you posted for this where you did
not clear the wait queue on every wakeup and thus didn't need the queue
lock in poll() (unless we were adding to it), performed much better.
However, the previous patch which tested better didn't add to the remote
queue when it was full on sendmsg() - so it wouldn't be correct for
epoll ET. Adding to the remote queue for every sendmsg() that fails does
seem undesirable, if we aren't even doing poll(). So I'm not sure if
just going back to the previous patch is a great option either....I'm
also not sure how realistic the test case I have is. It would be great
if we had some other workloads to test against.
Thanks,
-Jason
David Miller <[email protected]> writes:
> From: Rainer Weikusat <[email protected]>
> Date: Mon, 09 Nov 2015 14:40:48 +0000
>
>> + __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
>> + &u->peer_wake);
>
> This is more simply:
>
> __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, q);
Thank you for pointing this out.
Jason Baron <[email protected]> writes:
> On 11/09/2015 09:40 AM, Rainer Weikusat wrote:
[...]
>> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> + if (!unix_dgram_peer_recv_ready(sk, other)) {
>> if (!timeo) {
>> - err = -EAGAIN;
>> - goto out_unlock;
>> + if (unix_dgram_peer_wake_me(sk, other)) {
>> + err = -EAGAIN;
>> + goto out_unlock;
>> + }
>> +
>> + goto restart;
>> }
>
>
> So this will cause 'unix_state_lock(other) to be called twice in a
> row if we 'goto restart' (and hence will softlock the box). It just
> needs a 'unix_state_unlock(other);' before the 'goto restart'.
The goto restart was nonsense to begin with in this code path:
Restarting something is necessary after sleeping for some time but for
the case above, execution just continues. I've changed that (updated
patch should follow 'soon') to
if (!unix_dgram_peer_recv_ready(sk, other)) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);
err = sock_intr_errno(timeo);
if (signal_pending(current))
goto out_free;
goto restart;
}
if (unix_dgram_peer_wake_me(sk, other)) {
err = -EAGAIN;
goto out_unlock;
}
}
> I also tested this patch with a single unix server and 200 client
> threads doing roughly epoll() followed by write() until -EAGAIN in a
> loop. The throughput for the test was roughly the same as current
> upstream, but the cpu usage was a lot higher. I think its b/c this patch
> takes the server wait queue lock in the _poll() routine. This causes a
> lot of contention. The previous patch you posted for this where you did
> not clear the wait queue on every wakeup and thus didn't need the queue
> lock in poll() (unless we were adding to it), performed much better.
I'm somewhat unsure what to make of that: The previous patch would also
take the wait queue lock whenever poll was about to return 'not
writable' because of the length of the server receive queue unless
another thread using the same client socket also noticed this and
enqueued this same socket already. And "hundreds of clients using a
single client socket in order to send data to a single server socket"
doesn't seem very realistic to me.
Also, this code shouldn't usually be executed as the server should
usually be capable of keeping up with the data sent by clients. If it's
permanently incapable of that, you're effectively performing a
(successful) DDOS against it. Which should result in "high CPU
utilization" in either case. It may be possible to improve this by
tuning/ changing the flow control mechanism. Out of my head, I'd suggest
making the queue longer (the default value is 10) and delaying wake ups
until the server actually did catch up, IOW, the receive queue is empty
or almost empty. But this ought to be done with a different patch.
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <[email protected]>
---
- use wait_queue_t passed as argument to _relay
- fix possible deadlock and logic error in _dgram_sendmsg by
straightening the control flow ("spaghetti code considered
confusing")
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};
static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..4297d8e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
}
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ q);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+/* Needs sk unix state lock. Otherwise lockless check if data can (likely)
+ * be sent.
+ */
+static int unix_dgram_peer_recv_ready(struct sock *sk,
+ struct sock *other)
+{
+ return unix_peer(other) == sk || !unix_recvq_full(other);
+}
+
+/* Needs sk unix state lock. After recv_ready indicated not ready,
+ * establish peer_wait connection if still needed.
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1150,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);
if (other != old_peer)
@@ -1565,6 +1691,13 @@ restart:
unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);
unix_dgram_disconnected(sk, other);
@@ -1590,19 +1723,21 @@ restart:
goto out_unlock;
}
- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ if (!unix_dgram_peer_recv_ready(sk, other)) {
+ if (timeo) {
+ timeo = unix_wait_for_peer(other, timeo);
- timeo = unix_wait_for_peer(other, timeo);
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ goto restart;
+ }
- goto restart;
+ if (unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
}
if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -2453,14 +2588,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other &&
+ !unix_dgram_peer_recv_ready(sk, other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}
if (writable)
Hello,
On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
>
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
This whole patch seems pretty complicated to me.
Can't we just remove the unix_recvq_full checks alltogether and unify
unix_dgram_poll with unix_poll?
If we want to be cautious we could simply make unix_max_dgram_qlen limit
the number of skbs which are in flight from a sending socket. The skb
destructor can then decrement this. This seems much simpler.
Would this work?
Thanks,
Hannes
Hannes Frederic Sowa <[email protected]> writes:
> On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
>> An AF_UNIX datagram socket being the client in an n:1 association with
>> some server socket is only allowed to send messages to the server if the
>> receive queue of this socket contains at most sk_max_ack_backlog
>> datagrams.
[...]
> This whole patch seems pretty complicated to me.
>
> Can't we just remove the unix_recvq_full checks alltogether and unify
> unix_dgram_poll with unix_poll?
>
> If we want to be cautious we could simply make unix_max_dgram_qlen limit
> the number of skbs which are in flight from a sending socket. The skb
> destructor can then decrement this. This seems much simpler.
>
> Would this work?
In the way this is intended to work, cf
http://marc.info/?t=115627606000002&r=1&w=2
only if the limit would also apply to sockets which didn't sent anything
so far. Which means it'll end up in the exact same situation as before:
Sending something using a certain socket may not be possible because of
data sent by other sockets, so either, code trying to send using this
sockets ends up busy-waiting for "space again available" despite it's
trying to use select/ poll/ epolll/ $whatnot to get notified of this
condition and sleep until then or this notification needs to be
propagated to sleeping threads which didn't get to send anything yet.
Hi Rainer,
> +
> +/* Needs sk unix state lock. After recv_ready indicated not ready,
> + * establish peer_wait connection if still needed.
> + */
> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
> +{
> + int connected;
> +
> + connected = unix_dgram_peer_wake_connect(sk, other);
> +
> + if (unix_recvq_full(other))
> + return 1;
> +
> + if (connected)
> + unix_dgram_peer_wake_disconnect(sk, other);
> +
> + return 0;
> +}
> +
So the comment above this function says 'needs unix state lock', however
the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage
in unix_dgram_poll() has the 'sk' lock. So this looks racy.
Also, another tweak on this scheme: Instead of calling
'__remove_wait_queue()' in unix_dgram_peer_wake_relay(). We could
instead simply mark each item in the queue as 'WQ_FLAG_EXCLUSIVE'. Then,
since 'unix_dgram_recvmsg()' does an exclusive wakeup the queue has
effectively been disabled (minus the first exlusive item in the list
which can just return if its marked exclusive). This means that in
dgram_poll(), we add to the list if we have not yet been added, and if
we are on the list, we do a remove and then add removing the exclusive
flag. Thus, all the waiters that need a wakeup are at the beginning of
the queue, and the disabled ones are at the end with the
'WQ_FLAG_EXCLUSIVE' flag set.
This does make the list potentially long, but if we only walk it to the
point we are doing the wakeup, it has no impact. I like the fact that in
this scheme the wakeup doesn't have to call remove against a long of
waiters - its just setting the exclusive flag.
Thanks,
-Jason
> static inline int unix_writable(struct sock *sk)
> {
> return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
> @@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
> skpair->sk_state_change(skpair);
> sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
> }
> +
> + unix_dgram_peer_wake_disconnect(sk, skpair);
> sock_put(skpair); /* It may now die */
> unix_peer(sk) = NULL;
> }
> @@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
> INIT_LIST_HEAD(&u->link);
> mutex_init(&u->readlock); /* single task reading lock */
> init_waitqueue_head(&u->peer_wait);
> + init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
> unix_insert_socket(unix_sockets_unbound(sk), sk);
> out:
> if (sk == NULL)
> @@ -1031,6 +1150,13 @@ restart:
> if (unix_peer(sk)) {
> struct sock *old_peer = unix_peer(sk);
> unix_peer(sk) = other;
> +
> + if (unix_dgram_peer_wake_disconnect(sk, other))
> + wake_up_interruptible_poll(sk_sleep(sk),
> + POLLOUT |
> + POLLWRNORM |
> + POLLWRBAND);
> +
> unix_state_double_unlock(sk, other);
>
> if (other != old_peer)
> @@ -1565,6 +1691,13 @@ restart:
> unix_state_lock(sk);
> if (unix_peer(sk) == other) {
> unix_peer(sk) = NULL;
> +
> + if (unix_dgram_peer_wake_disconnect(sk, other))
> + wake_up_interruptible_poll(sk_sleep(sk),
> + POLLOUT |
> + POLLWRNORM |
> + POLLWRBAND);
> +
> unix_state_unlock(sk);
>
> unix_dgram_disconnected(sk, other);
> @@ -1590,19 +1723,21 @@ restart:
> goto out_unlock;
> }
>
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> - }
> + if (!unix_dgram_peer_recv_ready(sk, other)) {
> + if (timeo) {
> + timeo = unix_wait_for_peer(other, timeo);
>
> - timeo = unix_wait_for_peer(other, timeo);
> + err = sock_intr_errno(timeo);
> + if (signal_pending(current))
> + goto out_free;
>
> - err = sock_intr_errno(timeo);
> - if (signal_pending(current))
> - goto out_free;
> + goto restart;
> + }
>
> - goto restart;
> + if (unix_dgram_peer_wake_me(sk, other)) {
> + err = -EAGAIN;
> + goto out_unlock;
> + }
> }
>
> if (sock_flag(other, SOCK_RCVTSTAMP))
> @@ -2453,14 +2588,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
> return mask;
>
> writable = unix_writable(sk);
> - other = unix_peer_get(sk);
> - if (other) {
> - if (unix_peer(other) != sk) {
> - sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
> - if (unix_recvq_full(other))
> - writable = 0;
> - }
> - sock_put(other);
> + if (writable) {
> + unix_state_lock(sk);
> +
> + other = unix_peer(sk);
> + if (other &&
> + !unix_dgram_peer_recv_ready(sk, other) &&
> + unix_dgram_peer_wake_me(sk, other))
> + writable = 0;
> +
> + unix_state_unlock(sk);
> }
>
> if (writable)
>
Hi,
On Wed, Nov 11, 2015, at 17:12, Rainer Weikusat wrote:
> Hannes Frederic Sowa <[email protected]> writes:
> > On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
> >> An AF_UNIX datagram socket being the client in an n:1 association with
> >> some server socket is only allowed to send messages to the server if the
> >> receive queue of this socket contains at most sk_max_ack_backlog
> >> datagrams.
>
> [...]
>
> > This whole patch seems pretty complicated to me.
> >
> > Can't we just remove the unix_recvq_full checks alltogether and unify
> > unix_dgram_poll with unix_poll?
> >
> > If we want to be cautious we could simply make unix_max_dgram_qlen limit
> > the number of skbs which are in flight from a sending socket. The skb
> > destructor can then decrement this. This seems much simpler.
> >
> > Would this work?
>
> In the way this is intended to work, cf
>
> http://marc.info/?t=115627606000002&r=1&w=2
Oh, I see, we don't limit closed but still referenced sockets. This
actually makes sense on how fd handling is implemented, just as a range
check.
Have you checked if we can somehow deregister the socket in the poll
event framework? You wrote that it does not provide such a function but
maybe it would be easy to add?
Thanks,
Hannes
Jason Baron <[email protected]> writes:
>> +
>> +/* Needs sk unix state lock. After recv_ready indicated not ready,
>> + * establish peer_wait connection if still needed.
>> + */
>> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
>> +{
>> + int connected;
>> +
>> + connected = unix_dgram_peer_wake_connect(sk, other);
>> +
>> + if (unix_recvq_full(other))
>> + return 1;
>> +
>> + if (connected)
>> + unix_dgram_peer_wake_disconnect(sk, other);
>> +
>> + return 0;
>> +}
>> +
>
> So the comment above this function says 'needs unix state lock', however
> the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage
> in unix_dgram_poll() has the 'sk' lock. So this looks racy.
That's one thing which is broken with this patch. Judging from a 'quick'
look at the _dgram_sendmsg code, the unix_state_lock(other) will need to
be turned into a unix_state_double_lock(sk, other) and the remaining
code changed accordingly (since all of the checks must be done without
unlocking other).
There's also something else seriously wrong with the present patch: Some
code in unix_dgram_connect presently (with this change) looks like this:
/*
* If it was connected, reconnect.
*/
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
if (unix_dgram_peer_wake_disconnect(sk, other))
wake_up_interruptible_poll(sk_sleep(sk),
POLLOUT |
POLLWRNORM |
POLLWRBAND);
unix_state_double_unlock(sk, other);
if (other != old_peer)
unix_dgram_disconnected(sk, old_peer);
sock_put(old_peer);
and trying to disconnect from a peer the socket is just being
connected to is - of course - "flowering tomfoolery" (literal
translation of the German "bluehender Bloedsinn") --- it needs to
disconnect from old_peer instead.
I'll address the suggestion and send an updated patch "later today" (may
become "early tomorrow"). I have some code addressing both issues but
that's part of a release of 'our' kernel fork, ie, 3.2.54-based I'll
need to do 'soon'.
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <[email protected]>
---
"Believed to be least buggy version"
- disconnect from former peer in _dgram_connect
- use unix_state_double_lock in _dgram_sendmsg to ensure
recv_ready/ wake_me preconditions are met (noted by Jason
Baron)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};
static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..30e7c56 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
}
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ q);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+/* Preconditions for the following two functions:
+ *
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+
+static int unix_dgram_peer_recv_ready(struct sock *sk,
+ struct sock *other)
+{
+ return unix_peer(other) == sk || !unix_recvq_full(other);
+}
+
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1150,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);
if (other != old_peer)
@@ -1548,7 +1674,7 @@ restart:
goto out_free;
}
- unix_state_lock(other);
+ unix_state_double_lock(sk, other);
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;
@@ -1562,9 +1688,15 @@ restart:
sock_put(other);
err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);
unix_dgram_disconnected(sk, other);
@@ -1590,21 +1722,27 @@ restart:
goto out_unlock;
}
- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
+ if (timeo) {
+ unix_state_unlock(sk);
- timeo = unix_wait_for_peer(other, timeo);
+ timeo = unix_wait_for_peer(other, timeo);
- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
- goto restart;
+ goto restart;
+ }
+
+ if (unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
}
+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,7 +1756,7 @@ restart:
return len;
out_unlock:
- unix_state_unlock(other);
+ unix_state_double_unlock(sk, other);
out_free:
kfree_skb(skb);
out:
@@ -2453,14 +2591,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other &&
+ !unix_dgram_peer_recv_ready(sk, other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}
if (writable)
Hannes Frederic Sowa <[email protected]> writes:
> On Wed, Nov 11, 2015, at 17:12, Rainer Weikusat wrote:
>> Hannes Frederic Sowa <[email protected]> writes:
>> > On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
>> >> An AF_UNIX datagram socket being the client in an n:1 association with
>> >> some server socket is only allowed to send messages to the server if the
>> >> receive queue of this socket contains at most sk_max_ack_backlog
>> >> datagrams.
>>
>> [...]
>>
>> > This whole patch seems pretty complicated to me.
>> >
>> > Can't we just remove the unix_recvq_full checks alltogether and unify
>> > unix_dgram_poll with unix_poll?
>> >
>> > If we want to be cautious we could simply make unix_max_dgram_qlen limit
>> > the number of skbs which are in flight from a sending socket. The skb
>> > destructor can then decrement this. This seems much simpler.
>> >
>> > Would this work?
>>
>> In the way this is intended to work, cf
>>
>> http://marc.info/?t=115627606000002&r=1&w=2
>
> Oh, I see, we don't limit closed but still referenced sockets. This
> actually makes sense on how fd handling is implemented, just as a range
> check.
>
> Have you checked if we can somehow deregister the socket in the poll
> event framework? You wrote that it does not provide such a function but
> maybe it would be easy to add?
I thought about this but this would amount to adding a general interface
for the sole purpose of enabling the af_unix code to talk to the
eventpoll code and I don't really like this idea: IMHO, there should be
at least two users (preferably three) before creating any kind of
'abstract interface'. An even more ideal "castle in the air"
(hypothetical) solution would be "change the eventpoll.c code such that
it won't be affected if a wait queue just goes away". That's at least
theoretically possible (although it might not be in practice).
I wouldn't mind doing that (assuming it was possible) if it was just
for the kernels my employer uses because I'm aware of the uses these
will be put to and in control of the corresponding userland code. But
for "general Linux code", changing epoll in order to help the af_unix
code is more potential trouble than it's worth: Exchanging a relatively
unimportant bug in some module for a much more visibly damaging bug in a
central facility would be a bad tradeoff.
On 11/13/2015 01:51 PM, Rainer Weikusat wrote:
[...]
>
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> - }
> + if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
Remind me why the 'unix_peer(sk) == other' is added here? If the remote
is not connected we still want to make sure that we don't overflow the
the remote rcv queue, right?
In terms of this added 'double' lock for both sk and other, where
previously we just held the 'other' lock. I think we could continue to
just hold the 'other' lock unless the remote queue is full, so something
like:
if (unix_peer(other) != sk && unix_recvq_full(other)) {
bool need_wakeup = false;
....skipping the blocking case...
err = -EAGAIN;
if (!other_connected)
goto out_unlock;
unix_state_unlock(other);
unix_state_lock(sk);
/* if remote peer has changed under us, the connect()
will wake up any pending waiter, just return -EAGAIN
if (unix_peer(sk) == other) {
/* In case we see there is space available
queue the wakeup and we will try again. This
this should be an unlikely condition */
if (!unix_dgram_peer_wake_me(sk, other))
need_wakeup = true;
}
unix_state_unlock(sk);
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),POLLOUT
| POLLWRNORM | POLLWRBAND);
goto out_free;
}
So I'm not sure if the 'double' lock really affects any workload, but
the above might be away to avoid it.
Also - it might be helpful to add a 'Fixes:' tag referencing where this
issue started, in the changelog.
Worth mentioning too is that this patch should improve the polling case
here dramatically, as we currently wake the entire queue on every remote
read even when we have room in the rcv buffer. So this patch will cut
down on ctxt switching rate dramatically from what we currently have.
Thanks,
-Jason
Jason Baron <[email protected]> writes:
> On 11/13/2015 01:51 PM, Rainer Weikusat wrote:
>
> [...]
>
>>
>> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> - if (!timeo) {
>> - err = -EAGAIN;
>> - goto out_unlock;
>> - }
>> + if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
>
> Remind me why the 'unix_peer(sk) == other' is added here? If the remote
> is not connected we still want to make sure that we don't overflow the
> the remote rcv queue, right?
Good point. The check is actually wrong there as the original code would
also check the limit in case of an unconnected send to a socket found
via address lookup. It belongs into the 2nd if (were I originally put
it).
>
> In terms of this added 'double' lock for both sk and other, where
> previously we just held the 'other' lock. I think we could continue to
> just hold the 'other' lock unless the remote queue is full, so something
> like:
>
> if (unix_peer(other) != sk && unix_recvq_full(other)) {
> bool need_wakeup = false;
>
> ....skipping the blocking case...
>
> err = -EAGAIN;
> if (!other_connected)
> goto out_unlock;
> unix_state_unlock(other);
> unix_state_lock(sk);
That was my original idea. The problem with this is that the code
starting after the _lock and running until the main code path unlock has
to be executed in one go with the other lock held as the results of the
tests above this one may become invalid as soon as the other lock is
released. This means instead of continuing execution with the send code
proper after the block in case other became receive-ready between the
first and the second test (possible because _dgram_recvmsg does not
take the unix state lock), the whole procedure starting with acquiring
the other lock would need to be restarted. Given sufficiently unfavorable
circumstances, this could even turn into an endless loop which couldn't
be interrupted. (unless code for this was added).
[...]
> we currently wake the entire queue on every remote read even when we
> have room in the rcv buffer. So this patch will cut down on ctxt
> switching rate dramatically from what we currently have.
In my opinion, this could be improved by making the throttling mechanism
work like a flip flop: If the queue lenght hits the limit after a
_sendmsg, set a "no more applicants" flag blocking future sends until
cleared (checking the flag would replace the present
check). After the receive queue ran empty (or almost empty),
_dgram_sendmsg would clear the flag and do a wakeup. But this should be
an independent patch (if implemented).
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <[email protected]>
---
- fix logic in _dgram_sendmsg: queue limit also needs to be
enforced for unconnected sends
- drop _recv_ready helper function: I'm usually a big fan of
functional decomposition but in this case, the abstraction
seemed to obscure things rather than making them easier to
understand
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};
static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3f4974d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@ found:
return s;
}
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ q);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+/* preconditions:
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +536,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +772,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1140,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);
if (other != old_peer)
@@ -1548,7 +1664,7 @@ restart:
goto out_free;
}
- unix_state_lock(other);
+ unix_state_double_lock(sk, other);
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;
@@ -1562,9 +1678,15 @@ restart:
sock_put(other);
err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);
unix_dgram_disconnected(sk, other);
@@ -1591,20 +1713,27 @@ restart:
}
if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ if (timeo) {
+ unix_state_unlock(sk);
- timeo = unix_wait_for_peer(other, timeo);
+ timeo = unix_wait_for_peer(other, timeo);
- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
- goto restart;
+ goto restart;
+ }
+
+ if (unix_peer(sk) != other ||
+ unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
}
+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,7 +1747,7 @@ restart:
return len;
out_unlock:
- unix_state_unlock(other);
+ unix_state_double_unlock(sk, other);
out_free:
kfree_skb(skb);
out:
@@ -2453,14 +2582,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other && unix_peer(other) != sk &&
+ unix_recvq_full(other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}
if (writable)
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <[email protected]>
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---
Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll()
behavior for EPOLLOUT event": This shouldn't be an issue anymore with
this change despite it restores the "only when writable" behaviour" as
the wake up relay will also be set up once _dgram_sendmsg returned
EAGAIN for a send attempt on a n:1 connected socket.
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};
static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3f4974d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@ found:
return s;
}
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ q);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+/* preconditions:
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +536,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +772,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1140,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);
if (other != old_peer)
@@ -1548,7 +1664,7 @@ restart:
goto out_free;
}
- unix_state_lock(other);
+ unix_state_double_lock(sk, other);
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;
@@ -1562,9 +1678,15 @@ restart:
sock_put(other);
err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);
unix_dgram_disconnected(sk, other);
@@ -1591,20 +1713,27 @@ restart:
}
if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ if (timeo) {
+ unix_state_unlock(sk);
- timeo = unix_wait_for_peer(other, timeo);
+ timeo = unix_wait_for_peer(other, timeo);
- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
- goto restart;
+ goto restart;
+ }
+
+ if (unix_peer(sk) != other ||
+ unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
}
+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,7 +1747,7 @@ restart:
return len;
out_unlock:
- unix_state_unlock(other);
+ unix_state_double_unlock(sk, other);
out_free:
kfree_skb(skb);
out:
@@ -2453,14 +2582,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other && unix_peer(other) != sk &&
+ unix_recvq_full(other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}
if (writable)
On 11/15/2015 01:32 PM, Rainer Weikusat wrote:
>
> That was my original idea. The problem with this is that the code
> starting after the _lock and running until the main code path unlock has
> to be executed in one go with the other lock held as the results of the
> tests above this one may become invalid as soon as the other lock is
> released. This means instead of continuing execution with the send code
> proper after the block in case other became receive-ready between the
> first and the second test (possible because _dgram_recvmsg does not
> take the unix state lock), the whole procedure starting with acquiring
> the other lock would need to be restarted. Given sufficiently unfavorable
> circumstances, this could even turn into an endless loop which couldn't
> be interrupted. (unless code for this was added).
>
hmmm - I think we can avoid it by doing the wakeup from the write path
in the rare case that the queue has emptied - and avoid the double lock. IE:
unix_state_unlock(other);
unix_state_lock(sk);
err = -EAGAIN;
if (unix_peer(sk) == other) {
unix_dgram_peer_wake_connect(sk, other);
if (skb_queue_len(&other->sk_receive_queue) == 0)
need_wakeup = true;
}
unix_state_unlock(sk);
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk), POLLOUT
| POLLWRNORM | POLLWRBAND);
goto out_free;
Thanks,
-Jason
On 11/16/2015 05:28 PM, Rainer Weikusat wrote:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
>
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
>
> Signed-off-by: Rainer Weikusat <[email protected]>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
> ---
>
> Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll()
> behavior for EPOLLOUT event": This shouldn't be an issue anymore with
> this change despite it restores the "only when writable" behaviour" as
> the wake up relay will also be set up once _dgram_sendmsg returned
> EAGAIN for a send attempt on a n:1 connected socket.
>
>
Hi,
My only comment was about potentially avoiding the double lock in the
write path, otherwise this looks ok to me.
Thanks,
-Jason
Jason Baron <[email protected]> writes:
> On 11/15/2015 01:32 PM, Rainer Weikusat wrote:
>
>>
>> That was my original idea. The problem with this is that the code
>> starting after the _lock and running until the main code path unlock has
>> to be executed in one go with the other lock held as the results of the
>> tests above this one may become invalid as soon as the other lock is
>> released. This means instead of continuing execution with the send code
>> proper after the block in case other became receive-ready between the
>> first and the second test (possible because _dgram_recvmsg does not
>> take the unix state lock), the whole procedure starting with acquiring
>> the other lock would need to be restarted. Given sufficiently unfavorable
>> circumstances, this could even turn into an endless loop which couldn't
>> be interrupted. (unless code for this was added).
>>
>
> hmmm - I think we can avoid it by doing the wakeup from the write path
> in the rare case that the queue has emptied - and avoid the double lock. IE:
>
> unix_state_unlock(other);
> unix_state_lock(sk);
> err = -EAGAIN;
> if (unix_peer(sk) == other) {
> unix_dgram_peer_wake_connect(sk, other);
> if (skb_queue_len(&other->sk_receive_queue) == 0)
> need_wakeup = true;
> }
> unix_state_unlock(sk);
> if (need_wakeup)
> wake_up_interruptible_poll(sk_sleep(sk), POLLOUT
> | POLLWRNORM | POLLWRBAND);
> goto out_free;
This should probably rather be
if (unix_dgram_peer_wake_connect(sk, other) &&
skb_queue_len(&other->sk_receive_queue) == 0)
need_wakeup = 1;
as there's no need to do the wake up if someone else already connected
and then, the double lock could be avoided at the expense of returning a
gratuitous EAGAIN to the caller and throwing all of the work
_dgram_sendmsg did so far, eg, allocate a skb, copy the data into the
kernel, do all the other checks, away.
This would enable another thread to do one of the following things in
parallell with the 'locked' part of _dgram_sendmsg
1) connect sk to a socket != other
2) use sk to send to a socket != other
3) do a shutdown on sk
4) determine write-readyness of sk via poll callback
IMHO, the only thing which could possibly matter is 2) and my suggestion
for this would rather be "use a send socket per sending thread if this
matters to you" than "cause something to fail which could as well
have succeeded".
From: Rainer Weikusat <[email protected]>
Date: Mon, 16 Nov 2015 22:28:40 +0000
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
>
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
>
> Signed-off-by: Rainer Weikusat <[email protected]>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
So because of a corner case of epoll handling and sender socket release,
every single datagram sendmsg has to do a double lock now?
I do not dispute the correctness of your fix at this point, but that
added cost in the fast path is really too high.
David Miller <[email protected]> writes:
> From: Rainer Weikusat <[email protected]>
> Date: Mon, 16 Nov 2015 22:28:40 +0000
>
>> An AF_UNIX datagram socket being the client in an n:1 association with
>> some server socket is only allowed to send messages to the server if the
>> receive queue of this socket contains at most sk_max_ack_backlog
>> datagrams.
[...]
>> Signed-off-by: Rainer Weikusat <[email protected]>
>> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
>
> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.
This leaves only the option of a somewhat incorrect solution and what is
or isn't acceptable in this respect is somewhat difficult to decide. The
basic options would be
- return EAGAIN even if sending became possible (Jason's most
recent suggestions)
- retry sending a limited number of times, eg, once, before
returning EAGAIN, on the grounds that this is nicer to the
application and that redoing all the stuff up to the _lock in
dgram_sendmsg can possibly/ likely be avoided
Which one do you prefer?
Rainer Weikusat <[email protected]> writes:
[...]
> The basic options would be
>
> - return EAGAIN even if sending became possible (Jason's most
> recent suggestions)
>
> - retry sending a limited number of times, eg, once, before
> returning EAGAIN, on the grounds that this is nicer to the
> application and that redoing all the stuff up to the _lock in
> dgram_sendmsg can possibly/ likely be avoided
A third option: Use trylock to acquire the sk lock. If this succeeds,
there's no risk of deadlocking anyone even if acquiring the locks in the
wrong order. This could look as follows (NB: I didn't even compile this,
I just wrote the code to get an idea how complicated it would be):
int need_wakeup;
[...]
need_wakeup = 0;
err = 0;
if (spin_lock_trylock(unix_sk(sk)->lock)) {
if (unix_peer(sk) != other ||
unix_dgram_peer_wake_me(sk, other))
err = -EAGAIN;
} else {
err = -EAGAIN;
unix_state_unlock(other);
unix_state_lock(sk);
need_wakeup = unix_peer(sk) != other &&
unix_dgram_peer_wake_connect(sk, other) &&
sk_receive_queue_len(other) == 0;
}
unix_state_unlock(sk);
if (err) {
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),
POLLOUT |
POLLWRNORM |
POLLWRBAND);
goto out_free;
}
Rainer Weikusat <[email protected]> writes:
[...]
> This leaves only the option of a somewhat incorrect solution and what is
> or isn't acceptable in this respect is somewhat difficult to decide. The
> basic options would be
[...]
> - retry sending a limited number of times, eg, once, before
> returning EAGAIN, on the grounds that this is nicer to the
> application and that redoing all the stuff up to the _lock in
> dgram_sendmsg can possibly/ likely be avoided
Since it's better to have a specific example of something: Here's
another 'code sketch' of this option (hopefully with less errors this
time, there's an int restart = 0 above):
if (unix_peer(other) != sk && unix_recvq_full(other)) {
int need_wakeup;
[...]
need_wakeup = 0;
err = 0;
unix_state_unlock(other);
unix_state_lock(sk);
if (unix_peer(sk) == other) {
if (++restart == 2) {
need_wakeup = unix_dgram_peer_wake_connect(sk, other) &&
sk_receive_queue_len(other) == 0;
err = -EAGAIN;
} else if (unix_dgram_peer_wake_me(sk, other))
err = -EAGAIN;
} else
err = -EAGAIN;
unix_state_unlock(sk);
if (err || !restart) {
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),
POLLOUT |
POLLWRNORM |
POLLWRBAND);
goto out_free;
}
goto restart;
}
I don't particularly like that, either, and to me, the best option seems
to be to return the spurious EAGAIN if taking both locks unconditionally
is not an option as that's the simplest choice.
David Miller <[email protected]> writes:
> From: Rainer Weikusat <[email protected]>
> Date: Mon, 16 Nov 2015 22:28:40 +0000
>
>> An AF_UNIX datagram socket being the client in an n:1
[...]
> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.
Some more information on this: Running the test program included below
on my 'work' system (otherwise idle, after logging in via VT with no GUI
running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in the
following throughput statistics[*]:
avg 13.617 M/s
median 13.393 M/s
max 17.14 M/s
min 13.047 M/s
deviation 0.85
I'll try to post the results for 'unpatched' later as I'm also working
on a couple of other things.
[*] I do not use my fingers for counting, hence, these are binary and
not decimal units.
------------
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <unistd.h>
enum {
MSG_SZ = 16,
MSGS = 1000000
};
static char msg[MSG_SZ];
static uint64_t tv2u(struct timeval *tv)
{
uint64_t u;
u = tv->tv_sec;
u *= 1000000;
return u + tv->tv_usec;
}
int main(void)
{
struct timeval start, stop;
uint64_t t_diff;
double rate;
int sks[2];
unsigned remain;
char buf[MSG_SZ];
socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);
if (fork() == 0) {
close(*sks);
gettimeofday(&start, 0);
while (read(sks[1], buf, sizeof(buf)) > 0);
gettimeofday(&stop, 0);
t_diff = tv2u(&stop);
t_diff -= tv2u(&start);
rate = MSG_SZ * MSGS;
rate /= t_diff;
rate *= 1000000;
printf("rate %fM/s\n", rate / (1 << 20));
fflush(stdout);
_exit(0);
}
close(sks[1]);
remain = MSGS;
do write(*sks, msg, sizeof(msg)); while (--remain);
close(*sks);
wait(NULL);
return 0;
}
Rainer Weikusat <[email protected]> writes:
[...]
> Some more information on this: Running the test program included below
> on my 'work' system (otherwise idle, after logging in via VT with no GUI
> running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in the
> following throughput statistics[*]:
Since the results were too variable with only 20 runs, I've also tested
this with 100 for three kernels, stock 4.3, 4.3 plus the published
patch, 4.3 plus the published patch plus the "just return EAGAIN"
modification". The 1st and the 3rd perform about identical for the
test program I used (slightly modified version included below), the 2nd
is markedly slower. This is most easily visible when grouping the
printed data rates (B/s) 'by millions':
stock 4.3
---------
13000000.000-13999999.000 3 (3%)
14000000.000-14999999.000 82 (82%)
15000000.000-15999999.000 15 (15%)
4.3 + patch
-----------
13000000.000-13999999.000 54 (54%)
14000000.000-14999999.000 35 (35%)
15000000.000-15999999.000 7 (7%)
16000000.000-16999999.000 1 (1%)
18000000.000-18999999.000 1 (1%)
22000000.000-22999999.000 2 (2%)
4.3 + modified patch
--------------------
13000000.000-13999999.000 3 (3%)
14000000.000-14999999.000 82 (82%)
15000000.000-15999999.000 14 (14%)
24000000.000-24999999.000 1 (1%)
IMHO, the 3rd option would be the way to go if this was considered an
acceptable option (ie, despite it returns spurious errors in 'rare
cases').
modified test program
=====================
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <unistd.h>
enum {
MSG_SZ = 16,
MSGS = 1000000
};
static char msg[MSG_SZ];
static uint64_t tv2u(struct timeval *tv)
{
uint64_t u;
u = tv->tv_sec;
u *= 1000000;
return u + tv->tv_usec;
}
int main(void)
{
struct timeval start, stop;
uint64_t t_diff;
double rate;
int sks[2];
unsigned remain;
char buf[MSG_SZ];
socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);
if (fork() == 0) {
close(*sks);
gettimeofday(&start, 0);
while (read(sks[1], buf, sizeof(buf)) > 0);
gettimeofday(&stop, 0);
t_diff = tv2u(&stop);
t_diff -= tv2u(&start);
rate = MSG_SZ * MSGS;
rate /= t_diff;
rate *= 1000000;
printf("%f\n", rate);
fflush(stdout);
_exit(0);
}
close(sks[1]);
remain = MSGS;
do write(*sks, msg, sizeof(msg)); while (--remain);
close(*sks);
wait(NULL);
return 0;
}
Rainer Weikusat <[email protected]> writes:
> Rainer Weikusat <[email protected]> writes:
>
> [...]
>
>> The basic options would be
>>
>> - return EAGAIN even if sending became possible (Jason's most
>> recent suggestions)
>>
>> - retry sending a limited number of times, eg, once, before
>> returning EAGAIN, on the grounds that this is nicer to the
>> application and that redoing all the stuff up to the _lock in
>> dgram_sendmsg can possibly/ likely be avoided
>
> A third option:
A fourth and even one that's reasonably simple to implement: In case
other became ready during the checks, drop other lock, do a double-lock
sk, other, set a flag variable indicating this and restart the procedure
after the unix_state_lock_other[*], using the value of the flag to lock/
unlock sk as needed. Should other still be ready to receive data,
execution can then continue with the 'queue it' code as the other lock
was held all the time this time. Combined with a few unlikely
annotations in place where they're IMHO appropriate, this is speed-wise
comparable to the stock kernel.
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <[email protected]>
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---
This has been created around midnight at the end of a work day. I've
read through it a couple of times and found no errors. But I'm not
"ideally awake and attentive" right now.
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};
static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..6962ff1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@ found:
return s;
}
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ q);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+/* preconditions:
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +536,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +772,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1140,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);
if (other != old_peer)
@@ -1470,6 +1586,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
struct scm_cookie scm;
int max_level;
int data_len = 0;
+ int sk_locked;
wait_for_unix_gc();
err = scm_send(sock, msg, &scm, false);
@@ -1548,12 +1665,14 @@ restart:
goto out_free;
}
+ sk_locked = 0;
unix_state_lock(other);
+restart_locked:
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;
- if (sock_flag(other, SOCK_DEAD)) {
+ if (unlikely(sock_flag(other, SOCK_DEAD))) {
/*
* Check with 1003.1g - what should
* datagram error
@@ -1561,10 +1680,19 @@ restart:
unix_state_unlock(other);
sock_put(other);
+ if (!sk_locked)
+ unix_state_lock(sk);
+
err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);
unix_dgram_disconnected(sk, other);
@@ -1590,21 +1718,35 @@ restart:
goto out_unlock;
}
- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
+ if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
+ if (timeo) {
+ timeo = unix_wait_for_peer(other, timeo);
+
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
+
+ goto restart;
+ }
+
+ if (unix_peer(sk) != other ||
+ unix_dgram_peer_wake_me(sk, other)) {
err = -EAGAIN;
goto out_unlock;
}
- timeo = unix_wait_for_peer(other, timeo);
+ if (!sk_locked) {
+ unix_state_unlock(other);
+ unix_state_double_lock(sk, other);
- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
-
- goto restart;
+ sk_locked = 1;
+ goto restart_locked;
+ }
}
+ if (unlikely(sk_locked))
+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,6 +1760,8 @@ restart:
return len;
out_unlock:
+ if (sk_locked)
+ unix_state_unlock(sk);
unix_state_unlock(other);
out_free:
kfree_skb(skb);
@@ -2453,14 +2597,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other && unix_peer(other) != sk &&
+ unix_recvq_full(other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}
if (writable)
On 11/19/2015 06:52 PM, Rainer Weikusat wrote:
[...]
> @@ -1590,21 +1718,35 @@ restart:
> goto out_unlock;
> }
>
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> + if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
> + if (timeo) {
> + timeo = unix_wait_for_peer(other, timeo);
> +
> + err = sock_intr_errno(timeo);
> + if (signal_pending(current))
> + goto out_free;
> +
> + goto restart;
> + }
> +
> + if (unix_peer(sk) != other ||
> + unix_dgram_peer_wake_me(sk, other)) {
> err = -EAGAIN;
> goto out_unlock;
> }
Hi,
So here we are calling unix_dgram_peer_wake_me() without the sk lock the first time
through - right? In that case, we can end up registering on the queue of other for
the callback but we might have already connected to a different remote. In that case,
the wakeup will crash if 'sk' has freed in the meantime.
Thanks,
-Jason
Jason Baron <[email protected]> writes:
> On 11/19/2015 06:52 PM, Rainer Weikusat wrote:
>
> [...]
>
>> @@ -1590,21 +1718,35 @@ restart:
>> goto out_unlock;
>> }
>>
>> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> - if (!timeo) {
>> + if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
>> + if (timeo) {
>> + timeo = unix_wait_for_peer(other, timeo);
>> +
>> + err = sock_intr_errno(timeo);
>> + if (signal_pending(current))
>> + goto out_free;
>> +
>> + goto restart;
>> + }
>> +
>> + if (unix_peer(sk) != other ||
>> + unix_dgram_peer_wake_me(sk, other)) {
>> err = -EAGAIN;
>> goto out_unlock;
>> }
>
> Hi,
>
> So here we are calling unix_dgram_peer_wake_me() without the sk lock the first time
> through - right?
Yes. And this is obviously wrong. I spend most of the 'evening time'
(some people would call that 'night time') with testing this and didn't
get to read through it again yet. Thank you for pointing this out. I'll
send an updated patch shortly.
Rainer Weikusat <[email protected]> writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <[email protected]>
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---
- uninvert the lock/ check code in _dgram_sendmsg
- introduce a unix_dgram_peer_wake_disconnect_wakuep helper
function as there were two calls with a wakeup immediately
following and two without
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};
static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3d93b0d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,118 @@ found:
return s;
}
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ q);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static void unix_dgram_peer_wake_disconnect(struct sock *sk,
+ struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+}
+
+static void unix_dgram_peer_wake_disconnect_wakeup(struct sock *sk,
+ struct sock *other)
+{
+ unix_dgram_peer_wake_disconnect(sk, other);
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+}
+
+/* preconditions:
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +542,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +778,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1146,8 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+ unix_dgram_peer_wake_disconnect_wakeup(sk, old_peer);
+
unix_state_double_unlock(sk, other);
if (other != old_peer)
@@ -1470,6 +1587,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
struct scm_cookie scm;
int max_level;
int data_len = 0;
+ int sk_locked;
wait_for_unix_gc();
err = scm_send(sock, msg, &scm, false);
@@ -1548,12 +1666,14 @@ restart:
goto out_free;
}
+ sk_locked = 0;
unix_state_lock(other);
+restart_locked:
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;
- if (sock_flag(other, SOCK_DEAD)) {
+ if (unlikely(sock_flag(other, SOCK_DEAD))) {
/*
* Check with 1003.1g - what should
* datagram error
@@ -1561,10 +1681,14 @@ restart:
unix_state_unlock(other);
sock_put(other);
+ if (!sk_locked)
+ unix_state_lock(sk);
+
err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+ unix_dgram_peer_wake_disconnect_wakeup(sk, other);
+
unix_state_unlock(sk);
unix_dgram_disconnected(sk, other);
@@ -1590,21 +1714,38 @@ restart:
goto out_unlock;
}
- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
+ if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
+ if (timeo) {
+ timeo = unix_wait_for_peer(other, timeo);
+
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
+
+ goto restart;
}
- timeo = unix_wait_for_peer(other, timeo);
+ if (!sk_locked) {
+ unix_state_unlock(other);
+ unix_state_double_lock(sk, other);
+ }
- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ if (unix_peer(sk) != other ||
+ unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ sk_locked = 1;
+ goto out_unlock;
+ }
- goto restart;
+ if (!sk_locked) {
+ sk_locked = 1;
+ goto restart_locked;
+ }
}
+ if (unlikely(sk_locked))
+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,6 +1759,8 @@ restart:
return len;
out_unlock:
+ if (sk_locked)
+ unix_state_unlock(sk);
unix_state_unlock(other);
out_free:
kfree_skb(skb);
@@ -2453,14 +2596,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;
writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other && unix_peer(other) != sk &&
+ unix_recvq_full(other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}
if (writable)
Rainer Weikusat <[email protected]> writes:
[AF_UNIX SOCK_DGRAM throughput]
> It may be possible to improve this by tuning/ changing the flow
> control mechanism. Out of my head, I'd suggest making the queue longer
> (the default value is 10) and delaying wake ups until the server
> actually did catch up, IOW, the receive queue is empty or almost
> empty. But this ought to be done with a different patch.
Because I was curious about the effects, I implemented this using a
slightly modified design than the one I originally suggested to account
for the different uses of the 'is the receive queue full' check. The
code uses a datagram-specific checking function,
static int unix_dgram_recvq_full(struct sock const *sk)
{
struct unix_sock *u;
u = unix_sk(sk);
if (test_bit(UNIX_DG_FULL, &u->flags))
return 1;
if (!unix_recvq_full(sk))
return 0;
__set_bit(UNIX_DG_FULL, &u->flags);
return 1;
}
which gets called instead of the other for the n:1 datagram checks and a
if (test_bit(UNIX_DG_FULL, &u->flags) &&
!skb_queue_len(&sk->sk_receive_queue)) {
__clear_bit(UNIX_DG_FULL, &u->flags);
wake_up_interruptible_sync_poll(&u->peer_wait,
POLLOUT | POLLWRNORM |
POLLWRBAND);
}
in unix_dgram_recvmsg to delay wakeups until the queued datagrams have
been consumed if the queue overflowed before. This has the additional,
nice side effect that wakeups won't ever be done for 1:1 connected
datagram sockets (both SOCK_DGRAM and SOCK_SEQPACKET) where they're of
no use, anyway.
Compared to a 'stock' 4.3 running the test program I posted (supposed to
make the overhead noticable by sending lots of small messages), the
average number of bytes sent per second increased by about 782,961.79
(ca 764.61K), about 5.32% of the 4.3 number (14,714,579.91), with a
fairly simple code change.
On 11/20/2015 05:07 PM, Rainer Weikusat wrote:
> Rainer Weikusat <[email protected]> writes:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
>
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
>
> Signed-off-by: Rainer Weikusat <[email protected]>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Looks good to me.
Reviewed-by: Jason Baron <[email protected]>
Thanks,
-Jason
From: Rainer Weikusat <[email protected]>
Date: Fri, 20 Nov 2015 22:07:23 +0000
> Rainer Weikusat <[email protected]> writes:
> An AF_UNIX datagram socket being the client in an n:1 association with
> some server socket is only allowed to send messages to the server if the
> receive queue of this socket contains at most sk_max_ack_backlog
> datagrams. This implies that prospective writers might be forced to go
> to sleep despite none of the message presently enqueued on the server
> receive queue were sent by them. In order to ensure that these will be
> woken up once space becomes again available, the present unix_dgram_poll
> routine does a second sock_poll_wait call with the peer_wait wait queue
> of the server socket as queue argument (unix_dgram_recvmsg does a wake
> up on this queue after a datagram was received). This is inherently
> problematic because the server socket is only guaranteed to remain alive
> for as long as the client still holds a reference to it. In case the
> connection is dissolved via connect or by the dead peer detection logic
> in unix_dgram_sendmsg, the server socket may be freed despite "the
> polling mechanism" (in particular, epoll) still has a pointer to the
> corresponding peer_wait queue. There's no way to forcibly deregister a
> wait queue with epoll.
>
> Based on an idea by Jason Baron, the patch below changes the code such
> that a wait_queue_t belonging to the client socket is enqueued on the
> peer_wait queue of the server whenever the peer receive queue full
> condition is detected by either a sendmsg or a poll. A wake up on the
> peer queue is then relayed to the ordinary wait queue of the client
> socket via wake function. The connection to the peer wait queue is again
> dissolved if either a wake up is about to be relayed or the client
> socket reconnects or a dead peer is detected or the client socket is
> itself closed. This enables removing the second sock_poll_wait from
> unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
> that no blocked writer sleeps forever.
>
> Signed-off-by: Rainer Weikusat <[email protected]>
> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Applied and queued up for -stable, thanks to you and Jason for all of
your hard work on this fix.
David Miller <[email protected]> writes:
> From: Rainer Weikusat <[email protected]>
>> Rainer Weikusat <[email protected]> writes:
>> An AF_UNIX datagram socket being the client in an n:1 association
[...]
> Applied and queued up for -stable,
I'm sorry for this 13th hour request/ suggestion but while thinking
about a reply to Dmitry, it occurred to me that the restart_locked/
sk_locked logic could be avoided by moving the test for this condition
in front of all the others while leaving the 'act on it' code at its
back, ie, reorganize unix_dgram_sendmsg such that it looks like this:
unix_state_lock(other);
if (unix_peer(other) != sk && unix_recvq_full(other)) {
need_wait = 1;
if (!timeo) {
unix_state_unlock(other);
unix_state_double_lock(sk, other);
if (unix_peer(other) == sk ||
(unix_peer(sk) == other &&
!unix_dgram_peer_wake_me(sk, other)))
need_wait = 0;
unix_state_unlock(sk);
}
}
/* original code here */
if (need_wait) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);
err = sock_intr_errno(timeo);
if (signal_pending(current))
goto out_free;
goto restart;
}
err = -EAGAIN;
goto out_unlock;
}
/* original tail here */
This might cause a socket to be enqueued to the peer despite it's not
allowed to send to it but I don't think this matters much. This is a
less conservative modification but one which results in simpler code
overall. The kernel I'm currently running has been modified like this
and 'survived' the usual tests.
Rainer Weikusat <[email protected]> writes:
> David Miller <[email protected]> writes:
[...]
> I'm sorry for this 13th hour request/ suggestion but while thinking
> about a reply to Dmitry, it occurred to me that the restart_locked/
> sk_locked logic could be avoided by moving the test for this condition
> in front of all the others while leaving the 'act on it' code at its
> back, ie, reorganize unix_dgram_sendmsg such that it looks like this:
[...]
Just in case this is unclear on its own: If this was considered an
improvement by someone other than me, I could supply either a "complete"
patch with this re-arrangement or a cleanup delta patch changing the
previous change.