Hi, Karsten
Thanks for your reply. The previous discussion about the issue of socket
wait queue mismatch in SMC fallback can be referred from:
https://lore.kernel.org/all/[email protected]/
This set of patches includes two RFC patches, they are both aimed to fix
the same issue, the mismatch of socket wait queue in SMC fallback.
In your last reply, I am suggested to add the complete description about
the intention of initial patch in order that readers can understand the
idea behind it. This has been done in "[RFC PATCH net v2 0/2] net/smc: Fix
socket wait queue mismatch issue caused by fallback" of this mail.
Unfortunately, I found a defect later in the solution of the initial patch
or the v2 patch mentioned above. The defect is about fasync_list and related
to 67f562e3e14 ("net/smc: transfer fasync_list in case of fallback").
When user applications use sock_fasync() to insert entries into fasync_list,
the wait queue they operate is smc socket->wq. But in initial patch or
the v2 patch, I swapped sk->sk_wq of smc socket and clcsocket in smc_create(),
thus the sk_data_ready / sk_write_space.. of smc will wake up clcsocket->wq
finally. So the entries added into smc socket->wq.fasync_list won't be woken
up at all before fallback.
So the solution in initial patch or the v2 patch of this mail by swapping
sk->sk_wq of smc socket and clcsocket seems a bad way to fix this issue.
Therefore, I tried another solution by removing the wait queue entries from
smc socket->wq to clcsocket->wq during the fallback, which is described in the
"[RFC PATCH net 2/2] net/smc: Transfer remaining wait queue entries" of this
mail. In our test environment, this patch can fix the fallback issue well.
I am looking forward to hear your opinions. Thank you.
Cheers,
Wen Gu
Wen Gu (2):
net/smc: Fix socket wait queue mismatch issue caused by fallback
net/smc: Transfer remaining wait queue entries
There may be a mismatch of socket wait queue caused by SMC fallback.
When user applications used SMC to replace TCP, they might add
an eppoll_entry into smc socket->wq and expect to be notified if
epoll events like EPOLL_IN/EPOLL_OUT occurred on the smc socket.
However, once a fallback occurred, user applications start to use
clcsocket. So it is clcsocket->wq instead of smc socket->wq which
is woken up when there is data to be processed or sending space
available. Thus the eppoll_entry which was added into smc socket->wq
before fallback does not work anymore.
[if not fallback]
application
^
(poll)
|
smc socket->wq clcsocket->wq
(has eppoll_entry) .
. .
. .
smc socket->sk->sk_wq clcsocket->sk->sk_wq
^ ^
| (data flow) |(tcp handshake in rendezvous)
| |
|-------------------------------------------|
| sk_data_ready / sk_write_space .. |
|-------------------------------------------|
[if fallback]
application <------------------|
(can't poll anything)
|
smc socket->wq clcsocket->wq
(has eppoll_entry) .
. .
. .
smc socket->sk->sk_wq clcsocket->sk->sk_wq
^
|(data flow)
|
|-------------------------------------------|
| sk_data_ready / sk_write_space .. |
|-------------------------------------------|
For example, in nginx/wrk benchmark, this issue will cause an
all-zeros test result:
server: nginx -g 'daemon off;'
client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html
Running 5s test @ http://11.200.15.93/index.html
1 threads and 1 connections
Thread Stats Avg Stdev Max ± Stdev
Latency 0.00us 0.00us 0.00us -nan%
Req/Sec 0.00 0.00 0.00 -nan%
0 requests in 5.00s, 0.00B read
Requests/sec: 0.00
Transfer/sec: 0.00B
The main idea of this patch is that since wait queue is switched
from smc socket->wq to clcsocket->wq after fallback, disabling the
eppoll_entry in smc socket->wq, maybe we can try to use clcsocket->wq
from the beginning.
So this patch set smc socket->sk->sk_wq to clcsocket->wq when smc
socket is created and reset clcsocket->sk->sk_wq to clcsocket->wq
once fallback occurrs, making clcsocket->wq the constant wait queue
that user applications use. Thus the user application can be
notified by the eppoll_entry in clcsocket->wq no matter whether
a fallback occurrs.
After this patch:
[if not fallback]
application <------------------|
(poll)
|
smc socket->wq clcsocket->wq
. (has eppoll_entry)
` . .
` . . `
. ` ` .
` `
smc socket->sk->sk_wq clcsocket->sk->sk_wq
^ ^
|(data flow) |(tcp handshake in rendezvous)
| |
|-------------------------------------------|
| sk_data_ready / sk_write_space .. |
|-------------------------------------------|
[if fallback]
application <------------------|
(poll)
|
smc socket->wq clcsocket->wq
. (has eppoll_entry)
` . .
` . .
` . .
`
smc socket->sk->sk_wq clcsocket->sk->sk_wq
^
|(data flow)
|
|-------------------------------------------|
| sk_data_ready / sk_write_space .. |
|-------------------------------------------|
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
---
v1 -> v2
- Add the complete description about the intention of initial patch.
---
net/smc/af_smc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 54 insertions(+), 2 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 0cf7ed2..fe5a2cd 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -566,6 +566,20 @@ static void smc_switch_to_fallback(struct smc_sock *smc, int reason_code)
smc->fallback_rsn = reason_code;
smc_stat_fallback(smc);
trace_smc_switch_to_fallback(smc, reason_code);
+
+ /* We swapped sk->sk_wq of clcsocket and smc socket in smc_create()
+ * to prevent mismatch of wait queue caused by fallback.
+ *
+ * If fallback really occurred, user application starts to use clcsocket.
+ * Accordingly, clcsocket->sk->sk_wq should be reset to clcsocket->wq
+ * in order that user application still uses the same wait queue as what
+ * was used before fallback.
+ *
+ * After fallback, the relation between socket->wq and socket->sk->sk_wq is:
+ * - clcsocket->sk->sk_wq --> clcsocket->wq
+ * - smcsocket->sk->sk_wq --> clcsocket->wq
+ */
+ rcu_assign_pointer(smc->clcsock->sk->sk_wq, &smc->clcsock->wq);
if (smc->sk.sk_socket && smc->sk.sk_socket->file) {
smc->clcsock->file = smc->sk.sk_socket->file;
smc->clcsock->file->private_data = smc->clcsock;
@@ -2174,6 +2188,12 @@ static int smc_accept(struct socket *sock, struct socket *new_sock,
if (rc)
goto out;
+ /* The new accepted smc sock sets sk->sk_wq to clcsocket->wq like what
+ * smc_create() did when the fallback does not occur.
+ */
+ if (!smc_sk(nsk)->use_fallback)
+ rcu_assign_pointer(nsk->sk_wq, &smc_sk(nsk)->clcsock->wq);
+
if (lsmc->sockopt_defer_accept && !(flags & O_NONBLOCK)) {
/* wait till data arrives on the socket */
timeo = msecs_to_jiffies(lsmc->sockopt_defer_accept *
@@ -2310,8 +2330,15 @@ static __poll_t smc_poll(struct file *file, struct socket *sock,
mask = smc->clcsock->ops->poll(file, smc->clcsock, wait);
sk->sk_err = smc->clcsock->sk->sk_err;
} else {
- if (sk->sk_state != SMC_CLOSED)
- sock_poll_wait(file, sock, wait);
+ if (sk->sk_state != SMC_CLOSED) {
+ /* SMC always uses clcsocket->wq for the call to
+ * sock_poll_wait(), which is the same wait queue
+ * as what smc socket->sk->sk_wq points to before
+ * fallback or what clcsocket->sk->sk_wq points to
+ * after fallback.
+ */
+ sock_poll_wait(file, smc->clcsock, wait);
+ }
if (sk->sk_err)
mask |= EPOLLERR;
if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
@@ -2707,6 +2734,31 @@ static int smc_create(struct net *net, struct socket *sock, int protocol,
smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
+ /* To ensure that user applications consistently use the same wait queue
+ * before and after fallback, we set smc socket->sk->sk_wq to clcsocket->wq
+ * here and reset clcsocket->sk->sk_wq to clcsocket->wq in
+ * smc_switch_to_fallback() if fallback occurrs, making clcsocket->wq the
+ * constant wait queue which user applications use.
+ *
+ * here:
+ * - Set smc socket->sk->sk_wq to clcsocket->wq
+ * So that the sk_data_ready()/sk_write_space.. will wake up clcsocket->wq
+ * and user applications will be notified of epoll events if they added
+ * eppoll_entry into clcsocket->wq through smc_poll().
+ *
+ * - Set clcsocket->sk->sk_wq to smc socket->wq
+ * Since clcsocket->wq is occupied by smcsocket->sk->sk_wq, clcsocket->
+ * sk->sk_wq have to use another wait queue (smc socket->wq) during TCP
+ * handshake or CLC messages transmission in rendezvous.
+ *
+ * So the relation between socket->wq and socket->sk->sk_wq before fallback is:
+ *
+ * - smc socket->sk->sk_wq --> clcsocket->wq
+ * - clcsocket->sk->sk_wq --> smc socket->wq
+ */
+ rcu_assign_pointer(sk->sk_wq, &smc->clcsock->wq);
+ rcu_assign_pointer(smc->clcsock->sk->sk_wq, &smc->sk.sk_socket->wq);
+
out:
return rc;
}
--
1.8.3.1
The SMC fallback is incomplete currently. There may be some
wait queue entries remaining in smc socket->wq, which should
be removed to clcsocket->wq during the fallback.
For example, in nginx/wrk benchmark, this issue causes an
all-zeros test result:
server: nginx -g 'daemon off;'
client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html
Running 5s test @ http://11.200.15.93/index.html
1 threads and 1 connections
Thread Stats Avg Stdev Max ± Stdev
Latency 0.00us 0.00us 0.00us -nan%
Req/Sec 0.00 0.00 0.00 -nan%
0 requests in 5.00s, 0.00B read
Requests/sec: 0.00
Transfer/sec: 0.00B
The reason for this all-zeros result is that when wrk used SMC
to replace TCP, it added an eppoll_entry into smc socket->wq
and expected to be notified if epoll events like EPOLL_IN/
EPOLL_OUT occurred on the smc socket.
However, once a fallback occurred, wrk switches to use clcsocket.
Now it is clcsocket->wq instead of smc socket->wq which will
be woken up. The eppoll_entry remaining in smc socket->wq does
not work anymore and wrk stops the test.
This patch fixes this issue by removing remaining wait queue
entries from smc socket->wq to clcsocket->wq during the fallback.
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
---
net/smc/af_smc.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 0cf7ed2..11a966a 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -562,6 +562,10 @@ static void smc_stat_fallback(struct smc_sock *smc)
static void smc_switch_to_fallback(struct smc_sock *smc, int reason_code)
{
+ wait_queue_head_t *smc_wait = sk_sleep(&smc->sk);
+ wait_queue_head_t *clc_wait = sk_sleep(smc->clcsock->sk);
+ unsigned long flags;
+
smc->use_fallback = true;
smc->fallback_rsn = reason_code;
smc_stat_fallback(smc);
@@ -571,6 +575,16 @@ static void smc_switch_to_fallback(struct smc_sock *smc, int reason_code)
smc->clcsock->file->private_data = smc->clcsock;
smc->clcsock->wq.fasync_list =
smc->sk.sk_socket->wq.fasync_list;
+
+ /* There might be some wait queue entries remaining
+ * in smc socket->wq, which should be removed to
+ * clcsocket->wq during the fallback.
+ */
+ spin_lock_irqsave(&smc_wait->lock, flags);
+ spin_lock_irqsave(&clc_wait->lock, flags);
+ list_splice_init(&smc_wait->head, &clc_wait->head);
+ spin_unlock_irqrestore(&clc_wait->lock, flags);
+ spin_unlock_irqrestore(&smc_wait->lock, flags);
}
}
--
1.8.3.1
On 10/11/2021 13:50, Wen Gu wrote:
> Hi, Karsten
>
> Thanks for your reply. The previous discussion about the issue of socket
> wait queue mismatch in SMC fallback can be referred from:
> https://lore.kernel.org/all/[email protected]/
>
> This set of patches includes two RFC patches, they are both aimed to fix
> the same issue, the mismatch of socket wait queue in SMC fallback.
>
> In your last reply, I am suggested to add the complete description about
> the intention of initial patch in order that readers can understand the
> idea behind it. This has been done in "[RFC PATCH net v2 0/2] net/smc: Fix
> socket wait queue mismatch issue caused by fallback" of this mail.
>
> Unfortunately, I found a defect later in the solution of the initial patch
> or the v2 patch mentioned above. The defect is about fasync_list and related
> to 67f562e3e14 ("net/smc: transfer fasync_list in case of fallback").
>
> When user applications use sock_fasync() to insert entries into fasync_list,
> the wait queue they operate is smc socket->wq. But in initial patch or
> the v2 patch, I swapped sk->sk_wq of smc socket and clcsocket in smc_create(),
> thus the sk_data_ready / sk_write_space.. of smc will wake up clcsocket->wq
> finally. So the entries added into smc socket->wq.fasync_list won't be woken
> up at all before fallback.
>
> So the solution in initial patch or the v2 patch of this mail by swapping
> sk->sk_wq of smc socket and clcsocket seems a bad way to fix this issue.
>
> Therefore, I tried another solution by removing the wait queue entries from
> smc socket->wq to clcsocket->wq during the fallback, which is described in the
> "[RFC PATCH net 2/2] net/smc: Transfer remaining wait queue entries" of this
> mail. In our test environment, this patch can fix the fallback issue well.
Still running final tests but overall its working well here, too.
Until we maybe find a 'cleaner' solution if this I would like to go with your
current fixes. But I would like to improve the wording of the commit message and
the comments a little bit if you are okay with that.
If you send a new series with the 2 patches then I would take them and post them
to the list again with my changes.
What do you think?
On 2021/11/11 10:21 pm, Karsten Graul wrote:
> On 10/11/2021 13:50, Wen Gu wrote:
>> Hi, Karsten
>>
>> Thanks for your reply. The previous discussion about the issue of socket
>> wait queue mismatch in SMC fallback can be referred from:
>> https://lore.kernel.org/all/[email protected]/
>>
>> This set of patches includes two RFC patches, they are both aimed to fix
>> the same issue, the mismatch of socket wait queue in SMC fallback.
>>
>> In your last reply, I am suggested to add the complete description about
>> the intention of initial patch in order that readers can understand the
>> idea behind it. This has been done in "[RFC PATCH net v2 0/2] net/smc: Fix
>> socket wait queue mismatch issue caused by fallback" of this mail.
>>
>> Unfortunately, I found a defect later in the solution of the initial patch
>> or the v2 patch mentioned above. The defect is about fasync_list and related
>> to 67f562e3e14 ("net/smc: transfer fasync_list in case of fallback").
>>
>> When user applications use sock_fasync() to insert entries into fasync_list,
>> the wait queue they operate is smc socket->wq. But in initial patch or
>> the v2 patch, I swapped sk->sk_wq of smc socket and clcsocket in smc_create(),
>> thus the sk_data_ready / sk_write_space.. of smc will wake up clcsocket->wq
>> finally. So the entries added into smc socket->wq.fasync_list won't be woken
>> up at all before fallback.
>>
>> So the solution in initial patch or the v2 patch of this mail by swapping
>> sk->sk_wq of smc socket and clcsocket seems a bad way to fix this issue.
>>
>> Therefore, I tried another solution by removing the wait queue entries from
>> smc socket->wq to clcsocket->wq during the fallback, which is described in the
>> "[RFC PATCH net 2/2] net/smc: Transfer remaining wait queue entries" of this
>> mail. In our test environment, this patch can fix the fallback issue well.
>
> Still running final tests but overall its working well here, too.
> Until we maybe find a 'cleaner' solution if this I would like to go with your
> current fixes. But I would like to improve the wording of the commit message and
> the comments a little bit if you are okay with that.
>
> If you send a new series with the 2 patches then I would take them and post them
> to the list again with my changes.
Seems just the second patch alone will fix the issue.
>
> What do you think?
>
Thanks for your reply. I am glad that the second patch works well.
To avoid there being any misunderstanding between us, I want to explain
that just the second patch "[RFC PATCH net 2/2] net/smc: Transfer
remaining wait queue entries" alone will fix the issue well.
Because it transfers the remaining entries in smc socket->wq to
clcsocket->wq during the fallback, so that the entries added into smc
socket->wq before fallback will still works after fallback, even though
user applications start to use clcsocket.
The first patch "[RFC PATCH net v2 0/2] net/smc: Fix socket wait queue
mismatch issue caused by fallback" should be abandoned.
I sent it only to better explain the defect I found in my initial patch
or this v2 patch. Hope it didn't bother you. Swapping the sk->sk_wq
seems a bad way to fix the issue because it can not handle the
fasync_list well. Unfortunately I found this defect until I almost
finished it :(
So, I think maybe it is fine that just send the second patch "[RFC PATCH
net 2/2] net/smc: Transfer remaining wait queue entries" again. I will
send it later.
And, it is okay for me if you want to improve the commit messages or
comments.
Thank you.
Cheers,
Wen Gu