LinuxLists.cc - [PATCH] DLM: increase socket backlog to avoid hangs with 16 nodes

2023-02-02 15:24:32

Subject: [PATCH] DLM: increase socket backlog to avoid hangs with 16 nodes

On a 16 node virtual cluster with e1000 NICs joining the 12th node prints
SYN flood warnings for the DLM port:
Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies. Check SNMP counters.

And then joining a DLM lockspace hangs:
```
Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds. │
Dec 21 01:49:00 localhost kernel: [ 2285.786476] Not tainted 4.4.0+10 #1 │
Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. │
Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd D ffff88001930bc58 0 17638 1 0x00000000 │
Dec 21 01:49:00 localhost kernel: [ 2285.794615] ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000 │
Dec 21 01:49:00 localhost kernel: [ 2285.794617] ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000 │
Dec 21 01:49:00 localhost kernel: [ 2285.794619] ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10
Dec 21 01:49:00 localhost kernel: [ 2285.794644] [<ffffffff811570fe>] ? printk+0x4d/0x4f │
Dec 21 01:49:00 localhost kernel: [ 2285.794647] [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 │
Dec 21 01:49:00 localhost kernel: [ 2285.794649] [<ffffffff815a085d>] wait_for_completion+0x9d/0x110 │
Dec 21 01:49:00 localhost kernel: [ 2285.794653] [<ffffffff810979e0>] ? wake_up_q+0x80/0x80 │
Dec 21 01:49:00 localhost kernel: [ 2285.794661] [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm] │
Dec 21 01:49:00 localhost kernel: [ 2285.794665] [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100 │
Dec 21 01:49:00 localhost kernel: [ 2285.794670] [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm] │
Dec 21 01:49:00 localhost kernel: [ 2285.794673] [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0 │
Dec 21 01:49:00 localhost kernel: [ 2285.794677] [<ffffffff811b4438>] __vfs_write+0x28/0xd0 │
Dec 21 01:49:00 localhost kernel: [ 2285.794679] [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0 ┤
Dec 21 01:49:00 localhost kernel: [ 2285.794681] [<ffffffff811b4dc1>] vfs_write+0xb1/0x190 │
Dec 21 01:49:00 localhost kernel: [ 2285.794686] [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420 │
Dec 21 01:49:00 localhost kernel: [ 2285.794688] [<ffffffff811b5986>] SyS_write+0x46/0xa0 │
Dec 21 01:49:00 localhost kernel: [ 2285.794690] [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71
```

The previous limit of 5 seems like an arbitrary number, that doesn't match any
known DLM cluster size upper bound limit.

Signed-off-by: Edwin Török <[email protected]>
Cc: Christine Caulfield <[email protected]>
Cc: David Teigland <[email protected]>
Cc: [email protected]
---
Notes from 2023:
This patch was initially developed on 21 Dec 2017, and in production use ever since.
I expected to drop out of our patchqueue at the next kernel upgrade, however it
hasn't, so I probably forgot to send it.

I haven't noticed this bug again with the patch applied, and the previous value
of '5' seems like an arbitrary limit not matching any supported upper bounds
on DLM cluster sizes, so this patch has (unintentionally) had a 5 year test
cycle.

Although the join hanging forever like that may still be a bug, if the SYN cookies
consistently trigger it lets try to avoid the bug by avoiding the SYN cookies.
---
fs/dlm/lowcomms.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 4450721ec..105c9138b 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1774,7 +1774,7 @@ static int dlm_listen_for_all(void)
sock->sk->sk_data_ready = lowcomms_listen_data_ready;
release_sock(sock->sk);

- result = sock->ops->listen(sock, 5);
+ result = sock->ops->listen(sock, 128);
if (result < 0) {
dlm_close_sock(&listen_con.sock);
return result;
--
2.34.1

2023-02-02 22:26:09

by Alexander Aring

[permalink] [raw]

Subject: Re: [Cluster-devel] [PATCH] DLM: increase socket backlog to avoid hangs with 16 nodes

Hi,

On Thu, Feb 2, 2023 at 10:37 AM Edwin Török <[email protected]> wrote:
>
> On a 16 node virtual cluster with e1000 NICs joining the 12th node prints
> SYN flood warnings for the DLM port:
> Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies. Check SNMP counters.
>
> And then joining a DLM lockspace hangs:
> ```
> Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds. │
> Dec 21 01:49:00 localhost kernel: [ 2285.786476] Not tainted 4.4.0+10 #1 │
> Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. │
> Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd D ffff88001930bc58 0 17638 1 0x00000000 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794615] ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794617] ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794619] ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10
> Dec 21 01:49:00 localhost kernel: [ 2285.794644] [<ffffffff811570fe>] ? printk+0x4d/0x4f │
> Dec 21 01:49:00 localhost kernel: [ 2285.794647] [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794649] [<ffffffff815a085d>] wait_for_completion+0x9d/0x110 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794653] [<ffffffff810979e0>] ? wake_up_q+0x80/0x80 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794661] [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm] │
> Dec 21 01:49:00 localhost kernel: [ 2285.794665] [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794670] [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm] │
> Dec 21 01:49:00 localhost kernel: [ 2285.794673] [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794677] [<ffffffff811b4438>] __vfs_write+0x28/0xd0 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794679] [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0 ┤
> Dec 21 01:49:00 localhost kernel: [ 2285.794681] [<ffffffff811b4dc1>] vfs_write+0xb1/0x190 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794686] [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794688] [<ffffffff811b5986>] SyS_write+0x46/0xa0 │
> Dec 21 01:49:00 localhost kernel: [ 2285.794690] [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71
> ```
>
> The previous limit of 5 seems like an arbitrary number, that doesn't match any
> known DLM cluster size upper bound limit.
>
> Signed-off-by: Edwin Török <[email protected]>
> Cc: Christine Caulfield <[email protected]>
> Cc: David Teigland <[email protected]>
> Cc: [email protected]
> ---
> Notes from 2023:
> This patch was initially developed on 21 Dec 2017, and in production use ever since.
> I expected to drop out of our patchqueue at the next kernel upgrade, however it
> hasn't, so I probably forgot to send it.
>
> I haven't noticed this bug again with the patch applied, and the previous value
> of '5' seems like an arbitrary limit not matching any supported upper bounds
> on DLM cluster sizes, so this patch has (unintentionally) had a 5 year test
> cycle.

Correct, I guess the 128 coming from dlm_controld 128 max node
boundary [0] and I think it's okay to change it to 128, especially if
you start a cluster every "mostly" node wants to connect() at the same
time and the backlog of 5 could be not enough to handle them.
From my understanding the connect() will return -ECONNREFUSED in the
case of backlog is full. As I see the code of kernel v4.0 dlm at this
point will _not_ slow down a reattempt to run connect() again, see
[1].

>
> Although the join hanging forever like that may still be a bug, if the SYN cookies
> consistently trigger it lets try to avoid the bug by avoiding the SYN cookies.

right, it should work even with 1 as a backlog... after digging into
the code. My guess is that [2] ran into the MAX_CONNECT_RETRIES
condition and we do _not_ reset con->retries to 0 if connect() [3]
fails. If this happens we are kind of stuck and we would not try to
connect() again. Means you run into the hung task situation, because
e.g. ping_members() will not succeed.

The current upstream code does not have any limitations of trying
retries and will always wait some time to try a connect() again if it
fails. It should retry to connect so long the cluster manager tells
dlm kernel that a specific node is still part of a lockspace resource
and there is a transmission happening to this particular node. However
I think pumping this value to the maximum amount of possible nodes in
a cluster is not a bad idea.

There is still a question of "why seeing SYN flooding" warning, but as
I said there are a lot of connect() happening at the beginning at
mostly the same time starting a cluster and some of them retrying
connect() fast.

...

- Alex

[0] https://pagure.io/dlm/blob/c1e2ee574794e0dea68fd45783991ef7b1835901/f/dlm_controld/dlm_daemon.h#_177
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lowcomms.c?h=v4.0#n1226
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lowcomms.c?h=v4.0#n1168
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lowcomms.c?h=v4.0#n1220

2023-02-03 15:50:24

by Edwin Torok

[permalink] [raw]

Subject: Re: [Cluster-devel] [PATCH] DLM: increase socket backlog to avoid hangs with 16 nodes

> On 2 Feb 2023, at 22:25, Alexander Aring <[email protected]> wrote:
>
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments unless you have verified the sender and know the content is safe.
>
> Hi,
>
> On Thu, Feb 2, 2023 at 10:37 AM Edwin Török <[email protected]> wrote:
>>
>> On a 16 node virtual cluster with e1000 NICs joining the 12th node prints
>> SYN flood warnings for the DLM port:
>> Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies. Check SNMP counters.
>>
>> And then joining a DLM lockspace hangs:
>> ```
>> Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds. │
>> Dec 21 01:49:00 localhost kernel: [ 2285.786476] Not tainted 4.4.0+10 #1 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd D ffff88001930bc58 0 17638 1 0x00000000 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794615] ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794617] ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794619] ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10
>> Dec 21 01:49:00 localhost kernel: [ 2285.794644] [<ffffffff811570fe>] ? printk+0x4d/0x4f │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794647] [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794649] [<ffffffff815a085d>] wait_for_completion+0x9d/0x110 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794653] [<ffffffff810979e0>] ? wake_up_q+0x80/0x80 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794661] [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm] │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794665] [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794670] [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm] │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794673] [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794677] [<ffffffff811b4438>] __vfs_write+0x28/0xd0 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794679] [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0 ┤
>> Dec 21 01:49:00 localhost kernel: [ 2285.794681] [<ffffffff811b4dc1>] vfs_write+0xb1/0x190 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794686] [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794688] [<ffffffff811b5986>] SyS_write+0x46/0xa0 │
>> Dec 21 01:49:00 localhost kernel: [ 2285.794690] [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71
>> ```
>>
>> The previous limit of 5 seems like an arbitrary number, that doesn't match any
>> known DLM cluster size upper bound limit.
>>
>> Signed-off-by: Edwin Török <[email protected]>
>> Cc: Christine Caulfield <[email protected]>
>> Cc: David Teigland <[email protected]>
>> Cc: [email protected]
>> ---
>> Notes from 2023:
>> This patch was initially developed on 21 Dec 2017, and in production use ever since.
>> I expected to drop out of our patchqueue at the next kernel upgrade, however it
>> hasn't, so I probably forgot to send it.
>>
>> I haven't noticed this bug again with the patch applied, and the previous value
>> of '5' seems like an arbitrary limit not matching any supported upper bounds
>> on DLM cluster sizes, so this patch has (unintentionally) had a 5 year test
>> cycle.
>
> Correct, I guess the 128 coming from dlm_controld 128 max node
> boundary [0] and I think it's okay to change it to 128, especially if
> you start a cluster every "mostly" node wants to connect() at the same
> time and the backlog of 5 could be not enough to handle them.
> From my understanding the connect() will return -ECONNREFUSED in the
> case of backlog is full. As I see the code of kernel v4.0 dlm at this
> point will _not_ slow down a reattempt to run connect() again, see
> [1].

Thanks for digging into the code, that explains why 'SYN cookies' is a symptom rather than a bug here.

>
>>
>> Although the join hanging forever like that may still be a bug, if the SYN cookies
>> consistently trigger it lets try to avoid the bug by avoiding the SYN cookies.
>
> right, it should work even with 1 as a backlog... after digging into
> the code. My guess is that [2] ran into the MAX_CONNECT_RETRIES
> condition and we do _not_ reset con->retries to 0 if connect() [3]
> fails. If this happens we are kind of stuck and we would not try to
> connect() again. Means you run into the hung task situation, because
> e.g. ping_members() will not succeed.
>
> The current upstream code does not have any limitations of trying
> retries and will always wait some time to try a connect() again if it
> fails. It should retry to connect so long the cluster manager tells
> dlm kernel that a specific node is still part of a lockspace resource
> and there is a transmission happening to this particular node. However
> I think pumping this value to the maximum amount of possible nodes in
> a cluster is not a bad idea.

Although there are improvements possible in how connect failures are handled, this situation is quite rare outside of
a stress test under lab conditions, and this one-liner is probably the easiest way to improve robustness in this area.

>
> There is still a question of "why seeing SYN flooding" warning, but as
> I said there are a lot of connect() happening at the beginning at
> mostly the same time starting a cluster and some of them retrying
> connect() fast.

Yes, I think the SYN flood warning means just that: too many connects in a very short timeframe
and backlog exceeded, i.e. a symptom rather than a cause.

Best regards,
--Edwin

>
> ...
>
> - Alex
>
> [0] https://pagure.io/dlm/blob/c1e2ee574794e0dea68fd45783991ef7b1835901/f/dlm_controld/dlm_daemon.h#_177
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lowcomms.c?h=v4.0#n1226
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lowcomms.c?h=v4.0#n1168
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dlm/lowcomms.c?h=v4.0#n1220
>