We’re investigating a problem where we are seeing a number of Ubuntu 18.04 Linux clients that bombards our NFS Servers with TCP connections. Like 400 requests/second of new TCP connection requests to the NFS server port (with unique source port numbers) which causes the firewall state tracking tables on the server to grow like crazy.
The stream of IP packets we are seeing look like this (times in seconds):
0.001280 Client -> Server SYN
0.001289 Server -> Client SYN+ACK
0.001516 Client -> Server ACK
0.003609 Client -> Server FIN+ACK
0.003615 Server -> Client ACK
0.003620 Server -> Client FIN+ACK
0.003841 Client -> Server ACK
<repeat 200-400 times/s>
Ie, initiate a new connection and then immediately disconnect again.
On the client this coincides with a user which has their home directory mounted via NFS (v4) with sec=krb5 and an expired Kerberos ticket and having an “evolution” process running. This somehow causes rpc.gssd to run a lot...
The rpc.gssd seems to be doing this:
> Tasks: 266 total, 1 running, 193 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 5.0 us, 1.5 sy, 0.0 ni, 93.1 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
> KiB Mem : 16303264 total, 955844 free, 3912820 used, 11434600 buff/cache
> KiB Swap: 7812092 total, 7811580 free, 512 used. 11481708 avail Mem
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 705 root 20 0 328888 4924 3568 S 20.8 0.0 1520:27 rpc.gssd
> epoll_wait(4, [{EPOLLIN, {u32=9, u64=9}}], 32, -1) = 1
> read(9, "mech=krb5 uid=1036985 enctypes=1"..., 32768) = 50
> clone(child_stack=0x7f4746f5ffb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f4746f609d0, tls=0x7f4746f60700, child_tidptr=0x7f4746f609d0) = 1310
> epoll_wait(4, [{EPOLLIN, {u32=9, u64=9}}], 32, -1) = 1
Evolution has a number of file descriptors open:
> evolution 26916 olaha93 cwd DIR 0,68 45 4 /home/olaha93 (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 mem REG 0,68 229909 /home/olaha93/.cache/mesa_shader_cache/index (filur04.it.liu.se:/staff/olaha93) (stat: Permission denied)
> evolution 26916 olaha93 mem REG 0,68 175429 /home/olaha93/.config/dconf/user (filur04.it.liu.se:/staff/olaha93) (stat: Permission denied)
> evolution 26916 olaha93 mem REG 0,68 171739 /home/olaha93/.local/share/gvfs-metadata/.nfs0000000000029edb00000333 (filur04.it.liu.se:/staff/olaha93) (stat: Permission denied)
> evolution 26916 olaha93 15u REG 0,68 28672 230539 /home/olaha93/.pki/nssdb/cert9.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 16u REG 0,68 28672 230541 /home/olaha93/.pki/nssdb/key4.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 18u unknown 0,68 /home/olaha93/.cache/evolution/mail/e013fa56fec8685e5cb71733b66216b51495a944/folders.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 19u unknown 0,68 /home/olaha93/.cache/evolution/mail/1e49c4d51a101a775373e6455c4ce75b1bcc87a9/folders.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 21u unknown 0,68 /home/olaha93/.cache/evolution/mail/13f710d8659a79fd42ea3aff53b116f6a67bbbeb/folders.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 22u unknown 0,68 /home/olaha93/.cache/evolution/mail/94e55bfdbf6fe8f93ceedef1a7ae247fb97ea2ff/folders.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 23u unknown 0,68 /home/olaha93/.cache/evolution/mail/16e1d5d4fcf3c26e45b366759fdd06df1b55b979/folders.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 24u unknown 0,68 /home/olaha93/.local/share/evolution/mail/local/folders.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 25u unknown 0,68 /home/olaha93/.local/share/evolution/mail/vfolder/folders.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 26u unknown 0,68 /home/olaha93/.config/evolution/mail/remote-content.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 27u unknown 0,68 /home/olaha93/.config/evolution/mail/properties.db (filur04.it.liu.se:/staff/olaha93)
> evolution 26916 olaha93 29r unknown 0,68 /home/olaha93/.local/share/gvfs-metadata/.nfs0000000000029edb00000333 (filur04.it.liu.se:/staff/olaha93)
Repeated errors from evolution in the log files:
> Sep 13 16:01:48 lille71 evolution[26916]: camel_store_summary_disconnect_folder_summary: Store summary 0x559b44194390 is not connected to folder summary 0x7fcaf40428e0
> Sep 13 16:01:48 lille71 evolution[26916]: Unable to load summary: disk I/O error
> Sep 13 16:01:48 lille71 evolution[26916]: GError set over the top of a previous GError or uninitialized memory.
> This indicates a bug in someone's code. You must ensure an error is NULL before it's set.
> The overwriting error message was: disk I/O error
Restarting rpc.gssd doesn’t seem to help. Removing the expired kerberos cache file doesn’t help. So my guess is that evolution somehow triggers new NFS server connections but then the kernel immediately terminates it without ever sending any data to the NFS server.
(Rebooting the client works - but.. :-)
It would have been nice if an expired kerberos cache didn’t cause this aggressive behaviour…
- Peter
This is off topic, but it sounds like you might be accessing IMAP
folders over NFS? This is a bad idea if you are -- just use the IMAP
protocol and use local storage on the IMAP server. Anything else is a
recipe for lost mail and endless headaches.
On 9/16/19 1:33 AM, Peter Eriksson wrote:
> We’re investigating a problem where we are seeing a number of Ubuntu 18.04 Linux clients that bombards our NFS Servers with TCP connections. Like 400 requests/second of new TCP connection requests to the NFS server port (with unique source port numbers) which causes the firewall state tracking tables on the server to grow like crazy.
>
> The stream of IP packets we are seeing look like this (times in seconds):
>
> 0.001280 Client -> Server SYN
> 0.001289 Server -> Client SYN+ACK
> 0.001516 Client -> Server ACK
> 0.003609 Client -> Server FIN+ACK
> 0.003615 Server -> Client ACK
> 0.003620 Server -> Client FIN+ACK
> 0.003841 Client -> Server ACK
> <repeat 200-400 times/s>
>
> Ie, initiate a new connection and then immediately disconnect again.
>
> On the client this coincides with a user which has their home directory mounted via NFS (v4) with sec=krb5 and an expired Kerberos ticket and having an “evolution” process running. This somehow causes rpc.gssd to run a lot...
>
> The rpc.gssd seems to be doing this:
>
>> Tasks: 266 total, 1 running, 193 sleeping, 0 stopped, 0 zombie
>> %Cpu(s): 5.0 us, 1.5 sy, 0.0 ni, 93.1 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
>> KiB Mem : 16303264 total, 955844 free, 3912820 used, 11434600 buff/cache
>> KiB Swap: 7812092 total, 7811580 free, 512 used. 11481708 avail Mem
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 705 root 20 0 328888 4924 3568 S 20.8 0.0 1520:27 rpc.gssd
>
>> epoll_wait(4, [{EPOLLIN, {u32=9, u64=9}}], 32, -1) = 1
>> read(9, "mech=krb5 uid=1036985 enctypes=1"..., 32768) = 50
>> clone(child_stack=0x7f4746f5ffb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f4746f609d0, tls=0x7f4746f60700, child_tidptr=0x7f4746f609d0) = 1310
>> epoll_wait(4, [{EPOLLIN, {u32=9, u64=9}}], 32, -1) = 1
>
>
> Evolution has a number of file descriptors open:
>
>> evolution 26916 olaha93 cwd DIR 0,68 45 4 /home/olaha93 (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 mem REG 0,68 229909 /home/olaha93/.cache/mesa_shader_cache/index (filur04.it.liu.se:/staff/olaha93) (stat: Permission denied)
>> evolution 26916 olaha93 mem REG 0,68 175429 /home/olaha93/.config/dconf/user (filur04.it.liu.se:/staff/olaha93) (stat: Permission denied)
>> evolution 26916 olaha93 mem REG 0,68 171739 /home/olaha93/.local/share/gvfs-metadata/.nfs0000000000029edb00000333 (filur04.it.liu.se:/staff/olaha93) (stat: Permission denied)
>> evolution 26916 olaha93 15u REG 0,68 28672 230539 /home/olaha93/.pki/nssdb/cert9.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 16u REG 0,68 28672 230541 /home/olaha93/.pki/nssdb/key4.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 18u unknown 0,68 /home/olaha93/.cache/evolution/mail/e013fa56fec8685e5cb71733b66216b51495a944/folders.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 19u unknown 0,68 /home/olaha93/.cache/evolution/mail/1e49c4d51a101a775373e6455c4ce75b1bcc87a9/folders.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 21u unknown 0,68 /home/olaha93/.cache/evolution/mail/13f710d8659a79fd42ea3aff53b116f6a67bbbeb/folders.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 22u unknown 0,68 /home/olaha93/.cache/evolution/mail/94e55bfdbf6fe8f93ceedef1a7ae247fb97ea2ff/folders.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 23u unknown 0,68 /home/olaha93/.cache/evolution/mail/16e1d5d4fcf3c26e45b366759fdd06df1b55b979/folders.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 24u unknown 0,68 /home/olaha93/.local/share/evolution/mail/local/folders.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 25u unknown 0,68 /home/olaha93/.local/share/evolution/mail/vfolder/folders.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 26u unknown 0,68 /home/olaha93/.config/evolution/mail/remote-content.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 27u unknown 0,68 /home/olaha93/.config/evolution/mail/properties.db (filur04.it.liu.se:/staff/olaha93)
>> evolution 26916 olaha93 29r unknown 0,68 /home/olaha93/.local/share/gvfs-metadata/.nfs0000000000029edb00000333 (filur04.it.liu.se:/staff/olaha93)
>
>
> Repeated errors from evolution in the log files:
>
>> Sep 13 16:01:48 lille71 evolution[26916]: camel_store_summary_disconnect_folder_summary: Store summary 0x559b44194390 is not connected to folder summary 0x7fcaf40428e0
>> Sep 13 16:01:48 lille71 evolution[26916]: Unable to load summary: disk I/O error
>> Sep 13 16:01:48 lille71 evolution[26916]: GError set over the top of a previous GError or uninitialized memory.
>> This indicates a bug in someone's code. You must ensure an error is NULL before it's set.
>> The overwriting error message was: disk I/O error
>
> Restarting rpc.gssd doesn’t seem to help. Removing the expired kerberos cache file doesn’t help. So my guess is that evolution somehow triggers new NFS server connections but then the kernel immediately terminates it without ever sending any data to the NFS server.
>
> (Rebooting the client works - but.. :-)
>
>
> It would have been nice if an expired kerberos cache didn’t cause this aggressive behaviour…
>
> - Peter
>>> This message is from an external sender. Learn more about why this <<
>>> matters at https://links.utexas.edu/rtyclf. <<