A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
the write_queue was not empty as determined by tcp_write_queue_empty() but the
sk_buff containing the FIN flag had been freed and the socket was zombied in
that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.
Some instrumentation was added to the kernel and it was found that there is a
timing window where tcp_sendmsg() can run after tcp_send_fin().
tcp_sendmsg() will hit an error, for example:
1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩
1270 ▹ ▹ goto do_error;↩
tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.
If the other side sends a FIN packet the socket will transition to CLOSING and
remain that way until the system is rebooted.
Fix this by checking for the FIN flag in the sk_buff and don't free it if that
is the case. Testing confirmed that fixed the issue.
Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases")
Signed-off-by: Jon Maxwell <[email protected]>
---
net/ipv4/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c2d9830136d2..d2b06d8f0c37 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -938,7 +938,7 @@ int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
*/
void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
{
- if (skb && !skb->len) {
+ if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
tcp_unlink_write_queue(skb, sk);
if (tcp_write_queue_empty(sk))
tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
--
2.27.0
On Wed, Oct 20, 2021 at 4:25 PM Jon Maxwell <[email protected]> wrote:
>
> A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
> the write_queue was not empty as determined by tcp_write_queue_empty() but the
> sk_buff containing the FIN flag had been freed and the socket was zombied in
> that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.
>
> Some instrumentation was added to the kernel and it was found that there is a
> timing window where tcp_sendmsg() can run after tcp_send_fin().
>
> tcp_sendmsg() will hit an error, for example:
>
> 1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩
> 1270 ▹ ▹ goto do_error;↩
>
> tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
> TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.
>
> If the other side sends a FIN packet the socket will transition to CLOSING and
> remain that way until the system is rebooted.
>
> Fix this by checking for the FIN flag in the sk_buff and don't free it if that
> is the case. Testing confirmed that fixed the issue.
>
> Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases")
> Signed-off-by: Jon Maxwell <[email protected]>
> ---
> net/ipv4/tcp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c2d9830136d2..d2b06d8f0c37 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -938,7 +938,7 @@ int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
> */
> void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
> {
> - if (skb && !skb->len) {
> + if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> tcp_unlink_write_queue(skb, sk);
> if (tcp_write_queue_empty(sk))
> tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
>
Very nice catch !
The FIN flag is a really special case here.
What we need is to make sure the skb is 'empty' .
What about using a single condition ?
if (skb && TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
On Thu, Oct 21, 2021 at 12:11 PM Eric Dumazet <[email protected]> wrote:
>
> On Wed, Oct 20, 2021 at 4:25 PM Jon Maxwell <[email protected]> wrote:
> >
> > A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
> > the write_queue was not empty as determined by tcp_write_queue_empty() but the
> > sk_buff containing the FIN flag had been freed and the socket was zombied in
> > that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.
> >
> > Some instrumentation was added to the kernel and it was found that there is a
> > timing window where tcp_sendmsg() can run after tcp_send_fin().
> >
> > tcp_sendmsg() will hit an error, for example:
> >
> > 1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩
> > 1270 ▹ ▹ goto do_error;↩
> >
> > tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
> > TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.
> >
> > If the other side sends a FIN packet the socket will transition to CLOSING and
> > remain that way until the system is rebooted.
> >
> > Fix this by checking for the FIN flag in the sk_buff and don't free it if that
> > is the case. Testing confirmed that fixed the issue.
> >
> > Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases")
> > Signed-off-by: Jon Maxwell <[email protected]>
> > ---
> > net/ipv4/tcp.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index c2d9830136d2..d2b06d8f0c37 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -938,7 +938,7 @@ int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
> > */
> > void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
> > {
> > - if (skb && !skb->len) {
> > + if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> > tcp_unlink_write_queue(skb, sk);
> > if (tcp_write_queue_empty(sk))
> > tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> >
>
> Very nice catch !
>
Thanks Eric.
> The FIN flag is a really special case here.
>
> What we need is to make sure the skb is 'empty' .
>
> What about using a single condition ?
>
> if (skb && TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
Good call as the end_seq will be +1 for a FIN. So that's better.
Let me give the customer a kernel with your idea:
--- net/ipv4/tcp.c 2021-10-20 22:50:35.836001950 +0530
+++ net/ipv4/tcp.c.patch 2021-10-21 01:42:08.493569483 +0530
@@ -955,7 +955,7 @@
*/
void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
{
- if (skb && !skb->len) {
+ if (skb && TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
tcp_unlink_write_queue(skb, sk);
if (tcp_write_queue_empty(sk))
tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
I'll ask the customer to confirm that the v1 patch as above also resolves
the issue. Although I expect it will.
Then I'll resubmit a v1 patch with your suggestion probably early next week.
Regards
Jon
Hi Jon,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Jon-Maxwell/tcp-don-t-free-a-FIN-sk_buff-in-tcp_remove_empty_skb/20211021-072644
base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 2641b62d2fab52648e34cdc6994b2eacde2d27c1
config: i386-randconfig-a004-20211021 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 3cea2505fd8d99a9ba0cb625aecfe28a47c4e3f8)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/87f5ea309107de5183ec0bd7d0b48ec90546d350
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jon-Maxwell/tcp-don-t-free-a-FIN-sk_buff-in-tcp_remove_empty_skb/20211021-072644
git checkout 87f5ea309107de5183ec0bd7d0b48ec90546d350
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> net/ipv4/tcp.c:941:26: warning: logical not is only applied to the left hand side of this bitwise operator [-Wlogical-not-parentheses]
if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
^ ~
net/ipv4/tcp.c:941:26: note: add parentheses after the '!' to evaluate the bitwise operator first
if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
^
( )
net/ipv4/tcp.c:941:26: note: add parentheses around left hand side expression to silence this warning
if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
^
( )
1 warning generated.
vim +941 net/ipv4/tcp.c
932
933 /* In some cases, both sendpage() and sendmsg() could have added
934 * an skb to the write queue, but failed adding payload on it.
935 * We need to remove it to consume less memory, but more
936 * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
937 * users.
938 */
939 void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
940 {
> 941 if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
942 tcp_unlink_write_queue(skb, sk);
943 if (tcp_write_queue_empty(sk))
944 tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
945 sk_wmem_free_skb(sk, skb);
946 }
947 }
948
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Jon,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Jon-Maxwell/tcp-don-t-free-a-FIN-sk_buff-in-tcp_remove_empty_skb/20211021-072644
base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 2641b62d2fab52648e34cdc6994b2eacde2d27c1
config: i386-randconfig-s002-20211021 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-dirty
# https://github.com/0day-ci/linux/commit/87f5ea309107de5183ec0bd7d0b48ec90546d350
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jon-Maxwell/tcp-don-t-free-a-FIN-sk_buff-in-tcp_remove_empty_skb/20211021-072644
git checkout 87f5ea309107de5183ec0bd7d0b48ec90546d350
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
sparse warnings: (new ones prefixed by >>)
>> net/ipv4/tcp.c:941:61: sparse: sparse: dubious: !x & y
vim +941 net/ipv4/tcp.c
0c54b85f282812 Ilpo Järvinen 2009-03-14 932
fdfc5c8594c24c Eric Dumazet 2019-08-26 933 /* In some cases, both sendpage() and sendmsg() could have added
fdfc5c8594c24c Eric Dumazet 2019-08-26 934 * an skb to the write queue, but failed adding payload on it.
fdfc5c8594c24c Eric Dumazet 2019-08-26 935 * We need to remove it to consume less memory, but more
fdfc5c8594c24c Eric Dumazet 2019-08-26 936 * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
fdfc5c8594c24c Eric Dumazet 2019-08-26 937 * users.
fdfc5c8594c24c Eric Dumazet 2019-08-26 938 */
b796d04bd014fd Paolo Abeni 2020-11-16 939 void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
fdfc5c8594c24c Eric Dumazet 2019-08-26 940 {
87f5ea309107de Jon Maxwell 2021-10-21 @941 if (skb && !skb->len && !TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
fdfc5c8594c24c Eric Dumazet 2019-08-26 942 tcp_unlink_write_queue(skb, sk);
fdfc5c8594c24c Eric Dumazet 2019-08-26 943 if (tcp_write_queue_empty(sk))
fdfc5c8594c24c Eric Dumazet 2019-08-26 944 tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
fdfc5c8594c24c Eric Dumazet 2019-08-26 945 sk_wmem_free_skb(sk, skb);
fdfc5c8594c24c Eric Dumazet 2019-08-26 946 }
fdfc5c8594c24c Eric Dumazet 2019-08-26 947 }
fdfc5c8594c24c Eric Dumazet 2019-08-26 948
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]