Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754837AbbK0Pcv (ORCPT ); Fri, 27 Nov 2015 10:32:51 -0500 Received: from mail.univention.de ([82.198.197.8]:2788 "EHLO mail.univention.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753608AbbK0Pcr (ORCPT ); Fri, 27 Nov 2015 10:32:47 -0500 X-Greylist: delayed 520 seconds by postgrey-1.27 at vger.kernel.org; Fri, 27 Nov 2015 10:32:47 EST Message-ID: <5658758A.3020602@univention.de> Date: Fri, 27 Nov 2015 16:23:54 +0100 From: Philipp Hahn Organization: Univention GmbH User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.8.0 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org, Herbert Xu , Tejun Heo CC: Linus Torvalds , "David S. Miller" Subject: Strange PF_NETLINK NETLINK_ROUTE stall: netlink: Fix autobind race condition that leads to zero port ID Content-Type: multipart/mixed; boundary="------------000209020708070006050008" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11479 Lines: 283 This is a multi-part message in MIME format. --------------000209020708070006050008 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Hello, I observed a strange stall in a multi-threaded process (originally OpenLDAP slapd), which stalls during a call to getaddrinfo() after upgrading the kernel from 3.16.x to 4.1.12: > #0 0x00007f8e6d97b87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #1 0x00007f8e6d99bfbc in make_request (fd=12, pid=2871, seen_ipv4=, seen_ipv6=, in6ai=, in6ailen=) at ../sysdeps/unix/sysv/linux/check_pf.c:119 > #2 0x00007f8e6d99c42a in __check_pf (seen_ipv4=0x7f8e6a0969ff, seen_ipv6=0x7f8e6a0969fe, in6ai=0x7f8e6a0969e0, in6ailen=0x7f8e6a0969d8) at ../sysdeps/unix/sysv/linux/check_pf.c:272 > #3 0x00007f8e6d959007 in *__GI_getaddrinfo (name=0x7f8e6a096a80 "0.0.0.0", service=0x400d04 "7636", hints=0x7f8e6a096a50, pai=0x7f8e6a096a48) at ../sysdeps/posix/getaddrinfo.c:2389 (The glibc tries to get the routing table to see via netlink, if it should support IPv4 and/or IPv6. Neither nor fixed the problem.) I've been able to reproduce it by writing the attached program, which calls getaddrinfo() from 10 thread 10^6 times. On my 8-core Intel I see several threads hanging in the above call after some seconds - you don't have to wait for the process to terminate): > gdb --batch --ex 'set pagination off' --ex 'thread apply all bt' -p `pidof a.out` | grep rec ... > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 > #0 0x00007f3817bcc87d in recvmsg () at ../sysdeps/unix/syscall-template.S:82 Using "git bisect start v4.1.11 v.4.10" I've been able to identify 4e27762417669cb459971635be550eb7b5598286 (v4.1.9-26-g4e27762), which is a back-port of 1f770c0a09da855a2b51af6d19de97fb955eca85 (v4.2-11228-g1f770c0). I've attached the log. > # git log -1 4e27762417669cb459971635be550eb7b5598286 > commit 4e27762417669cb459971635be550eb7b5598286 > Author: Herbert Xu > Date: Fri Sep 18 19:16:50 2015 +0800 > > netlink: Fix autobind race condition that leads to zero port ID > > [ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ] > > The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink: > Reset portid after netlink_insert failure") introduced a race > condition where if two threads try to autobind the same socket > one of them may end up with a zero port ID. This led to kernel > deadlocks that were observed by multiple people. > > This patch reverts that commit and instead fixes it by introducing > a separte rhash_portid variable so that the real portid is only set > after the socket has been successfully hashed. > > Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure") > Reported-by: Tejun Heo > Reported-by: Linus Torvalds > Signed-off-by: Herbert Xu > Signed-off-by: David S. Miller > Signed-off-by: Greg Kroah-Hartman I verified 4e27762417669cb459971635be550eb7b5598286~1 does not show the problem. v4.1.13 still has the problem. > # git log --oneline v4.1..v4.1.13 -- net/netlink/af_netlink.c > e962218 netlink: Trim skb to alloc size to avoid MSG_TRUNC > d486236 netlink: Replace rhash_portid with bound > 4e27762 netlink: Fix autobind race condition that leads to zero port ID > 65d48c6 netlink, mmap: transform mmap skb into full skb on taps > d397617 netlink: make sure -EBUSY won't escape from netlink_insert > b265c30 netlink: don't hold mutex in rcu callback when releasing mmapd ring v4.3 does not show the problem. > # git log --oneline v4.2..v4.3 -- net/netlink/af_netlink.c > 47191d6 netlink: fix locking around NETLINK_LIST_MEMBERSHIPS > db65a3a netlink: Trim skb to alloc size to avoid MSG_TRUNC > da314c9 netlink: Replace rhash_portid with bound > 1f770c0 netlink: Fix autobind race condition that leads to zero port ID > 1853c94 netlink, mmap: transform mmap skb into full skb on taps > 6bb0fef netlink, mmap: fix edge-case leakages in nf queue zero-copy > a66e365 netlink, mmap: don't walk rx ring on poll if receive queue non-empty > 0ef7077 netlink: rx mmap: fix POLLIN condition > 7084a31 netlink: mmap: fix lookup frame position > 0a6a3a2 netlink: add NETLINK_CAP_ACK socket option v4.2 does not show the problem (as expected, as the patch isn't in 4.2.0 > # git log --oneline v4.2..v4.2.6 -- net/netlink/af_netlink.c > 326ce2c netlink: Trim skb to alloc size to avoid MSG_TRUNC > e2a3131 netlink: Replace rhash_portid with bound > 6e32e73 netlink: Fix autobind race condition that leads to zero port ID > 62f43b5 netlink, mmap: transform mmap skb into full skb on taps ring I no longer have an idea where to look next, as something outside of af_netlink.c seems to have fixed the issue for at least 4.3. Any idea? Kernel: 4.1.12 (Debian) GLIBC: 2.13-38+deb7u8 CPU: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz Thank you and a (hopefully) happy weekend. Philipp -- Philipp Hahn Open Source Software Engineer Univention GmbH be open. Mary-Somerville-Str. 1 D-28359 Bremen Tel.: +49 421 22232-0 Fax : +49 421 22232-99 hahn@univention.de http://www.univention.de/ Geschäftsführer: Peter H. Ganten HRB 20755 Amtsgericht Bremen Steuer-Nr.: 71-597-02876 --------------000209020708070006050008 Content-Type: text/x-csrc; name="40059_glic-NL.c" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="40059_glic-NL.c" /* * */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #define ARRAY_SIZE(a) ((sizeof (a)) / sizeof (*(a))) #define BUFLEN 4096 // cyrus-sasl int check(void) { const char *addr =3D "0.0.0.0;7636"; char hbuf[NI_MAXHOST]; struct addrinfo hints, *ai =3D NULL; int i, j; /* Parse the address */ for (i =3D 0; addr[i] !=3D '\0' && addr[i] !=3D ';'; i++) { if (i >=3D NI_MAXHOST) return 3; hbuf[i] =3D addr[i]; } hbuf[i] =3D '\0'; if (addr[i] =3D=3D ';') i++; /* XXX: Do we need this check? */ for (j =3D i; addr[j] !=3D '\0'; j++) if (!isdigit((int)(addr[j]))) return 2; memset(&hints, 0, sizeof(hints)); hints.ai_family =3D PF_UNSPEC; hints.ai_socktype =3D SOCK_STREAM; hints.ai_flags =3D AI_PASSIVE | AI_NUMERICHOST; if (getaddrinfo(hbuf, &addr[i], &hints, &ai) !=3D 0) return 1; freeaddrinfo(ai); return 0; } static pthread_t thread[10]; static void *run(void *data) { unsigned int round, j; unsigned int self =3D (unsigned int)(unsigned long)data; pthread_t tid =3D pthread_self(); printf("[%d] POSIX thread ID is %u\n", self, (unsigned int)tid); for (round =3D 0; round < 1000; round++) { for (j =3D 0; j < 1000; j++) { int i; i =3D check(); if (i) break; } printf("%d\n", round); unsigned int usecs =3D (round * (unsigned int)tid) % 1000000; usleep(usecs); if (round % 10 =3D=3D 0) putchar('0' + (self % 10)); } printf("[%d] end\n", self); pthread_exit(NULL); } int main(int argc, char *argv[], char *envp[]) { pid_t pid =3D getpid(); printf("pid=3D%d\n", pid); unsigned int t; for (t =3D 0; t < ARRAY_SIZE(thread); t++) pthread_create(&thread[t], NULL, run, (void *)(long)t); int ret =3D 0; void *value_ptr; for (t =3D 0; t < ARRAY_SIZE(thread); t++) ret |=3D pthread_join(thread[t], &value_ptr); return ret; } /* :!gcc -O0 -g -Wall -lpthread % */ --------------000209020708070006050008 Content-Type: text/plain; charset=UTF-8; name="40059_linux-NL.bisect" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="40059_linux-NL.bisect" Z2l0IGJpc2VjdCBzdGFydAojIGJhZDogWzIwNWE4NTE0ZTYzYTIyMWMzYTUwNzFmMjc1MjEw MTM0NDRlNDNlNWZdIExpbnV4IDQuMS4xMQpnaXQgYmlzZWN0IGJhZCAyMDVhODUxNGU2M2Ey MjFjM2E1MDcxZjI3NTIxMDEzNDQ0ZTQzZTVmCiMgZ29vZDogW2I5NTNjMGQyMzRiYzcyZTg0 ODlkM2JmNTFhMjc2YzVjNGVjODUzNDVdIExpbnV4IDQuMQpnaXQgYmlzZWN0IGdvb2QgYjk1 M2MwZDIzNGJjNzJlODQ4OWQzYmY1MWEyNzZjNWM0ZWM4NTM0NQojIGdvb2Q6IFs0YjZiZjE3 MGU4M2U2NjY5NzRjZTAwZDcyNTFkZDBkNGI1Y2E1NDM5XSBvY2ZzMjogZml4IHNoaWZ0IGxl ZnQgb3ZlcmZsb3cKZ2l0IGJpc2VjdCBnb29kIDRiNmJmMTcwZTgzZTY2Njk3NGNlMDBkNzI1 MWRkMGQ0YjVjYTU0MzkKIyBnb29kOiBbMDc5NmY1NWJhNGRiMjY4ZjNhNmE2YzZiYjllNDgx Y2U5ZWFmNWQ0OF0gaWRlYXBhZC1sYXB0b3A6IEFkZCBMZW5vdm8gWW9nYSAzIDE0IHRvIG5v X2h3X3Jma2lsbCBkbWkgbGlzdApnaXQgYmlzZWN0IGdvb2QgMDc5NmY1NWJhNGRiMjY4ZjNh NmE2YzZiYjllNDgxY2U5ZWFmNWQ0OAojIGJhZDogW2RjN2EzZDcwN2M5Nzg0ZjI2MDQ4ZDZm MDU5ZTc0NjdhOGE5MmUwNGJdIHNwaTogRml4IGRvY3VtZW50YXRpb24gb2Ygc3BpX2FsbG9j X21hc3RlcigpCmdpdCBiaXNlY3QgYmFkIGRjN2EzZDcwN2M5Nzg0ZjI2MDQ4ZDZmMDU5ZTc0 NjdhOGE5MmUwNGIKIyBnb29kOiBbOGJiOTIyNWE3YmQzYTA3YWFkZDU4YjE5YTYzM2YxZjdh ODg4NGQwMV0gbmV0OiBldGg6IGFsdGVyYTogZml4IG5hcGkgcG9sbF9saXN0IGNvcnJ1cHRp b24KZ2l0IGJpc2VjdCBnb29kIDhiYjkyMjVhN2JkM2EwN2FhZGQ1OGIxOWE2MzNmMWY3YTg4 ODRkMDEKIyBiYWQ6IFtiNmFjM2FlZTFjNzNkNWIxMDdhZTZlYzZlMzcxNWIzOWRiNzgyNzgx XSBpc2VyLXRhcmdldDogUHV0IHRoZSByZWZlcmVuY2Ugb24gY29tbWFuZHMgd2FpdGluZyBm b3IgdW5zb2wgZGF0YQpnaXQgYmlzZWN0IGJhZCBiNmFjM2FlZTFjNzNkNWIxMDdhZTZlYzZl MzcxNWIzOWRiNzgyNzgxCiMgYmFkOiBbNGUyNzc2MjQxNzY2OWNiNDU5OTcxNjM1YmU1NTBl YjdiNTU5ODI4Nl0gbmV0bGluazogRml4IGF1dG9iaW5kIHJhY2UgY29uZGl0aW9uIHRoYXQg bGVhZHMgdG8gemVybyBwb3J0IElECmdpdCBiaXNlY3QgYmFkIDRlMjc3NjI0MTc2NjljYjQ1 OTk3MTYzNWJlNTUwZWI3YjU1OTgyODYKIyBnb29kOiBbY2Y5Y2Y2YmMyNTNjNWQ4M2E4Y2I1 OTNkNGVkNzA0MmJiNGYwNGU3Y10gbWFjdnRhcDogZml4IFRVTlNFVFNOREJVRiB2YWx1ZXMg PiA2NGsKZ2l0IGJpc2VjdCBnb29kIGNmOWNmNmJjMjUzYzVkODNhOGNiNTkzZDRlZDcwNDJi YjRmMDRlN2MKIyBnb29kOiBbZmNlMTM0NjQ0MzAzYTRlYmUwODkyZWRjOWFmN2Q0NmUwZWJk YTlhYl0gZmliX3J1bGVzOiBmaXggZmliIHJ1bGUgZHVtcHMgYWNyb3NzIG11bHRpcGxlIHNr YnMKZ2l0IGJpc2VjdCBnb29kIGZjZTEzNDY0NDMwM2E0ZWJlMDg5MmVkYzlhZjdkNDZlMGVi ZGE5YWIKIyBnb29kOiBbMjgyMTE3YWNkZmFjN2I1ZDYyMWIxOGYwMDJhOTc1MWQ3ZjU4M2Rh Zl0gbmV0OiBwaHk6IGZpeGVkX3BoeTogaGFuZGxlIGxpbmstZG93biBjYXNlCmdpdCBiaXNl Y3QgZ29vZCAyODIxMTdhY2RmYWM3YjVkNjIxYjE4ZjAwMmE5NzUxZDdmNTgzZGFmCiMgZ29v ZDogW2Q2MDAxNzY0NjFlODVhMGJmM2I5NjM1ODdlNGNhOTJkNDk0YTdhNDddIG12bmV0YTog dXNlIGluYmFuZCBzdGF0dXMgb25seSB3aGVuIGV4cGxpY2l0bHkgZW5hYmxlZApnaXQgYmlz ZWN0IGdvb2QgZDYwMDE3NjQ2MWU4NWEwYmYzYjk2MzU4N2U0Y2E5MmQ0OTRhN2E0Nwo= --------------000209020708070006050008-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/