Subject: Asterisk deadlocks since Kernel 4.1

Hello,

since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
deadlocks every few hours (kill -9 is the only thing working). Booting
with 3.18 again let it run smooth again.

An strace shows asterisk is looping like this:

[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0
[pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
2, 1000) = 1 ([{fd=8, revents=POLLIN}])
[pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
it_value={0, 140592906050976}}) = 0

fd 8 is:

lrwx------ 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
anon_inode:[timerfd]


# cat /proc/6025/stack
[<ffffffff941d60e9>] poll_schedule_timeout+0x49/0x70
[<ffffffff941d76c7>] do_sys_poll+0x3d7/0x590
[<ffffffff941d78bc>] do_restart_poll+0x3c/0x70
[<ffffffff9408e7ff>] sys_restart_syscall+0x1f/0x30
[<ffffffff9463f1ee>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

Any ideas how to debug this?

Greets,
Stefan


2015-11-17 19:15:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:
> since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
> deadlocks every few hours (kill -9 is the only thing working). Booting
> with 3.18 again let it run smooth again.
>
> An strace shows asterisk is looping like this:
>
> [pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
> it_value={0, 140592906050976}}) = 0
> [pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
> 2, 1000) = 1 ([{fd=8, revents=POLLIN}])
> [pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
> it_value={0, 140592906050976}}) = 0
> [pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
>
> fd 8 is:
>
> lrwx------ 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
> anon_inode:[timerfd]
>
>
> # cat /proc/6025/stack
> [<ffffffff941d60e9>] poll_schedule_timeout+0x49/0x70
> [<ffffffff941d76c7>] do_sys_poll+0x3d7/0x590
> [<ffffffff941d78bc>] do_restart_poll+0x3c/0x70
> [<ffffffff9408e7ff>] sys_restart_syscall+0x1f/0x30
> [<ffffffff9463f1ee>] system_call_fastpath+0x12/0x71
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Any ideas how to debug this?

fd 8 is probably not really interesting. That looks like a interval
timer firing periodically. So it probably waits for fd 9 ...

Thanks,

tglx

Subject: Re: Asterisk deadlocks since Kernel 4.1

Am 17.11.2015 um 20:15 schrieb Thomas Gleixner:
> On Tue, 17 Nov 2015, Stefan Priebe - Profihost AG wrote:
>> since Upgrading our Asterisk System from Kernel 3.18.17 to 4.1.13 it
>> deadlocks every few hours (kill -9 is the only thing working). Booting
>> with 3.18 again let it run smooth again.
>>
>> An strace shows asterisk is looping like this:
>>
>> [pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
>> it_value={0, 140592906050976}}) = 0
>> [pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
>> [pid 6068] poll([{fd=9, events=POLLIN}, {fd=8, events=POLLIN|POLLPRI}],
>> 2, 1000) = 1 ([{fd=8, revents=POLLIN}])
>> [pid 6068] timerfd_gettime(8, , {it_interval={0, 20000000},
>> it_value={0, 140592906050976}}) = 0
>> [pid 6068] read(8, "\1\0\0\0\0\0\0\0", 8) = 8
>>
>> fd 8 is:
>>
>> lrwx------ 1 root root 64 Nov 17 15:27 /proc/6025/fd/8 ->
>> anon_inode:[timerfd]
>>
>>
>> # cat /proc/6025/stack
>> [<ffffffff941d60e9>] poll_schedule_timeout+0x49/0x70
>> [<ffffffff941d76c7>] do_sys_poll+0x3d7/0x590
>> [<ffffffff941d78bc>] do_restart_poll+0x3c/0x70
>> [<ffffffff9408e7ff>] sys_restart_syscall+0x1f/0x30
>> [<ffffffff9463f1ee>] system_call_fastpath+0x12/0x71
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Any ideas how to debug this?
>
> fd 8 is probably not really interesting. That looks like a interval
> timer firing periodically. So it probably waits for fd 9 ...

I've now also two gdb backtraces from two crashes:
http://pastebin.com/raw.php?i=yih5jNt8

http://pastebin.com/raw.php?i=kGEcvH4T

Stefan

>
> Thanks,
>
> tglx
>

2015-11-17 19:44:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Tue, 17 Nov 2015, Stefan Priebe wrote:
> I've now also two gdb backtraces from two crashes:
> http://pastebin.com/raw.php?i=yih5jNt8
>
> http://pastebin.com/raw.php?i=kGEcvH4T

They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.

Thanks,

tglx

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>> I've now also two gdb backtraces from two crashes:
>> http://pastebin.com/raw.php?i=yih5jNt8
>>
>> http://pastebin.com/raw.php?i=kGEcvH4T
>
> They don't tell me anything as I have no idea of the inner workings of
> asterisk. You might be better of to talk to the asterisk folks to help
> you track down what that thing is waiting for, so we can actually look
> at a well defined area.

The asterisk guys told me it's a livelock asterisk is waiting for
getaddrinfo / recvmsg.

Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
#0 0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fbeb9e92007 in getaddrinfo () from
/lib/x86_64-linux-gnu/libc.so.6
#4 0x00000000004f2e81 in ast_sockaddr_resolve
(addrs=addrs@entry=0x7fbe989c48d8, str=<optimized out>,
str@entry=0x7fbe989c4a80 "10.12.12.92:2052", flags=flags@entry=0,
family=2) at netsock2.c:268
#5 0x00007fbeb095fb8b in ast_sockaddr_resolve_first_af
(family=<optimized out>, flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",
addr=0x7fbeb4114948) at chan_sip.c:30797
#6 ast_sockaddr_resolve_first_transport (transport=<optimized out>,
flag=0, name=0x7fbe989c4a80 "10.12.12.92:2052",
addr=0x7fbeb4114948) at chan_sip.c:30828
#7 set_destination (uri=<optimized out>, p=0x7fbeb4114438) at
chan_sip.c:10602
#8 reqprep (req=req@entry=0x7fbe989c4fc0, p=p@entry=0x7fbeb4114438,
sipmethod=sipmethod@entry=8, seqno=<optimized out>,
seqno@entry=0, newbranch=newbranch@entry=1) at chan_sip.c:10925
---Type <return> to continue, or q <return> to quit---
#9 0x00007fbeb0969968 in transmit_request_with_auth
(p=p@entry=0x7fbeb4114438, newbranch=1, reliable=XMIT_RELIABLE, seqno=0,
sipmethod=8) at chan_sip.c:14296
#10 0x00007fbeb098b704 in sip_hangup (ast=<optimized out>) at
chan_sip.c:6822
#11 0x0000000000477b55 in ast_hangup (chan=chan@entry=0x7fbeb40ef018) at
channel.c:2887
#12 0x000000000050ce1b in __ast_pbx_run (c=c@entry=0x7fbeb40ef018,
args=args@entry=0x0) at pbx.c:5729
#13 0x000000000050e426 in pbx_thread (data=data@entry=0x7fbeb40ef018) at
pbx.c:5821
#14 0x0000000000548aaa in dummy_start (data=<optimized out>) at utils.c:1151
#15 0x00007fbeb965eb50 in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007fbeb9eb395d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000000000000000 in ?? ()

In both cases, the thread is waiting in recvmsg.

Stefan

>
> Thanks,
>
> tglx
>

2015-11-18 21:00:29

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1



On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:
>
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
> > On Tue, 17 Nov 2015, Stefan Priebe wrote:
> >> I've now also two gdb backtraces from two crashes:
> >> http://pastebin.com/raw.php?i=yih5jNt8
> >>
> >> http://pastebin.com/raw.php?i=kGEcvH4T
> >
> > They don't tell me anything as I have no idea of the inner workings of
> > asterisk. You might be better of to talk to the asterisk folks to help
> > you track down what that thing is waiting for, so we can actually look
> > at a well defined area.
>
> The asterisk guys told me it's a livelock asterisk is waiting for
> getaddrinfo / recvmsg.
>
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0 0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1 0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2 0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3 0x00007fbeb9e92007 in getaddrinfo () from
> /lib/x86_64-linux-gnu/libc.so.6

Florian Weimer recently patched glibc to detect filedescriptors which
have been reused incorrectly in the glibc name resolving routines:
<https://sourceware.org/ml/libc-alpha/2015-10/msg00865.html>

Could you determine the type of filedescriptor in /proc/pid/fd and
/proc/net/netlink by matching inodes?

Thanks,
Hannes

2015-11-18 21:18:11

by Florian Weimer

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On 11/18/2015 09:23 PM, Stefan Priebe wrote:
>
> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>> I've now also two gdb backtraces from two crashes:
>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>
>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>
>> They don't tell me anything as I have no idea of the inner workings of
>> asterisk. You might be better of to talk to the asterisk folks to help
>> you track down what that thing is waiting for, so we can actually look
>> at a well defined area.
>
> The asterisk guys told me it's a livelock asterisk is waiting for
> getaddrinfo / recvmsg.
>
> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
> #0 0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
> #1 0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2 0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3 0x00007fbeb9e92007 in getaddrinfo () from
> /lib/x86_64-linux-gnu/libc.so.6

Stefan,

please try to get a backtrace with debugging information. It is likely
that this is the make_request/__check_pf functionality in glibc, but it
would be nice to get some certainty.

Which glibc version do you use? Has it got a fix for CVE-2013-7423?

So far, the only known cause for a hang in this place (that is, lack of
return from recvmsg) is incorrect file descriptor use. (CVE-2013-7423
is such an issue in glibc itself.) The kernel upgrade could change
scheduling behavior, and the actual bug might have been latent before.

Theoretically, recvmsg could also hang if the Netlink query was dropped
by the kernel, or the final packet in the response was dropped. We
never saw that happen, even under extreme load, but I didn't test with
recent kernels.

The glibc change Hannes mentioned won't detect the hang, but if there is
incorrect file descriptor reuse going on, it is possible that the new
assert catches it.

Florian

Subject: Re: Asterisk deadlocks since Kernel 4.1

Am 18.11.2015 um 22:00 schrieb Hannes Frederic Sowa:
>
>
> On Wed, Nov 18, 2015, at 21:23, Stefan Priebe wrote:
>>
>> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>>> I've now also two gdb backtraces from two crashes:
>>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>>
>>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>>
>>> They don't tell me anything as I have no idea of the inner workings of
>>> asterisk. You might be better of to talk to the asterisk folks to help
>>> you track down what that thing is waiting for, so we can actually look
>>> at a well defined area.
>>
>> The asterisk guys told me it's a livelock asterisk is waiting for
>> getaddrinfo / recvmsg.
>>
>> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
>> #0 0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
>> #1 0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #2 0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #3 0x00007fbeb9e92007 in getaddrinfo () from
>> /lib/x86_64-linux-gnu/libc.so.6
>
> Florian Weimer recently patched glibc to detect filedescriptors which
> have been reused incorrectly in the glibc name resolving routines:
> <https://sourceware.org/ml/libc-alpha/2015-10/msg00865.html>
>
> Could you determine the type of filedescriptor in /proc/pid/fd and
> /proc/net/netlink by matching inodes?

you mean just:
la /proc/$pid/fd

and

cat /proc/net/netlink

Stefan

> Thanks,
> Hannes
>

2015-11-18 21:23:18

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
> you mean just:
> la /proc/$pid/fd

ls -l /proc/pid/fd/

the numbers in brackets in return from readlink are the inode numbers.

> and
>
> cat /proc/net/netlink

Exactly, last row is the inode number.

Bye,
Hannes

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 18.11.2015 um 22:18 schrieb Florian Weimer:
> On 11/18/2015 09:23 PM, Stefan Priebe wrote:
>>
>> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>>> I've now also two gdb backtraces from two crashes:
>>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>>
>>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>>
>>> They don't tell me anything as I have no idea of the inner workings of
>>> asterisk. You might be better of to talk to the asterisk folks to help
>>> you track down what that thing is waiting for, so we can actually look
>>> at a well defined area.
>>
>> The asterisk guys told me it's a livelock asterisk is waiting for
>> getaddrinfo / recvmsg.
>>
>> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
>> #0 0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
>> #1 0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #2 0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #3 0x00007fbeb9e92007 in getaddrinfo () from
>> /lib/x86_64-linux-gnu/libc.so.6
>
> Stefan,
>
> please try to get a backtrace with debugging information. It is likely
> that this is the make_request/__check_pf functionality in glibc, but it
> would be nice to get some certainty.
>
> Which glibc version do you use? Has it got a fix for CVE-2013-7423?

It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed:
https://security-tracker.debian.org/tracker/CVE-2013-7423

> So far, the only known cause for a hang in this place (that is, lack of
> return from recvmsg) is incorrect file descriptor use. (CVE-2013-7423
> is such an issue in glibc itself.) The kernel upgrade could change
> scheduling behavior, and the actual bug might have been latent before.
>
> Theoretically, recvmsg could also hang if the Netlink query was dropped
> by the kernel, or the final packet in the response was dropped. We
> never saw that happen, even under extreme load, but I didn't test with
> recent kernels.

The load is very low in this system. Just 30 phones and only 1-6 calling.

> The glibc change Hannes mentioned won't detect the hang, but if there is
> incorrect file descriptor reuse going on, it is possible that the new
> assert catches it.
>
> Florian
>


Stefan

Subject: Re: Asterisk deadlocks since Kernel 4.1

Am 18.11.2015 um 22:18 schrieb Florian Weimer:
> On 11/18/2015 09:23 PM, Stefan Priebe wrote:
>>
>> Am 17.11.2015 um 20:43 schrieb Thomas Gleixner:
>>> On Tue, 17 Nov 2015, Stefan Priebe wrote:
>>>> I've now also two gdb backtraces from two crashes:
>>>> http://pastebin.com/raw.php?i=yih5jNt8
>>>>
>>>> http://pastebin.com/raw.php?i=kGEcvH4T
>>>
>>> They don't tell me anything as I have no idea of the inner workings of
>>> asterisk. You might be better of to talk to the asterisk folks to help
>>> you track down what that thing is waiting for, so we can actually look
>>> at a well defined area.
>>
>> The asterisk guys told me it's a livelock asterisk is waiting for
>> getaddrinfo / recvmsg.
>>
>> Thread 2 (Thread 0x7fbe989c6700 (LWP 12890)):
>> #0 0x00007fbeb9eb487d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
>> #1 0x00007fbeb9ed4fcc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #2 0x00007fbeb9ed544a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #3 0x00007fbeb9e92007 in getaddrinfo () from
>> /lib/x86_64-linux-gnu/libc.so.6
>
> Stefan,
>
> please try to get a backtrace with debugging information. It is likely
> that this is the make_request/__check_pf functionality in glibc, but it
> would be nice to get some certainty.

sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
have ipv6 except for link local. Could it be this one?

https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
#0 0x00007f295de3287d in recvmsg () at
../sysdeps/unix/syscall-template.S:82
#1 0x00007f295de52fcc in make_request (fd=35, pid=26631,
seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
in6ai=<optimized out>, in6ailen=<optimized out>) at
../sysdeps/unix/sysv/linux/check_pf.c:119
#2 0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
in6ailen=0x7f295c00e838) at ../sysdeps/unix/sysv/linux/check_pf.c:271
#3 0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
"10.12.12.55", service=0x7f295c00e8bc "2135",
hints=0x7f295c00e910, pai=0x7f295c00e908) at
../sysdeps/posix/getaddrinfo.c:2389
#4 0x000000000050287e in ast_sockaddr_resolve (addrs=0x7f295c00e9d0,
str=0x7f295c00ea30 "10.12.12.55:2135", flags=0, family=2)
at netsock2.c:268
#5 0x00007f2958963ba2 in ast_sockaddr_resolve_first_af
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135", flag=0,
family=2) at chan_sip.c:30689
#6 0x00007f2958963cb5 in ast_sockaddr_resolve_first_transport
(addr=0x7f29300591d8, name=0x7f295c00ea30 "10.12.12.55:2135",
flag=0, transport=1) at chan_sip.c:30720
#7 0x00007f29588fd3cc in set_destination (p=0x7f2930058cc8,
uri=0x7f29300576e8 "sip:[email protected]:2135;line=to7a729l")
at chan_sip.c:10455
#8 0x00007f29588fe6e0 in reqprep (req=0x7f295c00fee0, p=0x7f2930058cc8,
sipmethod=4, seqno=287, newbranch=1) at chan_sip.c:10778
#9 0x00007f295890a201 in transmit_state_notify (p=0x7f2930058cc8,
state=1, full=1, timeout=0) at chan_sip.c:13259
#10 0x00007f29589141bb in cb_extensionstate (context=0x7f295c010cd0
"hints", exten=0x7f295c010c80 "9052QS", state=1,
data=0x7f2930058cc8) at chan_sip.c:15117
#11 0x000000000050ebf6 in handle_statechange (datap=0x7f293acef830) at
pbx.c:4972
#12 0x0000000000555f8e in tps_processing_function (data=0x1f24f28) at
taskprocessor.c:327
#13 0x0000000000569280 in dummy_start (data=0x1ed76f0) at utils.c:1173
#14 0x00007f295d5dcb50 in start_thread (arg=<optimized out>) at
pthread_create.c:304
#15 0x00007f295de3195d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#16 0x0000000000000000 in ?? ()

>
> Which glibc version do you use? Has it got a fix for CVE-2013-7423?
>
> So far, the only known cause for a hang in this place (that is, lack of
> return from recvmsg) is incorrect file descriptor use. (CVE-2013-7423
> is such an issue in glibc itself.) The kernel upgrade could change
> scheduling behavior, and the actual bug might have been latent before.
>
> Theoretically, recvmsg could also hang if the Netlink query was dropped
> by the kernel, or the final packet in the response was dropped. We
> never saw that happen, even under extreme load, but I didn't test with
> recent kernels.
>
> The glibc change Hannes mentioned won't detect the hang, but if there is
> incorrect file descriptor reuse going on, it is possible that the new
> assert catches it.
>
> Florian
>

2015-11-18 21:40:29

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> have ipv6 except for link local. Could it be this one?
>
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
>
> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> #0 0x00007f295de3287d in recvmsg () at
> ../sysdeps/unix/syscall-template.S:82
> #1 0x00007f295de52fcc in make_request (fd=35, pid=26631,
> seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
> in6ai=<optimized out>, in6ailen=<optimized out>) at
> ../sysdeps/unix/sysv/linux/check_pf.c:119
> #2 0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
> in6ailen=0x7f295c00e838) at
> ../sysdeps/unix/sysv/linux/check_pf.c:271
> #3 0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
> "10.12.12.55", service=0x7f295c00e8bc "2135",
> hints=0x7f295c00e910, pai=0x7f295c00e908) at
> ../sysdeps/posix/getaddrinfo.c:2389

Can you also get a /proc/pid/stack stacktrace?

Thanks,
Hannes

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:
> On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>> have ipv6 except for link local. Could it be this one?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
>>
>> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
>> #0 0x00007f295de3287d in recvmsg () at
>> ../sysdeps/unix/syscall-template.S:82
>> #1 0x00007f295de52fcc in make_request (fd=35, pid=26631,
>> seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
>> in6ai=<optimized out>, in6ailen=<optimized out>) at
>> ../sysdeps/unix/sysv/linux/check_pf.c:119
>> #2 0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
>> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
>> in6ailen=0x7f295c00e838) at
>> ../sysdeps/unix/sysv/linux/check_pf.c:271
>> #3 0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
>> "10.12.12.55", service=0x7f295c00e8bc "2135",
>> hints=0x7f295c00e910, pai=0x7f295c00e908) at
>> ../sysdeps/posix/getaddrinfo.c:2389
>
> Can you also get a /proc/pid/stack stacktrace?

Sure:

For the thread:
# cat /proc/26631/task/26654/stack
[<ffffffffb34e5cdb>] __skb_recv_datagram+0x52b/0x5d0
[<ffffffffb34e5db2>] skb_recv_datagram+0x32/0x40
[<ffffffffb35227bc>] netlink_recvmsg+0x4c/0x300
[<ffffffffb34d3863>] sock_recvmsg+0x13/0x20
[<ffffffffb34d679e>] ___sys_recvmsg+0xee/0x230
[<ffffffffb34d7739>] __sys_recvmsg+0x49/0x90
[<ffffffffb34d7792>] SyS_recvmsg+0x12/0x20
[<ffffffffb363f1ee>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

Stefan

>
> Thanks,
> Hannes
>

2015-11-18 21:58:52

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Wed, Nov 18, 2015, at 22:42, Stefan Priebe wrote:
>
> Am 18.11.2015 um 22:40 schrieb Hannes Frederic Sowa:
> > On Wed, Nov 18, 2015, at 22:36, Stefan Priebe wrote:
> >> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> >> have ipv6 except for link local. Could it be this one?
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
> >>
> >> Thread 31 (Thread 0x7f295c011700 (LWP 26654)):
> >> #0 0x00007f295de3287d in recvmsg () at
> >> ../sysdeps/unix/syscall-template.S:82
> >> #1 0x00007f295de52fcc in make_request (fd=35, pid=26631,
> >> seen_ipv4=<optimized out>, seen_ipv6=<optimized out>,
> >> in6ai=<optimized out>, in6ailen=<optimized out>) at
> >> ../sysdeps/unix/sysv/linux/check_pf.c:119
> >> #2 0x00007f295de5344a in __check_pf (seen_ipv4=0x7f295c00e85f,
> >> seen_ipv6=0x7f295c00e85e, in6ai=0x7f295c00e840,
> >> in6ailen=0x7f295c00e838) at
> >> ../sysdeps/unix/sysv/linux/check_pf.c:271
> >> #3 0x00007f295de10007 in *__GI_getaddrinfo (name=0x7f295c00e8b0
> >> "10.12.12.55", service=0x7f295c00e8bc "2135",
> >> hints=0x7f295c00e910, pai=0x7f295c00e908) at
> >> ../sysdeps/posix/getaddrinfo.c:2389
> >
> > Can you also get a /proc/pid/stack stacktrace?
>
> Sure:
>
> For the thread:
> # cat /proc/26631/task/26654/stack
> [<ffffffffb34e5cdb>] __skb_recv_datagram+0x52b/0x5d0
> [<ffffffffb34e5db2>] skb_recv_datagram+0x32/0x40
> [<ffffffffb35227bc>] netlink_recvmsg+0x4c/0x300
> [<ffffffffb34d3863>] sock_recvmsg+0x13/0x20
> [<ffffffffb34d679e>] ___sys_recvmsg+0xee/0x230
> [<ffffffffb34d7739>] __sys_recvmsg+0x49/0x90
> [<ffffffffb34d7792>] SyS_recvmsg+0x12/0x20
> [<ffffffffb363f1ee>] system_call_fastpath+0x12/0x71
> [<ffffffffffffffff>] 0xffffffffffffffff

Ok, we are definitely in netlink code and thus have a correct(? at least
a netlink) netlink fd which is blocking and waiting for an answer. I
research the glibc code a little bit but I think that the occurrences of
ipv6 parameters are just generic glibc code.

Let me try to reproduce that.

Thanks,
Hannes

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 18.11.2015 um 22:22 schrieb Hannes Frederic Sowa:
> On Wed, Nov 18, 2015, at 22:20, Stefan Priebe wrote:
>> you mean just:
>> la /proc/$pid/fd
>
> ls -l /proc/pid/fd/
>
> the numbers in brackets in return from readlink are the inode numbers.
>
>> and
>>
>> cat /proc/net/netlink
>
> Exactly, last row is the inode number.

It has stopped locking since i've set:

net.ipv6.conf.all.disable_ipv6=1

Stefan

2015-11-19 09:39:42

by Florian Weimer

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On 11/18/2015 10:23 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information. It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
>>
>> Which glibc version do you use? Has it got a fix for CVE-2013-7423?
>
> It's Debians 2.13-38+deb7u8 Debians issue tracker says it is fixed:
> https://security-tracker.debian.org/tracker/CVE-2013-7423

I checked, and the patch is there and applied.

>> Theoretically, recvmsg could also hang if the Netlink query was dropped
>> by the kernel, or the final packet in the response was dropped. We
>> never saw that happen, even under extreme load, but I didn't test with
>> recent kernels.
>
> The load is very low in this system. Just 30 phones and only 1-6 calling.

This is rather odd.

I'm not sure if we can do anything simple on the glibc to help to debug
this. If something in the process consumes the Netlink message from the
kernel (incorrectly reading from the internal glibc file descriptor),
this is fairly difficult detect inside glibc (it would require rather
large changes which do not currently exist at all).

Florian

2015-11-19 09:45:06

by Florian Weimer

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On 11/18/2015 10:36 PM, Stefan Priebe wrote:

>> please try to get a backtrace with debugging information. It is likely
>> that this is the make_request/__check_pf functionality in glibc, but it
>> would be nice to get some certainty.
>
> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
> have ipv6 except for link local.

glibc needs to know if the system has global unicast addresses if it
receives AAAA records.

It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
away. Even with that setting, the kernel seems to send two Netlink
responses. So either this is enough to narrow the window for the race
so that no longer triggers, or there is a genuine kernel issue with
supplying the requested IPv6 Netlink response.

> Could it be this one?
>
> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79

No, that's on the DNS/UDP side, not in the Netlink code.

Florian

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 19.11.2015 um 10:44 schrieb Florian Weimer:
> On 11/18/2015 10:36 PM, Stefan Priebe wrote:
>
>>> please try to get a backtrace with debugging information. It is likely
>>> that this is the make_request/__check_pf functionality in glibc, but it
>>> would be nice to get some certainty.
>>
>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>> have ipv6 except for link local.
>
> glibc needs to know if the system has global unicast addresses if it
> receives AAAA records.
>
> It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
> away. Even with that setting, the kernel seems to send two Netlink
> responses. So either this is enough to narrow the window for the race
> so that no longer triggers, or there is a genuine kernel issue with
> supplying the requested IPv6 Netlink response.

No idea it also goes away by downgrading to 3.18 again.

Stefan

>> Could it be this one?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
>
> No, that's on the DNS/UDP side, not in the Netlink code.
>
> Florian
>

Subject: Re: Asterisk deadlocks since Kernel 4.1

OK it had a livelock again. It just took more time.

So here is the data:

# la /proc/2598/fd
total 0
dr-x------ 2 root root 0 Nov 19 06:53 .
dr-xr-xr-x 7 callweaver callweaver 0 Nov 18 22:38 ..
lrwx------ 1 root root 64 Nov 19 06:54 0 -> /dev/null
lrwx------ 1 root root 64 Nov 19 06:54 1 -> /dev/null
lrwx------ 1 root root 64 Nov 19 06:54 10 -> socket:[12066]
lr-x------ 1 root root 64 Nov 19 06:54 11 -> anon_inode:inotify
lr-x------ 1 root root 64 Nov 19 06:54 12 -> pipe:[12181]
l-wx------ 1 root root 64 Nov 19 06:54 13 -> pipe:[12181]
lrwx------ 1 root root 64 Nov 19 10:56 14 -> socket:[510853]
lrwx------ 1 root root 64 Nov 19 10:56 15 -> socket:[510854]
lrwx------ 1 root root 64 Nov 19 10:56 16 ->
anon_inode:[timerfd]
lr-x------ 1 root root 64 Nov 19 10:56 17 -> pipe:[510856]
l-wx------ 1 root root 64 Nov 19 10:56 18 -> pipe:[510856]
lrwx------ 1 root root 64 Nov 19 10:56 19 -> socket:[208723]
lrwx------ 1 root root 64 Nov 19 06:54 2 -> /dev/null
l-wx------ 1 root root 64 Nov 19 10:56 20 ->
/var/log/asterisk/queue_log
lrwx------ 1 root root 64 Nov 19 10:56 21 -> socket:[199595]
lrwx------ 1 root root 64 Nov 19 10:56 22 -> socket:[510873]
lr-x------ 1 root root 64 Nov 19 10:56 23 -> anon_inode:inotify
lrwx------ 1 root root 64 Nov 19 10:56 24 -> socket:[525349]
lrwx------ 1 root root 64 Nov 19 10:56 25 -> socket:[525350]
lrwx------ 1 root root 64 Nov 19 10:56 26 -> socket:[510874]
lrwx------ 1 root root 64 Nov 19 10:56 27 ->
anon_inode:[timerfd]
lr-x------ 1 root root 64 Nov 19 10:56 28 -> pipe:[510876]
l-wx------ 1 root root 64 Nov 19 10:56 29 -> pipe:[510876]
lr-x------ 1 root root 64 Nov 19 06:54 3 -> /dev/urandom
lrwx------ 1 root root 64 Nov 19 10:56 30 -> socket:[527569]
lrwx------ 1 root root 64 Nov 19 10:56 31 -> socket:[527570]
lrwx------ 1 root root 64 Nov 19 10:56 32 -> socket:[528123]
lrwx------ 1 root root 64 Nov 19 10:56 33 -> socket:[528124]
lrwx------ 1 root root 64 Nov 19 10:56 34 -> socket:[530711]
lrwx------ 1 root root 64 Nov 19 10:56 35 -> socket:[530712]
lrwx------ 1 root root 64 Nov 19 10:56 36 -> socket:[533366]
lrwx------ 1 root root 64 Nov 19 10:56 37 -> socket:[533367]
lrwx------ 1 root root 64 Nov 19 10:56 38 -> socket:[535390]
lrwx------ 1 root root 64 Nov 19 10:56 39 -> socket:[531056]
lrwx------ 1 root root 64 Nov 19 06:54 4 -> socket:[11726]
lrwx------ 1 root root 64 Nov 19 10:56 40 -> socket:[531057]
lrwx------ 1 root root 64 Nov 19 10:56 41 -> socket:[535391]
lrwx------ 1 root root 64 Nov 19 10:56 42 -> socket:[537751]
lrwx------ 1 root root 64 Nov 19 10:56 43 -> socket:[533468]
lrwx------ 1 root root 64 Nov 19 10:56 44 -> socket:[531154]
lrwx------ 1 root root 64 Nov 19 10:56 45 -> socket:[531155]
lrwx------ 1 root root 64 Nov 19 10:56 46 -> socket:[533469]
lrwx------ 1 root root 64 Nov 19 10:56 47 -> socket:[536172]
lrwx------ 1 root root 64 Nov 19 10:56 48 -> socket:[536173]
lrwx------ 1 root root 64 Nov 19 10:56 49 -> socket:[537877]
l-wx------ 1 root root 64 Nov 19 06:54 5 ->
/var/log/asterisk/messages
lrwx------ 1 root root 64 Nov 19 10:56 50 -> socket:[537752]
lrwx------ 1 root root 64 Nov 19 10:56 51 -> socket:[539817]
lrwx------ 1 root root 64 Nov 19 10:56 52 -> socket:[537878]
lrwx------ 1 root root 64 Nov 19 10:56 53 -> socket:[539818]
lrwx------ 1 root root 64 Nov 19 10:56 54 -> socket:[541781]
lrwx------ 1 root root 64 Nov 19 10:56 55 -> socket:[541782]
lrwx------ 1 root root 64 Nov 19 10:56 56 -> socket:[543462]
lrwx------ 1 root root 64 Nov 19 10:56 57 -> socket:[545171]
lrwx------ 1 root root 64 Nov 19 10:56 58 -> socket:[537432]
lrwx------ 1 root root 64 Nov 19 10:56 59 -> socket:[537433]
l-wx------ 1 root root 64 Nov 19 06:54 6 ->
/var/log/asterisk/debug.log
lrwx------ 1 root root 64 Nov 19 10:56 60 -> socket:[545172]
lrwx------ 1 root root 64 Nov 19 10:56 61 ->
anon_inode:[timerfd]
lrwx------ 1 root root 64 Nov 19 10:56 62 -> socket:[541196]
lrwx------ 1 root root 64 Nov 19 10:56 63 -> socket:[538319]
lrwx------ 1 root root 64 Nov 19 10:56 64 -> socket:[538320]
lrwx------ 1 root root 64 Nov 19 10:56 65 -> socket:[474586]
lrwx------ 1 root root 64 Nov 19 10:56 66 -> socket:[541197]
lrwx------ 1 root root 64 Nov 19 10:56 67 -> socket:[542437]
lrwx------ 1 root root 64 Nov 19 10:56 68 -> socket:[542438]
lr-x------ 1 root root 64 Nov 19 10:56 69 -> pipe:[545174]
lrwx------ 1 root root 64 Nov 19 06:54 7 ->
/var/lib/asterisk/astdb
lrwx------ 1 root root 64 Nov 19 10:56 70 -> socket:[543463]
l-wx------ 1 root root 64 Nov 19 10:56 71 -> pipe:[545174]
lrwx------ 1 root root 64 Nov 19 10:56 76 -> socket:[543659]
lrwx------ 1 root root 64 Nov 19 10:56 77 -> socket:[543660]
lrwx------ 1 root root 64 Nov 19 10:56 78 ->
anon_inode:[timerfd]
lr-x------ 1 root root 64 Nov 19 10:56 79 -> pipe:[543662]
lrwx------ 1 root root 64 Nov 19 06:54 8 -> anon_inode:[timerfd]
l-wx------ 1 root root 64 Nov 19 10:56 80 -> pipe:[543662]
lrwx------ 1 root root 64 Nov 19 06:54 9 -> socket:[12052]

[asterisksnom: ~]# cat /proc/net/netlink
sk Eth Pid Groups Rmem Wmem Dump Locks Drops
Inode
ffff8800bac17000 0 0 00000000 0 0 0 2 0
3
ffff8800b5ef8000 4 0 00000000 0 0 0 2 0
6201
ffff8800b71cf000 10 0 00000000 0 0 0 2 0
5455
ffff8800b7176000 11 0 00000000 0 0 0 2 0
12
ffff8800b1169000 15 4294962899 00000000 0 0 0 2 0
7979
ffff8800b16cf000 15 441 00000001 0 0 0 2 0
1542
ffff8800b1168800 15 4294962900 00000000 0 0 0 2 0
7978
ffff8800b7088800 15 0 00000000 0 0 0 2 0
5
ffff8800b71c9800 16 0 00000000 0 0 0 2 0
15
ffff8800b16ca000 16 2362 00000002 0 0 0 2 0
12313

Stefan

Am 19.11.2015 um 10:49 schrieb Stefan Priebe - Profihost AG:
>
> Am 19.11.2015 um 10:44 schrieb Florian Weimer:
>> On 11/18/2015 10:36 PM, Stefan Priebe wrote:
>>
>>>> please try to get a backtrace with debugging information. It is likely
>>>> that this is the make_request/__check_pf functionality in glibc, but it
>>>> would be nice to get some certainty.
>>>
>>> sorry here it is. What I'm wondering is why is there ipv6 stuff? I don't
>>> have ipv6 except for link local.
>>
>> glibc needs to know if the system has global unicast addresses if it
>> receives AAAA records.
>>
>> It's curious that net.ipv6.conf.all.disable_ipv6=1 makes the problem go
>> away. Even with that setting, the kernel seems to send two Netlink
>> responses. So either this is enough to narrow the window for the race
>> so that no longer triggers, or there is a genuine kernel issue with
>> supplying the requested IPv6 Netlink response.
>
> No idea it also goes away by downgrading to 3.18 again.
>
> Stefan
>
>>> Could it be this one?
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=505105#c79
>>
>> No, that's on the DNS/UDP side, not in the Netlink code.
>>
>> Florian
>>

2015-11-19 11:41:37

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> OK it had a livelock again. It just took more time.
>
> So here is the data:

Thanks, I couldn't reproduce it so far with simple threaded resolver
loop on your kernel. :/

Your data is useless if you don't also provide the file descriptor which
you are blocking on right now. ;)

Thanks,
Hannes

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
>> OK it had a livelock again. It just took more time.
>>
>> So here is the data:
>
> Thanks, I couldn't reproduce it so far with simple threaded resolver
> loop on your kernel. :/
>
> Your data is useless if you don't also provide the file descriptor which
> you are blocking on right now. ;)
>
> Thanks,
> Hannes

ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
list + netlink list?

Stefan

2015-11-19 12:41:45

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
>
> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
> > On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
> >> OK it had a livelock again. It just took more time.
> >>
> >> So here is the data:
> >
> > Thanks, I couldn't reproduce it so far with simple threaded resolver
> > loop on your kernel. :/
> >
> > Your data is useless if you don't also provide the file descriptor which
> > you are blocking on right now. ;)
> >
> > Thanks,
> > Hannes
>
> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
> list + netlink list?

Yes, albeit the probability we find something new is minimal. We already
know it blocks deep down in netlink code. I tried to reproduce it in
multithreaded environment with no results so far. Can you even try a
newer kernel?

Bye,
Hannes

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 19.11.2015 um 13:41 schrieb Hannes Frederic Sowa:
> On Thu, Nov 19, 2015, at 12:43, Stefan Priebe - Profihost AG wrote:
>>
>> Am 19.11.2015 um 12:41 schrieb Hannes Frederic Sowa:
>>> On Thu, Nov 19, 2015, at 10:56, Stefan Priebe - Profihost AG wrote:
>>>> OK it had a livelock again. It just took more time.
>>>>
>>>> So here is the data:
>>>
>>> Thanks, I couldn't reproduce it so far with simple threaded resolver
>>> loop on your kernel. :/
>>>
>>> Your data is useless if you don't also provide the file descriptor which
>>> you are blocking on right now. ;)
>>>
>>> Thanks,
>>> Hannes
>>
>> ah sorry. So we need the gdb backtrace with the rcvmsg and then the fd
>> list + netlink list?
>
> Yes, albeit the probability we find something new is minimal. We already
> know it blocks deep down in netlink code. I tried to reproduce it in
> multithreaded environment with no results so far. Can you even try a
> newer kernel?

I can try Kernel 4.4-rc1 next week. Or something else?

Stefan

>
> Bye,
> Hannes
>

2015-11-19 13:19:22

by Florian Weimer

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:

> I can try Kernel 4.4-rc1 next week. Or something else?

I found this bug report which indicates that 4.1.10 works:

<https://issues.asterisk.org/jira/browse/ASTERISK-25251>

But in your original report, you said that 4.1.13 is broken.

This backtrace:

<https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>

shows a lot of waiting on quite different netlink sockets. So if this
is due to a race in Asterisk, it must have happened several times in a row.

Thanks,
Florian

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 19.11.2015 um 14:19 schrieb Florian Weimer:
> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>
>> I can try Kernel 4.4-rc1 next week. Or something else?
>
> I found this bug report which indicates that 4.1.10 works:
>
> <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>
> But in your original report, you said that 4.1.13 is broken.

That's correct i'm running 4.1.13.

> This backtrace:
>
> <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>
> shows a lot of waiting on quite different netlink sockets. So if this
> is due to a race in Asterisk, it must have happened several times in a row.

Might be I've always only one waiting.

Stefan


> Thanks,
> Florian
>

Subject: Re: Asterisk deadlocks since Kernel 4.1

Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>
> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>
>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>
>> I found this bug report which indicates that 4.1.10 works:
>>
>> <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>>
>> But in your original report, you said that 4.1.13 is broken.
>
> That's correct i'm running 4.1.13.
>
>> This backtrace:
>>
>> <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>>
>> shows a lot of waiting on quite different netlink sockets. So if this
>> is due to a race in Asterisk, it must have happened several times in a
>> row.

Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Stefan

2015-11-23 12:57:36

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
> >
> > Am 19.11.2015 um 14:19 schrieb Florian Weimer:
> >> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
> >>
> >>> I can try Kernel 4.4-rc1 next week. Or something else?
> >>
> >> I found this bug report which indicates that 4.1.10 works:
> >>
> >> <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
> >>
> >> But in your original report, you said that 4.1.13 is broken.
> >
> > That's correct i'm running 4.1.13.
> >
> >> This backtrace:
> >>
> >> <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
> >>
> >> shows a lot of waiting on quite different netlink sockets. So if this
> >> is due to a race in Asterisk, it must have happened several times in a
> >> row.
>
> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
> causing the isse in 4.1? It's an LTE kernel so it should be fixed!

Thanks for testing. I was not able to reproduce it at all, with as much
parallelism and threads as possible on any kernel. Could you try to do a
git bisect?

Thanks,
Hannes

Subject: Re: Asterisk deadlocks since Kernel 4.1


Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>>>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>>>
>>>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>>>
>>>> I found this bug report which indicates that 4.1.10 works:
>>>>
>>>> <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>>>>
>>>> But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
>>>> This backtrace:
>>>>
>>>> <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>>>>
>>>> shows a lot of waiting on quite different netlink sockets. So if this
>>>> is due to a race in Asterisk, it must have happened several times in a
>>>> row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
>
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?

OK i'll start digging around with major versions first to have a working
and a non working version.

I'll try 4.3 and 4.2 first to get an idea where it got fixed. Sadly i
can't bisect from 4.1.13 to 4.X so i have to verify first it also
happens with 4.1.

OK?

Stefan

>
> Thanks,
> Hannes
>

Subject: Re: Asterisk deadlocks since Kernel 4.1

Hi,

here are the results.

It works with 4.1.
It works with 4.2.
It does not work with 4.1.13.

git bisect tells me it stopped working after those two commits were applied:

commit d48623677191e0f035d7afd344f92cf880b01f8e
Author: Herbert Xu <[email protected]>
Date: Tue Sep 22 11:38:56 2015 +0800

netlink: Replace rhash_portid with bound

commit 4e27762417669cb459971635be550eb7b5598286
Author: Herbert Xu <[email protected]>
Date: Fri Sep 18 19:16:50 2015 +0800

netlink: Fix autobind race condition that leads to zero port ID

Stefan
Am 23.11.2015 um 13:57 schrieb Hannes Frederic Sowa:
> On Mon, Nov 23, 2015, at 13:44, Stefan Priebe - Profihost AG wrote:
>> Am 19.11.2015 um 20:51 schrieb Stefan Priebe:
>>>
>>> Am 19.11.2015 um 14:19 schrieb Florian Weimer:
>>>> On 11/19/2015 01:46 PM, Stefan Priebe - Profihost AG wrote:
>>>>
>>>>> I can try Kernel 4.4-rc1 next week. Or something else?
>>>>
>>>> I found this bug report which indicates that 4.1.10 works:
>>>>
>>>> <https://issues.asterisk.org/jira/browse/ASTERISK-25251>
>>>>
>>>> But in your original report, you said that 4.1.13 is broken.
>>>
>>> That's correct i'm running 4.1.13.
>>>
>>>> This backtrace:
>>>>
>>>> <https://issues.asterisk.org/jira/secure/attachment/52702/bt.txt>
>>>>
>>>> shows a lot of waiting on quite different netlink sockets. So if this
>>>> is due to a race in Asterisk, it must have happened several times in a
>>>> row.
>>
>> Kernel 4.4-rc2 works fine. How can we grab / get an idea which is
>> causing the isse in 4.1? It's an LTE kernel so it should be fixed!
>
> Thanks for testing. I was not able to reproduce it at all, with as much
> parallelism and threads as possible on any kernel. Could you try to do a
> git bisect?
>
> Thanks,
> Hannes
>

2015-12-02 11:40:37

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

Hello Stefan,

Stefan Priebe - Profihost AG <[email protected]> writes:


> here are the results.
>
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.
>
> git bisect tells me it stopped working after those two commits were applied:
>
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu <[email protected]>
> Date: Tue Sep 22 11:38:56 2015 +0800
>
> netlink: Replace rhash_portid with bound
>
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu <[email protected]>
> Date: Fri Sep 18 19:16:50 2015 +0800
>
> netlink: Fix autobind race condition that leads to zero port ID

Cool, thanks a lot. Does this patch make a difference?

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 59651af..278e94c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)

/* We need to ensure that the socket is hashed and visible. */
smp_wmb();
- nlk_sk(sk)->bound = portid;
+ nlk_sk(sk)->bound = true;

err:
release_sock(sk);

2015-12-02 17:58:44

by Philipp Hahn

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

Hi,

Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> here are the results.
>
> It works with 4.1.
> It works with 4.2.
> It does not work with 4.1.13.

the patches were first commitet in v4.3-rc3 and appear as backports only
since v4.2.3 and v4.1.10

> git bisect tells me it stopped working after those two commits were applied:
>
> commit d48623677191e0f035d7afd344f92cf880b01f8e
> Author: Herbert Xu <[email protected]>
> Date: Tue Sep 22 11:38:56 2015 +0800
>
> netlink: Replace rhash_portid with bound
>
> commit 4e27762417669cb459971635be550eb7b5598286
> Author: Herbert Xu <[email protected]>
> Date: Fri Sep 18 19:16:50 2015 +0800
>
> netlink: Fix autobind race condition that leads to zero port ID

I identified the same two patches for our prpblem; see mail from
2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
netlink: Fix autobind race condition that leads to zero port ID"

<http://marc.info/?l=linux-kernel&m=144863837825887&w=2>

Philipp

2015-12-02 17:58:43

by Philipp Hahn

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

Hi,

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:
> Cool, thanks a lot. Does this patch make a difference?
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>
> /* We need to ensure that the socket is hashed and visible. */
> smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>
> err:
> release_sock(sk);

Will test that tomorrow.

What confuss me is that it works with 4.3. I compared
net/netlink/af_netlink.[ch] for differences between v4.1.10 and v4.3 and
didn't find anything which would explain only 4,1 getting stuck.

Philipp

2015-12-02 18:23:48

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

Hello,

On Wed, Dec 2, 2015, at 18:15, Philipp Hahn wrote:
> Hi,
>
> Am 02.12.2015 um 10:45 schrieb Stefan Priebe - Profihost AG:
> > here are the results.
> >
> > It works with 4.1.
> > It works with 4.2.
> > It does not work with 4.1.13.
>
> the patches were first commitet in v4.3-rc3 and appear as backports only
> since v4.2.3 and v4.1.10
>
> > git bisect tells me it stopped working after those two commits were applied:
> >
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu <[email protected]>
> > Date: Tue Sep 22 11:38:56 2015 +0800
> >
> > netlink: Replace rhash_portid with bound
> >
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu <[email protected]>
> > Date: Fri Sep 18 19:16:50 2015 +0800
> >
> > netlink: Fix autobind race condition that leads to zero port ID
>
> I identified the same two patches for our prpblem; see mail from
> 2015-11-17 16:53 with subject "Strange PF_NETLINK NETLINK_ROUTE stall:
> netlink: Fix autobind race condition that leads to zero port ID"
>
> <http://marc.info/?l=linux-kernel&m=144863837825887&w=2>

Thanks, I missed this mail. I have a further look. I couldn't reproduce
it myself but will check if your test program does it.

Thanks a lot,
Hannes

Subject: Re: Asterisk deadlocks since Kernel 4.1

> Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa <[email protected]>:
>
> Hello Stefan,
>
> Stefan Priebe - Profihost AG <[email protected]> writes:
>
>
>> here are the results.
>>
>> It works with 4.1.
>> It works with 4.2.
>> It does not work with 4.1.13.
>>
>> git bisect tells me it stopped working after those two commits were applied:
>>
>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>> Author: Herbert Xu <[email protected]>
>> Date: Tue Sep 22 11:38:56 2015 +0800
>>
>> netlink: Replace rhash_portid with bound
>>
>> commit 4e27762417669cb459971635be550eb7b5598286
>> Author: Herbert Xu <[email protected]>
>> Date: Fri Sep 18 19:16:50 2015 +0800
>>
>> netlink: Fix autobind race condition that leads to zero port ID
>
> Cool, thanks a lot. Does this patch make a difference?
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>
> /* We need to ensure that the socket is hashed and visible. */
> smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>
> err:
> release_sock(sk);

Will try that soon but already started another test. I reverted all netlink patches from 4.1.13 back to 4.1 and applied all netlink patches from 4.1 up to 4.3.

Greets,
Stefan-

Subject: Re: Asterisk deadlocks since Kernel 4.1

Hi,

I got it fixed / at least not live / deadlocking by doing applying the
following patch - which is the diff of the commits below on top of 4.1.13.

patch:
http://pastebin.com/raw.php?i=hiuq4bsW

all commits / changes in reverse order:

* 0ceb380 - (6 weeks ago) netlink: fix locking around
NETLINK_LIST_MEMBERSHIPS - David Herrmann (HEAD)
* c3f272b - (7 weeks ago) netlink: Trim skb to alloc size to avoid
MSG_TRUNC - Arad, Ronen
* 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound -
Herbert Xu
* 35e9890 - (3 months ago) netlink: Fix autobind race condition that
leads to zero port ID - Herbert Xu
* f1d1215 - (3 months ago) netlink, mmap: transform mmap skb into full
skb on taps - Daniel Borkmann
* faad871 - (3 months ago) netlink, mmap: fix edge-case leakages in nf
queue zero-copy - Daniel Borkmann
* fb18c94 - (3 months ago) netlink, mmap: don't walk rx ring on poll if
receive queue non-empty - Daniel Borkmann
* da13789 - (3 months ago) netlink: rx mmap: fix POLLIN condition -
Ken-ichirou MATSUZAWA
* 808071f - (3 months ago) netlink: mmap: fix lookup frame position -
Ken-ichirou MATSUZAWA
* 589bfd5 - (3 months ago) netlink: add NETLINK_CAP_ACK socket option -
Christophe Ricard
* d23c4eb - (4 months ago) netlink: mmap: fix tx type check -
Ken-ichirou MATSUZAWA
* 5dcc50a - (4 months ago) netlink: make sure -EBUSY won't escape from
netlink_insert - Daniel Borkmann
* ada2b3e - (5 months ago) netlink: don't hold mutex in rcu callback
when releasing mmapd ring - Florian Westphal
* e0f54a3 - (5 months ago) netlink: Delete an unnecessary check before
the function call "module_put" - Markus Elfring
* 0a5bdaf - (6 months ago) netlink: add API to retrieve all group
memberships - David Herrmann
* 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu
* 021a670 - (7 months ago) netlink: Create kernel netlink sockets in the
proper network namespace - Eric W. Biederman
* e1b01b4 - (7 months ago) net: Pass kern from net_proto_family.create
to sk_alloc - Eric W. Biederman
* dd4b3c9 - (7 months ago) netlink: rename private flags and states -
Nicolas Dichtel
* 0356126 - (2 days ago) Revert "netlink: don't hold mutex in rcu
callback when releasing mmapd ring" - Stefan Priebe
* 231d0da - (2 days ago) Revert "netlink: make sure -EBUSY won't escape
from netlink_insert" - Stefan Priebe
* e0f56af1 - (2 days ago) Revert "netlink, mmap: transform mmap skb into
full skb on taps" - Stefan Priebe
* 23a0326 - (2 days ago) Revert "netlink: Fix autobind race condition
that leads to zero port ID" - Stefan Priebe
* 97f4677 - (2 days ago) Revert "netlink: Replace rhash_portid with
bound" - Stefan Priebe
* 40c851fe - (2 days ago) Revert "netlink: Trim skb to alloc size to
avoid MSG_TRUNC" - Stefan Priebe
* 1f2ce4a - (4 weeks ago) Linux 4.1.13 - Greg Kroah-Hartman (v4.1.13,
origin/linux-4.1.y)

So the netlink code is in line with 4.3.

Stefan

Am 02.12.2015 um 12:40 schrieb Hannes Frederic Sowa:
> Hello Stefan,
>
> Stefan Priebe - Profihost AG <[email protected]> writes:
>
>
>> here are the results.
>>
>> It works with 4.1.
>> It works with 4.2.
>> It does not work with 4.1.13.
>>
>> git bisect tells me it stopped working after those two commits were applied:
>>
>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>> Author: Herbert Xu <[email protected]>
>> Date: Tue Sep 22 11:38:56 2015 +0800
>>
>> netlink: Replace rhash_portid with bound
>>
>> commit 4e27762417669cb459971635be550eb7b5598286
>> Author: Herbert Xu <[email protected]>
>> Date: Fri Sep 18 19:16:50 2015 +0800
>>
>> netlink: Fix autobind race condition that leads to zero port ID
>
> Cool, thanks a lot. Does this patch make a difference?
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>
> /* We need to ensure that the socket is hashed and visible. */
> smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>
> err:
> release_sock(sk);
>

2015-12-05 01:08:38

by Herbert Xu

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:
>
> * 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
> - Herbert Xu
> * 35e9890 - (3 months ago) netlink: Fix autobind race condition that
> leads to zero port ID - Herbert Xu
> * 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu

These three patches are absolutely required in any kernel where the
netlink insertion is lockless. So yes they should be applied to
4.1.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-05 14:19:12

by Philipp Hahn

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

Hello Hannes,

On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:
> > git bisect tells me it stopped working after those two commits were applied:
> >
> > commit d48623677191e0f035d7afd344f92cf880b01f8e
> > Author: Herbert Xu <[email protected]>
> > Date: Tue Sep 22 11:38:56 2015 +0800
> >
> > netlink: Replace rhash_portid with bound
> >
> > commit 4e27762417669cb459971635be550eb7b5598286
> > Author: Herbert Xu <[email protected]>
> > Date: Fri Sep 18 19:16:50 2015 +0800
> >
> > netlink: Fix autobind race condition that leads to zero port ID
>
> Cool, thanks a lot. Does this patch make a difference?
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 59651af..278e94c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>
> /* We need to ensure that the socket is hashed and visible. */
> smp_wmb();
> - nlk_sk(sk)->bound = portid;
> + nlk_sk(sk)->bound = true;
>
> err:
> release_sock(sk);

Didn't help, test program still gets stuck; sorry :-(

Philipp

--
/ / (_)__ __ ____ __ Philipp Hahn
/ /__/ / _ \/ // /\ \/ /
/____/_/_//_/\_,_/ /_/\_\ [email protected]

Subject: Re: Asterisk deadlocks since Kernel 4.1

Hello Philipp,

Am 05.12.2015 um 15:19 schrieb Philipp Matthias Hahn:
> Hello Hannes,
>
> On Wed, Dec 02, 2015 at 12:40:32PM +0100, Hannes Frederic Sowa wrote:
>>> git bisect tells me it stopped working after those two commits were applied:
>>>
>>> commit d48623677191e0f035d7afd344f92cf880b01f8e
>>> Author: Herbert Xu <[email protected]>
>>> Date: Tue Sep 22 11:38:56 2015 +0800
>>>
>>> netlink: Replace rhash_portid with bound
>>>
>>> commit 4e27762417669cb459971635be550eb7b5598286
>>> Author: Herbert Xu <[email protected]>
>>> Date: Fri Sep 18 19:16:50 2015 +0800
>>>
>>> netlink: Fix autobind race condition that leads to zero port ID
>>
>> Cool, thanks a lot. Does this patch make a difference?
>>
>> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
>> index 59651af..278e94c 100644
>> --- a/net/netlink/af_netlink.c
>> +++ b/net/netlink/af_netlink.c
>> @@ -1137,7 +1137,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
>>
>> /* We need to ensure that the socket is hashed and visible. */
>> smp_wmb();
>> - nlk_sk(sk)->bound = portid;
>> + nlk_sk(sk)->bound = true;
>>
>> err:
>> release_sock(sk);
>
> Didn't help, test program still gets stuck; sorry :-(

Can you please try this patch on top of 4.1.13 (which helps me):
http://pastebin.com/raw.php?i=hiuq4bsW

Stefan

>
> Philipp
>

Subject: Re: Asterisk deadlocks since Kernel 4.1

Hi Herbert,

i think i found the issue in 4.1 with netlink. Somebody made a mistake
while backporting or cherry-picking your patch "netlink: Fix autobind
race condition that leads to zero port ID" to 4.1.

It misses a goto in 4.1.

This goto is missing in 4.1:

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 4017e12..f15c001 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
if (err == -EEXIST)
err = -EADDRINUSE;
sock_put(sk);
- goto err;
}

/* We need to ensure that the socket is hashed and visible. */

Can you please confirm, that this is not correct and might cause those
issues.

Stefan

Am 05.12.2015 um 02:08 schrieb Herbert Xu:
> On Fri, Dec 04, 2015 at 07:26:12PM +0100, Stefan Priebe wrote:
>>
>> * 9f87e0c - (2 months ago) netlink: Replace rhash_portid with bound
>> - Herbert Xu
>> * 35e9890 - (3 months ago) netlink: Fix autobind race condition that
>> leads to zero port ID - Herbert Xu
>> * 30c6472 - (7 months ago) netlink: Use random autobind rover - Herbert Xu
>
> These three patches are absolutely required in any kernel where the
> netlink insertion is lockless. So yes they should be applied to
> 4.1.
>
> Thanks,
>

2015-12-07 01:21:08

by Herbert Xu

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
> Hi Herbert,
>
> i think i found the issue in 4.1 with netlink. Somebody made a
> mistake while backporting or cherry-picking your patch "netlink: Fix
> autobind race condition that leads to zero port ID" to 4.1.
>
> It misses a goto in 4.1.
>
> This goto is missing in 4.1:
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
> if (err == -EEXIST)
> err = -EADDRINUSE;
> sock_put(sk);
> - goto err;
> }
>
> /* We need to ensure that the socket is hashed and visible. */
>
> Can you please confirm, that this is not correct and might cause
> those issues.

Well spotted! Yes this would be a fatal error and can cause the
problems you guys are seeing.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Subject: Re: Asterisk deadlocks since Kernel 4.1

Hi Herbert,

Am 07.12.2015 um 02:20 schrieb Herbert Xu:
> On Sun, Dec 06, 2015 at 09:56:34PM +0100, Stefan Priebe wrote:
>> Hi Herbert,
>>
>> i think i found the issue in 4.1 with netlink. Somebody made a
>> mistake while backporting or cherry-picking your patch "netlink: Fix
>> autobind race condition that leads to zero port ID" to 4.1.
>>
>> It misses a goto in 4.1.
>>
>> This goto is missing in 4.1:
>>
>> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
>> index 4017e12..f15c001 100644
>> --- a/net/netlink/af_netlink.c
>> +++ b/net/netlink/af_netlink.c
>> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
>> if (err == -EEXIST)
>> err = -EADDRINUSE;
>> sock_put(sk);
>> - goto err;
>> }
>>
>> /* We need to ensure that the socket is hashed and visible. */
>>
>> Can you please confirm, that this is not correct and might cause
>> those issues.
>
> Well spotted! Yes this would be a fatal error and can cause the
> problems you guys are seeing.

Thanks, good. Can you help me to get this fix upstream into the stable
lines?

Stefan

>
> Thanks,
>

2015-12-07 07:41:44

by Philipp Hahn

[permalink] [raw]
Subject: Re: Asterisk deadlocks since Kernel 4.1

Hello Stefan,

Am 06.12.2015 um 21:56 schrieb Stefan Priebe:
> i think i found the issue in 4.1 with netlink. Somebody made a mistake
> while backporting or cherry-picking your patch "netlink: Fix autobind
> race condition that leads to zero port ID" to 4.1.
>
> It misses a goto in 4.1.
>
> This goto is missing in 4.1:
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4017e12..f15c001 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1132,7 +1132,6 @@ static int netlink_insert(struct sock *sk, u32
> portid)
> if (err == -EEXIST)
> err = -EADDRINUSE;
> sock_put(sk);
> - goto err;
> }
>
> /* We need to ensure that the socket is hashed and visible. */
>
> Can you please confirm, that this is not correct and might cause those
> issues.

I just tested that patch and it seems to fix our hang. Thank you for
your good work.

Philipp

PS: I guess I can skip testing your other test request as this simple
patch is part of your other hiuq4bsW patch. If I should still test it,
just send a note.

2015-12-08 06:13:46

by Herbert Xu

[permalink] [raw]
Subject: netlink: Add missing goto statement to netlink_insert

On Mon, Dec 07, 2015 at 07:58:25AM +0100, Stefan Priebe - Profihost AG wrote:
>
> Thanks, good. Can you help me to get this fix upstream into the stable
> lines?

Sure. Greg, please apply this patch to fix up the backport for 4.1.

---8<---
The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") missed a
goto statement, which causes netlink to break subtly.

This was discovered by Stefan Priebe <[email protected]>.

Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...")
Reported-by: Stefan Priebe <[email protected]>
Reported-by: Philipp Hahn <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d139c43..0d6038c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1118,6 +1118,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
if (err == -EEXIST)
err = -EADDRINUSE;
sock_put(sk);
+ goto err;
}

/* We need to ensure that the socket is hashed and visible. */

--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-08 16:21:54

by David Miller

[permalink] [raw]
Subject: Re: netlink: Add missing goto statement to netlink_insert

From: Herbert Xu <[email protected]>
Date: Tue, 8 Dec 2015 14:13:19 +0800

> On Mon, Dec 07, 2015 at 07:58:25AM +0100, Stefan Priebe - Profihost AG wrote:
>>
>> Thanks, good. Can you help me to get this fix upstream into the stable
>> lines?
>
> Sure. Greg, please apply this patch to fix up the backport for 4.1.

Herbert, thanks so much for taking care of this.

> ---8<---
> The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> Fix autobind race condition that leads to zero port ID") missed a
> goto statement, which causes netlink to break subtly.
>
> This was discovered by Stefan Priebe <[email protected]>.
>
> Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...")
> Reported-by: Stefan Priebe <[email protected]>
> Reported-by: Philipp Hahn <[email protected]>
> Signed-off-by: Herbert Xu <[email protected]>

Acked-by: David S. Miller <[email protected]>

>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index d139c43..0d6038c 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -1118,6 +1118,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
> if (err == -EEXIST)
> err = -EADDRINUSE;
> sock_put(sk);
> + goto err;
> }
>
> /* We need to ensure that the socket is hashed and visible. */
>
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-12-09 03:29:56

by Greg KH

[permalink] [raw]
Subject: Re: netlink: Add missing goto statement to netlink_insert

On Tue, Dec 08, 2015 at 02:13:19PM +0800, Herbert Xu wrote:
> On Mon, Dec 07, 2015 at 07:58:25AM +0100, Stefan Priebe - Profihost AG wrote:
> >
> > Thanks, good. Can you help me to get this fix upstream into the stable
> > lines?
>
> Sure. Greg, please apply this patch to fix up the backport for 4.1.

Now queued up, thanks for the fix.

greg k-h