LinuxLists.cc - Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

2007-09-11 12:18:19

Subject: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

Hi,

since upgrading from SLES9 SP3 to SLES10 SP1 I see kernel segfaults which seem
network-related: Most notably slapd does not run any more, and my sendmail-milter
based virus scanner terminates now and then with kernel segfault.

Current kernel form SLES10 SP1 is:

# cat /proc/version
Linux version 2.6.16.53-0.8-smp (geeko@buildhost) (gcc version 4.1.2 20070115
(prerelease) (SUSE Linux)) #1 SMP Fri Aug 31 13:07:27 UTC 2007

The effects in syslog are:
Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at 0000000000000008 rip
000000000042c17a rsp 00007fffea55de00 error 4
Aug 31 15:14:57 kgate1 kernel: slapd[5296]: segfault at 0000555555981000 rip
00002ad35ffee46b rsp 00007fff4bf58c28 error 4
Aug 31 15:17:13 kgate1 kernel: powersaved[4747]: segfault at 0000000000000008 rip
000000000042c17a rsp 00007ffff434e260 error 4
Aug 31 17:50:48 kgate1 kernel: slapd[5561]: segfault at 0000555555986000 rip
00002b8fa3cf3483 rsp 00007fff08252808 error 4
Sep 3 09:02:04 kgate1 kernel: slapd[22654]: segfault at 0000555555992000 rip
00002afd6f7b4483 rsp 00007fff3c790458 error 4
Sep 3 13:14:45 kgate1 kernel: slapd[28324]: segfault at 0000555555962000 rip
00002b5c0ae00483 rsp 00007fffa1144e58 error 4
Sep 7 07:48:26 kgate1 kernel: hscan[1142]: segfault at 0000000000000003 rip
00002afac0581650 rsp 0000000041000928 error 4
Sep 7 09:12:24 kgate1 kernel: slapd[6022]: segfault at 00005555559b3000 rip
00002b1c15539483 rsp 00007fff96a0c978 error 4
Sep 10 17:02:35 kgate1 kernel: hscan[6795]: segfault at 0000000000000004 rip
00002b59c0300650 rsp 0000000042002928 error 4
Sep 11 08:43:43 kgate1 kernel: hscan[3456]: segfault at 0000000000000004 rip
00002adcd625d650 rsp 0000000043004928 error 4
Sep 11 10:45:38 kgate1 kernel: hscan[28343]: segfault at 0000000000000003 rip
00002b17020de650 rsp 0000000042803928 error 4

I know that this kind of report is not very helpful to you guys, but Novell does
not allow me to report a kernel bug directly. (I've told the person who may to do
so, but I'm unsure whether something is in progress already).

Also note that the i586 (32-bit, non-SMP) kernel does not have that problem.
Linux version 2.6.16.53-0.8-default (geeko@buildhost) (gcc version 4.1.2 20070115
(prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007

Regards,
Ulrich

2007-09-11 13:01:57

by Eric Dumazet

[permalink] [raw]

Subject: Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

On Tue, 11 Sep 2007 11:30:38 +0200
"Ulrich Windl" <[email protected]> wrote:

> Hi,
>
> since upgrading from SLES9 SP3 to SLES10 SP1 I see kernel segfaults which seem
> network-related: Most notably slapd does not run any more, and my sendmail-milter
> based virus scanner terminates now and then with kernel segfault.
>
> Current kernel form SLES10 SP1 is:
>
> # cat /proc/version
> Linux version 2.6.16.53-0.8-smp (geeko@buildhost) (gcc version 4.1.2 20070115
> (prerelease) (SUSE Linux)) #1 SMP Fri Aug 31 13:07:27 UTC 2007
>
> The effects in syslog are:
> Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at 0000000000000008 rip
> 000000000042c17a rsp 00007fffea55de00 error 4
> Aug 31 15:14:57 kgate1 kernel: slapd[5296]: segfault at 0000555555981000 rip
> 00002ad35ffee46b rsp 00007fff4bf58c28 error 4
> Aug 31 15:17:13 kgate1 kernel: powersaved[4747]: segfault at 0000000000000008 rip
> 000000000042c17a rsp 00007ffff434e260 error 4
> Aug 31 17:50:48 kgate1 kernel: slapd[5561]: segfault at 0000555555986000 rip
> 00002b8fa3cf3483 rsp 00007fff08252808 error 4
> Sep 3 09:02:04 kgate1 kernel: slapd[22654]: segfault at 0000555555992000 rip
> 00002afd6f7b4483 rsp 00007fff3c790458 error 4
> Sep 3 13:14:45 kgate1 kernel: slapd[28324]: segfault at 0000555555962000 rip
> 00002b5c0ae00483 rsp 00007fffa1144e58 error 4
> Sep 7 07:48:26 kgate1 kernel: hscan[1142]: segfault at 0000000000000003 rip
> 00002afac0581650 rsp 0000000041000928 error 4
> Sep 7 09:12:24 kgate1 kernel: slapd[6022]: segfault at 00005555559b3000 rip
> 00002b1c15539483 rsp 00007fff96a0c978 error 4
> Sep 10 17:02:35 kgate1 kernel: hscan[6795]: segfault at 0000000000000004 rip
> 00002b59c0300650 rsp 0000000042002928 error 4
> Sep 11 08:43:43 kgate1 kernel: hscan[3456]: segfault at 0000000000000004 rip
> 00002adcd625d650 rsp 0000000043004928 error 4
> Sep 11 10:45:38 kgate1 kernel: hscan[28343]: segfault at 0000000000000003 rip
> 00002b17020de650 rsp 0000000042803928 error 4
>
> I know that this kind of report is not very helpful to you guys, but Novell does
> not allow me to report a kernel bug directly. (I've told the person who may to do
> so, but I'm unsure whether something is in progress already).
>
> Also note that the i586 (32-bit, non-SMP) kernel does not have that problem.
> Linux version 2.6.16.53-0.8-default (geeko@buildhost) (gcc version 4.1.2 20070115
> (prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007

Are you sure ?

segfaulting are sysloged only on 64bits kernel.

Maybe your slapd/hscan processes are doing bad things, that make them
core dump without notice on a 32bits kernel.

Eric

2007-09-11 15:16:18

by Ulrich Windl

[permalink] [raw]

Subject: Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

On 11 Sep 2007 at 15:01, Eric Dumazet wrote:

[...]
> > Also note that the i586 (32-bit, non-SMP) kernel does not have that problem.
> > Linux version 2.6.16.53-0.8-default (geeko@buildhost) (gcc version 4.1.2 20070115
> > (prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007
>
> Are you sure ?

Not any more ;-)

>
> segfaulting are sysloged only on 64bits kernel.
>
> Maybe your slapd/hscan processes are doing bad things, that make them
> core dump without notice on a 32bits kernel.

I'm using the senddmail milter library that does the socket communication. So any
bad things should be searched there.

I tend to think that the same program when being compiled as a 32-bit executable
does not cause these segfaults on a 64 bit kernel.

I also tried to use ksymoops to get a disassembly of the corresponding kernel
code, but the result did not look good to me.

Is there a deeper reason why the kernel does not provide more info (like a call
trace) on segfaults?

Will an strace of the program (multi-threaded, unfortunately, just as slapd (most
likely)) be helpful?

When I tried it for slapd, the (rest of the) strace was:
9931 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
9931 connect(3, {sa_family=AF_INET, sin_port=htons(427), sin_addr=inet_addr("12
7.0.0.1")}, 16) = 0
9931 setsockopt(3, SOL_SOCKET, SO_RCVLOWAT, [18], 4) = 0
9931 setsockopt(3, SOL_SOCKET, SO_SNDLOWAT, [18], 4) = -1 ENOPROTOOPT (Protocol
not available)
9931 mmap(NULL, 1434435584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1
, 0) = 0x2aaaaae32000
9931 --- SIGSEGV (Segmentation fault) @ 0 (0) ---

Regards,
Ulrich

2007-09-11 15:55:20

by Ulrich Windl

[permalink] [raw]

Subject: Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

On 11 Sep 2007 at 15:01, Eric Dumazet wrote:

> On Tue, 11 Sep 2007 11:30:38 +0200
> "Ulrich Windl" <[email protected]> wrote:
>
> > Hi,
> >
> > since upgrading from SLES9 SP3 to SLES10 SP1 I see kernel segfaults which seem
> > network-related: Most notably slapd does not run any more, and my sendmail-milter
> > based virus scanner terminates now and then with kernel segfault.
> >
> > Current kernel form SLES10 SP1 is:
> >
> > # cat /proc/version
> > Linux version 2.6.16.53-0.8-smp (geeko@buildhost) (gcc version 4.1.2 20070115
> > (prerelease) (SUSE Linux)) #1 SMP Fri Aug 31 13:07:27 UTC 2007
> >
> > The effects in syslog are:
> > Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at 0000000000000008 rip
> > 000000000042c17a rsp 00007fffea55de00 error 4
[...]
> segfaulting are sysloged only on 64bits kernel.
>
> Maybe your slapd/hscan processes are doing bad things, that make them
> core dump without notice on a 32bits kernel.

A very wild guess: AFAIK SUSE Distributions are XENified recently, that is they
have libraries that treat thread local storage differently from the default. If
these programs (powersaved, slapd, hscan) are all multithreaded, could it be that
the cause of the problem is in that area?

If not, any clues on debugging/tracing? There's a
/usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing".

I also learned that the error code is only documented for i386 arch (thanks to
Emacs ediff):
* error_code:
* bit 0 == 0 means no page found, 1 means protection fault
* bit 1 == 0 means read, 1 means write
* bit 2 == 0 means kernel, 1 means user-mode

So the problem (error 4) looks a bit like a read on a NULL-pointer dereference,
right? And the "rip" is user space, correct?

Regards,
Ulrich

2007-09-11 15:57:35

by Eric Dumazet

[permalink] [raw]

Subject: Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

On Tue, 11 Sep 2007 17:15:26 +0200
"Ulrich Windl" <[email protected]> wrote:

> On 11 Sep 2007 at 15:01, Eric Dumazet wrote:
>
> [...]
> > > Also note that the i586 (32-bit, non-SMP) kernel does not have that problem.
> > > Linux version 2.6.16.53-0.8-default (geeko@buildhost) (gcc version 4.1.2 20070115
> > > (prerelease) (SUSE Linux)) #1 Fri Aug 31 13:07:27 UTC 2007
> >
> > Are you sure ?
>
> Not any more ;-)
>
> >
> > segfaulting are sysloged only on 64bits kernel.
> >
> > Maybe your slapd/hscan processes are doing bad things, that make them
> > core dump without notice on a 32bits kernel.
>
> I'm using the senddmail milter library that does the socket communication. So any
> bad things should be searched there.
>
> I tend to think that the same program when being compiled as a 32-bit executable
> does not cause these segfaults on a 64 bit kernel.
>
> I also tried to use ksymoops to get a disassembly of the corresponding kernel
> code, but the result did not look good to me.
>
> Is there a deeper reason why the kernel does not provide more info (like a call
> trace) on segfaults?
>
> Will an strace of the program (multi-threaded, unfortunately, just as slapd (most
> likely)) be helpful?
>
> When I tried it for slapd, the (rest of the) strace was:
> 9931 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
> 9931 connect(3, {sa_family=AF_INET, sin_port=htons(427), sin_addr=inet_addr("12
> 7.0.0.1")}, 16) = 0
> 9931 setsockopt(3, SOL_SOCKET, SO_RCVLOWAT, [18], 4) = 0
> 9931 setsockopt(3, SOL_SOCKET, SO_SNDLOWAT, [18], 4) = -1 ENOPROTOOPT (Protocol
> not available)
> 9931 mmap(NULL, 1434435584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1
> , 0) = 0x2aaaaae32000
> 9931 --- SIGSEGV (Segmentation fault) @ 0 (0) ---

Definitly a user mode problem, dereferencing a NULL pointer.

Try to attach gdb on this process instead of stracing it, then a "bt" command should
tell you some usefull things.

Strange thing here is that this program wants a huge block of memory (1434435584 bytes),
so maybe some file is corrupted, maybe you should check database integrity first.

2007-09-11 16:04:46

by Al Viro

[permalink] [raw]

Subject: Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

On Tue, Sep 11, 2007 at 05:54:38PM +0200, Ulrich Windl wrote:

> If not, any clues on debugging/tracing? There's a
> /usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing".

That would be because it has fsck-all to do with the kernel. Get the
coredump, then use gdb to deal with it.

2007-09-11 16:14:22

by Jan Engelhardt

[permalink] [raw]

Subject: Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

On Sep 11 2007 17:54, Ulrich Windl wrote:
>> > Aug 31 15:04:40 kgate1 kernel: powersaved[10102]: segfault at 0000000000000008 rip
>> > 000000000042c17a rsp 00007fffea55de00 error 4
>[...]
>> segfaulting are sysloged only on 64bits kernel.
>>
>> Maybe your slapd/hscan processes are doing bad things, that make them
>> core dump without notice on a 32bits kernel.
>
>A very wild guess: AFAIK SUSE Distributions are XENified recently,

Not only recently..

>I also learned that the error code is only documented for i386 arch (thanks to
>Emacs ediff):
> * error_code:
> * bit 0 == 0 means no page found, 1 means protection fault
> * bit 1 == 0 means read, 1 means write
> * bit 2 == 0 means kernel, 1 means user-mode
>
>So the problem (error 4) looks a bit like a read on a NULL-pointer
>dereference, right? And the "rip" is user space, correct?

rip points to userspace. If you are about dereferencing, look at
rax. If it is 0, it usually is logical what happened.
If it is slightly above, someone tried to access like foo->bar
where foo==NULL.

2007-09-12 06:36:50

by Ulrich Windl

[permalink] [raw]

Subject: Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

On 11 Sep 2007 at 17:04, Al Viro wrote:

> On Tue, Sep 11, 2007 at 05:54:38PM +0200, Ulrich Windl wrote:
>
> > If not, any clues on debugging/tracing? There's a
> > /usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing".
>
> That would be because it has fsck-all to do with the kernel. Get the
> coredump, then use gdb to deal with it.

Ok, but why is the message there at all? I think in Windows/XP the offending code
and the registers are shown in such occasions. I'd say either drop the message, or
improve it. It's also difficult to find the code after the program is gone due to
mapping of shared libraries. I managed to get a core dump of the application
however, and I did modify some code. I'll report once I have results.

Maybe it's "mea culpa" for my program, but powersaved and slapd are still to be
examined.

Regards,
Ulrich