2005-11-23 14:37:05

by Fabio Coatti

[permalink] [raw]
Subject: Dual opteron various segfaults with 2.6.14.2 and earlier kernels

Hi all,
I'm seeing several segfaults on a couple of HP DL585 Dual Opterons, 8Gb ram
each.

The segfaults are like this:
factorial[17031]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffe287e0 error 4
factorial[17034]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffc6f450 error 4
factorial[17038]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffdbd060 error 4
factorial[17044]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffb48fa0 error 4
factorial[17046]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffc2a7f0 error 4
ld[3997]: segfault at 0000000000000020 rip 00002aaaaad1a525 rsp
00007fffffa8e960 error 4
ld[4234]: segfault at 0000000000000020 rip 00002aaaaad1a525 rsp
00007fffffc3a1e0 error 4

This is only an example; often during some "make", also sed segfaults (!).
I've seen this with 2.6.12, 2.6.13.4, 2.6.14.2

distro: gentoo
gcc (GCC) 3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0, pie-8.7.8)

I've attached config.gz; I use Discontiguous Memory model, since Sparse Memory
simply won't boot.

I've googled a bit to find some help, but while I've find only problems like
mine, I've found no hints unfortunately.

Can someone give me some help?

Of course, more details are available.


--
Fabio "Cova" Coatti http://members.ferrara.linux.it/cova
Ferrara Linux Users Group http://ferrara.linux.it
GnuPG fp:9765 A5B6 6843 17BC A646 BE8C FA56 373A 5374 C703
Old SysOps never die... they simply forget their password.


Attachments:
(No filename) (1.50 kB)
config.gz (6.54 kB)
Download all attachments

2005-11-23 14:41:08

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Dual opteron various segfaults with 2.6.14.2 and earlier kernels

On Wed, 2005-11-23 at 15:37 +0100, Fabio Coatti wrote:
> Hi all,
> I'm seeing several segfaults on a couple of HP DL585 Dual Opterons, 8Gb ram
> each.


are you using the gentoo buildstuff for this? eg libjail or whatever
it's called?

2005-11-23 15:48:54

by Fabio Coatti

[permalink] [raw]
Subject: Re: Dual opteron various segfaults with 2.6.14.2 and earlier kernels

Alle 15:41, mercoled? 23 novembre 2005, Arjan van de Ven ha scritto:
> On Wed, 2005-11-23 at 15:37 +0100, Fabio Coatti wrote:
> > Hi all,
> > I'm seeing several segfaults on a couple of HP DL585 Dual Opterons, 8Gb
> > ram each.
>
> are you using the gentoo buildstuff for this? eg libjail or whatever
> it's called?

ldd /usr/bin/sed
libc.so.6 => /lib/libc.so.6 (0x00002aaaaabc1000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

The kernel is compiled in usual way, with vanilla sources, no distro patches.



--
Fabio "Cova" Coatti http://members.ferrara.linux.it/cova
Ferrara Linux Users Group http://ferrara.linux.it
GnuPG fp:9765 A5B6 6843 17BC A646 BE8C FA56 373A 5374 C703
Old SysOps never die... they simply forget their password.

2005-11-23 22:56:06

by Andrew Walrond

[permalink] [raw]
Subject: Re: Dual opteron various segfaults with 2.6.14.2 and earlier kernels

On Wednesday 23 November 2005 14:37, Fabio Coatti wrote:
> Hi all,
> I'm seeing several segfaults on a couple of HP DL585 Dual Opterons, 8Gb ram
> each.
>
> The segfaults are like this:
> factorial[17031]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffe287e0 error 4
> factorial[17034]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffc6f450 error 4
> factorial[17038]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffdbd060 error 4
> factorial[17044]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffb48fa0 error 4
> factorial[17046]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffc2a7f0 error 4
> ld[3997]: segfault at 0000000000000020 rip 00002aaaaad1a525 rsp
> 00007fffffa8e960 error 4
> ld[4234]: segfault at 0000000000000020 rip 00002aaaaad1a525 rsp
> 00007fffffc3a1e0 error 4
>
> This is only an example; often during some "make", also sed segfaults (!).
> I've seen this with 2.6.12, 2.6.13.4, 2.6.14.2
>

The symtoms look just like the TLB flush filter errata which affected SMP
x86_64 kernels upto (at least) 2.6.13.4. IIRC it was fixed for 2.6.14 (at
least I stopped using the patch after 2.6.13.4).

Are you sure you saw this with 2.6.14+ ?

Andrew Walrond

2005-11-23 23:26:46

by Fabio Coatti

[permalink] [raw]
Subject: Re: Dual opteron various segfaults with 2.6.14.2 and earlier kernels

Alle 23:55, mercoled? 23 novembre 2005, Andrew Walrond ha scritto:
> On Wednesday 23 November 2005 14:37, Fabio Coatti wrote:
> > Hi all,
> > I'm seeing several segfaults on a couple of HP DL585 Dual Opterons, 8Gb
> > ram each.
> >
> > The segfaults are like this:
> > factorial[17031]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffe287e0 error 4
> > factorial[17034]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffc6f450 error 4
> > factorial[17038]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffdbd060 error 4
> > factorial[17044]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffb48fa0 error 4
> > factorial[17046]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffc2a7f0 error 4
> > ld[3997]: segfault at 0000000000000020 rip 00002aaaaad1a525 rsp
> > 00007fffffa8e960 error 4
> > ld[4234]: segfault at 0000000000000020 rip 00002aaaaad1a525 rsp
> > 00007fffffc3a1e0 error 4
> >
> > This is only an example; often during some "make", also sed segfaults
> > (!). I've seen this with 2.6.12, 2.6.13.4, 2.6.14.2
>
> The symtoms look just like the TLB flush filter errata which affected SMP
> x86_64 kernels upto (at least) 2.6.13.4. IIRC it was fixed for 2.6.14 (at
> least I stopped using the patch after 2.6.13.4).
>
> Are you sure you saw this with 2.6.14+ ?


yes, uname says 2.6.14.2; on a second identical machine, I've just seen this:


factorial[2352]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffbfaf60 error 4
factorial[2354]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffe3fc70 error 4
factorial[2361]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffb07c50 error 4
factorial[2358]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffb07c50 error 4
factorial[2363]: segfault at 0000000000020f31 rip 00000000004035ae rsp
00007fffffe6d270 error 4

the kernel and HW are the same.

2005-11-24 00:03:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Dual opteron various segfaults with 2.6.14.2 and earlier kernels

Hello Fabio,

On Thu, Nov 24, 2005 at 12:26:41AM +0100, Fabio Coatti wrote:
> yes, uname says 2.6.14.2; on a second identical machine, I've just seen this:
>
>
> factorial[2352]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffbfaf60 error 4
> factorial[2354]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffe3fc70 error 4
> factorial[2361]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffb07c50 error 4
> factorial[2358]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffb07c50 error 4
> factorial[2363]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> 00007fffffe6d270 error 4
>
> the kernel and HW are the same.

Error 4 means a read in userland on a not mapped area.

The above isn't necessairly a kernel or hardware problem, it looks like
an userland bug if it segfaults at such a low address (20f31). Nothig is
mapped below "0x400000" exactly to catch these kind of bugs.

You should debug the program and check what's the code at address
0x4035ae? You can check it with gdb or objdump -d. Probably there's a
64bit bug in the program that doesn't trigger on x86 32bit (or you may
not be noticing the segfault on 32bits because it wouldn't be logged in
the syslog).

Hope this helps ;)

2005-11-24 08:07:36

by Andrew Walrond

[permalink] [raw]
Subject: Re: Dual opteron various segfaults with 2.6.14.2 and earlier kernels

On Thursday 24 November 2005 00:03, Andrea Arcangeli wrote:
> Hello Fabio,
>
> On Thu, Nov 24, 2005 at 12:26:41AM +0100, Fabio Coatti wrote:
> > yes, uname says 2.6.14.2; on a second identical machine, I've just seen
> > this:
> >
> >
> > factorial[2352]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffbfaf60 error 4
> > factorial[2354]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffe3fc70 error 4
> > factorial[2361]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffb07c50 error 4
> > factorial[2358]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffb07c50 error 4
> > factorial[2363]: segfault at 0000000000020f31 rip 00000000004035ae rsp
> > 00007fffffe6d270 error 4
> >
> > the kernel and HW are the same.
>
> Error 4 means a read in userland on a not mapped area.
>
> The above isn't necessairly a kernel or hardware problem, it looks like
> an userland bug if it segfaults at such a low address (20f31). Nothig is
> mapped below "0x400000" exactly to catch these kind of bugs.

Which makes sense; the sed failures seen during 'make' runs were probably a
result of the TLB flush filter errata on kernels prior to 2.6.14 , whereas
the above is a userland bug which occurs with all kernels.

Andrew Walrond