2006-03-26 19:54:29

by Paolo Ornati

[permalink] [raw]
Subject: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

I was compiling QT 4.1.1 on my Gentoo box when I've got "internal
compiler error: segmentation fault".

Looking at dmesg:

[ 2398.483016] slab error in slab_destroy_objs(): cache
`radix_tree_node': end of a freed object was overwritten [ 2398.483021]
[ 2398.483022] Call Trace: <ffffffff80159581>{__slab_error+33}
<ffffffff8015a8ab>{slab_destroy+168} [ 2398.483032]
<ffffffff8015aa84>{free_block+349}
<ffffffff8015abdc>{cache_flusharray+116} [ 2398.483040]
<ffffffff8015a7e5>{kmem_cache_free+84}
<ffffffff801e84ef>{radix_tree_delete+509} [ 2398.483047]
<ffffffff80160946>{free_buffer_head+53}
<ffffffff801609be>{try_to_free_buffers+118} [ 2398.483053]
<ffffffff80141104>{__remove_from_page_cache+28}
<ffffffff8014985f>{shrink_zone+2676} [ 2398.483061]
<ffffffff801596c4>{kmem_freepages+196}
<ffffffff8015a908>{slab_destroy+261} [ 2398.483070]
<ffffffff8014a001>{balance_pgdat+571} <ffffffff8014a26b>{kswapd+270}
[ 2398.483077] <ffffffff801361ce>{autoremove_wake_function+0}
<ffffffff801361ce>{autoremove_wake_function+0} [ 2398.483084]
<ffffffff8010b1b6>{child_rip+8} <ffffffff8014a15d>{kswapd+0}
[ 2398.483090] <ffffffff8010b1ae>{child_rip+0}


stripped config, dmesg & meminfo & slabinfo grabbed shortly after gcc
segfault are attached.

PS: I've got another "gcc segfault" trying to build Qt again after a
reboot but I don't think this is a memory problem (actually I have
a memory problem (single bit error) but it should be cured with
memmap=1K$214014K ;).

This time there's nothing strange in dmesg so maybe gcc segfaults have
nothing to do with slab error.

--
Paolo Ornati
Linux 2.6.16 on x86_64


Attachments:
(No filename) (1.58 kB)
dmesg (19.18 kB)
meminfo (604.00 B)
slabinfo (23.56 kB)
stripped_config (5.96 kB)
Download all attachments

2006-03-28 07:57:05

by Paolo Ornati

[permalink] [raw]
Subject: Random GCC segfaults -- Was: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

On Sun, 26 Mar 2006 21:53:46 +0200
Paolo Ornati <[email protected]> wrote:

> PS: I've got another "gcc segfault" trying to build Qt again after a
> reboot but I don't think this is a memory problem (actually I have
> a memory problem (single bit error) but it should be cured with
> memmap=1K$214014K ;).

I've got others NON reproducible gcc segfaults, usually compiling some
huge CPP source.

Now I'm back to 2.6.15.6 and I'm stress testing GCC, no segfaults so
far.

Doing a git-bisect is maybe possible... but it will take ages since I
don't have a test case :(!

Additionally there's the slab error (seen just one time)...


If anyone have some idea like try-to-revert-this-patch let me know.

--
Paolo Ornati
Linux 2.6.15.6 on x86_64

2006-03-28 08:41:44

by Andrew Morton

[permalink] [raw]
Subject: Re: Random GCC segfaults -- Was: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

Paolo Ornati <[email protected]> wrote:
>
> On Sun, 26 Mar 2006 21:53:46 +0200
> Paolo Ornati <[email protected]> wrote:
>
> > PS: I've got another "gcc segfault" trying to build Qt again after a
> > reboot but I don't think this is a memory problem (actually I have
> > a memory problem (single bit error) but it should be cured with
> > memmap=1K$214014K ;).
>
> I've got others NON reproducible gcc segfaults, usually compiling some
> huge CPP source.

If those errors had no corresponding kernel messages then what you have is
a classic symptom of failing memory hardware. Suggest you grab memtest86,
run it for 24 hours.

2006-03-28 09:22:45

by Paolo Ornati

[permalink] [raw]
Subject: Re: Random GCC segfaults -- Was: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

On Tue, 28 Mar 2006 00:41:37 -0800
Andrew Morton <[email protected]> wrote:

> If those errors had no corresponding kernel messages then what you have is
> a classic symptom of failing memory hardware. Suggest you grab memtest86,
> run it for 24 hours.

I've already run memtest86+ for hours (not 24 ok... "only" 4/5h) and I
found this:

An easly reproducilble memory failure (single bit flipping always at
the same address) <---- this one goes AWAY disabling bank interleaving
in BIOS.

Another memory failure (different address, always one bit flipping)
isn't found by memtest86+: I found it with CONFIG_DEBUG_SLAB and
I "fixed" it with memmap=... boot option.


Now, these 2 problems are both in my first 256MB memory module, so maybe
it is really another memory failure.

BUT now that I'm back on 2.6.15.6 I'm compiling a LOT of big CPP
projects and I haven't seen a single GCC segfault yet.

Maybe I should retry with 2.6.16 and if I can reproduce the problem I
can start testing 2.6.16-rc1 and so on...

--
Paolo Ornati
Linux 2.6.15.6 on x86_64

2006-03-28 11:47:10

by Paolo Ornati

[permalink] [raw]
Subject: Re: Random GCC segfaults -- Was: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

On Tue, 28 Mar 2006 11:22:41 +0200
Paolo Ornati <[email protected]> wrote:

> Maybe I should retry with 2.6.16 and if I can reproduce the problem I
> can start testing 2.6.16-rc1 and so on...

Reproduced with 2.6.16...

"TESTCASE" (I'm on Gentoo)
ebuild /usr/portage/x11-libs/qt/qt-4.1.1.ebuild clean
ebuild /usr/portage/x11-libs/qt/qt-4.1.1.ebuild compile


# time ./TESTCASE

...

g++ -c -m64 -pipe -march=athlon64 -O2 -pipe -Wall -W -D_REENTRANT -DQT_KEYWORDS -DQT_NO_DEBUG -DQT_XML_
LIB -DQT_GUI_LIB -DQT_NETWORK_LIB -DQT_CORE_LIB -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -DQT_SHARED -I
../../mkspecs/linux-g++-64 -I. -I../../include/QtCore -I../../include/QtNetwork -I../../include/QtGui -I
../../include/QtXml -I../../include -I.moc/release-shared -I. -o .obj/release-shared/docuparser.o docupa
rser.cpp
/var/tmp/portage/qt-4.1.1/work/qt-x11-opensource-src-4.1.1/bin/uic settingsdialog.ui -o ui_settingsdialo
g.h
g++ -c -m64 -pipe -march=athlon64 -O2 -pipe -Wall -W -D_REENTRANT -DQT_KEYWORDS -DQT_NO_DEBUG -DQT_XML_
LIB -DQT_GUI_LIB -DQT_NETWORK_LIB -DQT_CORE_LIB -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -DQT_SHARED -I
../../mkspecs/linux-g++-64 -I. -I../../include/QtCore -I../../include/QtNetwork -I../../include/QtGui -I
../../include/QtXml -I../../include -I.moc/release-shared -I. -o .obj/release-shared/index.o index.cpp
docuparser.cpp: In member function `virtual bool DocuParser310::startElement(const QString&, const QStri
ng&, const QString&, const QXmlAttributes&)':
docuparser.cpp:166: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://bugs.gentoo.org/> for instructions.
g++ -c -m64 -pipe -march=athlon64 -O2 -pipe -Wall -W -D_REENTRANT -DQT_KEYWORDS -DQT_NO_DEBUG -DQT_XML_
LIB -DQT_GUI_LIB -DQT_NETWORK_LIB -DQT_CORE_LIB -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -DQT_SHARED -I
../../mkspecs/linux-g++-64 -I. -I../../include/QtCore -I../../include/QtNetwork -I../../include/QtGui -I
../../include/QtXml -I../../include -I.moc/release-shared -I. -o .obj/release-shared/profile.o profile.c
pp
The bug is not reproducible, so it is likely a hardware or OS problem.
make[2]: *** [.obj/release-shared/docuparser.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/var/tmp/portage/qt-4.1.1/work/qt-x11-opensource-src-4.1.1/tools/assistant'
make[1]: *** [sub-assistant-all-ordered] Error 2
make[1]: Leaving directory `/var/tmp/portage/qt-4.1.1/work/qt-x11-opensource-src-4.1.1/tools'
make: *** [sub-tools-all-ordered] Error 2

...

real 49m35.866s
user 29m19.670s
sys 16m7.792s


I'm going to build/test 2.6.16-rc1.

--
Paolo Ornati
Linux 2.6.16 on x86_64

2006-03-28 13:54:10

by Pavel Machek

[permalink] [raw]
Subject: Re: Random GCC segfaults -- Was: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

Hi!

> On Tue, 28 Mar 2006 00:41:37 -0800
> Andrew Morton <[email protected]> wrote:
>
> > If those errors had no corresponding kernel messages then what you have is
> > a classic symptom of failing memory hardware. Suggest you grab memtest86,
> > run it for 24 hours.
>
> I've already run memtest86+ for hours (not 24 ok... "only" 4/5h) and I
> found this:
>
> An easly reproducilble memory failure (single bit flipping always at
> the same address) <---- this one goes AWAY disabling bank interleaving
> in BIOS.
>
> Another memory failure (different address, always one bit flipping)
> isn't found by memtest86+: I found it with CONFIG_DEBUG_SLAB and
> I "fixed" it with memmap=... boot option.
>
>
> Now, these 2 problems are both in my first 256MB memory module, so maybe
> it is really another memory failure.
>
> BUT now that I'm back on 2.6.15.6 I'm compiling a LOT of big CPP
> projects and I haven't seen a single GCC segfault yet.
>
> Maybe I should retry with 2.6.16 and if I can reproduce the problem I
> can start testing 2.6.16-rc1 and so on...

I'd really get new RAM... If the machine is "known bad", debugging on
it is likely waste of time.
Pavel
--
Picture of sleeping (Linux) penguin wanted...

2006-03-28 14:29:21

by Paolo Ornati

[permalink] [raw]
Subject: Re: Random GCC segfaults -- Was: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

On Tue, 28 Mar 2006 15:23:46 +0200
Pavel Machek <[email protected]> wrote:

> I'd really get new RAM... If the machine is "known bad", debugging on
> it is likely waste of time.

I know.

The fact is that when I was having memory problems I also have
filesystem corruption associated.

After fixing the first problem (easly reproducible) the filesystem
corruption become more rare.

After fixing the second problem (address detected by DEBUG_SLAB) I have
NEVER seen a single filesystem corruption yet.

Additionally I have tested 2.6.16-rc1 (found BAD after 20 min) and now
I'm re-testing with 2.6.15.6 --> it is compiling by some hours without
a single segfault.

So, I think it could be:

1) a memory problem exposed by the different behaviour of the kernel

2) a kernel BUG somewhere between 2.6.15 / 2.6.16.

Maybe, before using git-bisect, I can simply try to reproduce the
problem using only the first memory module (the bad one) and then try
with only the second one (good).

This should reveal if it is a memory problem or not (or maybe the
combination of GCC eating a lot of memory AND only 256MB of RAM instead
of 512MB will make the system swap a lot resulting in less memory stress
and thus make me unable to reproduce the problem ;)

--
Paolo Ornati
Linux 2.6.15.6 on x86_64

2006-03-28 14:37:43

by Paolo Ornati

[permalink] [raw]
Subject: Re: Random GCC segfaults -- Was: [2.6.16] slab error in slab_destroy_objs(): cache `radix_tree_node'...

On Tue, 28 Mar 2006 16:30:27 +0200
Paolo Ornati <[email protected]> wrote:

> I'm re-testing with 2.6.15.6 --> it is compiling by some hours without
> a single segfault.

Maybe I've exagerated here, it is finished now:

real 106m35.548s
user 56m11.111s
sys 33m48.371s

No problems as expected.

--
Paolo Ornati
Linux 2.6.15.6 on x86_64

2006-03-31 14:57:18

by Paolo Ornati

[permalink] [raw]
Subject: Re: Random GCC segfaults --> Just Bad Memory

On Tue, 28 Mar 2006 16:30:27 +0200
Paolo Ornati <[email protected]> wrote:

> Additionally I have tested 2.6.16-rc1 (found BAD after 20 min) and now
> I'm re-testing with 2.6.15.6 --> it is compiling by some hours without
> a single segfault.
>
> So, I think it could be:
>
> 1) a memory problem exposed by the different behaviour of the kernel
>
> 2) a kernel BUG somewhere between 2.6.15 / 2.6.16.

After replacing the problematic memory module I'm unable to reproduce
it... so 2.6.16 is OK :)

--
Paolo Ornati
Linux 2.6.16.1 on x86_64