2008-08-24 22:30:17

by Sitsofe Wheeler

[permalink] [raw]
Subject: Reproducible rRootage segfault with 2.6.25 and above (regression?)

I've found that when running certain levels in the game rRootage on
kernels later than 2.6.24 a segfault will be caused. This segfault is
not there on 2.6.24 (and below) though...

Frustratingly I have been unable to bisect my way to the kernel change
because I hit a USB timeout issue bisecting between 2.6.24-2.6.25 which
made booting impossible. Further it seems there are a number of
conditions that need to be met before the problem manifests itself:

1. Compiler optimisation used to compile rRootage must be -O1 or higher
(-Os also triggers the problem)
2. The running kernel (going by release) must be 2.6.25 or later.
3. The gcc used to compile the game must (seemingly) not be 3.3 (using
4.2 shows the problem. Other versions may also show up the problem).
4. Not every level in every mode will show the problem (it seems linked
to certain patterns). I have found level 9A in the green "GigaWing" mode
is usually quick to trigger the issue but you may have to kill the first
enemy once to see the problem (if you can just get to even that part it
is likely the problem is non present).

I have seen the issue on a range of 2.6.25+ kernels (both hand compiled
on openSUSE kernels and a pre-shipped 2.6.26-5 from Ubuntu 8.10).

The segfault in question is due to an array being accessed beyond its
bounds (the array sctbl on this line
http://www.koders.com/cpp/fid93F842B399CA68D754CADEC374AE934EED72C07D.aspx#L246
). Running the game under valgrind on a 2.6.24 kernel did not generate
any warnings about that array (using MALLOC_CHECK_=2 didn't generate
any warnings either). The problem has been reproduced on two different
machines (a Thinkpad T60 and an eeePC).

Finally, this also afflicts a prebuilt binary from 2004 (which probably
wasn't built using gcc4.x
http://sourceforge.net/project/showfiles.php?group_id=112441 ).

The issue is fiddly but reproducible. All help in pinpointing the
problem source is appreciated.

--
Sitsofe | http://sucs.org/~sits/


2008-08-25 12:33:43

by Alan

[permalink] [raw]
Subject: Re: Reproducible rRootage segfault with 2.6.25 and above (regression?)

> The issue is fiddly but reproducible. All help in pinpointing the
> problem source is appreciated.

For the kernel bisect if you get stuck at a point it fails remember that
point and then lie either yes/no to it working and carry on. If need be
you can go back the other way.

Another completely off the wall guess would be that your client code is
causing gcc to generate something where it is using data which has ended
up below the stack pointer and the timings have changed. Either through
gcc bug or passing around the address of an object that is out of
context. At that point a signal will rewrite the data in fun ways
producing results like you describe.

Alan

2008-08-25 20:32:01

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: Reproducible rRootage segfault with 2.6.25 and above (solved)



Alan Cox wrote:
> For the kernel bisect if you get stuck at a point it fails remember that
> point and then lie either yes/no to it working and carry on. If need be
> you can go back the other way.

I tried this quite a few times (you can always use replay and edit out
the lie) before posting (and using gitk to pick commits to) but it seems
like huge swathes of what I was interested in were inside this USB
issue. Eventually I broke down and used a loan laptop that didn't need
to boot from USB. I narrowed the issue down to 10 or so patches (from
8a423ff0c4a0472607bbed6790fdaeec54af2ebb to
0249c9c1e7505c2b020bcc6deaf1e0415de9943e which covers patches that
randomize brk and change vDSO) but after further incorrectly bisecting
to a patch it looks like the segfault was totally legit...

> Another completely off the wall guess would be that your client code is
> causing gcc to generate something where it is using data which has ended
> up below the stack pointer and the timings have changed. Either through
> gcc bug or passing around the address of an object that is out of
> context. At that point a signal will rewrite the data in fun ways
> producing results like you describe.

After reading this I went back and stuffed a bunch of asserts into the
rRootage code to see what was going on and found what looks like a bug
rRootage. I guess valgrind can't do array bounds checking - in fact this
is what I get for not reading the FAQ -
http://valgrind.org/docs/manual/faq.html#faq.overruns . A workaround
seems to be to do capping on the value used to index the array -
https://bugs.launchpad.net/ubuntu/+source/rrootage/+bug/261189/comments/4
. I even just tried using mudflap but that brought up so many spurious
warnings (supposedly it doesn't currently do well with C++) it wasn't
helpful.


--
Sitsofe | http://sucs.org/~sits/