2011-02-01 22:00:18

by Meelis Roos

[permalink] [raw]
Subject: 2.6.38-rc3 regression on parisc: segfaults

I have been testing devel kernels on SMP L1000 successfully until
2.6.38-rc2-00324-g70d1f36 included. The testing means booting the new
kernel and running aptitude to update to current debian unstable.

Now I tried 2.6.38-rc3 and got a crash from aptitude on 2 out of 2
tries. Maybe aptitude was broken inbetween but it looks like a kernel
bug. Retried 2.6.38-rc2-00324-g70d1f36 and that seemed to work fine so
it's more likely a kernel problem.

What additional information can I provide?

[ 74.590000]
[ 74.590000] do_page_fault() pid=979 command='aptitude' type=15 address=0x0000002d
[ 74.590000]
[ 74.590000] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[ 74.590000] PSW: 00000000000001001111111100001111 Not tainted
[ 74.590000] r00-03 000000ff0004ff0f 000000004027b5ac 00000000405df23b 000000004067e884
[ 74.590000] r04-07 000000004067c860 000000004067e6d0 000000004067e880 00000000c014b7d0
[ 74.590000] r08-11 0000000000000001 0000000000000001 000000004067c860 0000000041b082c8
[ 74.590000] r12-15 000000004067e730 000000004067e6d0 000000004067c860 000000004067c860
[ 74.590000] r16-19 000000004067c860 000000004067e060 0000000000000000 000000004067c860
[ 74.590000] r20-23 0000000000000229 0000000000000000 0000000000000000 0000000000000000
[ 74.590000] r24-27 fffffffffffffff5 ffffffffffffffd3 000000004067e730 00000000004227a4
[ 74.590000] r28-31 000000000000002d 0000000000000000 00000000c014b8c0 00000000402688db
[ 74.590000] sr00-03 0000000000228800 0000000000228800 0000000000000000 0000000000228800
[ 74.590000] sr04-07 0000000000228800 0000000000228800 0000000000228800 0000000000228800
[ 74.590000]
[ 74.590000] VZOUICununcqcqcqcqcqcrmunTDVZOUI
[ 74.590000] FPSR: 00001000001000100010000000000000
[ 74.590000] FPER1: 00000000
[ 74.590000] fr00-03 0822200000000000 0000000000000000 0000000000000000 0000000000000000
[ 74.590000] fr04-07 0000000a00000000 0000000000000000 0000000000000000 0000000000000000
[ 74.590000] fr08-11 0000000000000000 00000000406cf120 00000000401563e8 00000000404c59d8
[ 74.590000] fr12-15 000000000804000f 000000000800000f 00000000401563e8 00000000ffc60460
[ 74.590000] fr16-19 00000000406cf120 0000000040639d54 0000000000000046 0000000040599294
[ 74.590000] fr20-23 00000000ffc60348 00000000406dd920 0000000000000038 4038000000000000
[ 74.590000] fr24-27 0000000000000000 0000000000000000 3ff0000000000000 412e848c00000000
[ 74.590000] fr28-31 0000000040599250 00000000ffc60357 00000000ffc60357 00000000405dfba8
[ 74.590000]
[ 74.590000] IASQ: 0000000000228800 0000000000228800 IAOQ: 00000000405df25b 00000000405df25f
[ 74.590000] IIR: 0f80108b ISR: 0000000000228800 IOR: 000000000000002d
[ 74.590000] CPU: 0 CR30: 00000000fe050000 CR31: 0000000000008020
[ 74.590000] ORIG_R28: 0000000000000080
[ 74.590000] IAOQ[0]: 00000000405df25b
[ 74.590000] IAOQ[1]: 00000000405df25f
[ 74.590000] RP(r2): 00000000405df23b


--
Meelis Roos ([email protected])


2011-02-01 22:12:44

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

On Wed, 2011-02-02 at 00:00 +0200, Meelis Roos wrote:
> I have been testing devel kernels on SMP L1000 successfully until
> 2.6.38-rc2-00324-g70d1f36 included. The testing means booting the new
> kernel and running aptitude to update to current debian unstable.
>
> Now I tried 2.6.38-rc3 and got a crash from aptitude on 2 out of 2
> tries. Maybe aptitude was broken inbetween but it looks like a kernel
> bug. Retried 2.6.38-rc2-00324-g70d1f36 and that seemed to work fine so
> it's more likely a kernel problem.
>
> What additional information can I provide?

Probably a bisection, if you could. There have been no parisc patches
between -rc2 and -rc3, so it's coming from outside the architecture.

Thanks,

James

2011-02-01 22:16:45

by Carlos O'Donell

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

On Tue, Feb 1, 2011 at 5:00 PM, Meelis Roos <[email protected]> wrote:
> I have been testing devel kernels on SMP L1000 successfully until
> 2.6.38-rc2-00324-g70d1f36 included. The testing means booting the new
> kernel and running aptitude to update to current debian unstable.
>
> Now I tried 2.6.38-rc3 and got a crash from aptitude on 2 out of 2
> tries. Maybe aptitude was broken inbetween but it looks like a kernel
> bug. Retried 2.6.38-rc2-00324-g70d1f36 and that seemed to work fine so
> it's more likely a kernel problem.
>
> What additional information can I provide?
>
> [ ? 74.590000]
> [ ? 74.590000] do_page_fault() pid=979 command='aptitude' type=15 address=0x0000002d
> [ ? 74.590000]
> [ ? 74.590000] ? ? ?YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> [ ? 74.590000] PSW: 00000000000001001111111100001111 Not tainted
> [ ? 74.590000] r00-03 ?000000ff0004ff0f 000000004027b5ac 00000000405df23b 000000004067e884
> [ ? 74.590000] r04-07 ?000000004067c860 000000004067e6d0 000000004067e880 00000000c014b7d0
> [ ? 74.590000] r08-11 ?0000000000000001 0000000000000001 000000004067c860 0000000041b082c8
> [ ? 74.590000] r12-15 ?000000004067e730 000000004067e6d0 000000004067c860 000000004067c860
> [ ? 74.590000] r16-19 ?000000004067c860 000000004067e060 0000000000000000 000000004067c860
> [ ? 74.590000] r20-23 ?0000000000000229 0000000000000000 0000000000000000 0000000000000000
> [ ? 74.590000] r24-27 ?fffffffffffffff5 ffffffffffffffd3 000000004067e730 00000000004227a4
> [ ? 74.590000] r28-31 ?000000000000002d 0000000000000000 00000000c014b8c0 00000000402688db
> [ ? 74.590000] sr00-03 ?0000000000228800 0000000000228800 0000000000000000 0000000000228800
> [ ? 74.590000] sr04-07 ?0000000000228800 0000000000228800 0000000000228800 0000000000228800
> [ ? 74.590000]
> [ ? 74.590000] ? ? ? VZOUICununcqcqcqcqcqcrmunTDVZOUI
> [ ? 74.590000] FPSR: 00001000001000100010000000000000
> [ ? 74.590000] FPER1: 00000000
> [ ? 74.590000] fr00-03 ?0822200000000000 0000000000000000 0000000000000000 0000000000000000
> [ ? 74.590000] fr04-07 ?0000000a00000000 0000000000000000 0000000000000000 0000000000000000
> [ ? 74.590000] fr08-11 ?0000000000000000 00000000406cf120 00000000401563e8 00000000404c59d8
> [ ? 74.590000] fr12-15 ?000000000804000f 000000000800000f 00000000401563e8 00000000ffc60460
> [ ? 74.590000] fr16-19 ?00000000406cf120 0000000040639d54 0000000000000046 0000000040599294
> [ ? 74.590000] fr20-23 ?00000000ffc60348 00000000406dd920 0000000000000038 4038000000000000
> [ ? 74.590000] fr24-27 ?0000000000000000 0000000000000000 3ff0000000000000 412e848c00000000
> [ ? 74.590000] fr28-31 ?0000000040599250 00000000ffc60357 00000000ffc60357 00000000405dfba8
> [ ? 74.590000]
> [ ? 74.590000] IASQ: 0000000000228800 0000000000228800 IAOQ: 00000000405df25b 00000000405df25f
> [ ? 74.590000] ?IIR: 0f80108b ? ?ISR: 0000000000228800 ?IOR: 000000000000002d
> [ ? 74.590000] ?CPU: ? ? ? ?0 ? CR30: 00000000fe050000 CR31: 0000000000008020
> [ ? 74.590000] ?ORIG_R28: 0000000000000080
> [ ? 74.590000] ?IAOQ[0]: 00000000405df25b
> [ ? 74.590000] ?IAOQ[1]: 00000000405df25f
> [ ? 74.590000] ?RP(r2): 00000000405df23b

The rp (return pointer) is pointing back into what appears to be a
shared library (always loaded around 0x4???????).

The iir (interrupting instruction register) is instruction "0: 0f 80
10 8b ldw 0(ret0),r11" (you can do this yourself with "disasm"
from http://cvs.parisc-linux.org/build-tools/disasm?revision=1.1&view=markup).

You can see that ret0 is indeed 0x2d (the address of the fault), and
loading 0x0 + 0x2d will cause a fault and kill your program.

However, the failure probably happened earlier.

As James says, you should try to bisect exactly which commit caused the failure.

Cheers,
CArlos.

2011-02-03 02:32:10

by John David Anglin

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

> I have been testing devel kernels on SMP L1000 successfully until
> 2.6.38-rc2-00324-g70d1f36 included. The testing means booting the new
> kernel and running aptitude to update to current debian unstable.
>
> Now I tried 2.6.38-rc3 and got a crash from aptitude on 2 out of 2
> tries. Maybe aptitude was broken inbetween but it looks like a kernel
> bug. Retried 2.6.38-rc2-00324-g70d1f36 and that seemed to work fine so
> it's more likely a kernel problem.

If aptitude fails consistently, it should be possible to debug or
isolate to a particular kernel change. Usually, SMP segvs don't
provide much information as to the cause of the problem. strace
output and a gdb backtrace would be useful.

I have seen improved SMP stability building with GCC 4.5.3 (try a
recent snap). This fixes an asm/branch problem. It seems like James'
flush patch hasn't been pulled.

Dave
--
J. David Anglin [email protected]
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)

2011-02-03 07:03:23

by Meelis Roos

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

> If aptitude fails consistently, it should be possible to debug or
> isolate to a particular kernel change. Usually, SMP segvs don't
> provide much information as to the cause of the problem. strace
> output and a gdb backtrace would be useful.

It's not failing consitently - it's in different places. I'm bisecting
now.

--
Meelis Roos ([email protected])

2011-02-03 22:36:36

by Meelis Roos

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

> > I have been testing devel kernels on SMP L1000 successfully until
> > 2.6.38-rc2-00324-g70d1f36 included. The testing means booting the new
> > kernel and running aptitude to update to current debian unstable.
> >
> > Now I tried 2.6.38-rc3 and got a crash from aptitude on 2 out of 2
> > tries. Maybe aptitude was broken inbetween but it looks like a kernel
> > bug. Retried 2.6.38-rc2-00324-g70d1f36 and that seemed to work fine so
> > it's more likely a kernel problem.
> >
> > What additional information can I provide?
>
> Probably a bisection, if you could. There have been no parisc patches
> between -rc2 and -rc3, so it's coming from outside the architecture.

The result is strange :(

6b28405395f7ec492ea69f541cc774adcb9e00ca is the first bad commit
commit 6b28405395f7ec492ea69f541cc774adcb9e00ca
Author: Axel Köllhofer <[email protected]>
Date: Sat Jan 22 14:33:50 2011 -0600

staging: r8712u: Add new device IDs

This patch adds several new device ids to the r8712u staging driver.
The new ids were retrieved from latest vendor driver (v2.6.6.0.20101111)
downloadable from http://www.realtek.com.tw

Signed-off-by: Axel Koellhofer <[email protected]>
Signed-off-by: Larry Finger <[email protected]>
Cc: Stable <[email protected]> [2.6.37]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

:040000 040000 185c3d2c1e98cc99009bfb772ed0779410784110 f5aa903931116f28f803003f594fec3b2a29a6f6 M drivers

Seems absolutely unrelated - I do not have staging enabled so so no
CONFIG_R8712U either.

The "bad" bisects were clearly bad and failed quicky during aptitude
list update but the good ones might have needed more stress... or it is
some alignment-like problem. Will try again starting from these bad
bisects to narrow it down, and stress seemingly good ones better.

--
Meelis Roos ([email protected])

2011-02-04 10:12:02

by Meelis Roos

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

> If aptitude fails consistently, it should be possible to debug or
> isolate to a particular kernel change. Usually, SMP segvs don't
> provide much information as to the cause of the problem. strace
> output and a gdb backtrace would be useful.

strace works but does not tell much to me:

2349 _newselect(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
2349 rt_sigaction(SIGTSTP, {0x40664bea, [RT_1 RT_4 RT_5 RT_7 RT_11 RT_12 RT_15 RT_16 RT_18 RT_26], SA_RESTART}, NULL, 8) = 0
2349 futex(0x458e0858, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x458e0850, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = -1 ENOSYS (Function not implemented)
2349 futex(0x458e0858, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
2363 <... futex resumed> ) = 0
2363 futex(0x458e0850, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
2349 <... futex resumed> ) = 1
2349 futex(0x458e0850, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
2363 <... futex resumed> ) = 0
2363 futex(0x458e0850, FUTEX_WAKE_PRIVATE, 1) = 0
2363 futex(0x458e0820, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
2349 <... futex resumed> ) = 1
2349 futex(0x458e0820, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
2363 <... futex resumed> ) = 0
2363 futex(0x458e0820, FUTEX_WAKE_PRIVATE, 1) = 0
2363 rename("/var/lib/apt/lists/partial/ftp.ee.debian.org_debian_dists_unstable_main_source_Sources.diff_2011-02-02-0207.41.decomp", "/var/lib/apt/lists/ftp.ee.debian.org_debian_dists_unstable_main_source_Sources.ed") = 0
2363 stat64("/usr/lib/apt/methods/rred", {st_mode=0, st_size=0, ...}) = 0
2363 pipe([26, 28]) = 0
2363 pipe([30, 31]) = 0
2363 fcntl64(26, F_SETFD, FD_CLOEXEC) = 0
2363 fcntl64(28, F_SETFD, FD_CLOEXEC) = 0
2363 fcntl64(30, F_SETFD, FD_CLOEXEC) = 0
2363 fcntl64(31, F_SETFD, FD_CLOEXEC) = 0
2363 clone( <unfinished ...>
2349 <... futex resumed> ) = 1
2363 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x460df4a8) = 2372
2349 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
2372 rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTART}, <unfinished ...>
2349 write(1, "\33[56;1H\33[34h\33[?25h", 18 <unfinished ...>

Something futex-related. Full log temprarilty available at
http://www.cs.ut.ee/~mroos/aptitude-strace.txt

gdb does not seem to work well:

root@hernes:~# gdb aptitude
GNU gdb (GDB) 7.2-debian
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "hppa-linux-gnu".
For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/aptitude...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/bin/aptitude
[Thread debugging using libthread_db enabled]
warning: Can't attach LWP 1075813436: No such process
/tmp/buildd/gdb-7.2/gdb/linux-thread-db.c:392: internal-error: thread_get_info_callback: Assertion `inout->thread_info != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

--
Meelis Roos ([email protected])

2011-02-04 15:07:08

by John David Anglin

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

On Fri, 04 Feb 2011, Meelis Roos wrote:

> 2363 clone( <unfinished ...>
> 2349 <... futex resumed> ) = 1
> 2363 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x460df4a8) = 2372
> 2349 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> 2372 rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTART}, <unfinished ...>
> 2349 write(1, "\33[56;1H\33[34h\33[?25h", 18 <unfinished ...>
>
> Something futex-related. Full log temprarilty available at
> http://www.cs.ut.ee/~mroos/aptitude-strace.txt

This is possibly the infamous COW bug.

> gdb does not seem to work well:

I think the segv is in the dynamic loader. Try gdb on dynamic loader
and aptitude as run argument. Also suggest adding /usr/lib/debug to
LD_LIBRARY_PATH.

Dave
--
J. David Anglin [email protected]
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)

2011-02-04 15:20:10

by Carlos O'Donell

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

On Fri, Feb 4, 2011 at 10:07 AM, John David Anglin
<[email protected]> wrote:
> On Fri, 04 Feb 2011, Meelis Roos wrote:
>
>> 2363 ?clone( <unfinished ...>
>> 2349 ?<... futex resumed> ) ? ? ? ? ? ? = 1
>> 2363 ?<... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x460df4a8) = 2372
>> 2349 ?--- SIGSEGV (Segmentation fault) @ 0 (0) ---
>> 2372 ?rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTART}, ?<unfinished ...>
>> 2349 ?write(1, "\33[56;1H\33[34h\33[?25h", 18 <unfinished ...>
>>
>> Something futex-related. Full log temprarilty available at
>> http://www.cs.ut.ee/~mroos/aptitude-strace.txt
>
> This is possibly the infamous COW bug.

The COW bug that is triggered by a COW from an LWS-CAS? The solution
to which is to use locks around the LWS-CAS even on UP? I'd forgotten
about this issue actually, I should push that patch out to James.

Cheers,
Carlos.

2011-02-04 16:17:10

by John David Anglin

[permalink] [raw]
Subject: Re: 2.6.38-rc3 regression on parisc: segfaults

> > This is possibly the infamous COW bug.
>
> The COW bug that is triggered by a COW from an LWS-CAS? The solution
> to which is to use locks around the LWS-CAS even on UP? I'd forgotten
> about this issue actually, I should push that patch out to James.

I was actually thinking of the fork/clone race for which there are various
testcases on the wiki. The aptitude segv is with a SMP kernel, so I don't
think this is the UP LWS_CAS issue. However, I agree you should push
the change.

Dave
--
J. David Anglin [email protected]
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)