2000-12-28 13:51:52

by chris

[permalink] [raw]
Subject: Repeatable Oops in 2.4t13p4ac2

Hi - we are seeing the following repeatable Oops in 2.4t13p4ac2 compiled using
gcc 2.95.2 for PIII running on IDE disks. Occurs whilst copying lots of files
to/from remote filesystems.

Thank you

Chris

Unable to handle kernel paging request at virtual address 00040000
c0131891
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0131891>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: d8702d80 ecx: d8702d80 edx: 00040000
esi: d8702d80 edi: d8702d80 ebp: 00000003 esp: c1893f88
ds: 0018 es: 0018 ss: 0018
Process bdflush (pid: 5, stackpage=c1893000)
Stack: c01344e3 d8702d80 c167db44 c167db60 00000000 0001bbca 00000018 00000000
c0129918 c167db44 00000000 c1892000 0000005a c1892332 0008e000 00000000
00000000 00000000 00000016 00000000 c0134ad4 00000003 00000000 00010f00
Call Trace: [<c01344e3>] [<c0129918>] [<c0134ad4>] [<c0107480>]
Code: 89 02 c7 41 30 00 00 00 00 0f b7 41 0a 50 51 e8 0f ff ff ff

>>EIP; c0131891 <__remove_from_queues+19/34> <=====
Trace; c01344e3 <try_to_free_buffers+b3/1e0>
Trace; c0129918 <page_launder+3a8/8bc>
Trace; c0134ad4 <bdflush+94/e8>
Trace; c0107480 <kernel_thread+28/38>
Code; c0131891 <__remove_from_queues+19/34>
00000000 <_EIP>:
Code; c0131891 <__remove_from_queues+19/34> <=====
0: 89 02 mov %eax,(%edx) <=====
Code; c0131893 <__remove_from_queues+1b/34>
2: c7 41 30 00 00 00 00 movl $0x0,0x30(%ecx)
Code; c013189a <__remove_from_queues+22/34>
9: 0f b7 41 0a movzwl 0xa(%ecx),%eax
Code; c013189e <__remove_from_queues+26/34>
d: 50 push %eax
Code; c013189f <__remove_from_queues+27/34>
e: 51 push %ecx
Code; c01318a0 <__remove_from_queues+28/34>
f: e8 0f ff ff ff call ffffff23 <_EIP+0xffffff23> c01317b4
<__remove_from_lru_list+0/68>



-------------------------------------------------
Everyone should have http://www.freedom2surf.net/


2000-12-28 14:19:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2


On Thu, 28 Dec 2000 [email protected] wrote:

> Hi - we are seeing the following repeatable Oops in 2.4t13p4ac2 compiled using
> gcc 2.95.2 for PIII running on IDE disks. Occurs whilst copying lots of files
> to/from remote filesystems.
>
> Thank you
>
> Chris
>
> Unable to handle kernel paging request at virtual address 00040000

Just to confirm: it always oopses on virtual address 00040000 ?

Thanks

2000-12-28 16:17:29

by Alan

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2

> Hi - we are seeing the following repeatable Oops in 2.4t13p4ac2 compiled using
> gcc 2.95.2 for PIII running on IDE disks. Occurs whilst copying lots of files
> to/from remote filesystems.

I've had a couple of reports like this. Can you test 2.4t13p4 without the -ac
changes. If the -ac changes cause it then I need to know, but with the -ac
changes nobody else will care ;)

So the first way to narrow it down is that

2000-12-28 17:00:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2


On Thu, 28 Dec 2000, Alan Cox wrote:

> > Hi - we are seeing the following repeatable Oops in 2.4t13p4ac2 compiled using
> > gcc 2.95.2 for PIII running on IDE disks. Occurs whilst copying lots of files
> > to/from remote filesystems.
>
> I've had a couple of reports like this. Can you test 2.4t13p4 without the -ac
> changes. If the -ac changes cause it then I need to know, but with the -ac
> changes nobody else will care ;)
>
> So the first way to narrow it down is that

Alan,

Do you remember if the reports you've got always oopsed the same
address (0040000) ?


2000-12-28 17:15:18

by Alan

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2

> Alan,
>
> Do you remember if the reports you've got always oopsed the same
> address (0040000) ?

They vary in report

2000-12-28 17:34:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2



On Thu, 28 Dec 2000, Alan Cox wrote:

> > Alan,
> >
> > Do you remember if the reports you've got always oopsed the same
> > address (0040000) ?
>
> They vary in report

Doesn't it sounds like memory problems?

2000-12-28 17:47:43

by Alan

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2

> > > Alan,
> > >
> > > Do you remember if the reports you've got always oopsed the same
> > > address (0040000) ?
> >
> > They vary in report
>
> Doesn't it sounds like memory problems?

For -ac Im working on the assumption I introduced a bug into the mm code

2000-12-28 18:39:47

by chris

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2

> > > >
> > > > Do you remember if the reports you've got always oopsed the same
> > > > address (0040000) ?
> > >

Hi - Here's another Oops from the same machine. It looks to be in a totally
different place in the code which probably means it's a memory problem? I'll
try installing on another box to confirm.

Thank you for your help!

Chris

Unable to handle kernel NULL pointer dereference at virtual address 00000120
c0145914
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0145914>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010207
eax: 00000000 ebx: 00000100 ecx: 0000001e edx: 00000c0c
esi: 00000100 edi: 00000000 ebp: 0025dbb1 esp: c333fe5c
ds: 0018 es: 0018 ss: 0018
Process nfsd (pid: 194, stackpage=c333f000)
Stack: 00041182 dff86060 0025dbb1 c18ee000 c0145d3e c18ee000 0025dbb1 dff86060
00000000 00000000 00041182 c337a200 c3345ec0 c93da800 c0167b01 c18ee000
0025dbb1 00000000 00000000 00000003 c337a200 c0167f41 c18ee000 0025dbb1
Call Trace: [<c0145d3e>] [<c0167b01>] [<c0167f41>] [<c01eaef6>] [<c01684b0>]
[<c0168a3a>] [<c0166d39
[<c01666c3>] [<c01f8715>] [<c01664ed>] [<c0107480>]
Code: 39 6e 20 75 ef 8b 44 24 14 39 86 90 00 00 00 75 e3 85 ff 74

>>EIP; c0145914 <find_inode+1c/48> <=====
Trace; c0145d3e <iget4+52/e8>
Trace; c0167b01 <nfsd_iget+19/f0>
Trace; c0167f41 <find_fh_dentry+21/324>
Trace; c01eaef6 <inet_sendmsg+3e/44>
Trace; c01684b0 <fh_verify+26c/48c>
Trace; c0168a3a <nfsd_lookup+6a/4f8>
Trace; c01666c3 <nfsd_dispatch+cb/168>
Trace; c01f8715 <svc_process+28d/4c8>
Trace; c01664ed <nfsd+215/320>
Trace; c0107480 <kernel_thread+28/38>
Code; c0145914 <find_inode+1c/48>
00000000 <_EIP>:
Code; c0145914 <find_inode+1c/48> <=====
0: 39 6e 20 cmp %ebp,0x20(%esi) <=====
Code; c0145917 <find_inode+1f/48>
3: 75 ef jne fffffff4 <_EIP+0xfffffff4> c0145908
<find_inode+10/48>
Code; c0145919 <find_inode+21/48>
5: 8b 44 24 14 mov 0x14(%esp,1),%eax
Code; c014591d <find_inode+25/48>
9: 39 86 90 00 00 00 cmp %eax,0x90(%esi)
Code; c0145923 <find_inode+2b/48>
f: 75 e3 jne fffffff4 <_EIP+0xfffffff4> c0145908
<find_inode+10/48>
Code; c0145925 <find_inode+2d/48>
11: 85 ff test %edi,%edi
Code; c0145927 <find_inode+2f/48>
13: 74 00 je 15 <_EIP+0x15> c0145929
<find_inode+31/48>


-------------------------------------------------
Everyone should have http://www.freedom2surf.net/

2000-12-28 18:54:43

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2



On Thu, 28 Dec 2000 [email protected] wrote:

> > > > >
> > > > > Do you remember if the reports you've got always oopsed the same
> > > > > address (0040000) ?
> > > >
>
> Hi - Here's another Oops from the same machine. It looks to be in a totally
> different place in the code which probably means it's a memory problem?

Not necessarily, but it may be a memory problem.

> I'll try installing on another box to confirm.

You can run the memtest86 tool (you can find it at
http://reality.sgi.com/cbrady_denver/memtest86/), which is a more reliable
way to find out if its really a memory bug.

2000-12-29 02:16:55

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2

On Thu, Dec 28, 2000 at 05:18:30PM +0000, Alan Cox wrote:

For -ac Im working on the assumption I introduced a bug into the
mm code

after discussion with you and further pounding with t13p4+cw (not ac)
I am fairly confident something in ac2 is fishy. I can repeatable get
ac2 to fail with PCMCIA and also reiserfs under load, I absolutely
cannot get these failures without ac2.

This is totally repeatable so if you want further diagnostics please
let me know....



--cw

P.S. Using 2.95.2 if that matters; it's more or less always works but
maybe ac2 triggers something there as some people seem to find
ac2 pretty good?

2000-12-29 02:21:26

by Alan

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2

> I am fairly confident something in ac2 is fishy. I can repeatable get
> ac2 to fail with PCMCIA and also reiserfs under load, I absolutely
> cannot get these failures without ac2.

The PCMCIA thing is unlikely to be related (there are no changes on any PCMCIA
that actually worked on 13pre4). Reiserfs might be the trigger because the
quota code changed, but if it did touch it I'd expect it to have failed
to compile

> This is totally repeatable so if you want further diagnostics please
> let me know....

I'm going to go and do a detailed audit of the mm bits I have differing from
Linus. For one I'd be much happier to differ in drivers with Linus and avoid
differing in mm/vm internals stuff.

Alan

2000-12-29 02:27:26

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Repeatable Oops in 2.4t13p4ac2

On Fri, Dec 29, 2000 at 01:52:07AM +0000, Alan Cox wrote:

The PCMCIA thing is unlikely to be related (there are no changes
on any PCMCIA that actually worked on 13pre4).

Oh, I'm sure it is unrelated -- it's just a good trigger for the
problem on my laptop.

Reiserfs might be the trigger because the quota code changed, but
if it did touch it I'd expect it to have failed to compile

No quotas, local version of reiserfs but probably not too divergent
from what else is out there.

I'm going to go and do a detailed audit of the mm bits I have
differing from Linus. For one I'd be much happier to differ in
drivers with Linus and avoid differing in mm/vm internals stuff.

I'll resync to t13p5 and any new reiserfs change that might be
relevant and see how that goes; if that works I'll start merging in
ac2 and see when it breaks.



--cw