2004-11-04 00:44:55

by Ray Van Dolson

[permalink] [raw]
Subject: kernel BUG at mm/prio_tree.c:377

Description of problem:
Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. Fedora
Core 2.

This server operates as a PPTP Concentrator running the PoPToP server
(1.2.1) along with pppd 2.4.3. We have tried this system using both
the onboard Broadcom gigabit NIC's as well as a dual Intel EEPro 100.

Usually within 24 hours of bootup, the following oops occurs:

kernel BUG at mm/prio_tree.c:377!
invalid operand: 0000 [#1]
SMP nntrack(U) ip_tables(U) md5(U) ipv6(U) sunrpc(U) e100(U) mii(U)
sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U)
battery(U) asus_acpi(U) ac(U) ext3(U) jbd(U)
Modules linked in: ipt_LOG(U) sch_tbf(U) ppp_mppe(U) ppp_async(U)
crc_ccitt(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U)
ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_co
CPU: 1
EIP: 0060:[<021425de>] Tainted: P
EFLAGS: 00010202 (2.6.8-1.521custom)
EIP is at prio_tree_right+0x85/0xc5
eax: 00000009 ebx: 0cf1acf8 ecx: 00000000 edx: 12da3d00
esi: 00000000 edi: 00000004 ebp: 404a6d78 esp: 0cf1ac90
ds: 007b es: 007b ss: 0068
Process yum (pid: 24194, threadinfo=0cf1a000 task=12e4ecb0)
Stack: 0cf1acf8 00000004 00000004 404a6d78 021427ae 00000004 0cf1acb0
0cf1acb4 00000000 00000043 0cf1acf8 404a6d78 00000004 08ec1ac4 02142968
00000004 0000007b 404a6d54 034fac80 02150cf7 00000004 00000004 00000004
00000001
Call Trace:
[<021427ae>] prio_tree_next+0x89/0x9b
[<02142968>] vma_prio_tree_next+0x4b/0x63
[<02150cf7>] page_referenced+0x14d/0x18d
[<021478cd>] refill_inactive_zone+0x245/0x6a0
[<0211b29e>] activate_task+0x86/0x93
[<02147db5>] shrink_zone+0x8d/0xb4
[<02147e1f>] shrink_caches+0x43/0x4e
[<02147edd>] try_to_free_pages+0xb3/0x16c
[<02140369>] __alloc_pages+0x1c8/0x2be
[<0214bd83>] do_anonymous_page+0xb6/0x241
[<0214bf77>] do_no_page+0x69/0x3a0
[<0214c460>] handle_mm_fault+0xdf/0x1d4
[<0211955b>] do_page_fault+0x17c/0x58b
[<0214e81d>] unmap_vma_list+0xe/0x17
[<0214ebd5>] do_munmap+0x17a/0x186
[<0214fcef>] move_page_tables+0x3f/0x4c
[<0214fded>] move_vma+0xf1/0x175
[<0215017a>] do_mremap+0x309/0x32c
[<021193df>] do_page_fault+0x0/0x58b
Code: 0f 0b 79 01 cf fa 2e 02 39 52 04 74 08 0f 0b 7a 01 cf fa 2e

The system continues to function for approxiamately another minute. I
see messages such as the following on the console repeatedly:

dst cache overflow

Eventually the system becomes completely unresponsive. When I hit the
power button, ACPI tries to power down the system, but hangs after
killing a few processes and I must hard reset it.

I do not think this is bad hardware as we have approximately 11
DL140's and this will happen on all of them although more quickly on
the ones with higher user load (network traffic, CPU usage, etc).

Hoping someone can give me some suggestions if this is more likely to be a
hardware issue... just can't imagine getting that many bad servers. :)

Thanks in advance.


Subject: Re: kernel BUG at mm/prio_tree.c:377


Hi Ray,

Can you please apply the patch I recently posted and report
back.

http://marc.theaimsgroup.com/?l=linux-kernel&m=109926628920398

The patch fixes a bug reported earlier. However, earlier
oops were triggered at mm/prio_tree.c:538.

I haven't looked at the trace carefully. I will do so.
Please report back if the previous patch fixes your problem.

Thanks,
Rajesh

-----------------------------------------------------

Ray Van Dolson <[email protected]> wrote:

Description of problem:
Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. Fedora
Core 2.

This server operates as a PPTP Concentrator running the PoPToP server
(1.2.1) along with pppd 2.4.3. We have tried this system using both
the onboard Broadcom gigabit NIC's as well as a dual Intel EEPro 100.

Usually within 24 hours of bootup, the following oops occurs:

kernel BUG at mm/prio_tree.c:377!
invalid operand: 0000 [#1]
SMP nntrack(U) ip_tables(U) md5(U) ipv6(U) sunrpc(U) e100(U) mii(U)
sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U)
battery(U) asus_acpi(U) ac(U) ext3(U) jbd(U)
Modules linked in: ipt_LOG(U) sch_tbf(U) ppp_mppe(U) ppp_async(U)
crc_ccitt(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U)
ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_co
CPU: 1
EIP: 0060:[<021425de>] Tainted: P
EFLAGS: 00010202 (2.6.8-1.521custom)
EIP is at prio_tree_right+0x85/0xc5
eax: 00000009 ebx: 0cf1acf8 ecx: 00000000 edx: 12da3d00
esi: 00000000 edi: 00000004 ebp: 404a6d78 esp: 0cf1ac90
ds: 007b es: 007b ss: 0068
Process yum (pid: 24194, threadinfo=0cf1a000 task=12e4ecb0)
Stack: 0cf1acf8 00000004 00000004 404a6d78 021427ae 00000004 0cf1acb0
0cf1acb4 00000000 00000043 0cf1acf8 404a6d78 00000004 08ec1ac4 02142968
00000004 0000007b 404a6d54 034fac80 02150cf7 00000004 00000004 00000004
00000001
Call Trace:
[<021427ae>] prio_tree_next+0x89/0x9b
[<02142968>] vma_prio_tree_next+0x4b/0x63
[<02150cf7>] page_referenced+0x14d/0x18d
[<021478cd>] refill_inactive_zone+0x245/0x6a0
[<0211b29e>] activate_task+0x86/0x93
[<02147db5>] shrink_zone+0x8d/0xb4
[<02147e1f>] shrink_caches+0x43/0x4e
[<02147edd>] try_to_free_pages+0xb3/0x16c
[<02140369>] __alloc_pages+0x1c8/0x2be
[<0214bd83>] do_anonymous_page+0xb6/0x241
[<0214bf77>] do_no_page+0x69/0x3a0
[<0214c460>] handle_mm_fault+0xdf/0x1d4
[<0211955b>] do_page_fault+0x17c/0x58b
[<0214e81d>] unmap_vma_list+0xe/0x17
[<0214ebd5>] do_munmap+0x17a/0x186
[<0214fcef>] move_page_tables+0x3f/0x4c
[<0214fded>] move_vma+0xf1/0x175
[<0215017a>] do_mremap+0x309/0x32c
[<021193df>] do_page_fault+0x0/0x58b
Code: 0f 0b 79 01 cf fa 2e 02 39 52 04 74 08 0f 0b 7a 01 cf fa 2e

The system continues to function for approxiamately another minute. I
see messages such as the following on the console repeatedly:

dst cache overflow

Eventually the system becomes completely unresponsive. When I hit the
power button, ACPI tries to power down the system, but hangs after
killing a few processes and I must hard reset it.

I do not think this is bad hardware as we have approximately 11
DL140's and this will happen on all of them although more quickly on
the ones with higher user load (network traffic, CPU usage, etc).

Hoping someone can give me some suggestions if this is more likely to be a
hardware issue... just can't imagine getting that many bad servers. :)

Thanks in advance.

2004-11-04 08:04:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kernel BUG at mm/prio_tree.c:377

On Wed, 2004-11-03 at 16:36 -0800, Ray Van Dolson wrote:
> Description of problem:
> Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. Fedora
> Core 2.

> EIP: 0060:[<021425de>] Tainted: P

Which binary only driver are you using ?

2004-11-04 16:15:24

by Ray Van Dolson

[permalink] [raw]
Subject: Re: kernel BUG at mm/prio_tree.c:377

ppp_mppe patch from the pppd package. Lots of people use it without
problems. If it is the source of troubles, that won't be good as we need
it for our clients to connect. :)

On Thu, Nov 04, 2004 at 09:04:16AM +0100, Arjan van de Ven wrote:
> On Wed, 2004-11-03 at 16:36 -0800, Ray Van Dolson wrote:
> > Description of problem:
> > Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. Fedora
> > Core 2.
> > EIP: 0060:[<021425de>] Tainted: P
> Which binary only driver are you using ?

2004-11-04 16:33:48

by Ray Van Dolson

[permalink] [raw]
Subject: Re: kernel BUG at mm/prio_tree.c:377

I should note that the reason it taints the kernel is that it uses the BSD
license.

On Thu, Nov 04, 2004 at 08:14:30AM -0800, Ray Van Dolson wrote:
> ppp_mppe patch from the pppd package. Lots of people use it without
> problems. If it is the source of troubles, that won't be good as we need
> it for our clients to connect. :)
>
> On Thu, Nov 04, 2004 at 09:04:16AM +0100, Arjan van de Ven wrote:
> > On Wed, 2004-11-03 at 16:36 -0800, Ray Van Dolson wrote:
> > > Description of problem:
> > > Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. Fedora
> > > Core 2.
> > > EIP: 0060:[<021425de>] Tainted: P
> > Which binary only driver are you using ?

2004-11-09 00:45:02

by Ray Van Dolson

[permalink] [raw]
Subject: Re: kernel BUG at mm/prio_tree.c:377

Rajesh, I applied your patch and it definitely seems to have halped. The
server lasted nearly three days. :-) In fact, it didn't really seem to
hard lock but I had to reset it to get things working after the latest
crash.

Details:


kernel BUG at kernel/exit.c:842!
invalid operand: 0000 [#1]
SMP
Modules linked in: sch_tbf(U) ppp_async(U) crc_ccitt(U) ppp_mppe(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U) ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_conntrack(U) ip_tables(U) sunrpc(U) e100(U) mii(U) sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U) battery(U) ac(U) ext3(U) jbd(U)
CPU: 2
EIP: 0060:[<02121d10>] Tainted: P VLI
EFLAGS: 00010246 (2.6.9-1.1_FC2custom)
EIP is at do_exit+0x3b3/0x3bd
eax: 00000000 ebx: 26506560 ecx: 26506000 edx: 0381dd60
esi: 41fec340 edi: 26506030 ebp: 00001000 esp: 23b82f98
ds: 007b es: 007b ss: 0068
Process pppd (pid: 27091, threadinfo=23b82000 task=26506030)
Stack: 0d611e00 00001000 23b82000 23b82000 02121e05 00001000 23b82fc4 00000010
f6f32684 23b82000 fffec200 00000010 00000000 00000000 00000010 f6f32684
fef87938 000000fc 0000007b 0000007b 000000fc f6fa37a2 00000073 00000246
Call Trace:
[<02121e05>] sys_exit_group+0x0/0xd
Code: c1 e0 07 8d 04 10 ff 88 00 01 00 00 83 3a 02 75 0b 8b 82 08 11 00 00 e8 d8 95 ff ff 89 6f 7c 89 f8 e8 88 f5 ff ff e8 bc 74 19 00 <0f> 0b 4a 03 96 09 2d 02 eb fe 53 85 c0 89 d3 74 05 e8 35 ab ff
<1>Unable to handle kernel NULL pointer dereference at virtual address 00000024
printing eip:
0211ddb0
*pde = 00004001
Oops: 0000 [#2]
SMP
Modules linked in: sch_tbf(U) ppp_async(U) crc_ccitt(U) ppp_mppe(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U) ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_conntrack(U) ip_tables(U) sunrpc(U) e100(U) mii(U) sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U) battery(U) ac(U) ext3(U) jbd(U)
CPU: 2
EIP: 0060:[<0211ddb0>] Tainted: P VLI
EFLAGS: 00010286 (2.6.9-1.1_FC2custom)
EIP is at mm_release+0x33/0x70
eax: 00000000 ebx: 26506030 ecx: 00000000 edx: 00000000
esi: f6ff6828 edi: 00000000 ebp: 0000000b esp: 23b82e50
ds: 007b es: 007b ss: 0068
Process pppd (pid: 27091, threadinfo=23b82000 task=26506030)
Stack: 00000000 00000000 23b82f64 26506030 02121a20 23b82000 23b82f64 00000000
022c9112 021064a2 0000000b 23b82f64 022c9112 00000000 000000ff 0000000b
00000000 02106784 00001000 23b82f64 00000000 02106784 00001000 02106850
Call Trace:
[<02121a20>] do_exit+0xc3/0x3bd
[<021064a2>] do_divide_error+0x0/0xea
[<02106784>] do_invalid_op+0x0/0xd5
[<02106784>] do_invalid_op+0x0/0xd5
[<02106850>] do_invalid_op+0xcc/0xd5
[<0211bff5>] load_balance+0x27/0x135
[<02121d10>] do_exit+0x3b3/0x3bd
[<022b9a4a>] schedule+0x87e/0x8aa
[<0217e45d>] proc_delete_inode+0x0/0x61
[<022b9a4a>] schedule+0x87e/0x8aa
[<02121d10>] do_exit+0x3b3/0x3bd
[<02121e05>] sys_exit_group+0x0/0xd
Code: 8b 90 14 01 00 00 31 c0 8e e0 8e e8 85 d2 74 11 c7 83 14 01 00 00 00 00 00 00 89 d0 e8 b5 ea ff ff 8b b3 1c 01 00 00 85 f6 74 38 <8b> 47 24 48 7e 32 c7 83 1c 01 00 00 00 00 00 00 89 e2 89 f1 c7
<1>Unable to handle kernel NULL pointer dereference at virtual address 00000024
printing eip:
0211ddb0
*pde = 00004001
Oops: 0000 [#3]
SMP
Modules linked in: sch_tbf(U) ppp_async(U) crc_ccitt(U) ppp_mppe(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U) ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_conntrack(U) ip_tables(U) sunrpc(U) e100(U) mii(U) sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U) battery(U) ac(U) ext3(U) jbd(U)
CPU: 2
EIP: 0060:[<0211ddb0>] Tainted: P VLI
EFLAGS: 00010286 (2.6.9-1.1_FC2custom)
EIP is at mm_release+0x33/0x70
eax: 00000000 ebx: 26506030 ecx: 00000000 edx: 00000000
esi: f6ff6828 edi: 00000000 ebp: 0000000b esp: 23b82cc8
ds: 007b es: 007b ss: 0068

I started also noticing "Neighbour table overflow" error messages as well.
This server makes heavy use of proxy arp, so I wonder if I need to tweak
the gc_thresh* and the other gc* variables in proc...

The weird thing is that even after these "oopses" happened, the box was
still functioning. I could access the web server running on it, it was
still passing traffic for existing tunnels, but I could not establish new
ones. Couldn't ssh in, etc (thus I had to hard reset it).

As you can see, this is running on kernel 2.6.9 (from Fedora Core 2 testing
update tree) w/ your patch you mentioned below.

Any ideas?

On Wed, Nov 03, 2004 at 08:27:09PM -0500, Rajesh Venkatasubramanian wrote:
> Hi Ray,
>
> Can you please apply the patch I recently posted and report
> back.
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109926628920398
>
> The patch fixes a bug reported earlier. However, earlier
> oops were triggered at mm/prio_tree.c:538.
>
> I haven't looked at the trace carefully. I will do so .
> Please report back if the previous patch fixes your problem .
>
> Thanks,
> Rajesh
>