2003-08-08 14:39:55

by Xiaogang Wang

[permalink] [raw]
Subject: page_alloc.c bug and heavy I/O

Hi,

My hardware and softare:

Asus P4P800, 2GB memory, 2.8GHZ P4 with HT enabled.
On-board 3com Giga bit network card
1 parallel ata 160G maxtor disk
Nvidia Gefore4 MX440-8x graphics card (Asus V9180)

Redhat 7.3, original kernel 2.4.18-3
Intel Fortran Compiler 7.1
Intel Math Kernel Library 6.0

My problem is that one of my fortran code always crashes after 10-24 hours.
This code has a heavy IO. It writes out a 5MB binary file every 1 minute.

The error message in /var/log/message is: (coulson is the name of the computer)

Aug 7 21:11:29 coulson kernel: kernel BUG at page_alloc.c:226!
Aug 7 21:11:29 coulson kernel: invalid operand: 0000
Aug 7 21:11:29 coulson kernel: nfsd lockd sunrpc binfmt_misc sr_mod soundcore
parport_pc lp parport autofs 3c
....

the line number with page_alloc.c varies for different crases (not always 226).

This code also had a heavy IO. Specifically, it writes out a 5MB file every
1 minute.

I have done a couple of tests to try to find the cause, but without success so
far.

1) I have rmmod the 3com2000.0 network driver. The driver source is downloaded
from asus website, and compiled by me. The crash still occurs.

2) I heard of Nvidia binary driver can cause page_alloc.c kernel bug.
I do have a Nvidia Gefore4 MX440-8x graphics card (Asus V9180), but I did
not use Nvidia binary driver. Instead I used the vesa driver coming with redhat
7.3. Nevertheless, I changed to an old pci ATI rage graphics card. But the
crash still occurs.

3) when the crash occurs, my local X is up and running. I have not tried
the case with local X shutdown.

I also got the same crash when I run the code on a second computer with the
same hw and sw. This makes the hw defect less likely to be the cause.

I am thinking to recompile a new kernel. But now I focus on if it is caused
by some uncompatible module drivers.

I would appreciate your inputs on this. Please cc your answer to me. I am not on
the list.

Xiaogang



------------------------------------------------
Dr Xiaogang Wang
Departement de chimie
Universite de Montreal
C.P. 6128, succursale Centre-ville
Montreal (Quebec) H3C 3J7

Tel. (514) 3436111 ext 3947 (office)
FAX (514) 3437586 (office)
e-mail: [email protected]
homepage: http://www.esi.umontreal.ca/~wangx
------------------------------------------------



2003-08-08 15:09:07

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: page_alloc.c bug and heavy I/O

On Fri, 8 Aug 2003, Xiaogang Wang wrote:

> Hi,
>
> My hardware and softare:
>
> Asus P4P800, 2GB memory, 2.8GHZ P4 with HT enabled.
> On-board 3com Giga bit network card
> 1 parallel ata 160G maxtor disk
> Nvidia Gefore4 MX440-8x graphics card (Asus V9180)
>
> Redhat 7.3, original kernel 2.4.18-3

You might want to update your RedHat kernel and if the problem persists
report it on their bugzilla (bugzilla.redhat.com).

> Intel Fortran Compiler 7.1
> Intel Math Kernel Library 6.0
>
> My problem is that one of my fortran code always crashes after 10-24 hours.
> This code has a heavy IO. It writes out a 5MB binary file every 1 minute.
>
> The error message in /var/log/message is: (coulson is the name of the computer)
>
> Aug 7 21:11:29 coulson kernel: kernel BUG at page_alloc.c:226!
> Aug 7 21:11:29 coulson kernel: invalid operand: 0000
> Aug 7 21:11:29 coulson kernel: nfsd lockd sunrpc binfmt_misc sr_mod soundcore

--
function.linuxpower.ca

2003-08-08 17:19:48

by Alan

[permalink] [raw]
Subject: Re: page_alloc.c bug and heavy I/O

On Gwe, 2003-08-08 at 15:39, Xiaogang Wang wrote:
> Hi,
>
> My hardware and softare:
>
> Asus P4P800, 2GB memory, 2.8GHZ P4 with HT enabled.
> On-board 3com Giga bit network card
> 1 parallel ata 160G maxtor disk
> Nvidia Gefore4 MX440-8x graphics card (Asus V9180)
>
> Redhat 7.3, original kernel 2.4.18-3

How about updating to the errata kernel ?

2003-08-11 13:29:52

by Xiaogang Wang

[permalink] [raw]
Subject: Re: page_alloc.c bug and heavy I/O

On Fri, 8 Aug 2003, Zwane Mwaikambo wrote:

> On Fri, 8 Aug 2003, Xiaogang Wang wrote:
>
> > Hi,
> >
> > My hardware and softare:
> >
> > Asus P4P800, 2GB memory, 2.8GHZ P4 with HT enabled.
> > On-board 3com Giga bit network card
> > 1 parallel ata 160G maxtor disk
> > Nvidia Gefore4 MX440-8x graphics card (Asus V9180)
> >
> > Redhat 7.3, original kernel 2.4.18-3
>
> You might want to update your RedHat kernel and if the problem persists
> report it on their bugzilla (bugzilla.redhat.com).
>

I have updated the kernel from linux-2.4.18-3 to linux-2.4.20-19.7
with no network driver 3c2000.o compiled and with an old pci ATI rage graphics
card.

The kernel BUG occurs now at vmscan.c, in lieu of page_alloc.c


Aug 9 20:38:19 coulson kernel: ------------[ cut here ]------------
Aug 9 20:38:19 coulson kernel: kernel BUG at vmscan.c:747!
Aug 9 20:38:19 coulson kernel: invalid operand: 0000
Aug 9 20:38:19 coulson kernel:
Aug 9 20:38:19 coulson kernel: CPU: 0

I still have no clue.

Xiaogang

------------------------------------------------
Dr Xiaogang Wang
Departement de chimie
Universite de Montreal
C.P. 6128, succursale Centre-ville
Montreal (Quebec) H3C 3J7

Tel. (514) 3436111 ext 3947 (office)
FAX (514) 3437586 (office)
e-mail: [email protected]
homepage: http://www.esi.umontreal.ca/~wangx
------------------------------------------------