2007-10-16 19:11:33

by Dave Jones

[permalink] [raw]
Subject: Re: error: Eeek! page_mapcount(page) went negative! (-1) with different process and kernels

On Tue, Oct 16, 2007 at 07:17:32PM +0200, Arnaud Fontaine wrote:
> Hello,
>
> We have often the following error from the kernel:
>

> sshd[1551] trap invalid opcode rip:2aeacc0677a0 rsp:7fffe0c7e688 error:0
> Eeek! page_mapcount(page) went negative! (-1)
> page pfn = 7f7a8
> page->flags = 400000000001002c
> page->count = 1
> page->mapping = ffff810056170550
> vma->vm_ops = 0xffffffff80667ba0
> vma->vm_ops->nopage = _stext+0x7fdf7000/0x20
> vma->vm_ops->fault = filemap_fault+0x0/0x450
> vma->vm_file->f_op->mmap = generic_file_mmap+0x0/0x50
> ....
>
> We have tested with different kernel (2.6.23.1 and 2.6.22) and the same
> error happens with different process. Any idea for knowing what could
> cause this error?

Many of these that I've seen have turned out to be a hardware problem.
Try running memtest86+ on that machine for a while.
It doesn't catch all problems, but it will highlight more common memory faults.

Dave

--
http://www.codemonkey.org.uk

2007-10-16 23:03:19

by Arnaud Fontaine

[permalink] [raw]
Subject: Re: error: Eeek! page_mapcount(page) went negative! (-1) with different process and kernels

>>>>> "Dave" == Dave Jones <[email protected]> writes:

Dave> Many of these that I've seen have turned out to be a hardware
Dave> problem. Try running memtest86+ on that machine for a while.
Dave> It doesn't catch all problems, but it will highlight more
Dave> common memory faults.

Hello,

We ran memtest86+ before production, it was about one month ago. Do you
think it could come from that anyway?

Regards,
Arnaud Fontaine

2007-10-17 02:36:56

by Dave Jones

[permalink] [raw]
Subject: Re: error: Eeek! page_mapcount(page) went negative! (-1) with different process and kernels

On Wed, Oct 17, 2007 at 01:03:02AM +0200, Arnaud Fontaine wrote:
> >>>>> "Dave" == Dave Jones <[email protected]> writes:
>
> Dave> Many of these that I've seen have turned out to be a hardware
> Dave> problem. Try running memtest86+ on that machine for a while.
> Dave> It doesn't catch all problems, but it will highlight more
> Dave> common memory faults.
>
> Hello,
>
> We ran memtest86+ before production, it was about one month ago. Do you
> think it could come from that anyway?

Not impossible. Hardware failures can occur at any time.
Somewhat unlikely though. As I mentioned, memtest also doesn't trap
all hardware problems. I have a board that passes memtest with flying
colours, yet dies under even slight load. Examination of the board
shows that it has leaking capacitors.

Dave

--
http://www.codemonkey.org.uk

2007-10-18 11:26:17

by Goswin von Brederlow

[permalink] [raw]
Subject: Re: error: Eeek! page_mapcount(page) went negative! (-1) with different process and kernels

Arnaud Fontaine <[email protected]> writes:

>>>>>> "Dave" == Dave Jones <[email protected]> writes:
>
> Dave> Many of these that I've seen have turned out to be a hardware
> Dave> problem. Try running memtest86+ on that machine for a while.
> Dave> It doesn't catch all problems, but it will highlight more
> Dave> common memory faults.
>
> Hello,
>
> We ran memtest86+ before production, it was about one month ago. Do you
> think it could come from that anyway?

I find that a lot of the time memtest does not reveal an error. Only
when you combine multiple sources or on random access do you get
errors. For example compiling a kernel while doing heavy I/O on the
disk. But that might just be me. Errors are rather random occurances.

Compiling a kernel repeadatly and multiple in parallel is usualy a
good test. If it sometimes fails to compile then it is near certain a
hardware error.

MfG
Goswin