2009-07-08 11:07:32

by Guennadi Liakhovetski

[permalink] [raw]
Subject: [BUG 2.6.30] Bad page map in process

Hi

with a 2.6.30 kernel with only platform-specific modifications and this
avr32 patch:

http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commitdiff;h=bb6e647051a59dca5a72b3deef1e061d7c1c34da

we're seeing kernel BUGs following an application segfault. Here's an
example:

[60254.432000] application[465]: segfault at 4377f876 pc 2aaabbde sp 7faa77f0 ecr 24
[60255.396000] BUG: Bad page map in process application pte:13f26ed4 pmd:92fdd000
[60255.404000] page:902c44c0 flags:0000002c count:1 mapcount:-1 mapping:9345765c index:5
[60255.412000] addr:2ae4f000 vm_flags:08000075 anon_vma:(null) mapping:93454dd4 index:0
[60255.420000] vma->vm_ops->fault: filemap_fault+0x0/0x26c
[60255.424000] vma->vm_file->f_op->mmap: generic_file_readonly_mmap+0x0/0x18
[60255.432000] Call trace:
[60255.432000] [<90027b7c>] dump_stack+0x18/0x20
[60255.432000] [<9005f2e8>] print_bad_pte+0x120/0x13c
[60255.432000] [<90060964>] unmap_vmas+0x230/0x3e4
[60255.432000] [<90061ed2>] exit_mmap+0x5e/0xd0
[60255.432000] [<9002d380>] mmput+0x24/0x7c
[60255.432000] [<9002fd70>] exit_mm+0xb4/0xb8
[60255.432000] [<90030ede>] do_exit+0xde/0x3d8
[60255.432000] [<90031222>] do_group_exit+0x4a/0x64
[60255.432000] [<9003688e>] get_signal_to_deliver+0x22a/0x24c
[60255.432000] [<900272ce>] do_signal+0x52/0x3f0
[60255.432000] [<90027698>] do_notify_resume+0x2c/0xfc
[60255.432000] [<900233d2>] fault_exit_work+0x24/0x36
[60255.432000]
[60255.432000] Disabling lock debugging due to kernel taint
[60255.432000] BUG: Bad page state in process application pfn:13f26
[60255.440000] page:902c44c0 flags:0000000c count:0 mapcount:-1 mapping:9345765c index:5
[60255.448000] Call trace:
[60255.448000] [<90027b7c>] dump_stack+0x18/0x20
[60255.448000] [<90054c76>] bad_page+0xa6/0xd0
[60255.448000] [<900557e6>] free_hot_cold_page+0xa2/0x160
[60255.448000] [<900558dc>] free_hot_page+0x8/0xc
[60255.448000] [<90057bae>] put_page+0xca/0xe8
[60255.448000] [<900662a4>] free_page_and_swap_cache+0x38/0x3c
[60255.448000] [<90060972>] unmap_vmas+0x23e/0x3e4
[60255.448000] [<90061ed2>] exit_mmap+0x5e/0xd0
[60255.448000] [<9002d380>] mmput+0x24/0x7c
[60255.448000] [<9002fd70>] exit_mm+0xb4/0xb8
[60255.448000] [<90030ede>] do_exit+0xde/0x3d8
[60255.448000] [<90031222>] do_group_exit+0x4a/0x64
[60255.448000] [<9003688e>] get_signal_to_deliver+0x22a/0x24c
[60255.448000] [<900272ce>] do_signal+0x52/0x3f0
[60255.448000] [<90027698>] do_notify_resume+0x2c/0xfc
[60255.448000] [<900233d2>] fault_exit_work+0x24/0x36
[60255.448000]

Questions: can this BUG be caused by the segfault (it better not)? If not,
what can be the reason? The problem occurs sporadically, I've only had one
such case since yesterday. Yet one more application segfault last night
didn't produce a BUG. This is with a kernel configured with SLAB. With
SLUB we also observed similar BUGs on application exit but without signal
handling path in the backtrace. But, I think, I've had other problems with
SLUB before, so, we switched back to SLAB for now...

Thanks
Guennadi
---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/


2009-07-08 11:23:30

by Hans-Christian Egtvedt

[permalink] [raw]
Subject: Re: [BUG 2.6.30] Bad page map in process

On Wed, 8 Jul 2009 13:07:31 +0200 (CEST)
Guennadi Liakhovetski <[email protected]> wrote:

Hi Guennadi,

> with a 2.6.30 kernel
>

Could you give a short description of the rest of your setup as well?

libc library and version number? Latest known to be good is uClibc
v0.9.30.1.

binutils version? Latest known to be good is binutils version
2.18.atmel.1.0.1.buildroot.1.

gcc version? Latest known to be good is gcc version
4.2.2-atmel.1.1.3.buildroot.1.

<snipp link to patch and BUG output>

--
Best regards,
Hans-Christian Egtvedt

2009-07-08 12:28:24

by Guennadi Liakhovetski

[permalink] [raw]
Subject: Re: [BUG 2.6.30] Bad page map in process

On Wed, 8 Jul 2009, Hans-Christian Egtvedt wrote:

> On Wed, 8 Jul 2009 13:07:31 +0200 (CEST)
> Guennadi Liakhovetski <[email protected]> wrote:
>
> Hi Guennadi,
>
> > with a 2.6.30 kernel
> >
>
> Could you give a short description of the rest of your setup as well?

Sure, it is based on buildroot v2.3.0:

> libc library and version number? Latest known to be good is uClibc
> v0.9.30.1.

It's v0.9.30.

> binutils version? Latest known to be good is binutils version
> 2.18.atmel.1.0.1.buildroot.1.

Yep.

> gcc version? Latest known to be good is gcc version
> 4.2.2-atmel.1.1.3.buildroot.1.

Yep.

Thanks
Guennadi
---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/

2009-07-10 18:34:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: [BUG 2.6.30] Bad page map in process

On Wed, 8 Jul 2009, Guennadi Liakhovetski wrote:
>
> with a 2.6.30 kernel with only platform-specific modifications and this
> avr32 patch:
>
> http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commitdiff;h=bb6e647051a59dca5a72b3deef1e061d7c1c34da
>
> we're seeing kernel BUGs following an application segfault. Here's an
> example:
>
> [60254.432000] application[465]: segfault at 4377f876 pc 2aaabbde sp 7faa77f0 ecr 24
> [60255.396000] BUG: Bad page map in process application pte:13f26ed4 pmd:92fdd000
> [60255.404000] page:902c44c0 flags:0000002c count:1 mapcount:-1 mapping:9345765c index:5
> [60255.412000] addr:2ae4f000 vm_flags:08000075 anon_vma:(null) mapping:93454dd4 index:0

This is the first time I've seen one of these messages since putting it
into 2.6.29, and nice to see that it's doing its job: the info amidst the
data is that mapcount is -1 when it ought to be 0, and the mapping,index
of the page the pte points to doesn't match up with the mapping,index
which the vma intends at that address: probably the pte is corrupt.

I've not looked up avr32 pte layout, is 13f26ed4 good or bad?
I hope avr32 people can tell more about the likely cause.

Also, the addr mapped by this pte (2ae4f000) is not the address
which segfaulted (4377f876): it would have been satisfying if those
had matched up, but I don't think we can conclude anything from the
fact that they don't.

> [60255.420000] vma->vm_ops->fault: filemap_fault+0x0/0x26c
> [60255.424000] vma->vm_file->f_op->mmap: generic_file_readonly_mmap+0x0/0x18
> [60255.432000] Call trace: (exiting)
>
> Questions: can this BUG be caused by the segfault (it better not)?

It better not.

> If not, what can be the reason?

It looks like page table corruption.

> The problem occurs sporadically, I've only had one
> such case since yesterday. Yet one more application segfault last night
> didn't produce a BUG.

I think page table corruption is causing segfaults, and page table
corruption is causing "Bad page map"s when the app exits. Yes,
sometimes you'll see one, sometimes the other, sometimes both.

More might be learnt by comparing all the different such messages
you've seen: for example, we're now printing the "pmd" there, in
case it emerges that all such errors occur in or near the same
physical address.

> This is with a kernel configured with SLAB. With
> SLUB we also observed similar BUGs on application exit but without signal
> handling path in the backtrace. But, I think, I've had other problems with
> SLUB before, so, we switched back to SLAB for now...

I wouldn't read too much into the SLAB versus SLUB difference here,
suspect just coincidence; but I could be horribly wrong.

Hugh

2009-07-12 07:57:58

by Haavard Skinnemoen

[permalink] [raw]
Subject: Re: [BUG 2.6.30] Bad page map in process

On Fri, 10 Jul 2009 19:34:06 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> I've not looked up avr32 pte layout, is 13f26ed4 good or bad?
> I hope avr32 people can tell more about the likely cause.

It looks OK for a user mapping, assuming you have at least 64MB of
SDRAM (the SDRAM starts at 0x10000000) -- all the normal userspace flags
are set and all the kernel-only flags are unset. It's marked as
executable, so it could be that the segfault was caused by the CPU
executing the wrong code.

The virtual address 0x4377f876 is a bit higher than what you normally
see on avr32 systems, but there's not necessarily anything wrong with
it -- userspace goes up to 0x80000000.

Btw, is preempt enabled when you see this?

Haavard

2009-07-12 19:59:59

by Guennadi Liakhovetski

[permalink] [raw]
Subject: Re: [BUG 2.6.30] Bad page map in process

On Sun, 12 Jul 2009, Haavard Skinnemoen wrote:

> On Fri, 10 Jul 2009 19:34:06 +0100 (BST)
> Hugh Dickins <[email protected]> wrote:
>
> > I've not looked up avr32 pte layout, is 13f26ed4 good or bad?
> > I hope avr32 people can tell more about the likely cause.
>
> It looks OK for a user mapping, assuming you have at least 64MB of
> SDRAM (the SDRAM starts at 0x10000000) -- all the normal userspace flags
> are set and all the kernel-only flags are unset. It's marked as
> executable, so it could be that the segfault was caused by the CPU
> executing the wrong code.
>
> The virtual address 0x4377f876 is a bit higher than what you normally
> see on avr32 systems, but there's not necessarily anything wrong with
> it -- userspace goes up to 0x80000000.
>
> Btw, is preempt enabled when you see this?

No, preempt was off.

I can give a couple more details to the problem:

1. it might well be hardware-related.

2. the specific BUG that I posted originally wasn't very interesting,
because it wasn't the first one. Having read a few posts I wasn't quite
sure how really severe this BUG was, i.e., whether or not it requiret a
reboot. There used to be a message like "reboot is required" around this
sort of exceptions, but then it has been removed, so, I thought, it wasn't
required any more. But the fact is, that once one such BUG has occurred,
new ones will come from various applications and eventually the system
will become unusable.

3. What makes it a kind of hard to believe that it's a hardware problem,
is that up to now we have only been able to produce the _first_ such
segfault and BUG with just one specific user-space application. In
principle the application doesn't do anything critical. It just uses Qt ta
draw on the framebuffer. And we have been able to reproduce the problem by
running just a truncated version of the app - just the Qt and local class
initialisation. Running such an application repeatedly eventually produces
a segfault, and at some point also the "bad page map" BUG.

We're currently trying to investigate and fix the hardware, will post our
results.

Thanks
Guennadi
---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/

2009-07-13 11:57:01

by Hugh Dickins

[permalink] [raw]
Subject: Re: [BUG 2.6.30] Bad page map in process

On Sun, 12 Jul 2009, Guennadi Liakhovetski wrote:
>
> 2. the specific BUG that I posted originally wasn't very interesting,
> because it wasn't the first one. Having read a few posts I wasn't quite
> sure how really severe this BUG was, i.e., whether or not it requiret a
> reboot. There used to be a message like "reboot is required" around this
> sort of exceptions, but then it has been removed, so, I thought, it wasn't
> required any more. But the fact is, that once one such BUG has occurred,
> new ones will come from various applications and eventually the system
> will become unusable.

I replaced Bad page state's reboot is needed message by just the BUG
prefix: partly because the bad page handling _is_ now more resilient;
but more because I don't like wasting screenlines which could hold
vital info, and because I didn't see how this BUG differs from others
in whether or not you need a reboot.

A BUG means the kernel is in unknown territory: if you're brave and
want to gather more info, you try to keep on running after a BUG;
if you're cautious, you reboot as soon as possible.

(Hmm, but perhaps I should wire these in to panic_on_oops??)

You did the right thing: kept on running, then decided it wasn't
worth it. (But you've only sent the one pair of messages gathered:
okay, let's forget the rest until you've sorted the hardware angle.)

Hugh