2006-03-08 19:36:28

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: 2.6.15-rt20, "bad page state", jackd

Hi all, I reported this in mid January (I thought I had sent to the list
but the report went to Ingo and Steven off list)

I'm seeing the same problem in 2.6.15-rt21 in some of my machines. After
a reboot into the kernel I just login as root in a terminal, start the
jackd sound server ("jackd -d alsa -d hw") and when stopping it (just
doing a <ctrl>c) I get a bunch of messages of this form:

> Trying to fix it up, but a reboot is needed
> Bad page state at __free_pages_ok (in process 'jackd', page c10012fc)

Has anyone else seen this?

I'm in the process of building an -rt21 kernel before posting more
detailed error messages (this kernel is patched with some additional
stuff).

Ingo had suggested at the time I enable DEBUG_PAGEALLOC and DEBUG_SLAB
to get more information (Jan 21 or so), I was never able to get the
machine to boot into that kernel, I kept getting oopses. Weird, I'm
trying to find the post I sent to lkml (I have it in my mailbox) in the
lkml archives to include a link and can't find it...

More details later...
-- Fernando



2006-03-08 21:48:57

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

On Wed, 2006-03-08 at 11:36 -0800, Fernando Lopez-Lezcano wrote:
> Hi all, I reported this in mid January (I thought I had sent to the list
> but the report went to Ingo and Steven off list)
>
> I'm seeing the same problem in 2.6.15-rt21 in some of my machines. After
> a reboot into the kernel I just login as root in a terminal, start the
> jackd sound server ("jackd -d alsa -d hw") and when stopping it (just
> doing a <ctrl>c) I get a bunch of messages of this form:
>
> > Trying to fix it up, but a reboot is needed
> > Bad page state at __free_pages_ok (in process 'jackd', page c10012fc)
>
> Has anyone else seen this?
>
> I'm in the process of building an -rt21 kernel before posting more
> detailed error messages (this kernel is patched with some additional
> stuff).

This is what the errors look like (I'm attaching the whole dmesg output
and the .config file used to build the smp kernel):

Bad page state at __free_pages_ok (in process 'jackd', page c1013ce0)
flags:0x00000414 mapping:00000000 mapcount:0 count:0
Backtrace:
[<c015947d>] bad_page+0x7d/0xc0 (8)
[<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
[<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
[<c015db47>] release_pages+0x127/0x1a0 (16)
[<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
[<c01681ae>] unmap_region+0x13e/0x160 (28)
[<c0168461>] do_munmap+0xe1/0x120 (48)
[<c01684df>] sys_munmap+0x3f/0x60 (32)
[<c01034a1>] syscall_call+0x7/0xb (16)
Trying to fix it up, but a reboot is needed
Bad page state at __free_pages_ok (in process 'jackd', page c1013ca4)
flags:0x00000414 mapping:00000000 mapcount:0 count:0
Backtrace:
[<c015947d>] bad_page+0x7d/0xc0 (8)
[<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
[<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
[<c015db47>] release_pages+0x127/0x1a0 (16)
[<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
[<c01681ae>] unmap_region+0x13e/0x160 (28)
[<c0168461>] do_munmap+0xe1/0x120 (48)
[<c01684df>] sys_munmap+0x3f/0x60 (32)
[<c01034a1>] syscall_call+0x7/0xb (16)
Trying to fix it up, but a reboot is needed
Bad page state at __free_pages_ok (in process 'jackd', page c1013c68)
flags:0x00000414 mapping:00000000 mapcount:0 count:0
Backtrace:
[<c015947d>] bad_page+0x7d/0xc0 (8)
[<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
[<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
[<c015db47>] release_pages+0x127/0x1a0 (16)
[<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
[<c01681ae>] unmap_region+0x13e/0x160 (28)
[<c0168461>] do_munmap+0xe1/0x120 (48)
[<c01684df>] sys_munmap+0x3f/0x60 (32)
[<c01034a1>] syscall_call+0x7/0xb (16)
Trying to fix it up, but a reboot is needed

I'm now building another -rt21 with DEBUG_PAGEALLOC and DEBUG_SLAB
enabled to see if I can pinpoint in more detail what's happening (if I
can get it to boot!).

-- Fernando


Attachments:
kernel-2.6.15-i686-smp.ccrma.config (59.54 kB)
dmesg.6.post (28.54 kB)
Download all attachments

2006-03-09 00:30:50

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

On Wed, 2006-03-08 at 13:48 -0800, Fernando Lopez-Lezcano wrote:
> On Wed, 2006-03-08 at 11:36 -0800, Fernando Lopez-Lezcano wrote:
> > Hi all, I reported this in mid January (I thought I had sent to the list
> > but the report went to Ingo and Steven off list)
> >
> > I'm seeing the same problem in 2.6.15-rt21 in some of my machines. After
> > a reboot into the kernel I just login as root in a terminal, start the
> > jackd sound server ("jackd -d alsa -d hw") and when stopping it (just
> > doing a <ctrl>c) I get a bunch of messages of this form:
> >
> > > Trying to fix it up, but a reboot is needed
> > > Bad page state at __free_pages_ok (in process 'jackd', page c10012fc)
> >
> > Has anyone else seen this?
> >
> > I'm in the process of building an -rt21 kernel before posting more
> > detailed error messages (this kernel is patched with some additional
> > stuff).
>
> This is what the errors look like (I'm attaching the whole dmesg output
> and the .config file used to build the smp kernel):
>
> Bad page state at __free_pages_ok (in process 'jackd', page c1013ce0)
> flags:0x00000414 mapping:00000000 mapcount:0 count:0
> Backtrace:
> [<c015947d>] bad_page+0x7d/0xc0 (8)
> [<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
> [<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
> [<c015db47>] release_pages+0x127/0x1a0 (16)
> [<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
> [<c01681ae>] unmap_region+0x13e/0x160 (28)
> [<c0168461>] do_munmap+0xe1/0x120 (48)
> [<c01684df>] sys_munmap+0x3f/0x60 (32)
> [<c01034a1>] syscall_call+0x7/0xb (16)
> Trying to fix it up, but a reboot is needed

[MUNCH]

> I'm now building another -rt21 with DEBUG_PAGEALLOC and DEBUG_SLAB
> enabled to see if I can pinpoint in more detail what's happening (if I
> can get it to boot!).

I'm not able to boot with those two options enabled. It looks like it is
hanging after loading the sound drivers - this is on top of fc4. I can't
even get to single user and I'm searching for a serial cable right now
to see if I can get more information in a way that can be posted here.
Arghhh :-)
-- Fernando


2006-03-09 02:19:33

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

On Wed, 2006-03-08 at 16:30 -0800, Fernando Lopez-Lezcano wrote:
> On Wed, 2006-03-08 at 13:48 -0800, Fernando Lopez-Lezcano wrote:
> > On Wed, 2006-03-08 at 11:36 -0800, Fernando Lopez-Lezcano wrote:
> > > Hi all, I reported this in mid January (I thought I had sent to the list
> > > but the report went to Ingo and Steven off list)
> > >
> > > I'm seeing the same problem in 2.6.15-rt21 in some of my machines. After
> > > a reboot into the kernel I just login as root in a terminal, start the
> > > jackd sound server ("jackd -d alsa -d hw") and when stopping it (just
> > > doing a <ctrl>c) I get a bunch of messages of this form:
> > >
> > > > Trying to fix it up, but a reboot is needed
> > > > Bad page state at __free_pages_ok (in process 'jackd', page c10012fc)
> > >
> > > Has anyone else seen this?
> > >
> > > I'm in the process of building an -rt21 kernel before posting more
> > > detailed error messages (this kernel is patched with some additional
> > > stuff).
> >
> > This is what the errors look like (I'm attaching the whole dmesg output
> > and the .config file used to build the smp kernel):
> >
> > Bad page state at __free_pages_ok (in process 'jackd', page c1013ce0)
> > flags:0x00000414 mapping:00000000 mapcount:0 count:0
> > Backtrace:
> > [<c015947d>] bad_page+0x7d/0xc0 (8)
> > [<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
> > [<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
> > [<c015db47>] release_pages+0x127/0x1a0 (16)
> > [<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
> > [<c01681ae>] unmap_region+0x13e/0x160 (28)
> > [<c0168461>] do_munmap+0xe1/0x120 (48)
> > [<c01684df>] sys_munmap+0x3f/0x60 (32)
> > [<c01034a1>] syscall_call+0x7/0xb (16)
> > Trying to fix it up, but a reboot is needed
>
> [MUNCH]
>
> > I'm now building another -rt21 with DEBUG_PAGEALLOC and DEBUG_SLAB
> > enabled to see if I can pinpoint in more detail what's happening (if I
> > can get it to boot!).
>
> I'm not able to boot with those two options enabled. It looks like it is
> hanging after loading the sound drivers - this is on top of fc4. I can't
> even get to single user and I'm searching for a serial cable right now
> to see if I can get more information in a way that can be posted here.
> Arghhh :-)

A kind soul (thanks Carr!) made the proper null modem cable for me...

So, I'm attaching the result of <ctrl><sys>T when the machine is hung.
This is when trying to boot single user into -rt21 with the previously
posted .config and DEBUG_PAGEALLOC and DEBUG_SLAB enabled. Without those
two options the kernel boots but I get the aforementioned errors when
stopping jackd.

Let me know if I could try something else.
-- Fernando


Attachments:
dump.txt (58.75 kB)

2006-03-09 08:48:07

by Heiko Carstens

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

On Wed, Mar 08, 2006 at 11:36:04AM -0800, Fernando Lopez-Lezcano wrote:
> Hi all, I reported this in mid January (I thought I had sent to the list
> but the report went to Ingo and Steven off list)
>
> I'm seeing the same problem in 2.6.15-rt21 in some of my machines. After
> a reboot into the kernel I just login as root in a terminal, start the
> jackd sound server ("jackd -d alsa -d hw") and when stopping it (just
> doing a <ctrl>c) I get a bunch of messages of this form:
>
> > Trying to fix it up, but a reboot is needed
> > Bad page state at __free_pages_ok (in process 'jackd', page c10012fc)
>
> Has anyone else seen this?

Actually I have a bug report that looks quite the same. Happens on s390x
with lots of I/O stress. But that is against vanilla 2.6.16-rc4 + additional
patches. I need to ask to reproduce that with a plain vanilla kernel, so
that a git bisect search might help to figure out what is wrong.
Unfortunately it seems to take hours before we hit the bug.

<0>Bad page state in process 'blast'
<0>page:0000000000507d00 flags:0x000000060000002a mapping:00000000007570b0 mapcount:1 count:8
<0>Trying to fix it up, but a reboot is needed
<0>Backtrace:
<4>0000000006e93750 0000000000000000 0000000000773780 0700000000007c7a
<4> 0000000000000001 000000000025f878 000000000025f878 0000000000104840
<4> 0000000000000000 000000060000002a 0000000000000000 0000000000518d50
<4> 000000000000000c 0000000000000008 0000000006e936f8 0000000006e93770
<4> 000000000044e1f0 0000000000104840 0000000006e936f8 0000000006e93738
<4>Call Trace:
<4>([<0000000000104870>] dump_stack+0x2b8/0x374)
<4> [<00000000001a97de>] get_page_from_freelist+0x72e/0x8e8
<4> [<00000000001a9aa8>] __alloc_pages+0x110/0x324
<4> [<00000000001b37a0>] page_cache_readahead+0xf6c/0x11e4
<4> [<000000000019f870>] do_generic_mapping_read+0x150/0x828
<4> [<00000000001a07f4>] generic_file_aio_read+0x1f8/0x258
<4> [<00000000001fa844>] do_sync_read+0x130/0x1bc
<4> [<00000000001fc230>] sys_read+0x170/0x3b8
<4> [<000000000010fb20>] sysc_tracego+0xe/0x14
<4> [<0000020000043a84>] 0x20000043a84

Heiko

2006-03-09 09:07:46

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

Heiko Carstens wrote:

> Actually I have a bug report that looks quite the same. Happens on s390x
> with lots of I/O stress. But that is against vanilla 2.6.16-rc4 + additional
> patches. I need to ask to reproduce that with a plain vanilla kernel, so
> that a git bisect search might help to figure out what is wrong.
> Unfortunately it seems to take hours before we hit the bug.
>

It is different because yours is coming from __alloc_pages, so if there
are no previous warnings then it means it wasn't corrupted before last
being freed, so it is a use-after-free bug. I'm surprised at how many
fields have been modified though (mapping, count, mapcount).

PG_error is set, which might mean it is a bug in an error handling path
that doesn't trigger very often.

> <0>Bad page state in process 'blast'
> <0>page:0000000000507d00 flags:0x000000060000002a mapping:00000000007570b0 mapcount:1 count:8
> <0>Trying to fix it up, but a reboot is needed
> <0>Backtrace:
> <4>0000000006e93750 0000000000000000 0000000000773780 0700000000007c7a
> <4> 0000000000000001 000000000025f878 000000000025f878 0000000000104840
> <4> 0000000000000000 000000060000002a 0000000000000000 0000000000518d50
> <4> 000000000000000c 0000000000000008 0000000006e936f8 0000000006e93770
> <4> 000000000044e1f0 0000000000104840 0000000006e936f8 0000000006e93738
> <4>Call Trace:
> <4>([<0000000000104870>] dump_stack+0x2b8/0x374)
> <4> [<00000000001a97de>] get_page_from_freelist+0x72e/0x8e8
> <4> [<00000000001a9aa8>] __alloc_pages+0x110/0x324
> <4> [<00000000001b37a0>] page_cache_readahead+0xf6c/0x11e4
> <4> [<000000000019f870>] do_generic_mapping_read+0x150/0x828
> <4> [<00000000001a07f4>] generic_file_aio_read+0x1f8/0x258
> <4> [<00000000001fa844>] do_sync_read+0x130/0x1bc
> <4> [<00000000001fc230>] sys_read+0x170/0x3b8
> <4> [<000000000010fb20>] sysc_tracego+0xe/0x14
> <4> [<0000020000043a84>] 0x20000043a84
>
> Heiko

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-09 21:08:34

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

On Thu, 2006-03-09 at 09:47 +0100, Heiko Carstens wrote:
> On Wed, Mar 08, 2006 at 11:36:04AM -0800, Fernando Lopez-Lezcano wrote:
> > Hi all, I reported this in mid January (I thought I had sent to the list
> > but the report went to Ingo and Steven off list)
> >
> > I'm seeing the same problem in 2.6.15-rt21 in some of my machines. After
> > a reboot into the kernel I just login as root in a terminal, start the
> > jackd sound server ("jackd -d alsa -d hw") and when stopping it (just
> > doing a <ctrl>c) I get a bunch of messages of this form:
> >
> > > Trying to fix it up, but a reboot is needed
> > > Bad page state at __free_pages_ok (in process 'jackd', page c10012fc)
> >
> > Has anyone else seen this?
>
> Actually I have a bug report that looks quite the same. Happens on s390x
> with lots of I/O stress. But that is against vanilla 2.6.16-rc4 + additional
> patches. I need to ask to reproduce that with a plain vanilla kernel, so
> that a git bisect search might help to figure out what is wrong.
> Unfortunately it seems to take hours before we hit the bug.

In my case it is completely repeatable.
Boot, start jackd, stop jackd -> problem appears.

This does not happen on all computers so it would seem to me it is
related to the sound drivers. I'll try to see if there is a correlation
with the sound card being used.

Is there anything else I could do to try to help resolve this?
-- Fernando


2006-03-09 22:57:39

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

Fernando Lopez-Lezcano wrote:

> In my case it is completely repeatable.
> Boot, start jackd, stop jackd -> problem appears.
>
> This does not happen on all computers so it would seem to me it is
> related to the sound drivers. I'll try to see if there is a correlation
> with the sound card being used.
>
> Is there anything else I could do to try to help resolve this?

Can you test with the latest mainline -git snapshot, or is it only
the -rt tree that causes the warnings?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-10 02:48:19

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

On Fri, 2006-03-10 at 09:57 +1100, Nick Piggin wrote:
> Fernando Lopez-Lezcano wrote:
>
> > In my case it is completely repeatable.
> > Boot, start jackd, stop jackd -> problem appears.
> >
> > This does not happen on all computers so it would seem to me it is
> > related to the sound drivers. I'll try to see if there is a correlation
> > with the sound card being used.
> >
> > Is there anything else I could do to try to help resolve this?
>
> Can you test with the latest mainline -git snapshot, or is it only
> the -rt tree that causes the warnings?

I found something strange although I don't know why it happens yet:

Fedora Core 4 kernel (2.6.15 + patches) works fine.
Fedora Core 4 kernel + -rt21, [ahem... sorry], works fine.
Fedora Core 4 kernel + -rt21 + alsa kernel modules from 1.0.10 or
1.0.11rc3, fails[*]
Plain vanilla 2.6.15 + -rt21, works fine
Plain vanilla 2.6.15 + -rt21 + alsa kernel modules from 1.0.10 or
1.0.11rc3, fails[*]

So, it looks like it is some weird interaction between kernel modules
that were not compiled as part of the kernel and the kernel itself. The
"updated" modules are installed in a separate location (not on top of
the built in kernel modules) and are found before the ones in the kernel
tree.

I have been building this combination for a long long time with no
problems, I don't know what might have happened that changed things.

Could be:
- configuration problems?
- the alsa tree is somehow incompatible with the kernel alsa tree, is
that even possible?

I have _no_ idea on what to start looking for... but, oh well, this is a
start. Suggestions welcome. Thanks for somehow pointing me in the right
direction.

-- Fernando

[*] I need that because of cards not yet included in the kernel proper,
and fixes that have not yet percolated to the kernel alsa tree.


2006-03-10 05:08:17

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.15-rt20, "bad page state", jackd

Fernando Lopez-Lezcano wrote:
> On Fri, 2006-03-10 at 09:57 +1100, Nick Piggin wrote:
>
>>Fernando Lopez-Lezcano wrote:

>>Can you test with the latest mainline -git snapshot, or is it only
>>the -rt tree that causes the warnings?
>
>
> I found something strange although I don't know why it happens yet:
>
> Fedora Core 4 kernel (2.6.15 + patches) works fine.
> Fedora Core 4 kernel + -rt21, [ahem... sorry], works fine.
> Fedora Core 4 kernel + -rt21 + alsa kernel modules from 1.0.10 or
> 1.0.11rc3, fails[*]
> Plain vanilla 2.6.15 + -rt21, works fine
> Plain vanilla 2.6.15 + -rt21 + alsa kernel modules from 1.0.10 or
> 1.0.11rc3, fails[*]
>
> So, it looks like it is some weird interaction between kernel modules
> that were not compiled as part of the kernel and the kernel itself. The
> "updated" modules are installed in a separate location (not on top of
> the built in kernel modules) and are found before the ones in the kernel
> tree.
>
> I have been building this combination for a long long time with no
> problems, I don't know what might have happened that changed things.
>
> Could be:
> - configuration problems?

No. It shouldn't do this even if there is a configuration problem.

> - the alsa tree is somehow incompatible with the kernel alsa tree, is
> that even possible?
>

Yes. Most likely this. It should be fixed before the new ALSA code is
pushed upstream.

It is probably not so much a matter of somebody breaking the ALSA code
as that it hasn't been updated for the new kernel refcounting rules.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-10 18:50:48

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd

On Fri, 2006-03-10 at 16:08 +1100, Nick Piggin wrote:
> Fernando Lopez-Lezcano wrote:
> > On Fri, 2006-03-10 at 09:57 +1100, Nick Piggin wrote:
> >>Fernando Lopez-Lezcano wrote:
> >>Can you test with the latest mainline -git snapshot, or is it only
> >>the -rt tree that causes the warnings?
> >
> > I found something strange although I don't know why it happens yet:
> >
> > Fedora Core 4 kernel (2.6.15 + patches) works fine.
> > Fedora Core 4 kernel + -rt21, [ahem... sorry], works fine.
> > Fedora Core 4 kernel + -rt21 + alsa kernel modules from 1.0.10 or
> > 1.0.11rc3, fails[*]
> > Plain vanilla 2.6.15 + -rt21, works fine
> > Plain vanilla 2.6.15 + -rt21 + alsa kernel modules from 1.0.10 or
> > 1.0.11rc3, fails[*]
> >
> > So, it looks like it is some weird interaction between kernel modules
> > that were not compiled as part of the kernel and the kernel itself. The
> > "updated" modules are installed in a separate location (not on top of
> > the built in kernel modules) and are found before the ones in the kernel
> > tree.
> >
> > I have been building this combination for a long long time with no
> > problems, I don't know what might have happened that changed things.
> >
> > Could be:
> > - configuration problems?
>
> No. It shouldn't do this even if there is a configuration problem.
>
> > - the alsa tree is somehow incompatible with the kernel alsa tree, is
> > that even possible?
>
> Yes. Most likely this. It should be fixed before the new ALSA code is
> pushed upstream.
>
> It is probably not so much a matter of somebody breaking the ALSA code
> as that it hasn't been updated for the new kernel refcounting rules.

Takashi and other gurus in alsa-devel, any comments on this? The
original problem - not quoted in this email - is that when I stop jackd
in the affected configurations I get errors similar to this one:

> Bad page state at __free_pages_ok (in process 'jackd', page c1013ce0)
> flags:0x00000414 mapping:00000000 mapcount:0 count:0
> Backtrace:
> [<c015947d>] bad_page+0x7d/0xc0 (8)
> [<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
> [<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
> [<c015db47>] release_pages+0x127/0x1a0 (16)
> [<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
> [<c01681ae>] unmap_region+0x13e/0x160 (28)
> [<c0168461>] do_munmap+0xe1/0x120 (48)
> [<c01684df>] sys_munmap+0x3f/0x60 (32)
> [<c01034a1>] syscall_call+0x7/0xb (16)
> Trying to fix it up, but a reboot is needed

One other thing occurred to me (not tested yet)

- userspace regression in the module load code (so that in the end
modules from the in kernel tree get mixed with modules coming from the
externally compiled alsa tree). Very unlikely, I think, I could test for
this by removing the in kernel modules temporarily.

I have problems in both:
snd-ice1712 (midiman delta 66)
snd-hdsp (rme hdsp)
but this seems to work fine:
snd-echo3g (gina3g)

The interesting thing is that the one that works (snd-echo3g) has no
counterpat in the in kernel alsa tree - that is, only exists in the
add-on modules compiled externally. Coincidence?

-- Fernando


2006-03-10 23:10:42

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd

On Fri, 2006-03-10 at 10:50 -0800, Fernando Lopez-Lezcano wrote:
> On Fri, 2006-03-10 at 16:08 +1100, Nick Piggin wrote:
> > Fernando Lopez-Lezcano wrote:
> > > On Fri, 2006-03-10 at 09:57 +1100, Nick Piggin wrote:
> > >>Fernando Lopez-Lezcano wrote:
> > >>Can you test with the latest mainline -git snapshot, or is it only
> > >>the -rt tree that causes the warnings?
> > >
> > > I found something strange although I don't know why it happens yet:
> > >
> > > Fedora Core 4 kernel (2.6.15 + patches) works fine.
> > > Fedora Core 4 kernel + -rt21, [ahem... sorry], works fine.
> > > Fedora Core 4 kernel + -rt21 + alsa kernel modules from 1.0.10 or
> > > 1.0.11rc3, fails[*]
> > > Plain vanilla 2.6.15 + -rt21, works fine
> > > Plain vanilla 2.6.15 + -rt21 + alsa kernel modules from 1.0.10 or
> > > 1.0.11rc3, fails[*]
> > >
> > > So, it looks like it is some weird interaction between kernel modules
> > > that were not compiled as part of the kernel and the kernel itself. The
> > > "updated" modules are installed in a separate location (not on top of
> > > the built in kernel modules) and are found before the ones in the kernel
> > > tree.
> > >
> > > I have been building this combination for a long long time with no
> > > problems, I don't know what might have happened that changed things.
> > >
> > > Could be:
> > > - configuration problems?
> >
> > No. It shouldn't do this even if there is a configuration problem.
> >
> > > - the alsa tree is somehow incompatible with the kernel alsa tree, is
> > > that even possible?
> >
> > Yes. Most likely this. It should be fixed before the new ALSA code is
> > pushed upstream.
> >
> > It is probably not so much a matter of somebody breaking the ALSA code
> > as that it hasn't been updated for the new kernel refcounting rules.
>
> Takashi and other gurus in alsa-devel, any comments on this? The
> original problem - not quoted in this email - is that when I stop jackd
> in the affected configurations I get errors similar to this one:
>
> > Bad page state at __free_pages_ok (in process 'jackd', page c1013ce0)
> > flags:0x00000414 mapping:00000000 mapcount:0 count:0
> > Backtrace:
> > [<c015947d>] bad_page+0x7d/0xc0 (8)
> > [<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
> > [<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
> > [<c015db47>] release_pages+0x127/0x1a0 (16)
> > [<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
> > [<c01681ae>] unmap_region+0x13e/0x160 (28)
> > [<c0168461>] do_munmap+0xe1/0x120 (48)
> > [<c01684df>] sys_munmap+0x3f/0x60 (32)
> > [<c01034a1>] syscall_call+0x7/0xb (16)
> > Trying to fix it up, but a reboot is needed
>
> One other thing occurred to me (not tested yet)
>
> - userspace regression in the module load code (so that in the end
> modules from the in kernel tree get mixed with modules coming from the
> externally compiled alsa tree). Very unlikely, I think, I could test for
> this by removing the in kernel modules temporarily.

I just tested this and no, it is not the problem. I removed all
in-kernel modules that started with snd-* and reloaded alsa (making sure
that nothing remained loaded from the previous drivers): same problem.
It really starts to looks like it is an incompatibility between the
current alsa tree (outside of the kernel) and the current kernels.

-- Fernando


2006-03-11 00:01:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd

Fernando Lopez-Lezcano wrote:

> Takashi and other gurus in alsa-devel, any comments on this? The
> original problem - not quoted in this email - is that when I stop jackd
> in the affected configurations I get errors similar to this one:
>
>
>>Bad page state at __free_pages_ok (in process 'jackd', page c1013ce0)
>>flags:0x00000414 mapping:00000000 mapcount:0 count:0
>>Backtrace:
>> [<c015947d>] bad_page+0x7d/0xc0 (8)
>> [<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
>> [<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
>> [<c015db47>] release_pages+0x127/0x1a0 (16)
>> [<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
>> [<c01681ae>] unmap_region+0x13e/0x160 (28)
>> [<c0168461>] do_munmap+0xe1/0x120 (48)
>> [<c01684df>] sys_munmap+0x3f/0x60 (32)
>> [<c01034a1>] syscall_call+0x7/0xb (16)
>>Trying to fix it up, but a reboot is needed
>

FWIW, this is a PageReserved page being freed. PageReserved does
anything, and you instead need to ensure the page count is incremented
in your ->nopage handler (ie. via get_page()).

>
> One other thing occurred to me (not tested yet)
>
> - userspace regression in the module load code (so that in the end
> modules from the in kernel tree get mixed with modules coming from the
> externally compiled alsa tree). Very unlikely, I think, I could test for
> this by removing the in kernel modules temporarily.
>
> I have problems in both:
> snd-ice1712 (midiman delta 66)
> snd-hdsp (rme hdsp)
> but this seems to work fine:
> snd-echo3g (gina3g)
>
> The interesting thing is that the one that works (snd-echo3g) has no
> counterpat in the in kernel alsa tree - that is, only exists in the
> add-on modules compiled externally. Coincidence?
>
> -- Fernando
>
>
>


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-13 00:50:16

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

On Sat, 2006-03-11 at 11:01 +1100, Nick Piggin wrote:
> Fernando Lopez-Lezcano wrote:
>
> > Takashi and other gurus in alsa-devel, any comments on this? The
> > original problem - not quoted in this email - is that when I stop jackd
> > in the affected configurations I get errors similar to this one:
> >
> >
> >>Bad page state at __free_pages_ok (in process 'jackd', page c1013ce0)
> >>flags:0x00000414 mapping:00000000 mapcount:0 count:0
> >>Backtrace:
> >> [<c015947d>] bad_page+0x7d/0xc0 (8)
> >> [<c01598fd>] __free_pages_ok+0x9d/0x180 (36)
> >> [<c015a5ac>] __pagevec_free+0x3c/0x50 (40)
> >> [<c015db47>] release_pages+0x127/0x1a0 (16)
> >> [<c016c93d>] free_pages_and_swap_cache+0x7d/0xc0 (80)
> >> [<c01681ae>] unmap_region+0x13e/0x160 (28)
> >> [<c0168461>] do_munmap+0xe1/0x120 (48)
> >> [<c01684df>] sys_munmap+0x3f/0x60 (32)
> >> [<c01034a1>] syscall_call+0x7/0xb (16)
> >>Trying to fix it up, but a reboot is needed
> >
>
> FWIW, this is a PageReserved page being freed. PageReserved does
> anything, and you instead need to ensure the page count is incremented
> in your ->nopage handler (ie. via get_page()).
>
> >
> > One other thing occurred to me (not tested yet)
> >
> > - userspace regression in the module load code (so that in the end
> > modules from the in kernel tree get mixed with modules coming from the
> > externally compiled alsa tree). Very unlikely, I think, I could test for
> > this by removing the in kernel modules temporarily.
> >
> > I have problems in both:
> > snd-ice1712 (midiman delta 66)
> > snd-hdsp (rme hdsp)
> > but this seems to work fine:
> > snd-echo3g (gina3g)
> >
> > The interesting thing is that the one that works (snd-echo3g) has no
> > counterpat in the in kernel alsa tree - that is, only exists in the
> > add-on modules compiled externally. Coincidence?

Well, I found it. Finally. I diffed memalloc.c in the alsa kernel tree
with alsa stable 1.0.10 and googled for the obvious two chunks that
stood out :-)

It's apparently an old issue, see here and follow the thread:
http://lkml.org/lkml/2005/11/21/9
So, 1.0.10 obviously did not include these two patches:

========
--- linux-2.6.15-old/acore/memalloc.c 2006-03-10 15:13:36.636282832
-0800
+++ linux-2.6.15/sound/core/memalloc.c 2006-01-02 19:21:10.000000000
-0800
@@ -267,6 +197,7 @@

snd_assert(size > 0, return NULL);
snd_assert(gfp_flags != 0, return NULL);
+ gfp_flags |= __GFP_COMP; /* compound page lets parts be
mapped */
pg = get_order(size);
if ((res = (void *) __get_free_pages(gfp_flags, pg)) != NULL) {
mark_pages(virt_to_page(res), pg);
@@ -311,6 +242,7 @@
snd_assert(dma != NULL, return NULL);
pg = get_order(size);
gfp_flags = GFP_KERNEL
+ | __GFP_COMP /* compound page lets parts be mapped */
| __GFP_NORETRY /* don't trigger OOM-killer */
| __GFP_NOWARN; /* no stack trace print - this call is
non-critical */
res = dma_alloc_coherent(dev, PAGE_SIZE << pg, dma, gfp_flags);
========

With this in a short remote test 1.0.10 on top of 2.6.15-rt21 does not
generate the bad page messages I originally reported. Woohoo!

And 1.0.11rc3 apparently only includes one of the two patches.

]# find . -type f -exec grep GFP_COMP {} \; -print
#ifndef __GFP_COMP
#define __GFP_COMP 0
./include/adriver.h
gfp_flags |= __GFP_COMP; /* compound page lets parts be
mapped */
| __GFP_COMP /* compound page lets parts be mapped */
./alsa-kernel/core/memalloc.c

I'll test 1.0.11rc3 asap to confirm whether adding the missing bit makes
a difference or not. I think I was getting the exact same error on
1.0.11rc3 but I have to make sure.

Thanks for all the help!
-- Fernando


2006-03-13 02:42:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

Fernando Lopez-Lezcano wrote:

> Well, I found it. Finally. I diffed memalloc.c in the alsa kernel tree
> with alsa stable 1.0.10 and googled for the obvious two chunks that
> stood out :-)
>

Well, good work on tracking it down. I guess you should forward
forward your patch to the ALSA guys.

[snip]

Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-13 03:27:13

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

On Mon, 2006-03-13 at 13:42 +1100, Nick Piggin wrote:
> Fernando Lopez-Lezcano wrote:
>
> > Well, I found it. Finally. I diffed memalloc.c in the alsa kernel tree
> > with alsa stable 1.0.10 and googled for the obvious two chunks that
> > stood out :-)
> >
>
> Well, good work on tracking it down. I guess you should forward
> forward your patch to the ALSA guys.

It fixes 1.0.10 with recent kernels but I guess 1.0.10 is old so maybe
it will not get patched (just a guess) - what would that be, 1.0.10a?.
1.0.11rc3 did not trigger the problem in a quick test but I could swear
it did before, I'll have to retest again tomorrow (maybe it was
happening with a different card).

-- Fernando


2006-03-13 03:53:45

by Lee Revell

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

On Sun, 2006-03-12 at 19:39 -0800, Fernando Lopez-Lezcano wrote:
> On Sun, 2006-03-12 at 22:31 -0500, Lee Revell wrote:
>
> > Older ALSA with a newer kernel has never been supported. Why would you
> > want to replace the ALSA in the kernel with an old version?
>
> Because it is not an older version?
> "cat /proc/asound/version" for the 2.6.15 in kernel tree prints this:
> Advanced Linux Sound Architecture Driver Version 1.0.10rc3
> That should be older than 1.0.10 final.

Ah, sorry. Then you're right, this patch must have slipped through the
cracks.

> (plus 1.0.10 has drivers that are not yet in the kernel tree AFAIK)

Yeah I never liked this practice, I think all ALSA drivers should be in
the kernel. IMHO an immature driver is better than no driver.

Lee

2006-03-13 03:32:01

by Lee Revell

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

On Sun, 2006-03-12 at 19:26 -0800, Fernando Lopez-Lezcano wrote:
> On Mon, 2006-03-13 at 13:42 +1100, Nick Piggin wrote:
> > Fernando Lopez-Lezcano wrote:
> >
> > > Well, I found it. Finally. I diffed memalloc.c in the alsa kernel tree
> > > with alsa stable 1.0.10 and googled for the obvious two chunks that
> > > stood out :-)
> > >
> >
> > Well, good work on tracking it down. I guess you should forward
> > forward your patch to the ALSA guys.
>
> It fixes 1.0.10 with recent kernels but I guess 1.0.10 is old so maybe
> it will not get patched (just a guess) - what would that be, 1.0.10a?.
> 1.0.11rc3 did not trigger the problem in a quick test but I could swear
> it did before, I'll have to retest again tomorrow (maybe it was
> happening with a different card).

Older ALSA with a newer kernel has never been supported. Why would you
want to replace the ALSA in the kernel with an old version?

Lee

2006-03-13 03:39:31

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

On Sun, 2006-03-12 at 22:31 -0500, Lee Revell wrote:
> On Sun, 2006-03-12 at 19:26 -0800, Fernando Lopez-Lezcano wrote:
> > On Mon, 2006-03-13 at 13:42 +1100, Nick Piggin wrote:
> > > Fernando Lopez-Lezcano wrote:
> > >
> > > > Well, I found it. Finally. I diffed memalloc.c in the alsa kernel tree
> > > > with alsa stable 1.0.10 and googled for the obvious two chunks that
> > > > stood out :-)
> > > >
> > >
> > > Well, good work on tracking it down. I guess you should forward
> > > forward your patch to the ALSA guys.
> >
> > It fixes 1.0.10 with recent kernels but I guess 1.0.10 is old so maybe
> > it will not get patched (just a guess) - what would that be, 1.0.10a?.
> > 1.0.11rc3 did not trigger the problem in a quick test but I could swear
> > it did before, I'll have to retest again tomorrow (maybe it was
> > happening with a different card).
>
> Older ALSA with a newer kernel has never been supported. Why would you
> want to replace the ALSA in the kernel with an old version?

Because it is not an older version?
"cat /proc/asound/version" for the 2.6.15 in kernel tree prints this:
Advanced Linux Sound Architecture Driver Version 1.0.10rc3
That should be older than 1.0.10 final.
(plus 1.0.10 has drivers that are not yet in the kernel tree AFAIK)

-- Fernando


2006-03-13 09:28:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)


* Fernando Lopez-Lezcano <[email protected]> wrote:

> Well, I found it. Finally. I diffed memalloc.c in the alsa kernel tree
> with alsa stable 1.0.10 and googled for the obvious two chunks that
> stood out :-)
>
> It's apparently an old issue, see here and follow the thread:
> http://lkml.org/lkml/2005/11/21/9
> So, 1.0.10 obviously did not include these two patches:

ah, great detective work! FYI, current upstream has the fix included, so
2.6.16-rc6-rt2 should not have this particular problem.

Ingo

2006-03-13 11:05:27

by Takashi Iwai

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

At Sun, 12 Mar 2006 22:53:41 -0500,
Lee Revell wrote:
>
> On Sun, 2006-03-12 at 19:39 -0800, Fernando Lopez-Lezcano wrote:
> > On Sun, 2006-03-12 at 22:31 -0500, Lee Revell wrote:
> >
> > > Older ALSA with a newer kernel has never been supported. Why would you
> > > want to replace the ALSA in the kernel with an old version?
> >
> > Because it is not an older version?
> > "cat /proc/asound/version" for the 2.6.15 in kernel tree prints this:
> > Advanced Linux Sound Architecture Driver Version 1.0.10rc3
> > That should be older than 1.0.10 final.
>
> Ah, sorry. Then you're right, this patch must have slipped through the
> cracks.

Well, ALSA 1.0.10-final was already released in last November,
i.e. before 2.6.15. When 2.6.15 was released, we had ALSA 1.0.11rc2.

> > (plus 1.0.10 has drivers that are not yet in the kernel tree AFAIK)
>
> Yeah I never liked this practice, I think all ALSA drivers should be in
> the kernel. IMHO an immature driver is better than no driver.

These drivers are either ones that are pretty experimental or broken,
or the ones that are not confirmed to work with the latest 2.6
kernel. In the latter case, I can push to kernel tree at any time
once after someone tests and reports it to me.


Takashi

2006-03-13 17:34:18

by Fernando Lopez-Lezcano

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

On Mon, 2006-03-13 at 12:05 +0100, Takashi Iwai wrote:
> At Sun, 12 Mar 2006 22:53:41 -0500,
> Lee Revell wrote:
> > On Sun, 2006-03-12 at 19:39 -0800, Fernando Lopez-Lezcano wrote:
> > > On Sun, 2006-03-12 at 22:31 -0500, Lee Revell wrote:
> > >
> > > > Older ALSA with a newer kernel has never been supported. Why would you
> > > > want to replace the ALSA in the kernel with an old version?
> > >
> > > Because it is not an older version?
> > > "cat /proc/asound/version" for the 2.6.15 in kernel tree prints this:
> > > Advanced Linux Sound Architecture Driver Version 1.0.10rc3
> > > That should be older than 1.0.10 final.
> >
> > Ah, sorry. Then you're right, this patch must have slipped through the
> > cracks.
>
> Well, ALSA 1.0.10-final was already released in last November,
> i.e. before 2.6.15. When 2.6.15 was released, we had ALSA 1.0.11rc2.

I understand. Still, 2.6.15 has 1.0.10rc3 and current alsa "stable" does
not work out of the box with it (at least for some of the cards and in
my tests - hmmm, maybe this only happens when running with the -rt
patches?).

There's one additional tiny patch needed in alsa 1.0.10 if you want
snd-rtctimer to be detected by configure and subsequently built under
2.6.15+:

========
alsa-driver-1.0.10/configure~ 2005-11-16 09:41:17.000000000 -0500
+++ alsa-driver-1.0.10/configure 2006-03-06 20:48:03.152744160 -0500
@@ -8260,7 +8260,7 @@
echo "$as_me:$LINENO: checking for RTC callback support in kernel" >&5
echo $ECHO_N "checking for RTC callback support in kernel... $ECHO_C"
>&6
rtcsup=""
-if test "$kversion.$kpatchlevel" = "2.6" -a "$kpatchlevel" -ge 15; then
+if test "$kversion.$kpatchlevel" = "2.6" -a "$ksublevel" -ge 15; then
ac_save_CFLAGS="$CFLAGS"
ac_save_CC=$CC
CFLAGS="$KERNEL_CHECK_CFLAGS"
========

-- Fernando


2006-03-13 17:41:13

by Takashi Iwai

[permalink] [raw]
Subject: Re: [Alsa-devel] Re: 2.6.15-rt20, "bad page state", jackd (alsa 1.0.10 vs. recent kernels)

At Mon, 13 Mar 2006 09:33:50 -0800,
Fernando Lopez-Lezcano wrote:
>
> On Mon, 2006-03-13 at 12:05 +0100, Takashi Iwai wrote:
> > At Sun, 12 Mar 2006 22:53:41 -0500,
> > Lee Revell wrote:
> > > On Sun, 2006-03-12 at 19:39 -0800, Fernando Lopez-Lezcano wrote:
> > > > On Sun, 2006-03-12 at 22:31 -0500, Lee Revell wrote:
> > > >
> > > > > Older ALSA with a newer kernel has never been supported. Why would you
> > > > > want to replace the ALSA in the kernel with an old version?
> > > >
> > > > Because it is not an older version?
> > > > "cat /proc/asound/version" for the 2.6.15 in kernel tree prints this:
> > > > Advanced Linux Sound Architecture Driver Version 1.0.10rc3
> > > > That should be older than 1.0.10 final.
> > >
> > > Ah, sorry. Then you're right, this patch must have slipped through the
> > > cracks.
> >
> > Well, ALSA 1.0.10-final was already released in last November,
> > i.e. before 2.6.15. When 2.6.15 was released, we had ALSA 1.0.11rc2.
>
> I understand. Still, 2.6.15 has 1.0.10rc3 and current alsa "stable" does
> not work out of the box with it (at least for some of the cards and in
> my tests - hmmm, maybe this only happens when running with the -rt
> patches?).

I hoped to release 1.0.11-final much ealier to make it available for
2.6.15 kernel, but it's delayed quite much, so far.
The current release-cycle is really really bad...


Takashi