2006-02-10 02:10:24

by Jonathan Woithe

[permalink] [raw]
Subject: 2.6.15-rt16: possible sound-related side-effect

Hi all

In the last week I have updated the kernel on our laptop to 2.6.15-rt16.
By and large this worked well and had the attractive side effect of making
the clock run at the correct speed once more.

During development of an ALSA patch I had the need to remove and reinsert
the hda-intel and hda-codec modules on numerous occasions. Every so often
(perhaps once every 5 or 6 times on average) the initialisation sequence of
hda-intel would get hung up and the associated insmod would never return. A
reboot was required to clear the problem. The following messages were
written to syslog repeatedly and often:

Feb 5 21:36:24: ALSA sound/pci/hda/hda_intel.c:511: azx_get_response timeout
Feb 5 21:36:26 halite last message repeated 9 times
Feb 5 21:36:29 halite kernel: printk: 31 messages suppressed.

I have noticed the "azx_get_response timeout" messages in earlier kernels
as well, but up until now the hda initialisation hasn't gotten hung up.

The latching up of the hda-intel initialisation does not appear to occur
when doing the same thing under a non-RT 2.6.15 kernel. Furthermore, I have
had an instance where the lockup occured while cold-booting an unmodified
2.6.15-rt16, which rules out any changes I made to ALSA as the cause of the
problem. In any case the changes I was making to ALSA don't affect the
initialisation code.

Prior to this kernel I was running an unmodified 2.6.14-rt21 kernel and
while these messages did occur they didn't cause hda-intel to lock up.

Any suggestions as to what might be causing this and/or of further tests
which might help narrow down the cause?

Regards
jonathan


2006-02-10 08:05:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect



On Fri, 10 Feb 2006, Jonathan Woithe wrote:

> Hi all
>
> In the last week I have updated the kernel on our laptop to 2.6.15-rt16.
> By and large this worked well and had the attractive side effect of making
> the clock run at the correct speed once more.
>
> During development of an ALSA patch I had the need to remove and reinsert
> the hda-intel and hda-codec modules on numerous occasions. Every so often
> (perhaps once every 5 or 6 times on average) the initialisation sequence of
> hda-intel would get hung up and the associated insmod would never return. A
> reboot was required to clear the problem. The following messages were
> written to syslog repeatedly and often:
>
> Feb 5 21:36:24: ALSA sound/pci/hda/hda_intel.c:511: azx_get_response timeout
> Feb 5 21:36:26 halite last message repeated 9 times
> Feb 5 21:36:29 halite kernel: printk: 31 messages suppressed.
>
> I have noticed the "azx_get_response timeout" messages in earlier kernels
> as well, but up until now the hda initialisation hasn't gotten hung up.
>
> The latching up of the hda-intel initialisation does not appear to occur
> when doing the same thing under a non-RT 2.6.15 kernel. Furthermore, I have
> had an instance where the lockup occured while cold-booting an unmodified
> 2.6.15-rt16, which rules out any changes I made to ALSA as the cause of the
> problem. In any case the changes I was making to ALSA don't affect the
> initialisation code.
>
> Prior to this kernel I was running an unmodified 2.6.14-rt21 kernel and
> while these messages did occur they didn't cause hda-intel to lock up.
>
> Any suggestions as to what might be causing this and/or of further tests
> which might help narrow down the cause?

Could you turn on nmi_watchdog as well as softlockup_detect.

nmi_watchedog:
make sure CONFIG_X86_LOCAL_APIC is set, and then pass in the command
line "nmi_watchdog=2 lapic" (lapic may or may not be needed, but should
not hurt to include it).

softlockup_detect:
set CONFIG_DETECT_SOFTLOCKUP.

these may point out better what is locked up.

-- Steve

2006-02-15 23:37:34

by Jonathan Woithe

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect

Hi Steve

> > In the last week I have updated the kernel on our laptop to 2.6.15-rt16.
> > By and large this worked well and had the attractive side effect of making
> > the clock run at the correct speed once more.
> >
> > During development of an ALSA patch I had the need to remove and reinsert
> > the hda-intel and hda-codec modules on numerous occasions. Every so often
> > (perhaps once every 5 or 6 times on average) the initialisation sequence of
> > hda-intel would get hung up and the associated insmod would never return. A
> > reboot was required to clear the problem. The following messages were
> > written to syslog repeatedly and often:
> >
> > Feb 5 21:36:24: ALSA sound/pci/hda/hda_intel.c:511: azx_get_response timeout
> > Feb 5 21:36:26 halite last message repeated 9 times
> > Feb 5 21:36:29 halite kernel: printk: 31 messages suppressed.
> >
> > I have noticed the "azx_get_response timeout" messages in earlier kernels
> > as well, but up until now the hda initialisation hasn't gotten hung up.
> >
> > The latching up of the hda-intel initialisation does not appear to occur
> > when doing the same thing under a non-RT 2.6.15 kernel. Furthermore, I have
> > had an instance where the lockup occured while cold-booting an unmodified
> > 2.6.15-rt16, which rules out any changes I made to ALSA as the cause of the
> > problem. In any case the changes I was making to ALSA don't affect the
> > initialisation code.
> >
> > Prior to this kernel I was running an unmodified 2.6.14-rt21 kernel and
> > while these messages did occur they didn't cause hda-intel to lock up.
> >
> > Any suggestions as to what might be causing this and/or of further tests
> > which might help narrow down the cause?
>
> Could you turn on nmi_watchdog as well as softlockup_detect.
>
> nmi_watchedog:
> make sure CONFIG_X86_LOCAL_APIC is set, and then pass in the command
> line "nmi_watchdog=2 lapic" (lapic may or may not be needed, but should
> not hurt to include it).
>
> softlockup_detect:
> set CONFIG_DETECT_SOFTLOCKUP.
>
> these may point out better what is locked up.

The NMI watchdog triggers only when the kernel as a whole locks up. That
isn't what's happening here. What is occuring is that the HDA side of
things is entering a seemingly endless loop, but the rest of the kernel
continues to function normally.

In any case, I did the above and booted with "nmi_watchdog=2 lapic". I
triggered the fault but neither the soft lockup nor NMI watchdog triggered,
which given the above was of no surprise.

The "azx_get_response timeout" is only produced by a single function:
azx_get_response() in hda_intel.c. This in turn seems to be called from
only one location - azx_codec_create() in hda_intel.c. azx_probe() calls
azx_codec_create() exactly once, and azx_probe() is the .probe method for
the PCI driver definition. Given that the above call chain in hda-intel.c
does not contain any loops, I conclude that the PCI code is repeatedly
calling the probe method due (presumedly) to the repeated timeouts. I can't
see how else azx_get_response() could be called repeatedly.

As mentioned in the original post, other kernels display the timeout
message. However, they do not loop endlessly and the driver load succeeds
(presumedly on the second attempt, since only one timeout message was ever
seen at a time). Applying rt16 to 2.6.15 seems to change something which
causes the endless timeouts. The condition doesn't exist all the time; once
it does exist however it persists until reboot. Note too that the condition
seems to only affect the HDA side of things - all other components of the
kernel appear to function correctly.

The machine in question is an i915-based laptop using PCI-express internally.

Any other suggestions as to tests which might narrow down the problem's cause?

Regards
jonathan

2006-02-16 07:55:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect


On Thu, 16 Feb 2006, Jonathan Woithe wrote:

> Hi Steve
>
> > > In the last week I have updated the kernel on our laptop to 2.6.15-rt16.
> > > By and large this worked well and had the attractive side effect of making
> > > the clock run at the correct speed once more.
> > >
> > > During development of an ALSA patch I had the need to remove and reinsert
> > > the hda-intel and hda-codec modules on numerous occasions. Every so often
> > > (perhaps once every 5 or 6 times on average) the initialisation sequence of
> > > hda-intel would get hung up and the associated insmod would never return. A
> > > reboot was required to clear the problem. The following messages were
> > > written to syslog repeatedly and often:
> > >
> > > Feb 5 21:36:24: ALSA sound/pci/hda/hda_intel.c:511: azx_get_response timeout
> > > Feb 5 21:36:26 halite last message repeated 9 times
> > > Feb 5 21:36:29 halite kernel: printk: 31 messages suppressed.
> > >
> > > I have noticed the "azx_get_response timeout" messages in earlier kernels
> > > as well, but up until now the hda initialisation hasn't gotten hung up.
> > >
> > > The latching up of the hda-intel initialisation does not appear to occur
> > > when doing the same thing under a non-RT 2.6.15 kernel. Furthermore, I have
> > > had an instance where the lockup occured while cold-booting an unmodified
> > > 2.6.15-rt16, which rules out any changes I made to ALSA as the cause of the
> > > problem. In any case the changes I was making to ALSA don't affect the
> > > initialisation code.
> > >
> > > Prior to this kernel I was running an unmodified 2.6.14-rt21 kernel and
> > > while these messages did occur they didn't cause hda-intel to lock up.
> > >
> > > Any suggestions as to what might be causing this and/or of further tests
> > > which might help narrow down the cause?
> >
> > Could you turn on nmi_watchdog as well as softlockup_detect.
> >
> > nmi_watchedog:
> > make sure CONFIG_X86_LOCAL_APIC is set, and then pass in the command
> > line "nmi_watchdog=2 lapic" (lapic may or may not be needed, but should
> > not hurt to include it).
> >
> > softlockup_detect:
> > set CONFIG_DETECT_SOFTLOCKUP.
> >
> > these may point out better what is locked up.
>
> The NMI watchdog triggers only when the kernel as a whole locks up. That
> isn't what's happening here. What is occuring is that the HDA side of
> things is entering a seemingly endless loop, but the rest of the kernel
> continues to function normally.

Sorry, I was thinking your kernel locked up, not just the insmod. OK,
forget about what I asked.


>
> In any case, I did the above and booted with "nmi_watchdog=2 lapic". I
> triggered the fault but neither the soft lockup nor NMI watchdog triggered,
> which given the above was of no surprise.
>
> The "azx_get_response timeout" is only produced by a single function:
> azx_get_response() in hda_intel.c. This in turn seems to be called from
> only one location - azx_codec_create() in hda_intel.c. azx_probe() calls

It's not called there. It's setup as a function pointer, so it is called
when another fuction calls the get_response pointer.

> azx_codec_create() exactly once, and azx_probe() is the .probe method for
> the PCI driver definition. Given that the above call chain in hda-intel.c
> does not contain any loops, I conclude that the PCI code is repeatedly
> calling the probe method due (presumedly) to the repeated timeouts. I can't
> see how else azx_get_response() could be called repeatedly.

If you want to know where it is being called from, do add the following:

static unsigned int azx_get_response(struct hda_codec *codec)
{
azx_t *chip = codec->bus->private_data;
int timeout = 50;

while (chip->rirb.cmds) {
if (! --timeout) {
snd_printk(KERN_ERR "azx_get_response timeout\n");
+ dump_stack();
chip->rirb.rp = azx_readb(chip, RIRBWP);
chip->rirb.cmds = 0;
return -1;
}
msleep(1);
}
return chip->rirb.res; /* the last value */
}

that will show the call trace of the function.

>
> As mentioned in the original post, other kernels display the timeout
> message. However, they do not loop endlessly and the driver load succeeds
> (presumedly on the second attempt, since only one timeout message was ever
> seen at a time). Applying rt16 to 2.6.15 seems to change something which
> causes the endless timeouts. The condition doesn't exist all the time; once
> it does exist however it persists until reboot. Note too that the condition
> seems to only affect the HDA side of things - all other components of the
> kernel appear to function correctly.
>
> The machine in question is an i915-based laptop using PCI-express internally.
>
> Any other suggestions as to tests which might narrow down the problem's cause?
>

Yeah, could you just add that dump_stack to see who is calling this. Then
we can look into that.

Thanks,

-- Steve

2006-02-16 09:43:43

by Bill Huey

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect

On Thu, Feb 16, 2006 at 02:55:17AM -0500, Steven Rostedt wrote:
> Yeah, could you just add that dump_stack to see who is calling this. Then
> we can look into that.

I'm also getting a funny x86_64 crash on boot with what looks like the
slab.c changes and the cache_grow() function. Tracking it down has been
rather odd in the it's complaining about a fault deferencing a bogus
value on a up_mutex. I'll see what I can do in the next day or so.

bill

2006-02-17 01:43:11

by Jonathan Woithe

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect

Hi all

> > > > In the last week I have updated the kernel on our laptop to 2.6.15-rt16.
> > > > By and large this worked well and had the attractive side effect of making
> > > > the clock run at the correct speed once more.
> > > >
> > > > During development of an ALSA patch I had the need to remove and reinsert
> > > > the hda-intel and hda-codec modules on numerous occasions. Every so often
> > > > (perhaps once every 5 or 6 times on average) the initialisation sequence of
> > > > hda-intel would get hung up and the associated insmod would never return. A
> > > > reboot was required to clear the problem. The following messages were
> > > > written to syslog repeatedly and often:
> > > >
> > > > Feb 5 21:36:24: ALSA sound/pci/hda/hda_intel.c:511: azx_get_response timeout
> > > > Feb 5 21:36:26 halite last message repeated 9 times
> > > > Feb 5 21:36:29 halite kernel: printk: 31 messages suppressed.
> > > >
> > > > I have noticed the "azx_get_response timeout" messages in earlier kernels
> > > > as well, but up until now the hda initialisation hasn't gotten hung up.
> > > >
> > > > The latching up of the hda-intel initialisation does not appear to occur
> > > > when doing the same thing under a non-RT 2.6.15 kernel. Furthermore, I have
> > > > had an instance where the lockup occured while cold-booting an unmodified
> > > > 2.6.15-rt16, which rules out any changes I made to ALSA as the cause of the
> > > > problem. In any case the changes I was making to ALSA don't affect the
> > > > initialisation code.
> > > >
> > > > Prior to this kernel I was running an unmodified 2.6.14-rt21 kernel and
> > > > while these messages did occur they didn't cause hda-intel to lock up.
> > > >
> > > > Any suggestions as to what might be causing this and/or of further tests
> > > > which might help narrow down the cause?
:
> Sorry, I was thinking your kernel locked up, not just the insmod. OK,
> forget about what I asked.

No problem.

> > The "azx_get_response timeout" is only produced by a single function:
> > azx_get_response() in hda_intel.c. This in turn seems to be called from
> > only one location - azx_codec_create() in hda_intel.c. azx_probe() calls
>
> It's not called there. It's setup as a function pointer, so it is called
> when another fuction calls the get_response pointer.

Yep, you're right - I misread the code the other day, probably because I
was in a hurry. My turn to be sorry. :)

> If you want to know where it is being called from, do add the following:
>
> static unsigned int azx_get_response(struct hda_codec *codec)
> {
> azx_t *chip = codec->bus->private_data;
> int timeout = 50;
>
> while (chip->rirb.cmds) {
> if (! --timeout) {
> snd_printk(KERN_ERR "azx_get_response timeout\n");
> + dump_stack();
> chip->rirb.rp = azx_readb(chip, RIRBWP);
> chip->rirb.cmds = 0;
> return -1;
> }
> msleep(1);
> }
> return chip->rirb.res; /* the last value */
> }
>
> that will show the call trace of the function.
> :
> > Any other suggestions as to tests which might narrow down the problem's cause?
>
> Yeah, could you just add that dump_stack to see who is calling this. Then
> we can look into that.

No problem. I'll do this sometime over the weekend and report the findings
early next week.

Regards
jonathan

2006-02-24 01:28:51

by Jonathan Woithe

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect

Hi guys

> > > > In the last week I have updated the kernel on our laptop to 2.6.15-rt16.
> > > > By and large this worked well and had the attractive side effect of making
> > > > the clock run at the correct speed once more.
> > > >
> > > > During development of an ALSA patch I had the need to remove and reinsert
> > > > the hda-intel and hda-codec modules on numerous occasions. Every so often
> > > > (perhaps once every 5 or 6 times on average) the initialisation sequence of
> > > > hda-intel would get hung up and the associated insmod would never return. A
> > > > reboot was required to clear the problem. The following messages were
> > > > written to syslog repeatedly and often:
> > > >
> > > > Feb 5 21:36:24: ALSA sound/pci/hda/hda_intel.c:511: azx_get_response timeout
> > > > Feb 5 21:36:26 halite last message repeated 9 times
> > > > Feb 5 21:36:29 halite kernel: printk: 31 messages suppressed.
> > > >
> > > > I have noticed the "azx_get_response timeout" messages in earlier kernels
> > > > as well, but up until now the hda initialisation hasn't gotten hung up.
:
> > In any case, I did the above and booted with "nmi_watchdog=2 lapic". I
> > triggered the fault but neither the soft lockup nor NMI watchdog triggered,
> > which given the above was of no surprise.
> >
> > The "azx_get_response timeout" is only produced by a single function:
> > azx_get_response() in hda_intel.c. This in turn seems to be called from
> > only one location - azx_codec_create() in hda_intel.c. azx_probe() calls
>
> It's not called there. It's setup as a function pointer, so it is called
> when another fuction calls the get_response pointer.
>
> > azx_codec_create() exactly once, and azx_probe() is the .probe method for
> > the PCI driver definition. Given that the above call chain in hda-intel.c
> > does not contain any loops, I conclude that the PCI code is repeatedly
> > calling the probe method due (presumedly) to the repeated timeouts. I can't
> > see how else azx_get_response() could be called repeatedly.
>
> If you want to know where it is being called from, do add the following:
>
> static unsigned int azx_get_response(struct hda_codec *codec)
> {
> azx_t *chip = codec->bus->private_data;
> int timeout = 50;
>
> while (chip->rirb.cmds) {
> if (! --timeout) {
> snd_printk(KERN_ERR "azx_get_response timeout\n");
> + dump_stack();
> chip->rirb.rp = azx_readb(chip, RIRBWP);
> chip->rirb.cmds = 0;
> return -1;
> }
> msleep(1);
> }
> return chip->rirb.res; /* the last value */
> }
>
> that will show the call trace of the function.
:
> > Any other suggestions as to tests which might narrow down the problem's cause?
>
> Yeah, could you just add that dump_stack to see who is calling this. Then
> we can look into that.

I finally got around to doing this and the first two dumps are at the
end of this message. Many many more were sent to the logs though - as
one would expect, since we know that in the fault condition this function
is called very frequently without any end in site.

I've put the *full* dump up at

http://www.atrad.com.au/~jwoithe/kernel/full_timeout_dump.bz2

if anyone's interested. Similarly, I moved the dump_stack() call
outside the while() loop and obtained a dump in the case of things
working correctly. You can grab this from

http://www.atrad.com.au/~jwoithe/kernel/dump_loaded_ok.bz2

Oh yes, the dump_stack() call was put *before* the snd_printk() call
when I did the tests. This makes no practical difference except that it
explains why the output from this came after the dumps.

Over the weekend I'll try to test 2.6.15-rt17 since that's recently come
out.

What next?

Regards
jonathan

Stack dump following the first timeout:
Feb 22 21:34:15 halite kernel: [<f8ccd22d>] azx_get_response+0x21/0x7c [snd_hda_intel] (8)
Feb 22 21:34:15 halite kernel: [<f8d5b038>] snd_hda_codec_read+0x38/0x4a [snd_hda_codec] (12)
Feb 22 21:34:15 halite kernel: [<f8d5b0cf>] snd_hda_get_sub_nodes+0x18/0x33 [snd_hda_codec] (16)
Feb 22 21:34:15 halite kernel: [<f8d5c6f9>] hda_set_power_state+0x2d/0xaf [snd_hda_codec] (24)
Feb 22 21:34:15 halite kernel: [<f8d5c7df>] snd_hda_build_controls+0x64/0x8b [snd_hda_codec] (56)
Feb 22 21:34:15 halite kernel: [<f8cce015>] azx_mixer_create+0xf/0x11 [snd_hda_intel] (28)
Feb 22 21:34:15 halite kernel: [<f8cce6d6>] azx_probe+0x98/0xcb [snd_hda_intel] (8)
Feb 22 21:34:15 halite kernel: [<c021e95b>] pci_call_probe+0xf/0x12 (24)
Feb 22 21:34:15 halite kernel: [<c021e991>] __pci_device_probe+0x33/0x47 (12)
Feb 22 21:34:15 halite kernel: [<c021e9c4>] pci_device_probe+0x1f/0x34 (28)
Feb 22 21:34:15 halite kernel: [<c0264893>] driver_probe_device+0x3a/0x84 (24)
Feb 22 21:34:15 halite kernel: [<c026493d>] __driver_attach+0x0/0x34 (16)
Feb 22 21:34:15 halite kernel: [<c0264963>] __driver_attach+0x26/0x34 (4)
Feb 22 21:34:15 halite kernel: [<c02640d1>] bus_for_each_dev+0x4a/0x70 (20)
Feb 22 21:34:15 halite kernel: [<c02169c0>] kobject_add+0x82/0xa1 (24)
Feb 22 21:34:15 halite kernel: [<c0264985>] driver_attach+0x14/0x18 (20)
Feb 22 21:34:15 halite kernel: [<c026493d>] __driver_attach+0x0/0x34 (16)
Feb 22 21:34:15 halite kernel: [<c02644c1>] bus_add_driver+0x57/0x9f (4)
Feb 22 21:34:15 halite kernel: [<c0264c62>] driver_register+0x4d/0x53 (24)
Feb 22 21:34:15 halite kernel: [<c021eba7>] __pci_register_driver+0x7a/0xa9 (20)
Feb 22 21:34:15 halite kernel: [<c021ebc5>] __pci_register_driver+0x98/0xa9 (12)
Feb 22 21:34:15 halite kernel: [<f8c8f00f>] alsa_card_azx_init+0xf/0x12 [snd_hda_intel] (28)
Feb 22 21:34:15 halite kernel: [<c01317c1>] sys_init_module+0xac/0x19c (12)
Feb 22 21:34:15 halite kernel: [<c0102cad>] syscall_call+0x7/0xb (16)
Feb 22 21:34:15 halite kernel: ALSA /mnt/hd/alsa-1.0.11rc3/alsa-driver-1.0.11rc3/alsa-kernel/pci/hda/hda_intel.c:516: azx_get_response timeout

Second dump following the start of the timeouts:
Feb 22 21:34:16 halite kernel: [<f8ccd22d>] azx_get_response+0x21/0x7c [snd_hda_intel] (8)
Feb 22 21:34:16 halite kernel: [<f8d5b038>] snd_hda_codec_read+0x38/0x4a [snd_hda_codec] (12)
Feb 22 21:34:16 halite kernel: [<f8d5c72f>] hda_set_power_state+0x63/0xaf [snd_hda_codec] (16)
Feb 22 21:34:16 halite kernel: [<f8d5c7df>] snd_hda_build_controls+0x64/0x8b [snd_hda_codec] (44)
Feb 22 21:34:16 halite kernel: [<f8cce015>] azx_mixer_create+0xf/0x11 [snd_hda_intel] (28)
Feb 22 21:34:16 halite kernel: [<f8cce6d6>] azx_probe+0x98/0xcb [snd_hda_intel] (8)
Feb 22 21:34:16 halite kernel: [<c021e95b>] pci_call_probe+0xf/0x12 (24)
Feb 22 21:34:16 halite kernel: [<c021e991>] __pci_device_probe+0x33/0x47 (12)
Feb 22 21:34:16 halite kernel: [<c021e9c4>] pci_device_probe+0x1f/0x34 (28)
Feb 22 21:34:16 halite kernel: [<c0264893>] driver_probe_device+0x3a/0x84 (24)
Feb 22 21:34:16 halite kernel: [<c026493d>] __driver_attach+0x0/0x34 (16)
Feb 22 21:34:16 halite kernel: [<c0264963>] __driver_attach+0x26/0x34 (4)
Feb 22 21:34:16 halite kernel: [<c02640d1>] bus_for_each_dev+0x4a/0x70 (20)
Feb 22 21:34:16 halite kernel: [<c02169c0>] kobject_add+0x82/0xa1 (24)
Feb 22 21:34:16 halite kernel: [<c0264985>] driver_attach+0x14/0x18 (20)
Feb 22 21:34:16 halite kernel: [<c026493d>] __driver_attach+0x0/0x34 (16)
Feb 22 21:34:16 halite kernel: [<c02644c1>] bus_add_driver+0x57/0x9f (4)
Feb 22 21:34:16 halite kernel: [<c0264c62>] driver_register+0x4d/0x53 (24)
Feb 22 21:34:16 halite kernel: [<c021eba7>] __pci_register_driver+0x7a/0xa9 (20)
Feb 22 21:34:16 halite kernel: [<c021ebc5>] __pci_register_driver+0x98/0xa9 (12)
Feb 22 21:34:16 halite kernel: [<f8c8f00f>] alsa_card_azx_init+0xf/0x12 [snd_hda_intel] (28)
Feb 22 21:34:16 halite kernel: [<c01317c1>] sys_init_module+0xac/0x19c (12)
Feb 22 21:34:16 halite kernel: [<c0102cad>] syscall_call+0x7/0xb (16)
Feb 22 21:34:16 halite kernel: ALSA /mnt/hd/alsa-1.0.11rc3/alsa-driver-1.0.11rc3/alsa-kernel/pci/hda/hda_intel.c:516: azx_get_response timeout

2006-02-24 07:38:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect



On Fri, 24 Feb 2006, Jonathan Woithe wrote:

>
> I finally got around to doing this and the first two dumps are at the
> end of this message. Many many more were sent to the logs though - as
> one would expect, since we know that in the fault condition this function
> is called very frequently without any end in site.

Thanks, it didn't find the bug for me, but it did get me a better idea of
what it's doing. Now that you modified the kernel, I hope you are
comfortable in doing more of that :)

Remove the dump stack, and apply the following patch (should work with
both -rt16 and -rt17). This is _not_ a bug fix (obviously) but will do
a lot more prints. I don't know how many interrupts this driver produces
so beware that this can really fill the logs.

Apply the patch and then can you do the same and post the results? This
will give me a good idea at what is happening. You'll want to remove the
patch afterwards, since it produces a lot of noise.

>
> I've put the *full* dump up at
>
> http://www.atrad.com.au/~jwoithe/kernel/full_timeout_dump.bz2
>
> if anyone's interested. Similarly, I moved the dump_stack() call
> outside the while() loop and obtained a dump in the case of things
> working correctly. You can grab this from
>
> http://www.atrad.com.au/~jwoithe/kernel/dump_loaded_ok.bz2
>
> Oh yes, the dump_stack() call was put *before* the snd_printk() call
> when I did the tests. This makes no practical difference except that it
> explains why the output from this came after the dumps.
>
> Over the weekend I'll try to test 2.6.15-rt17 since that's recently come
> out.
>
> What next?
>

Here's the patch.

-- Steve

Index: linux-2.6.15-rt17/sound/pci/hda/hda_intel.c
===================================================================
--- linux-2.6.15-rt17.orig/sound/pci/hda/hda_intel.c 2006-02-08 08:34:07.000000000 -0500
+++ linux-2.6.15-rt17/sound/pci/hda/hda_intel.c 2006-02-24 02:28:24.000000000 -0500
@@ -486,10 +486,14 @@ static void azx_update_rirb(azx_t *chip)
u32 res, res_ex;

wp = azx_readb(chip, RIRBWP);
- if (wp == chip->rirb.wp)
+ printk("%s: wp 0x%x\n",__FUNCTION__,wp);
+ if (wp == chip->rirb.wp) {
+ printk("%s: ?? bye bye\n",__FUNCTION__);
return;
+ }
chip->rirb.wp = wp;

+ printk("%s: rp=0x%x\n",__FUNCTION__,chip->rirb.rp);
while (chip->rirb.rp != wp) {
chip->rirb.rp++;
chip->rirb.rp %= ICH6_MAX_RIRB_ENTRIES;
@@ -497,13 +501,16 @@ static void azx_update_rirb(azx_t *chip)
rp = chip->rirb.rp << 1; /* an RIRB entry is 8-bytes */
res_ex = le32_to_cpu(chip->rirb.buf[rp + 1]);
res = le32_to_cpu(chip->rirb.buf[rp]);
- if (res_ex & ICH6_RIRB_EX_UNSOL_EV)
+ if (res_ex & ICH6_RIRB_EX_UNSOL_EV) {
+ printk("%s: unsol event\n",__FUNCTION__);
snd_hda_queue_unsol_event(chip->bus, res, res_ex);
- else if (chip->rirb.cmds) {
+ } else if (chip->rirb.cmds) {
chip->rirb.cmds--;
+ printk("%s: chip->rirb.cmds=0x%x\n",__FUNCTION__,chip->rirb.cmds);
chip->rirb.res = res;
}
}
+ printk("%s: out\n",__FUNCTION__);
}

/* receive a response */
@@ -767,11 +774,14 @@ static irqreturn_t azx_interrupt(int irq
u32 status;
int i;

+ printk("%s: irq %d\n",__FUNCTION__,irq);
spin_lock(&chip->reg_lock);

status = azx_readl(chip, INTSTS);
+ printk("%s: status INTSTS 0x%x\n",__FUNCTION__,status);
if (status == 0) {
spin_unlock(&chip->reg_lock);
+ printk("%s: no irq?\n",__FUNCTION__);
return IRQ_NONE;
}

@@ -791,6 +801,7 @@ static irqreturn_t azx_interrupt(int irq

/* clear rirb int */
status = azx_readb(chip, RIRBSTS);
+ printk("%s: status RIRBSTS 0x%x\n",__FUNCTION__,status);
if (status & RIRB_INT_MASK) {
if (status & RIRB_INT_RESPONSE)
azx_update_rirb(chip);
@@ -804,6 +815,7 @@ static irqreturn_t azx_interrupt(int irq
#endif
spin_unlock(&chip->reg_lock);

+ printk("%s: out\n",__FUNCTION__);
return IRQ_HANDLED;
}

2006-02-26 23:25:01

by Jonathan Woithe

[permalink] [raw]
Subject: Re: 2.6.15-rt16: possible sound-related side-effect

Hi Steve

> > I finally got around to doing this and the first two dumps are at the
> > end of this message. Many many more were sent to the logs though - as
> > one would expect, since we know that in the fault condition this function
> > is called very frequently without any end in site.
>
> Thanks, it didn't find the bug for me, but it did get me a better idea of
> what it's doing. Now that you modified the kernel, I hope you are
> comfortable in doing more of that :)

As I said, this issue first came to light while I worked on an ALSA driver
patch. I'm fine with patches. :)

> Remove the dump stack, and apply the following patch (should work with
> both -rt16 and -rt17). This is _not_ a bug fix (obviously) but will do
> a lot more prints. I don't know how many interrupts this driver produces
> so beware that this can really fill the logs.

FYI it applied with some manual intervention against -rt16. I ran out of
time to try -rt17 though. It'll be a day or so before I get to that.

The minor problem is that with the patch applied I was unable to trigger the
bug. I then went back to what I *think* was a totally unmodified 2.6.15-rt16
tree, recompiled that from scratch and was then unable to reproduce the
problem with that one either.

Now, this was quite late last night and there were a few other strange
issues occuring at the same time, so before declaring this whole thing the
result of an out-of-sync kernel compile I want to do everything again from
scratch (especially since I can't quite see how things could have gotten out
of sync). I'll also test -rt17 for completeness.

Another possibility is that it's CVS-alsa which is doing something odd to
the hardware. I'll test for this as well.

I'll try to do all this tonight, so hopefully by tomorrow I'll have some
definitive news one way or the other.

Regards
jonathan