Hi,
A bit, suddenly by desktop PC started to fail to resume. I have
redirected the console to ttyS0 and managed to caputere the oops
(attached). I am not a dissassebling expert and I have built my
kernel without full debuging symbold but here is what I found
(at least for the first trace in the attached oops.txt).
The failing code is somewhere around line 2400 of
drivers/firewire/ohci.c (the latest mainline). There is a note
about some values beeing NULL during the resume process but it
appears there are more NULLs then expected.
(%rbx) points to ohci structure.
I attach:
oops.txt - full dump of oops from console.
oops_code.txt - disassembled Code from the oops.
ohci_enable_disassembled.txt - dissassembled ohci_enable function
from my kernel (gentoo v3.18.8, but as far as I can tell there
haven't been much changes around).
I have marked the failing instruction in the disassembler dumps
with "-->".
There are two conditinons I *suspect* being responsible for this
situation.
Hardware failure. There was a storm a week ago recently which might
damaged the hardware. It appears it hit my SB Audigy very slightly
(the card's PCI interface appears OK but the AC97 codec is glitching
when setting mixer registers)
Hardware bug in the on-board firewire controller *and* a bug in the
driver. The code around the line 2400 appears to handle multiple
firewire ports (if I recognise variable names correctly, e.g.
next_config_rom). Now, without the SB card, I've got only one
firewire port so this is what has changed.
Please tell me how can I help more to debug this problem. (I may
have some problems using the firewire port because I don't have any
firewire devices)
Kind regards,
--
Było mi bardzo miło. Twoje oczy lubią mnie
>Łukasz< i to mnie zgubi (c)SNL
Lukasz Stelmach wrote:
> A bit, suddenly by desktop PC started to fail to resume. [...]
> The failing code is somewhere around line 2400 of
> drivers/firewire/ohci.c (the latest mainline).
> 0x000000000000003f <+31>: callq 0xffffffffffffb037 <copy_config_rom>
> 0x0000000000000044 <+36>: mov 0x898(%rbx),%rax
> -->0x000000000000004b <+43>: mov (%rax),%edx <--
(The copy_config_rom call was not actually executed; the else branch
jumped to 44.)
ohci->next_config_rom is NULL because ohci->config_rom is NULL.
> The code around the line 2400 appears to handle multiple
> firewire ports (if I recognise variable names correctly, e.g.
> next_config_rom).
No, this code handles multiple versions of the same data structure.
> Hardware bug in the on-board firewire controller *and* a bug in the
> driver.
Indeed; this appears to be the culprit:
> [ 232.855042] firewire_ohci 0000:04:03.0: added OHCI v1.0 device as card 0, 8 IR + 8 IT contexts, quirks 0x0
> [ 232.864724] firewire_ohci 0000:04:03.0: bad self ID 0/1 (00000000 != ~00000000)
With the "bad self ID", bus_reset_work() just aborts, and the controller
is never completely initialized (therefore the unexpected NULL).
Try unloading and reloading the firewire-ohci module to see if you can
ever avoid the "bad self ID" error. But if it stays, your hardware
indeed appears to be broken.
Regards,
Clemens