2006-08-12 10:08:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

Hi,

On 2.6.18-rc3-mm2 with hotfixes I get things like the appended one on attempts
to suspend to disk. It occurs while devices are being suspended and is fairly
reproducible.

Greetings,
Rafael


Suspending device 0000:01:00.0
Suspending device 0000:02:02.0
Suspending device 0000:02:01.4
Suspending device 0000:02:01.3
Suspending device 0000:02:01.2
Suspending device 0000:02:01.1
Suspending device 0000:02:01.0
Suspending device 0000:02:00.0
skge Ram read data parity error
skge Ram write data parity error
skge eth0: receive queue parity error
skge <NULL>: receive queue parity error
skge 0000:02:00.0: PCI error cmd=0x110 status=0x2b0
general protection fault: 0000 [1] PREEMPT
last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/subsystem_device
CPU 0
Modules linked in: ide_cd cdrom usbserial asus_acpi thermal ipv6 processor fan button battery ac af_packet snd_pcm_oss snd_mixer_oss snd_seq
snd_seq_device bcm43xx ieee80211softmac ieee80211 ieee80211_crypt pcmcia firmware_class ohci1394 ieee1394 skge yenta_socket rsrc_nonstatic pc
mcia_core usbhid ff_memless snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd ohci_hcd i2c_nfo
rce2 i2c_core parport_pc lp parport dm_mod
Pid: 4, comm: events/0 Not tainted 2.6.18-rc3-mm2 #17
RIP: 0010:[<ffffffff88107287>] [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
RSP: 0018:ffffffff80621e70 EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: 0000000000000040
RDX: ffff81005addf128 RSI: ffffffff80621eec RDI: ffff81005addeb60
RBP: ffffffff80621ed0 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000040 R11: 0000000000000000 R12: ffff81005addf0a0
R13: 0000000000000000 R14: ffff810057fe9180 R15: 00000000ffffffff
FS: 00002b4b98df4b00(0000) GS:ffffffff808c2000(0000) knlGS:00000000558b4d00
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002adeb0d7d0b0 CR3: 0000000025147000 CR4: 00000000000006e0
Process events/0 (pid: 4, threadinfo ffff810037f44000, task ffff810037fef100)
Stack: ffffffff80621eb0 ffffffff80621eec ffff81005addeb60 ffff81005ad61488
ffff81005addf128 000000400000000a 00000001008e6a25 0000000000000000
ffff81005addeb60 0000000000000000 00000001008e6a25 00000000ffffffff
Call Trace:
[<ffffffff8040b1ba>] net_rx_action+0xba/0x1f0
[<ffffffff80233640>] __do_softirq+0x70/0xf0
[<ffffffff8020aa7c>] call_softirq+0x1c/0x30
DWARF2 unwinder stuck at call_softirq+0x1c/0x30
Leftover inexact backtrace:
<IRQ> [<ffffffff8020ca4d>] do_softirq+0x3d/0xb0
[<ffffffff8023349e>] irq_exit+0x4e/0x60
[<ffffffff8020cbf5>] do_IRQ+0x135/0x140
[<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
[<ffffffff8020a266>] ret_from_intr+0x0/0xf
<EOI> [<ffffffff80233367>] local_bh_enable_ip+0xe7/0x110
[<ffffffff804718b9>] _spin_unlock_bh+0x39/0x40
[<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
[<ffffffff80427c8b>] rt_cache_flush+0xab/0x100
[<ffffffff8045a1c9>] fib_netdev_event+0xa9/0xc0
[<ffffffff8023c2af>] notifier_call_chain+0x2f/0x50
[<ffffffff8023c4b9>] raw_notifier_call_chain+0x9/0x10
[<ffffffff80409789>] netdev_state_change+0x29/0x40
[<ffffffff80415122>] linkwatch_run_queue+0x162/0x190
[<ffffffff8041517a>] linkwatch_event+0x2a/0x40
[<ffffffff8023fd72>] run_workqueue+0xc2/0x120
[<ffffffff80415150>] linkwatch_event+0x0/0x40
[<ffffffff8023fff1>] worker_thread+0x121/0x160
[<ffffffff80229370>] default_wake_function+0x0/0x10
[<ffffffff8023fed0>] worker_thread+0x0/0x160
[<ffffffff802436f9>] kthread+0xd9/0x110
[<ffffffff8024b1ad>] trace_hardirqs_on+0x11d/0x150
[<ffffffff8020a706>] child_rip+0x8/0x12
[<ffffffff80471e5b>] _spin_unlock_irq+0x2b/0x60
[<ffffffff8020a2c0>] restore_args+0x0/0x30
[<ffffffff80243620>] kthread+0x0/0x110
[<ffffffff8020a6fe>] child_rip+0x0/0x12


Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
RIP [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
RSP <ffffffff80621e70>
<0>Kernel panic - not syncing: Aiee, killing interrupt handler!


2006-08-12 12:29:00

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Sat, 12 Aug 2006 12:07:42 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> Hi,
>
> On 2.6.18-rc3-mm2 with hotfixes I get things like the appended one on attempts
> to suspend to disk. It occurs while devices are being suspended and is fairly
> reproducible.
>
> Greetings,
> Rafael
>
>
> Suspending device 0000:01:00.0
> Suspending device 0000:02:02.0
> Suspending device 0000:02:01.4
> Suspending device 0000:02:01.3
> Suspending device 0000:02:01.2
> Suspending device 0000:02:01.1
> Suspending device 0000:02:01.0
> Suspending device 0000:02:00.0
> skge Ram read data parity error
> skge Ram write data parity error
> skge eth0: receive queue parity error
> skge <NULL>: receive queue parity error
> skge 0000:02:00.0: PCI error cmd=0x110 status=0x2b0
> general protection fault: 0000 [1] PREEMPT
> last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/subsystem_device
> CPU 0
> Modules linked in: ide_cd cdrom usbserial asus_acpi thermal ipv6 processor fan button battery ac af_packet snd_pcm_oss snd_mixer_oss snd_seq
> snd_seq_device bcm43xx ieee80211softmac ieee80211 ieee80211_crypt pcmcia firmware_class ohci1394 ieee1394 skge yenta_socket rsrc_nonstatic pc
> mcia_core usbhid ff_memless snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd ohci_hcd i2c_nfo
> rce2 i2c_core parport_pc lp parport dm_mod
> Pid: 4, comm: events/0 Not tainted 2.6.18-rc3-mm2 #17
> RIP: 0010:[<ffffffff88107287>] [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
> RSP: 0018:ffffffff80621e70 EFLAGS: 00010202
> RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: 0000000000000040

RAX doesn't look good.

> RDX: ffff81005addf128 RSI: ffffffff80621eec RDI: ffff81005addeb60
> RBP: ffffffff80621ed0 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000040 R11: 0000000000000000 R12: ffff81005addf0a0
> R13: 0000000000000000 R14: ffff810057fe9180 R15: 00000000ffffffff
> FS: 00002b4b98df4b00(0000) GS:ffffffff808c2000(0000) knlGS:00000000558b4d00
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00002adeb0d7d0b0 CR3: 0000000025147000 CR4: 00000000000006e0
> Process events/0 (pid: 4, threadinfo ffff810037f44000, task ffff810037fef100)
> Stack: ffffffff80621eb0 ffffffff80621eec ffff81005addeb60 ffff81005ad61488
> ffff81005addf128 000000400000000a 00000001008e6a25 0000000000000000
> ffff81005addeb60 0000000000000000 00000001008e6a25 00000000ffffffff
> Call Trace:
> [<ffffffff8040b1ba>] net_rx_action+0xba/0x1f0
> [<ffffffff80233640>] __do_softirq+0x70/0xf0
> [<ffffffff8020aa7c>] call_softirq+0x1c/0x30
> DWARF2 unwinder stuck at call_softirq+0x1c/0x30
> Leftover inexact backtrace:
> <IRQ> [<ffffffff8020ca4d>] do_softirq+0x3d/0xb0
> [<ffffffff8023349e>] irq_exit+0x4e/0x60
> [<ffffffff8020cbf5>] do_IRQ+0x135/0x140
> [<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
> [<ffffffff8020a266>] ret_from_intr+0x0/0xf
> <EOI> [<ffffffff80233367>] local_bh_enable_ip+0xe7/0x110
> [<ffffffff804718b9>] _spin_unlock_bh+0x39/0x40
> [<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
> [<ffffffff80427c8b>] rt_cache_flush+0xab/0x100
> [<ffffffff8045a1c9>] fib_netdev_event+0xa9/0xc0
> [<ffffffff8023c2af>] notifier_call_chain+0x2f/0x50
> [<ffffffff8023c4b9>] raw_notifier_call_chain+0x9/0x10
> [<ffffffff80409789>] netdev_state_change+0x29/0x40
> [<ffffffff80415122>] linkwatch_run_queue+0x162/0x190
> [<ffffffff8041517a>] linkwatch_event+0x2a/0x40
> [<ffffffff8023fd72>] run_workqueue+0xc2/0x120
> [<ffffffff80415150>] linkwatch_event+0x0/0x40
> [<ffffffff8023fff1>] worker_thread+0x121/0x160
> [<ffffffff80229370>] default_wake_function+0x0/0x10
> [<ffffffff8023fed0>] worker_thread+0x0/0x160
> [<ffffffff802436f9>] kthread+0xd9/0x110
> [<ffffffff8024b1ad>] trace_hardirqs_on+0x11d/0x150
> [<ffffffff8020a706>] child_rip+0x8/0x12
> [<ffffffff80471e5b>] _spin_unlock_irq+0x2b/0x60
> [<ffffffff8020a2c0>] restore_args+0x0/0x30
> [<ffffffff80243620>] kthread+0x0/0x110
> [<ffffffff8020a6fe>] child_rip+0x0/0x12
> Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
> RIP [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
> RSP <ffffffff80621e70>
> <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

ksymoops says:

Code; ffffffff88107287 <_end+7ac9287/7efc2000>
00000000 <_EIP>:
Code; ffffffff88107287 <_end+7ac9287/7efc2000> <=====
0: 44 inc %esp <=====
Code; ffffffff88107288 <_end+7ac9288/7efc2000>
1: 8b 28 mov (%eax),%ebp
Code; ffffffff8810728a <_end+7ac928a/7efc2000>
3: c7 45 d0 00 00 00 00 movl $0x0,0xffffffd0(%ebp)
Code; ffffffff88107291 <_end+7ac9291/7efc2000>
a: 45 inc %ebp
Code; ffffffff88107292 <_end+7ac9292/7efc2000>
b: 85 ed test %ebp,%ebp
Code; ffffffff88107294 <_end+7ac9294/7efc2000>
d: 0f 89 29 fb ff ff jns fffffb3c <_EIP+0xfffffb3c>
Code; ffffffff8810729a <_end+7ac929a/7efc2000>
13: e9 00 00 00 00 jmp 18 <_EIP+0x18>

So even if we didn't deref a kfree'd pointer, we're about to.

It would be good if you could poke around in gdb, work out exactly which
statement it's oopsing at, please.

2006-08-12 13:40:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

Andrew Morton wrote:
> It would be good if you could poke around in gdb, work out exactly which
> statement it's oopsing at, please.

I'm also interested to know if the problem goes away when you disable
preempt...

Jeff


2006-08-12 14:32:37

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Saturday 12 August 2006 14:28, Andrew Morton wrote:
> On Sat, 12 Aug 2006 12:07:42 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > Hi,
> >
> > On 2.6.18-rc3-mm2 with hotfixes I get things like the appended one on attempts
> > to suspend to disk. It occurs while devices are being suspended and is fairly
> > reproducible.
> >
> > Greetings,
> > Rafael
> >
> >
> > Suspending device 0000:01:00.0
> > Suspending device 0000:02:02.0
> > Suspending device 0000:02:01.4
> > Suspending device 0000:02:01.3
> > Suspending device 0000:02:01.2
> > Suspending device 0000:02:01.1
> > Suspending device 0000:02:01.0
> > Suspending device 0000:02:00.0
> > skge Ram read data parity error
> > skge Ram write data parity error
> > skge eth0: receive queue parity error
> > skge <NULL>: receive queue parity error

This stuff comes from the interrupt handler which apparently races with
something.

> > skge 0000:02:00.0: PCI error cmd=0x110 status=0x2b0
> > general protection fault: 0000 [1] PREEMPT
> > last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/subsystem_device
> > CPU 0
> > Modules linked in: ide_cd cdrom usbserial asus_acpi thermal ipv6 processor fan button battery ac af_packet snd_pcm_oss snd_mixer_oss snd_seq
> > snd_seq_device bcm43xx ieee80211softmac ieee80211 ieee80211_crypt pcmcia firmware_class ohci1394 ieee1394 skge yenta_socket rsrc_nonstatic pc
> > mcia_core usbhid ff_memless snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd ohci_hcd i2c_nfo
> > rce2 i2c_core parport_pc lp parport dm_mod
> > Pid: 4, comm: events/0 Not tainted 2.6.18-rc3-mm2 #17
> > RIP: 0010:[<ffffffff88107287>] [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
> > RSP: 0018:ffffffff80621e70 EFLAGS: 00010202
> > RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: 0000000000000040
>
> RAX doesn't look good.

Yup.

> > RDX: ffff81005addf128 RSI: ffffffff80621eec RDI: ffff81005addeb60
> > RBP: ffffffff80621ed0 R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000040 R11: 0000000000000000 R12: ffff81005addf0a0
> > R13: 0000000000000000 R14: ffff810057fe9180 R15: 00000000ffffffff
> > FS: 00002b4b98df4b00(0000) GS:ffffffff808c2000(0000) knlGS:00000000558b4d00
> > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > CR2: 00002adeb0d7d0b0 CR3: 0000000025147000 CR4: 00000000000006e0
> > Process events/0 (pid: 4, threadinfo ffff810037f44000, task ffff810037fef100)
> > Stack: ffffffff80621eb0 ffffffff80621eec ffff81005addeb60 ffff81005ad61488
> > ffff81005addf128 000000400000000a 00000001008e6a25 0000000000000000
> > ffff81005addeb60 0000000000000000 00000001008e6a25 00000000ffffffff
> > Call Trace:
> > [<ffffffff8040b1ba>] net_rx_action+0xba/0x1f0
> > [<ffffffff80233640>] __do_softirq+0x70/0xf0
> > [<ffffffff8020aa7c>] call_softirq+0x1c/0x30
> > DWARF2 unwinder stuck at call_softirq+0x1c/0x30
> > Leftover inexact backtrace:
> > <IRQ> [<ffffffff8020ca4d>] do_softirq+0x3d/0xb0
> > [<ffffffff8023349e>] irq_exit+0x4e/0x60
> > [<ffffffff8020cbf5>] do_IRQ+0x135/0x140
> > [<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
> > [<ffffffff8020a266>] ret_from_intr+0x0/0xf
> > <EOI> [<ffffffff80233367>] local_bh_enable_ip+0xe7/0x110
> > [<ffffffff804718b9>] _spin_unlock_bh+0x39/0x40
> > [<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
> > [<ffffffff80427c8b>] rt_cache_flush+0xab/0x100
> > [<ffffffff8045a1c9>] fib_netdev_event+0xa9/0xc0
> > [<ffffffff8023c2af>] notifier_call_chain+0x2f/0x50
> > [<ffffffff8023c4b9>] raw_notifier_call_chain+0x9/0x10
> > [<ffffffff80409789>] netdev_state_change+0x29/0x40
> > [<ffffffff80415122>] linkwatch_run_queue+0x162/0x190
> > [<ffffffff8041517a>] linkwatch_event+0x2a/0x40
> > [<ffffffff8023fd72>] run_workqueue+0xc2/0x120
> > [<ffffffff80415150>] linkwatch_event+0x0/0x40
> > [<ffffffff8023fff1>] worker_thread+0x121/0x160
> > [<ffffffff80229370>] default_wake_function+0x0/0x10
> > [<ffffffff8023fed0>] worker_thread+0x0/0x160
> > [<ffffffff802436f9>] kthread+0xd9/0x110
> > [<ffffffff8024b1ad>] trace_hardirqs_on+0x11d/0x150
> > [<ffffffff8020a706>] child_rip+0x8/0x12
> > [<ffffffff80471e5b>] _spin_unlock_irq+0x2b/0x60
> > [<ffffffff8020a2c0>] restore_args+0x0/0x30
> > [<ffffffff80243620>] kthread+0x0/0x110
> > [<ffffffff8020a6fe>] child_rip+0x0/0x12
> > Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
> > RIP [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
> > RSP <ffffffff80621e70>
> > <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
>
> ksymoops says:
>
> Code; ffffffff88107287 <_end+7ac9287/7efc2000>
> 00000000 <_EIP>:
> Code; ffffffff88107287 <_end+7ac9287/7efc2000> <=====
> 0: 44 inc %esp <=====
> Code; ffffffff88107288 <_end+7ac9288/7efc2000>
> 1: 8b 28 mov (%eax),%ebp
> Code; ffffffff8810728a <_end+7ac928a/7efc2000>
> 3: c7 45 d0 00 00 00 00 movl $0x0,0xffffffd0(%ebp)
> Code; ffffffff88107291 <_end+7ac9291/7efc2000>
> a: 45 inc %ebp
> Code; ffffffff88107292 <_end+7ac9292/7efc2000>
> b: 85 ed test %ebp,%ebp
> Code; ffffffff88107294 <_end+7ac9294/7efc2000>
> d: 0f 89 29 fb ff ff jns fffffb3c <_EIP+0xfffffb3c>
> Code; ffffffff8810729a <_end+7ac929a/7efc2000>
> 13: e9 00 00 00 00 jmp 18 <_EIP+0x18>
>
> So even if we didn't deref a kfree'd pointer, we're about to.

Hm, but the code should be 64-bit?

> It would be good if you could poke around in gdb, work out exactly which
> statement it's oopsing at, please.

(gdb) l *skge_poll+0x547
0x5287 is in skge_poll (skge.c:2719).
2714 struct skge_rx_desc *rd = e->desc;
2715 struct sk_buff *skb;
2716 u32 control;
2717
2718 rmb();
2719 control = rd->control;
2720 if (control & BMU_OWN)
2721 break;
2722
2723 skb = skge_rx_get(skge, e, control, rd->status, rd->csum2);

2006-08-12 14:33:06

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Saturday 12 August 2006 15:39, Jeff Garzik wrote:
> Andrew Morton wrote:
> > It would be good if you could poke around in gdb, work out exactly which
> > statement it's oopsing at, please.
>
> I'm also interested to know if the problem goes away when you disable
> preempt...

That will take some time to test. :-)

Rafael

2006-08-12 16:12:57

by Edgar E. Iglesias

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Sat, Aug 12, 2006 at 04:31:18PM +0200, Rafael J. Wysocki wrote:
> On Saturday 12 August 2006 14:28, Andrew Morton wrote:
> > On Sat, 12 Aug 2006 12:07:42 +0200
> > "Rafael J. Wysocki" <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > On 2.6.18-rc3-mm2 with hotfixes I get things like the appended one on attempts
> > > to suspend to disk. It occurs while devices are being suspended and is fairly
> > > reproducible.
> > >
> > > Greetings,
> > > Rafael
> > >
> > >
> > > Suspending device 0000:01:00.0
> > > Suspending device 0000:02:02.0
> > > Suspending device 0000:02:01.4
> > > Suspending device 0000:02:01.3
> > > Suspending device 0000:02:01.2
> > > Suspending device 0000:02:01.1
> > > Suspending device 0000:02:01.0
> > > Suspending device 0000:02:00.0
> > > skge Ram read data parity error
> > > skge Ram write data parity error
> > > skge eth0: receive queue parity error
> > > skge <NULL>: receive queue parity error
>
> This stuff comes from the interrupt handler which apparently races with
> something.

Maybe the skge driver is not doing netif_poll_disable before clearing the rx
ring at suspend/down?

Best regards
--
Programmer
Edgar E. Iglesias <[email protected]> 46.46.272.1946

2006-08-12 17:14:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Saturday 12 August 2006 18:12, Edgar E. Iglesias wrote:
> On Sat, Aug 12, 2006 at 04:31:18PM +0200, Rafael J. Wysocki wrote:
> > On Saturday 12 August 2006 14:28, Andrew Morton wrote:
> > > On Sat, 12 Aug 2006 12:07:42 +0200
> > > "Rafael J. Wysocki" <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > On 2.6.18-rc3-mm2 with hotfixes I get things like the appended one on attempts
> > > > to suspend to disk. It occurs while devices are being suspended and is fairly
> > > > reproducible.
> > > >
> > > > Greetings,
> > > > Rafael
> > > >
> > > >
> > > > Suspending device 0000:01:00.0
> > > > Suspending device 0000:02:02.0
> > > > Suspending device 0000:02:01.4
> > > > Suspending device 0000:02:01.3
> > > > Suspending device 0000:02:01.2
> > > > Suspending device 0000:02:01.1
> > > > Suspending device 0000:02:01.0
> > > > Suspending device 0000:02:00.0
> > > > skge Ram read data parity error
> > > > skge Ram write data parity error
> > > > skge eth0: receive queue parity error
> > > > skge <NULL>: receive queue parity error
> >
> > This stuff comes from the interrupt handler which apparently races with
> > something.
>
> Maybe the skge driver is not doing netif_poll_disable before clearing the rx
> ring at suspend/down?

Apparently it doesn't.

At least netif_poll_disable is not referenced anywhere in skge.c .

Greetings,
Rafael

2006-08-12 18:16:07

by Edgar E. Iglesias

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Sat, Aug 12, 2006 at 07:13:01PM +0200, Rafael J. Wysocki wrote:
> Apparently it doesn't.

Hi, could you try and see if this helps?

Best regards
--
Programmer
Edgar E. Iglesias <[email protected]> 46.46.272.1946

Signed-off-by: Edgar E. Iglesias <[email protected]>

diff --git a/drivers/net/skge.c b/drivers/net/skge.c
index 7de9a07..accefab 100644
--- a/drivers/net/skge.c
+++ b/drivers/net/skge.c
@@ -2211,6 +2211,7 @@ static int skge_up(struct net_device *de
skge_write8(hw, Q_ADDR(rxqaddr[port], Q_CSR), CSR_START | CSR_IRQ_CL_F);
skge_led(skge, LED_MODE_ON);

+ netif_poll_enable(dev);
return 0;

free_rx_ring:
@@ -2279,6 +2280,7 @@ static int skge_down(struct net_device *

skge_led(skge, LED_MODE_OFF);

+ netif_poll_disable(dev);
skge_tx_clean(skge);
skge_rx_clean(skge);


2006-08-12 19:36:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Saturday 12 August 2006 20:16, Edgar E. Iglesias wrote:
> On Sat, Aug 12, 2006 at 07:13:01PM +0200, Rafael J. Wysocki wrote:
> > Apparently it doesn't.
>
> Hi, could you try and see if this helps?

With the patch I can't reproduce the problem. I sometimes get the error
messages from the interrupt handler, but then it doesn't blow up in
skge_poll(), so I think the patch helps.

Thanks,
Rafael

2006-08-12 19:36:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Saturday 12 August 2006 16:32, Rafael J. Wysocki wrote:
> On Saturday 12 August 2006 15:39, Jeff Garzik wrote:
> > Andrew Morton wrote:
> > > It would be good if you could poke around in gdb, work out exactly which
> > > statement it's oopsing at, please.
> >
> > I'm also interested to know if the problem goes away when you disable
> > preempt...
>
> That will take some time to test. :-)

It's also reproducible with PREEMPT disabled.

2006-08-13 05:56:58

by Stephen Hemminger

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Sat, 12 Aug 2006 20:16:03 +0200
"Edgar E. Iglesias" <[email protected]> wrote:

> On Sat, Aug 12, 2006 at 07:13:01PM +0200, Rafael J. Wysocki wrote:
> > Apparently it doesn't.
>
> Hi, could you try and see if this helps?
>
> Best regards

That looks good, but needs a few more changes for full safety.
Kind of like the sky2 changes needed to get Mac Mini to work.

The machine I have with skge boards don't suspend right but that is because
of other problems.

2006-08-13 08:58:21

by Chuck Ebbert

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

In-Reply-To: <[email protected]>

On Sat, 12 Aug 2006 05:28:53 -0700, Andrew Morton wrote:

> > general protection fault: 0000 [1] PREEMPT
> > last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/subsystem_device
> > CPU 0
> > Modules linked in: ide_cd cdrom usbserial asus_acpi thermal ipv6 processor fan button battery ac af_packet snd_pcm_oss snd_mixer_oss snd_seq
> > snd_seq_device bcm43xx ieee80211softmac ieee80211 ieee80211_crypt pcmcia firmware_class ohci1394 ieee1394 skge yenta_socket rsrc_nonstatic pc
> > mcia_core usbhid ff_memless snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd ohci_hcd i2c_nfo
> > rce2 i2c_core parport_pc lp parport dm_mod
> > Pid: 4, comm: events/0 Not tainted 2.6.18-rc3-mm2 #17
> > RIP: 0010:[<ffffffff88107287>] [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
> > RSP: 0018:ffffffff80621e70 EFLAGS: 00010202
> > RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: 0000000000000040
>
> RAX doesn't look good.
>
> > RDX: ffff81005addf128 RSI: ffffffff80621eec RDI: ffff81005addeb60
> > RBP: ffffffff80621ed0 R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000040 R11: 0000000000000000 R12: ffff81005addf0a0
> > R13: 0000000000000000 R14: ffff810057fe9180 R15: 00000000ffffffff
> > FS: 00002b4b98df4b00(0000) GS:ffffffff808c2000(0000) knlGS:00000000558b4d00
> > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > CR2: 00002adeb0d7d0b0 CR3: 0000000025147000 CR4: 00000000000006e0
> > Process events/0 (pid: 4, threadinfo ffff810037f44000, task ffff810037fef100)
> > Stack: ffffffff80621eb0 ffffffff80621eec ffff81005addeb60 ffff81005ad61488
> > ffff81005addf128 000000400000000a 00000001008e6a25 0000000000000000
> > ffff81005addeb60 0000000000000000 00000001008e6a25 00000000ffffffff
> > Call Trace:
> > [<ffffffff8040b1ba>] net_rx_action+0xba/0x1f0
> > [<ffffffff80233640>] __do_softirq+0x70/0xf0
> > [<ffffffff8020aa7c>] call_softirq+0x1c/0x30
> > DWARF2 unwinder stuck at call_softirq+0x1c/0x30
> > Leftover inexact backtrace:
> > <IRQ> [<ffffffff8020ca4d>] do_softirq+0x3d/0xb0
> > [<ffffffff8023349e>] irq_exit+0x4e/0x60
> > [<ffffffff8020cbf5>] do_IRQ+0x135/0x140
> > [<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
> > [<ffffffff8020a266>] ret_from_intr+0x0/0xf
> > <EOI> [<ffffffff80233367>] local_bh_enable_ip+0xe7/0x110
> > [<ffffffff804718b9>] _spin_unlock_bh+0x39/0x40
> > [<ffffffff80427b9e>] rt_run_flush+0x8e/0xd0
> > [<ffffffff80427c8b>] rt_cache_flush+0xab/0x100
> > [<ffffffff8045a1c9>] fib_netdev_event+0xa9/0xc0
> > [<ffffffff8023c2af>] notifier_call_chain+0x2f/0x50
> > [<ffffffff8023c4b9>] raw_notifier_call_chain+0x9/0x10
> > [<ffffffff80409789>] netdev_state_change+0x29/0x40
> > [<ffffffff80415122>] linkwatch_run_queue+0x162/0x190
> > [<ffffffff8041517a>] linkwatch_event+0x2a/0x40
> > [<ffffffff8023fd72>] run_workqueue+0xc2/0x120
> > [<ffffffff80415150>] linkwatch_event+0x0/0x40
> > [<ffffffff8023fff1>] worker_thread+0x121/0x160
> > [<ffffffff80229370>] default_wake_function+0x0/0x10
> > [<ffffffff8023fed0>] worker_thread+0x0/0x160
> > [<ffffffff802436f9>] kthread+0xd9/0x110
> > [<ffffffff8024b1ad>] trace_hardirqs_on+0x11d/0x150
> > [<ffffffff8020a706>] child_rip+0x8/0x12
> > [<ffffffff80471e5b>] _spin_unlock_irq+0x2b/0x60
> > [<ffffffff8020a2c0>] restore_args+0x0/0x30
> > [<ffffffff80243620>] kthread+0x0/0x110
> > [<ffffffff8020a6fe>] child_rip+0x0/0x12
> > Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
> > RIP [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
> > RSP <ffffffff80621e70>
>
> ksymoops says:
>
> Code; ffffffff88107287 <_end+7ac9287/7efc2000>
> 00000000 <_EIP>:
> Code; ffffffff88107287 <_end+7ac9287/7efc2000> <=====
> 0: 44 inc %esp <=====
> Code; ffffffff88107288 <_end+7ac9288/7efc2000>
> 1: 8b 28 mov (%eax),%ebp

0x44 is a REX prefix in 64-bit mode, so somehow ksymoops got it
wrong and gave you an i386-mode decode instead of 64-bit mode.
Did you run it on a i386 machine and it assumed i386? Maybe you
need to use "-a x86-64"? (I can't make it work on my setup.)

So it's really "mov (%r8),%ebp" if I am reading the manual right.

--
Chuck

2006-08-13 17:39:04

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Sun, 13 Aug 2006 04:53:09 -0400
Chuck Ebbert <[email protected]> wrote:

> > > Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
> > > RIP [<ffffffff88107287>] :skge:skge_poll+0x547/0x570
> > > RSP <ffffffff80621e70>
> >
> > ksymoops says:
> >
> > Code; ffffffff88107287 <_end+7ac9287/7efc2000>
> > 00000000 <_EIP>:
> > Code; ffffffff88107287 <_end+7ac9287/7efc2000> <=====
> > 0: 44 inc %esp <=====
> > Code; ffffffff88107288 <_end+7ac9288/7efc2000>
> > 1: 8b 28 mov (%eax),%ebp
>
> 0x44 is a REX prefix in 64-bit mode, so somehow ksymoops got it
> wrong and gave you an i386-mode decode instead of 64-bit mode.
> Did you run it on a i386 machine and it assumed i386? Maybe you
> need to use "-a x86-64"? (I can't make it work on my setup.)
>
> So it's really "mov (%r8),%ebp" if I am reading the manual right.

I don't know what ksymoops's problem is. I noticed that without `-a' it
gave x86 code so I gave it `-a i386:x86-64' and didn't bother to read the
output ;) Seems that nothing I can do will persuade it to not treat this as
i386 code.

2006-08-14 00:21:49

by Keith Owens

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

Chuck Ebbert (on Sun, 13 Aug 2006 04:53:09 -0400) wrote:
>In-Reply-To: <[email protected]>
>
>On Sat, 12 Aug 2006 05:28:53 -0700, Andrew Morton wrote:
>
>> > Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
>>
>> ksymoops says:
>>
>> Code; ffffffff88107287 <_end+7ac9287/7efc2000>
>> 00000000 <_EIP>:
>> Code; ffffffff88107287 <_end+7ac9287/7efc2000> <=====
>> 0: 44 inc %esp <=====
>> Code; ffffffff88107288 <_end+7ac9288/7efc2000>
>> 1: 8b 28 mov (%eax),%ebp
>
>0x44 is a REX prefix in 64-bit mode, so somehow ksymoops got it
>wrong and gave you an i386-mode decode instead of 64-bit mode.
>Did you run it on a i386 machine and it assumed i386? Maybe you
>need to use "-a x86-64"? (I can't make it work on my setup.)
>
>So it's really "mov (%r8),%ebp" if I am reading the manual right.

ksymoops -VKLMO -t elf64-x86-64 -a i386:x86-64

ksymoops 2.4.11 on i686 2.6.16.21-0.13-smp. Options used
-V (specified)
-K (specified)
-L (specified)
-O (specified)
-M (specified)
-t elf64-x86-64 -a i386:x86-64

Warning (merge_maps): no symbols in merged map
Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9

Code; 0000000000000000 No symbols available
0000000000000000 <_RIP>:
Code; 0000000000000000 No symbols available
0: 44 8b 28 mov (%rax),%r13d
Code; 0000000000000003 No symbols available
3: c7 45 d0 00 00 00 00 movl $0x0,0xffffffffffffffd0(%rbp)
Code; 000000000000000a No symbols available
a: 45 85 ed test %r13d,%r13d
Code; 000000000000000d No symbols available
d: 0f 89 29 fb ff ff jns fffffffffffffb3c <_RIP+0xfffffffffffffb3c>
Code; 0000000000000013 No symbols available
13: e9 00 00 00 00 jmpq 18 <_RIP+0x18>

2006-08-14 00:35:30

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Mon, 14 Aug 2006 10:21:55 +1000
Keith Owens <[email protected]> wrote:

> ksymoops -VKLMO -t elf64-x86-64 -a i386:x86-64

box:/home/akpm> ksymoops -VKLMO -t elf64-x86-64 -a i386:x86-64 < x
ksymoops 2.4.11 on x86_64 2.6.17-rc5. Options used
-V (specified)
-K (specified)
-L (specified)
-O (specified)
-M (specified)
-t elf64-x86-64 -a i386:x86-64

Warning (merge_maps): no symbols in merged map
CPU 0
...
[<ffffffff80471e5b>] _spin_unlock_irq+0x2b/0x60
[<ffffffff8020a2c0>] restore_args+0x0/0x30
[<ffffffff80243620>] kthread+0x0/0x110
[<ffffffff8020a6fe>] child_rip+0x0/0x12
Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
Error (Oops_bfd_perror): /tmp/ksymoops.0lrVNY Invalid bfd target

box:/home/akpm> rpm -qi ksymoops
Name : ksymoops Relocations: (not relocatable)
Version : 2.4.11 Vendor: (none)
Release : 1 Build Date: Sat Jan 8 05:43:45 2005
Install Date: Wed Jun 28 16:59:45 2006 Build Host: ocs3.ocs.com.au
Group : Utilities/System Source RPM: ksymoops-2.4.11-1.src.rpm
Size : 542288 License: GPL
Signature : (none)
Summary : Kernel oops and error message decoder
Description :
The Linux kernel produces error messages that contain machine specific
numbers which are meaningless for debugging. ksymoops reads machine
specific files and the error log and converts the addresses to
meaningful symbols and offsets.

2006-08-14 00:54:16

by Keith Owens

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

Andrew Morton (on Sun, 13 Aug 2006 17:35:03 -0700) wrote:
>On Mon, 14 Aug 2006 10:21:55 +1000
>Keith Owens <[email protected]> wrote:
>
>> ksymoops -VKLMO -t elf64-x86-64 -a i386:x86-64
>
>box:/home/akpm> ksymoops -VKLMO -t elf64-x86-64 -a i386:x86-64 < x
>ksymoops 2.4.11 on x86_64 2.6.17-rc5. Options used
> -V (specified)
> -K (specified)
> -L (specified)
> -O (specified)
> -M (specified)
> -t elf64-x86-64 -a i386:x86-64
>
>Warning (merge_maps): no symbols in merged map
>CPU 0
>...
> [<ffffffff80471e5b>] _spin_unlock_irq+0x2b/0x60
> [<ffffffff8020a2c0>] restore_args+0x0/0x30
> [<ffffffff80243620>] kthread+0x0/0x110
> [<ffffffff8020a6fe>] child_rip+0x0/0x12
>Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
>Error (Oops_bfd_perror): /tmp/ksymoops.0lrVNY Invalid bfd target
>
>box:/home/akpm> rpm -qi ksymoops
>Name : ksymoops Relocations: (not relocatable)
>Version : 2.4.11 Vendor: (none)
>Release : 1 Build Date: Sat Jan 8 05:43:45 2005
>Install Date: Wed Jun 28 16:59:45 2006 Build Host: ocs3.ocs.com.au
>Group : Utilities/System Source RPM: ksymoops-2.4.11-1.src.rpm

Back in 2000 there were a lot of version problems between ksymoops and
libbfd and libiberty, so I statically link against these libraries when
I build the rpm. You have an i386 version of ksymoops, which was built
against an i386 only version of libbfd, it does not support target
elf64-x86-64. Grab the ksymoops src.rpm and rebuild on x86_64, or use
a binary rpm from an x86_64 distribution.

2006-08-14 01:06:26

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

On Mon, 14 Aug 2006 10:54:21 +1000
Keith Owens <[email protected]> wrote:

> >Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
> >Error (Oops_bfd_perror): /tmp/ksymoops.0lrVNY Invalid bfd target
> >
> >box:/home/akpm> rpm -qi ksymoops
> >Name : ksymoops Relocations: (not relocatable)
> >Version : 2.4.11 Vendor: (none)
> >Release : 1 Build Date: Sat Jan 8 05:43:45 2005
> >Install Date: Wed Jun 28 16:59:45 2006 Build Host: ocs3.ocs.com.au
> >Group : Utilities/System Source RPM: ksymoops-2.4.11-1.src.rpm
>
> Back in 2000 there were a lot of version problems between ksymoops and
> libbfd and libiberty, so I statically link against these libraries when
> I build the rpm. You have an i386 version of ksymoops, which was built
> against an i386 only version of libbfd, it does not support target
> elf64-x86-64. Grab the ksymoops src.rpm and rebuild on x86_64, or use
> a binary rpm from an x86_64 distribution.

But would such a binary be able to decode i386 oopses?

ftp://ftp.kernel.org/pub/linux/utils/kernel/ksymoops/v2.4/ksymoops-2.4.11-1.src.rpm
fails to build, btw. Had to do s/Copyright/License/ in the spec file.

2006-08-14 01:10:15

by Keith Owens

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 (+ hotfixes): GPF related to skge on suspend

Andrew Morton (on Sun, 13 Aug 2006 18:06:02 -0700) wrote:
>On Mon, 14 Aug 2006 10:54:21 +1000
>Keith Owens <[email protected]> wrote:
>
>> >Code: 44 8b 28 c7 45 d0 00 00 00 00 45 85 ed 0f 89 29 fb ff ff e9
>> >Error (Oops_bfd_perror): /tmp/ksymoops.0lrVNY Invalid bfd target
>> >
>> >box:/home/akpm> rpm -qi ksymoops
>> >Name : ksymoops Relocations: (not relocatable)
>> >Version : 2.4.11 Vendor: (none)
>> >Release : 1 Build Date: Sat Jan 8 05:43:45 2005
>> >Install Date: Wed Jun 28 16:59:45 2006 Build Host: ocs3.ocs.com.au
>> >Group : Utilities/System Source RPM: ksymoops-2.4.11-1.src.rpm
>>
>> Back in 2000 there were a lot of version problems between ksymoops and
>> libbfd and libiberty, so I statically link against these libraries when
>> I build the rpm. You have an i386 version of ksymoops, which was built
>> against an i386 only version of libbfd, it does not support target
>> elf64-x86-64. Grab the ksymoops src.rpm and rebuild on x86_64, or use
>> a binary rpm from an x86_64 distribution.
>
>But would such a binary be able to decode i386 oopses?

It depends on your versions of bfdutils and binutils. ksymoops does
not decode the object itself, it uses bfd and objdump to do the work.
FWIW, the version of ksymoops in suselinux 10.0 for x86_64 will handle
both i386 and x86_64.

>ftp://ftp.kernel.org/pub/linux/utils/kernel/ksymoops/v2.4/ksymoops-2.4.11-1.src.rpm
>fails to build, btw. Had to do s/Copyright/License/ in the spec file.

Ah, the joys of changing RPM standards.