Hello,
Running hpl with openmpi over Infiniband gets me a crash.
Using hpl, openmpi 1.0.2, openib, and the 2.6.17-rc3 kernel.
I don't see the crash under ip over ib (ran for over an hour),
the crash occurs immediately upon attempting to start xhpl.
Here is the crash captured via the serial port:
[ 144.713555] ----------- [cut here ] --------- [please bite here ]
---------
[ 144.720550] Kernel BUG at drivers/infiniband/hw/ipath/ipath_layer.c:757
[ 144.727205] invalid opcode: 0000 [1] SMP
[ 144.731334] CPU 0
[ 144.733419] Modules linked in: ipv6 autofs4 adm1026 hwmon_vid
i2c_piix4 nfs lockd nfs_acl sunrpc dm_mirror dm_multipath dm_mod button
battery ac ohci_hcd ehci_hcd i2c_nforce2 i2c_core shpchp snd_intel8x0
snd_ac97_codec snd_ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer
snd soundcore snd_page_alloc ib_ipoib ib_ipath ipath_core ib_uverbs
ib_umad ib_ucm ib_sa ib_cm ib_mad ib_core tg3 floppy sata_svw ext3 jbd
sata_nv libata sd_mod scsi_mod
[ 144.774643] Pid: 4771, comm: xhpl Not tainted 2.6.17-rc3 #1
[ 144.780244] RIP: 0010:[<ffffffff880f6984>]
<ffffffff880f6984>{:ipath_core:ipath_verbs_send+362}
[ 144.788858] RSP: 0018:ffffffff8051be38 EFLAGS: 00010246
[ 144.794409] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff8100df4a0150
[ 144.801574] RDX: ffffc200003b1078 RSI: 0000000000000000 RDI:
ffff8100df4a0150
[ 144.808742] RBP: 0000000000000000 R08: ffff8100df4a0158 R09:
0000000000000018
[ 144.815910] R10: 0000000000000018 R11: 0000000000000246 R12:
ffffc2000026f020
[ 144.823071] R13: 0000000000000000 R14: 0000000000000018 R15:
0000000000000000
[ 144.830230] FS: 00002b750d6fcca0(0000) GS:ffffffff805ad000(0000)
knlGS:0000000000000000
[ 144.838398] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 144.844190] CR2: 000000000047f050 CR3: 000000000cdb1000 CR4:
00000000000006e0
[ 144.851370] Process xhpl (pid: 4771, threadinfo ffff81000ceba000,
task ffff81000cc9e880)
[ 144.859504] Stack: ffffffff8059d900 ffff8100df4a0150 00000018dfef1000
ffff8100df4a0120
[ 144.867549] ffff8100df4a0000 ffffffff805f7d88 ffff8100df4a0098
0000000000000038
[ 144.875829] 0000000000000400 ffffffff8811869e
[ 144.881079] Call Trace: <IRQ>
<ffffffff8811869e>{:ib_ipath:ipath_do_rc_send+348}
[ 144.888727] <ffffffff80232548>{do_timer+58}
<ffffffff8020d0bb>{main_timer_handler+493}
[ 144.897498] <ffffffff8022efc6>{tasklet_hi_action+105}
<ffffffff8022ebc4>{__do_softirq+80}
[ 144.906525] <ffffffff8020aa5a>{call_softirq+30}
<ffffffff8020bc0a>{do_softirq+47}
[ 144.914854] <ffffffff8020bbd1>{do_IRQ+62}
<ffffffff80209b96>{ret_from_intr+0} <EOI>
[ 144.923395] <ffffffff8026aafe>{kfree+417}
<ffffffff880e14ff>{:ib_uverbs:ib_uverbs_poll_cq+409}
[ 144.932867] <ffffffff880dfa27>{:ib_uverbs:ib_uverbs_write+196}
<ffffffff8026f746>{vfs_write+212}
[ 144.942509] <ffffffff8026f897>{sys_write+69}
<ffffffff80209612>{system_call+126}
[ 144.950997]
[ 144.950998] Code: 0f 0b 68 84 21 10 88 c2 f5 02 eb 07 44 39 f3 41 0f
47 de 48
[ 144.960709] RIP <ffffffff880f6984>{:ipath_core:ipath_verbs_send+362}
RSP <ffffffff8051be38>
[ 144.969212] <3>BUG: sleeping function called from invalid context at
include/linux/rwsem.h:43
[ 144.977952] in_atomic():1, irqs_disabled():0
[ 144.982261]
[ 144.982262] Call Trace: <IRQ> <ffffffff80221daa>{__might_sleep+190}
[ 144.990056] <ffffffff80216103>{flat_send_IPI_mask+0}
<ffffffff80236073>{blocking_notifier_call_chain+31}
[ 145.000411] <ffffffff8022c2ee>{do_exit+34}
<ffffffff80423c6f>{_spin_unlock_irqrestore+11}
[ 145.009454] <ffffffff8020b027>{do_divide_error+0}
<ffffffff8020b22e>{do_invalid_op+145}
[ 145.018334] <ffffffff880f6984>{:ipath_core:ipath_verbs_send+362}
[ 145.025102] <ffffffff803f7d02>{tcp_v4_do_rcv+43}
<ffffffff88092128>{:tg3:tg3_interrupt_tagged+51}
[ 145.034840] <ffffffff8020a551>{error_exit+0}
<ffffffff880f6984>{:ipath_core:ipath_verbs_send+362}
[ 145.044606] <ffffffff880f6b40>{:ipath_core:ipath_verbs_send+806}
[ 145.051390] <ffffffff8811869e>{:ib_ipath:ipath_do_rc_send+348}
<ffffffff80232548>{do_timer+58}
[ 145.060897] <ffffffff8020d0bb>{main_timer_handler+493}
<ffffffff8022efc6>{tasklet_hi_action+105}
[ 145.070569] <ffffffff8022ebc4>{__do_softirq+80}
<ffffffff8020aa5a>{call_softirq+30}
[ 145.079121] <ffffffff8020bc0a>{do_softirq+47}
<ffffffff8020bbd1>{do_IRQ+62}
[ 145.086947] <ffffffff80209b96>{ret_from_intr+0} <EOI>
<ffffffff8026aafe>{kfree+417}
[ 145.095523] <ffffffff880e14ff>{:ib_uverbs:ib_uverbs_poll_cq+409}
[ 145.102291] <ffffffff880dfa27>{:ib_uverbs:ib_uverbs_write+196}
<ffffffff8026f746>{vfs_write+212}
[ 145.111972] <ffffffff8026f897>{sys_write+69}
<ffffffff80209612>{system_call+126}
[ 145.120482] Kernel panic - not syncing: Aiee, killing interrupt handler!
[ 145.127265]
/proc/interrupts looks like this:
CPU0 CPU1 CPU2 CPU3
0: 107714 110040 109206 113504 IO-APIC-edge timer
1: 417 1287 405 1627 IO-APIC-edge i8042
8: 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 IO-APIC-level acpi
15: 50 0 0 23 IO-APIC-edge ide1
50: 0 0 0 0 IO-APIC-level
libata, ohci_hcd:usb2
58: 0 0 0 0 IO-APIC-level libata
66: 0 0 0 0 IO-APIC-level libata
74: 15625 0 0 11 IO-APIC-level eth0
90: 551 0 0 0 IO-APIC-level
ipath_core
98: 0 0 0 0 IO-APIC-level
NVidia CK804
233: 249 904 1161 4180 IO-APIC-level
libata, ehci_hcd:usb1
NMI: 107 124 406 483
LOC: 440388 440365 440341 440317
ERR: 0
MIS: 0
Any ideas?
Roger
On Mon, 2006-05-08 at 16:36 -0500, Roger Heflin wrote:
> I don't see the crash under ip over ib (ran for over an hour),
> the crash occurs immediately upon attempting to start xhpl.
Hi, Roger -
Thanks for the report. We've recently fixed some locking problems that
may help with this, but I haven't fed them to Roland yet. If you need
some updated code to test before I sort out and send patches to Roland,
please let me know.
Thanks,
<b