2002-11-27 00:25:39

by Steffen Persvold

[permalink] [raw]
Subject: Processor stuck in smp_call_function (arch/i386/kernel/smp.c)

Dear kernel experts,

On a couple of Dual Xeon E7500 based machines (SuperMicro motherboards) we
have been experiencing frequent lockups running compute and I/O intensive
tasks. It came to a point where I patched the 2.4.20-rc2 kernel with kdb
v2.5 by Keith Owens trying to find out what was happening.

Now, when the systems become unresponsive I'm able to enter kdb and do a
back trace. It looks like this (trimmed down a bit to contain IMHO useful
info only) :

smp_call_function+0x83 (0xc01141e0, 0x0, 0x1, 0x1)
flush_tlb_all+0x14 ()
vmfree_area_pages+0x180 (0xf8c00000, 0x11000)
vfree+0x39 (0xf8a5e000)
release_segments+0x47 (0xf6435880)
exit_mmap+0x12 (0xf6435880)
mmput+0x5d (0xf6435880)
do_exit+0xd0 (0x200)

smp_call_function() is looping here :

/* Wait for response */
while (atomic_read(&data.started) != cpus)
barrier();

So it seems a process is about to exit and the TLB is to be flushed.
However when the active cpu (cpu 0) waits for the flush_tlb_all_ipi()
function to start on cpu 1, it loops forever. In addition, trying to
switch to cpu 1 with kdb (with the 'cpu 1' command) results in an
'Invalid cpu number' error and I was told by Keith that this must be
because the other cpu hasn't responded to the kdb_ipi NMI.

It just looks like cpu 1 died for some reason, why ?

Has anyone experienced the same behaviour ?

I would really appreciate any input on this.

PS.

I've attached the output of dmesg and lspci, I hope it is helpful. As you
can see the system has a lot of IO-APICs, the reason is that this sytems
has fully equipped the E7500 MCH hub interfaces :

Hub Interface A : ICH3 (main APIC)
Hub Interface B, C, D : P64H2 (each with two PCI-X busses and APICs).

IOAPIC #2 is the APIC on the ICH3
IOAPIC #3 is the 1st APIC on Hub interface B
IOAPIC #4 is the 2nd APIC on Hub interface B
IOAPIC #5 is the 1st APIC on Hub interface C
IOAPIC #8 is the 2nd APIC on Hub interface C
IOAPIC #9 is the 1st APIC on Hub interface D
IOAPIC #10 is the 2nd APIC on Hub interface D

The PCI device where most of the IO is performed is located on the 2nd
APIC on Hub interface D.

DS

Thanks,
--
Steffen Persvold | Scali AS
mailto:[email protected] | http://www.scali.com
Tel: (+47) 2262 8950 | Olaf Helsets vei 6
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY



Attachments:
dmesg (22.47 kB)
lspci (1.89 kB)
Download all attachments