2013-03-13 02:31:29

by Daniel Kahn Gillmor

[permalink] [raw]
Subject: Linux IPMI subsystem hang

Hi Linux kernel folks (Corey is explicitly listed here because of being
the only contact i could find in IPMI.txt)--

I am working with a Lenovo ThinkCentre M78, model 4865-A14, and it seems
to have trouble with the IPMI subsystem.

udev seems to hang for about 3 minutes at startup, ultimately failing
with the following messages:

udevd[416]: worker [495] unexpectedly returned with status 0x0100
udevd[416]: worker [495] failed while handling '/devices/pci0000:00/0000:00:15.2/0000:03:00.3'

This hang happens whether i'm running linux kernel 3.2 or 3.8, using
either x86 or x86_64 kernels.

If i blacklist the following modules via /etc/modprobe.d/blacklist.conf,
then there is no hang on startup:

blacklist ipmi_si
blacklist ipmi_msghandler

The device in question appears this way to lspci:

03:00.3 IPMI SMIC interface [0c07]: Realtek Semiconductor Co., Ltd. Device [10ec:816c] (rev 01) (prog-if 01)
Subsystem: Lenovo Device [17aa:3089]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin D routed to IRQ 17
Region 0: I/O ports at e000 [size=256]
Region 2: Memory at fea10000 (64-bit, non-prefetchable) [size=256]
Region 4: Memory at fea00000 (64-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: ipmi_si

This machine is a workstation, and i would be surprised if it actually
has a full-blown IPMI subsystem. I can find no mention of IPMI in the
BIOS, or in lenovo's hardware maintenance manual [0] or user guide [1]
for this model.

So i suspect that this device is being mistaken for an IPMI device, or
that it is actually a broken/malformed IPMI device that shouldn't be
initialized. Either that, or the IPMI module itself is doing something
wrong.

Can i do anything to help resolve this problem so that the machine boots
without delay and the ipmi subsystem doesn't get hung on this device?
Is there debugging output i can provide that would be useful, or any
other test you'd like me to run?

I'd be happy to report any requested details. Also, if you want to
redirect me to a better forum for resolving this problem that would be
appreciated.

I am not subscribed to LKML, so please CC me on any replies.

Regards,

--dkg

[0] http://download.lenovo.com/ibmdl/pub/pc/pccbbs/thinkcentre_pdf/m78_hmm.pdf
[1] http://download.lenovo.com/ibmdl/pub/pc/pccbbs/thinkcentre_pdf/m78_sff_ug_en.pdf


Attachments:
(No filename) (965.00 B)

2013-03-15 18:57:39

by Daniel Kahn Gillmor

[permalink] [raw]
Subject: Re: Linux IPMI subsystem hang

On Tue 2013-03-12 22:23:37 -0400, Daniel Kahn Gillmor wrote:

> I am working with a Lenovo ThinkCentre M78, model 4865-A14, and it seems
> to have trouble with the IPMI subsystem.
>
> udev seems to hang for about 3 minutes at startup, ultimately failing
> with the following messages:
>
> udevd[416]: worker [495] unexpectedly returned with status 0x0100
> udevd[416]: worker [495] failed while handling '/devices/pci0000:00/0000:00:15.2/0000:03:00.3'
>
> This hang happens whether i'm running linux kernel 3.2 or 3.8, using
> either x86 or x86_64 kernels.

trying with udev 175-7.1 (from debian unstable) and kernel 3.2, i see
that the failure message is:

udevd[548]: timeout: killing '/sbin/modprobe -b pci:v000010ECd0000816Csv000017AAsd00003089bc0Csc07i01' [623]

and:

[ 5.650931] ipmi message handler version 39.2
[ 5.916958] IPMI System Interface driver.
[ 5.921153] ipmi_si 0000:03:00.3: probing via PCI
[ 5.925851] ipmi_si 0000:03:00.3: [io 0xe000-0xe0ff] regsize 1 spacing 1 irq 17
[ 5.933727] ipmi_si: Adding PCI-specified kcs state machine
[ 5.939554] ipmi_si: Trying PCI-specified kcs state machine at i/o address 0xe000, slave address 0x0, irq 17
[ 406.916061] ipmi_si: There appears to be no BMC at this location

with kernel 3.8, the last line ("There appears to be no BMC at this
location") isn't emitted, but the delay/hang with modprobe still
happens.

I think the first alias in ipmi_si.ko is what is causing this to be triggered:

0 krazy:~# modinfo ipmi_si | grep ^alias
alias: pci:v*d*sv*sd*bc0Csc07i*
alias: pci:v0000103Cd0000121Asv*sd*bc*sc*i*
0 krazy:~#

since the bc0Csc07 matches the [0c07] identifier from lspci:

> 03:00.3 IPMI SMIC interface [0c07]: Realtek Semiconductor Co., Ltd. Device [10ec:816c] (rev 01) (prog-if 01)

It seems like there are four plausible cases:

0) this is actually an IPMI device, but the hardware is broken.

1) this is an IPMI device, but it does not implement some part of the
IPMI spec that ipmi_si.ko expects to be implemented, and ipmi_si.ko
cannot detect this cleanly.

2) this device is not an IPMI device at all, and is mislabeled in its
PCI identifiers somehow.

3) this device is not an IPMI device at all, it is properly labeled,
and the module's internal aliasing (and lspci's index?) is
overgeneral and misidentifies the device.

How can i distinguish between these cases?

> I am not subscribed to LKML, so please CC me on any replies.

Regards,

--dkg


Attachments:
(No filename) (965.00 B)

2013-03-18 14:09:04

by Corey Minyard

[permalink] [raw]
Subject: Re: Linux IPMI subsystem hang

On 03/15/2013 01:57 PM, Daniel Kahn Gillmor wrote:
> On Tue 2013-03-12 22:23:37 -0400, Daniel Kahn Gillmor wrote:
>
>> I am working with a Lenovo ThinkCentre M78, model 4865-A14, and it seems
>> to have trouble with the IPMI subsystem.
>>
>> udev seems to hang for about 3 minutes at startup, ultimately failing
>> with the following messages:
>>
>> udevd[416]: worker [495] unexpectedly returned with status 0x0100
>> udevd[416]: worker [495] failed while handling '/devices/pci0000:00/0000:00:15.2/0000:03:00.3'
>>
>> This hang happens whether i'm running linux kernel 3.2 or 3.8, using
>> either x86 or x86_64 kernels.
> trying with udev 175-7.1 (from debian unstable) and kernel 3.2, i see
> that the failure message is:
>
> udevd[548]: timeout: killing '/sbin/modprobe -b pci:v000010ECd0000816Csv000017AAsd00003089bc0Csc07i01' [623]
>
> and:
>
> [ 5.650931] ipmi message handler version 39.2
> [ 5.916958] IPMI System Interface driver.
> [ 5.921153] ipmi_si 0000:03:00.3: probing via PCI
> [ 5.925851] ipmi_si 0000:03:00.3: [io 0xe000-0xe0ff] regsize 1 spacing 1 irq 17
> [ 5.933727] ipmi_si: Adding PCI-specified kcs state machine
> [ 5.939554] ipmi_si: Trying PCI-specified kcs state machine at i/o address 0xe000, slave address 0x0, irq 17
> [ 406.916061] ipmi_si: There appears to be no BMC at this location
>
> with kernel 3.8, the last line ("There appears to be no BMC at this
> location") isn't emitted, but the delay/hang with modprobe still
> happens.
>
> I think the first alias in ipmi_si.ko is what is causing this to be triggered:
>
> 0 krazy:~# modinfo ipmi_si | grep ^alias
> alias: pci:v*d*sv*sd*bc0Csc07i*
> alias: pci:v0000103Cd0000121Asv*sd*bc*sc*i*
> 0 krazy:~#
>
> since the bc0Csc07 matches the [0c07] identifier from lspci:
>
>> 03:00.3 IPMI SMIC interface [0c07]: Realtek Semiconductor Co., Ltd. Device [10ec:816c] (rev 01) (prog-if 01)
> It seems like there are four plausible cases:
>
> 0) this is actually an IPMI device, but the hardware is broken.
>
> 1) this is an IPMI device, but it does not implement some part of the
> IPMI spec that ipmi_si.ko expects to be implemented, and ipmi_si.ko
> cannot detect this cleanly.
>
> 2) this device is not an IPMI device at all, and is mislabeled in its
> PCI identifiers somehow.
>
> 3) this device is not an IPMI device at all, it is properly labeled,
> and the module's internal aliasing (and lspci's index?) is
> overgeneral and misidentifies the device.
>
> How can i distinguish between these cases?

I would guess that the register spacing is wrong. The spec has a
protocol for determining register spacing, but according to the spec it
only works for KCS interfaces. Since this is a SMIC interface, it's not
implemented.

You can hardcode values in ipmi_pci_probe_regspacing() in
drivers/char/ipmi/ipmi_si_intf.c to see if that makes a difference. I'd
guess 4, but it might be 16. I can think about trying the protocol on
SMIC, perhaps it will work there, too.

-corey