We have a number of Intel x3550 servers (Intel 5000-series). They've
been running 3.7.2 fine.
In the last week I've run 3.8.11, 3.8.12 and 3.9.2 on them. All have
long hangs at boot, and later hung tasks in modprobe.
I've only just started doing proper diagnostics on 3.9.2. In the first
run the hang was loading the "radeon" and "tpm" modules. I blacklisted
these and rebooted. Now its stuck on "i5000_edac" and "i801_smbus". So
this is starting to smell like something lower-level, maybe down in the
chipset somewhere.
I'm about to try a bisect. First time I've done so on the kernel, so I
don't quite know how it will go. I'll report back if I manage to figure
out where the problem is coming from. I'd still appreciate any guidance
you might have though!
Cheers,
Rob N.
On Mon, May 13, 2013 at 11:22:32AM +1000, Robert Norris wrote:
> We have a number of Intel x3550 servers (Intel 5000-series). They've
> been running 3.7.2 fine.
>
> In the last week I've run 3.8.11, 3.8.12 and 3.9.2 on them. All have
> long hangs at boot, and later hung tasks in modprobe.
I bisected this and tracked it to this commit:
commit 6676a847d48ac48908cf467b42da9045b5463a6e
Author: Jean Delvare <[email protected]>
Date: Sun Dec 16 21:11:55 2012 +0100
i2c-i801: Enable interrupts for all post-ICH5 chips
I did not receive a single bug report after interrupt support was
added for a limited number of chips. So I'd say the code is good and
should be enabled for all supported chips, that is: ICH5 and later.
Signed-off-by: Jean Delvare <[email protected]>
Reviewed-by: Daniel Kurtz <[email protected]>
I've tested by building 3.9.2 with that single commit reverted, and it
boots without issue.
According to lspci I have:
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
Which has PCI ID 0x269b (ie PCI_DEVICE_ID_INTEL_ESB2_17).
For now I will either revert this commit in my kernel builds or
blacklist the module on these machines (I haven't decided which I prefer
yet).
Obviously, I can reproduce this reliably, and am happy to test.
Cheers,
Rob N.
Hi Robert,
Adding the linux-i2c list to Cc.
On Wed, 15 May 2013 09:16:26 +1000, Robert Norris wrote:
> On Mon, May 13, 2013 at 11:22:32AM +1000, Robert Norris wrote:
> > We have a number of Intel x3550 servers (Intel 5000-series). They've
> > been running 3.7.2 fine.
> >
> > In the last week I've run 3.8.11, 3.8.12 and 3.9.2 on them. All have
> > long hangs at boot, and later hung tasks in modprobe.
>
> I bisected this and tracked it to this commit:
>
> commit 6676a847d48ac48908cf467b42da9045b5463a6e
> Author: Jean Delvare <[email protected]>
> Date: Sun Dec 16 21:11:55 2012 +0100
>
> i2c-i801: Enable interrupts for all post-ICH5 chips
>
> I did not receive a single bug report after interrupt support was
> added for a limited number of chips. So I'd say the code is good and
> should be enabled for all supported chips, that is: ICH5 and later.
>
> Signed-off-by: Jean Delvare <[email protected]>
> Reviewed-by: Daniel Kurtz <[email protected]>
>
> I've tested by building 3.9.2 with that single commit reverted, and it
> boots without issue.
Thanks a lot for reporting and even more for bisecting it, I know it
takes time. I apologize for the trouble. I suppose I should have been a
bit more cautious with the 63xxESB chips as they are a different family
of hardware.
> According to lspci I have:
>
> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
>
> Which has PCI ID 0x269b (ie PCI_DEVICE_ID_INTEL_ESB2_17).
Can you share the full output of lspci -s 00:1f.3 -vv?
I'm also curious if the SMBus controller shares its interrupt line with
another chip. /proc/interrupts should tell but you'll have to make one
of your systems hang again.
> For now I will either revert this commit in my kernel builds or
> blacklist the module on these machines (I haven't decided which I prefer
> yet).
You can also pass parameter disable_features=0x10 to the i2c-i801
driver, this will disable interrupt support without having to rebuild
the driver. I suppose this could be documented in more details in
modinfo, I'll work on that.
> Obviously, I can reproduce this reliably, and am happy to test.
Thanks for the offer. Right now I am stuck in bed and must take some
rest. When I feel better I'll see if I can gain access to systems with
Intel 63xxESB chips to try and reproduce the hang you're seeing. I'll
also take a look at the datasheets again to see if any difference stands
out.
For the time being I plan to simply disable interrupt support again for
the ESB chips, until we fully understand what happens on your systems.
As far as debugging goes, please tell me if you have any I2C/SMBus
slave device driver loaded (check in /sys/bus/i2c/drivers.) Loading the
i2c-i801 driver doesn't do much on its own if there are no slave device
drivers using it.
Thanks,
--
Jean Delvare
Hi Jean,
On Wed, May 15, 2013 at 11:20:44AM +0200, Jean Delvare wrote:
> Thanks a lot for reporting and even more for bisecting it, I know it
> takes time. I apologize for the trouble. I suppose I should have been
> a bit more cautious with the 63xxESB chips as they are a different
> family of hardware.
No problem! It was kind of fun actually ;)
> Can you share the full output of lspci -s 00:1f.3 -vv?
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
Subsystem: IBM Device 02dd
Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin B routed to IRQ 0
Region 4: I/O ports at 0440 [size=32]
> I'm also curious if the SMBus controller shares its interrupt line
> with another chip. /proc/interrupts should tell but you'll have to
> make one of your systems hang again.
I'm not sure how to read it, so here it is (3.9.2, immediately after
boot, no options to i2c_i801):
CPU0 CPU1 CPU2 CPU3
0: 42 0 0 0 IO-APIC-edge timer
1: 0 0 0 0 IO-APIC-edge i8042
4: 1 1 0 0 IO-APIC-edge
8: 0 1 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 IO-APIC-fasteoi acpi
14: 0 0 0 0 IO-APIC-edge ata_piix
15: 0 0 0 0 IO-APIC-edge ata_piix
17: 1225 1124 1113 1111 IO-APIC-fasteoi aacraid
20: 0 0 0 0 IO-APIC-fasteoi i801_smbus
22: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2, radeon
23: 25 21 27 29 IO-APIC-fasteoi uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb4
41: 79 8 5 4 PCI-MSI-edge eth2
42: 1 2 1 4 PCI-MSI-edge eth3
43: 0 2 1 1 PCI-MSI-edge ioat-msi
44: 98 107 111 111 PCI-MSI-edge eth1
45: 1178 1210 1218 1215 PCI-MSI-edge eth0
NMI: 4 5 3 4 Non-maskable interrupts
LOC: 3685 3953 6895 8014 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 4 5 3 4 Performance monitoring interrupts
IWI: 0 0 0 0 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 6352 5546 6942 7790 Rescheduling interrupts
CAL: 975 1256 973 1488 Function call interrupts
TLB: 682 964 732 1003 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 1 1 1 1 Machine check polls
ERR: 0
MIS: 0
> You can also pass parameter disable_features=0x10 to the i2c-i801
> driver, this will disable interrupt support without having to rebuild
> the driver. I suppose this could be documented in more details in
> modinfo, I'll work on that.
I went with blacklisting for now because this driver doesn't appear to
be doing anything useful for us (sensors etc are working without it).
I'll confess to not really knowing much about its purpose though.
> Thanks for the offer. Right now I am stuck in bed and must take some
> rest. When I feel better I'll see if I can gain access to systems with
> Intel 63xxESB chips to try and reproduce the hang you're seeing. I'll
> also take a look at the datasheets again to see if any difference
> stands out.
We'd be happy to give you access to one of our x3550s if you like (the
same one I did the bisect on). We'd move it outside our production
network and reinstall it and you'd be free to poke and prod and crash it
as much as you like. Let me know when/if you're interested and we'll
make it happen. No hurry from our end though, its a barely-used machine
and will happily sit there waiting. Get your rest first!
> As far as debugging goes, please tell me if you have any I2C/SMBus
> slave device driver loaded (check in /sys/bus/i2c/drivers.) Loading the
> i2c-i801 driver doesn't do much on its own if there are no slave device
> drivers using it.
$ modprobe i2c-i801 disable_features=0x10
$ dmesg | tail
...
[28876.193408] i801_smbus 0000:00:1f.3: Interrupt disabled by user
[28876.201168] ics932s401 4-0069: ics932s401 chip found
$ ls /sys/bus/i2c/drivers
dummy ics932s401
Thanks for your help!
Cheers,
Rob.
Robert,
On Wed, 15 May 2013 21:27:41 +1000, Robert Norris wrote:
> On Wed, May 15, 2013 at 11:20:44AM +0200, Jean Delvare wrote:
> > Can you share the full output of lspci -s 00:1f.3 -vv?
>
> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
> Subsystem: IBM Device 02dd
> Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Interrupt: pin B routed to IRQ 0
Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
reason for this hang. Was it with the i2c-i801 driver loaded, or
blacklisted? Please check if it makes a difference.
Do you see the same (and more generally, this issue) on one, some or
all of your x3550 servers?
Are you using IPMI on these machines?
> Region 4: I/O ports at 0440 [size=32]
>
> > I'm also curious if the SMBus controller shares its interrupt line
> > with another chip. /proc/interrupts should tell but you'll have to
> > make one of your systems hang again.
>
> I'm not sure how to read it, so here it is (3.9.2, immediately after
> boot, no options to i2c_i801):
>
> CPU0 CPU1 CPU2 CPU3
> (...)
> 20: 0 0 0 0 IO-APIC-fasteoi i801_smbus
Here the IRQ looks correct, and it isn't shared. But I am surprised
that the counters are all 0. If an SMBus transaction had been
attempted, there should be a 1 somewhere, even if the transaction
ultimately failed.
> (...)
> I went with blacklisting for now because this driver doesn't appear to
> be doing anything useful for us (sensors etc are working without it).
> I'll confess to not really knowing much about its purpose though.
It all depends on what I2C/SMBus slaves are connected to the SMBus.
Often there are the SPD EEPROMs from your memory modules, sometimes
with integrated thermal sensors (on DDR3 only - driver is jc42.) And in
your case a clock chip as well, for which IBM contributed a driver.
> > (...)
> > As far as debugging goes, please tell me if you have any I2C/SMBus
> > slave device driver loaded (check in /sys/bus/i2c/drivers.) Loading the
> > i2c-i801 driver doesn't do much on its own if there are no slave device
> > drivers using it.
>
> $ modprobe i2c-i801 disable_features=0x10
> $ dmesg | tail
> ...
> [28876.193408] i801_smbus 0000:00:1f.3: Interrupt disabled by user
> [28876.201168] ics932s401 4-0069: ics932s401 chip found
> $ ls /sys/bus/i2c/drivers
> dummy ics932s401
The dummy driver is a helper stub for i2c-core, it doesn't actually
access the SMBus. ics932s401 is for the clock chip, and I know clock
chips can be tricky and error prone. OTOH I can only guess that IBM had
a good reason to contribute the driver and make it auto-load on the
x3550.
I would appreciate if you could test the following:
* Blacklist i2c-i801 and ics932s401 so that none of them get
auto-loaded.
* Manually load i2c-i801 with interrupts enabled, and see what happens.
* If no hang happens, load i2c-dev, find the i801 bus number with
i2cdetect -l (from the i2c-tools package - it should be 4 according
to what you reported so far but there is no guarantee that it won't
change across reboots.) Then do a simple read from a random address
with:
# i2cget 4 0x50 0x00
(Adjust the bus number as needed.)
I am curious if this will hang as well or only when accessing the
clock chip at address 0x69.
Thanks,
--
Jean Delvare
On Wed, May 15, 2013 at 09:49:23PM +0200, Jean Delvare wrote:
> > Interrupt: pin B routed to IRQ 0
>
> Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
> reason for this hang. Was it with the i2c-i801 driver loaded, or
> blacklisted? Please check if it makes a difference.
That was without the driver loaded (blacklisted). After loading (with
interrupts enabled) we get:
Interrupt: pin B routed to IRQ 20
> Do you see the same (and more generally, this issue) on one, some or
> all of your x3550 servers?
The issue has occured on at least three x3550s (we have 11). I haven't
tested more, because knowingly crashing production machines sucks.
This appears to be the case on other machines. With the module
blacklisted (never loaded), lspci shows IRQ 0. After load, IRQ 20.
(tested on 3.4 and 3.9).
> Are you using IPMI on these machines?
Yes, but only for monitoring/sensors, if that makes a difference.
> I would appreciate if you could test the following:
> * Blacklist i2c-i801 and ics932s401 so that none of them get
> auto-loaded.
Done.
> * Manually load i2c-i801 with interrupts enabled, and see what
> happens.
Returned immediately:
[ 60.527140] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt
> * If no hang happens, load i2c-dev, find the i801 bus number with
> i2cdetect -l (from the i2c-tools package - it should be 4 according
> to what you reported so far but there is no guarantee that it won't
> change across reboots.)
$ i2cdetect -l
i2c-0 i2c Radeon i2c bit bus DVI_DDC I2C adapter
i2c-1 i2c Radeon i2c bit bus VGA_DDC I2C adapter
i2c-2 i2c Radeon i2c bit bus MONID I2C adapter
i2c-3 i2c Radeon i2c bit bus CRT2_DDC I2C adapter
i2c-4 smbus SMBus I801 adapter at 0440 SMBus adapter
> Then do a simple read from a random address
> with:
> # i2cget 4 0x50 0x00
> (Adjust the bus number as needed.)
> I am curious if this will hang as well or only when accessing the
> clock chip at address 0x69.
Yep, that one hangs. The hung task handler picked it up after a few
minutes.
Cheers,
Rob.
Hi Robert,
On Thu, 16 May 2013 13:44:55 +1000, Robert Norris wrote:
> On Wed, May 15, 2013 at 09:49:23PM +0200, Jean Delvare wrote:
> > > Interrupt: pin B routed to IRQ 0
> >
> > Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
> > reason for this hang. Was it with the i2c-i801 driver loaded, or
> > blacklisted? Please check if it makes a difference.
>
> That was without the driver loaded (blacklisted). After loading (with
> interrupts enabled) we get:
>
> Interrupt: pin B routed to IRQ 20
For the record, I also see the IRQ value change after loading the
i2c-i801 driver on my system (with an ICH10 south bridge.) From 14 to
22 in my case. So it's a bit different (no IRQ 0) but not still
somewhat similar, so I'm still not sure if this has anything to do with
your issue.
>
> > Do you see the same (and more generally, this issue) on one, some or
> > all of your x3550 servers?
>
> The issue has occured on at least three x3550s (we have 11). I haven't
> tested more, because knowingly crashing production machines sucks.
Yes of course, I understand, I did not expect you to do that ;)
> This appears to be the case on other machines. With the module
> blacklisted (never loaded), lspci shows IRQ 0. After load, IRQ 20.
> (tested on 3.4 and 3.9).
OK.
> > Are you using IPMI on these machines?
>
> Yes, but only for monitoring/sensors, if that makes a difference.
IPMI is still likely to access the SMBus controller. If there's a BMC
in the machine, it can also access the SMBus slave with its own
controller. It would be good to rule this out by disabling IPMI
completely, removing the BMC from the machine if it has one, and
checking if it makes the issue go away or not.
> > I would appreciate if you could test the following:
> > * Blacklist i2c-i801 and ics932s401 so that none of them get
> > auto-loaded.
>
> Done.
>
> > * Manually load i2c-i801 with interrupts enabled, and see what
> > happens.
>
> Returned immediately:
>
> [ 60.527140] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt
This confirms that the i2c-i801 driver loading itself isn't the problem.
> > * If no hang happens, load i2c-dev, find the i801 bus number with
> > i2cdetect -l (from the i2c-tools package - it should be 4 according
> > to what you reported so far but there is no guarantee that it won't
> > change across reboots.)
>
> $ i2cdetect -l
> i2c-0 i2c Radeon i2c bit bus DVI_DDC I2C adapter
> i2c-1 i2c Radeon i2c bit bus VGA_DDC I2C adapter
> i2c-2 i2c Radeon i2c bit bus MONID I2C adapter
> i2c-3 i2c Radeon i2c bit bus CRT2_DDC I2C adapter
> i2c-4 smbus SMBus I801 adapter at 0440 SMBus adapter
>
> > Then do a simple read from a random address
> > with:
> > # i2cget 4 0x50 0x00
> > (Adjust the bus number as needed.)
> > I am curious if this will hang as well or only when accessing the
> > clock chip at address 0x69.
>
> Yep, that one hangs. The hung task handler picked it up after a few
> minutes.
OK, this means that any transaction request to the SMBus controller
causes the hang.
The i2c-i801 driver is optimistically using wait_event() when waiting
for an interrupt to arrive. I suppose that the interrupt is never
delivered in your case (all 0 in /proc/interrupts.)
Daniel, shouldn't we use wait_event_timeout() instead to catch issues
like this and fail cleanly? Maybe even fallback to polling
automatically?
--
Jean Delvare
On Thu, 16 May 2013 13:44:55 +1000, Robert Norris wrote:
> > Then do a simple read from a random address
> > with:
> > # i2cget 4 0x50 0x00
> > (Adjust the bus number as needed.)
> > I am curious if this will hang as well or only when accessing the
> > clock chip at address 0x69.
>
> Yep, that one hangs. The hung task handler picked it up after a few
> minutes.
Hmm, can you please dump the PCI configuration space of the SMBus
controller?
# /sbin/lspci -s 00:1f.3 -xxx
It might be that you have interrupts routed to SMI# instead of the
regular IRQ line. The driver is supposed to disable interrupt support
in that case, but I have never tested it.
Thanks,
--
Jean Delvare
Hi,
while you are chasing some problem with i2c_801 I would like to mention
that I never got an answer on the thread https://lkml.org/lkml/2013/1/23/405
about a kmemleak reported by kernel . Maybe this could give you a hint?
If these do not overlap I would be anyways glad to receive an answer via
the original thread I have started.
Thank you,
Martin
Jean Delvare wrote:
> Hi Robert,
>
> On Thu, 16 May 2013 13:44:55 +1000, Robert Norris wrote:
>> On Wed, May 15, 2013 at 09:49:23PM +0200, Jean Delvare wrote:
>>>> Interrupt: pin B routed to IRQ 0
>>>
>>> Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
>>> reason for this hang. Was it with the i2c-i801 driver loaded, or
>>> blacklisted? Please check if it makes a difference.
>>
>> That was without the driver loaded (blacklisted). After loading (with
>> interrupts enabled) we get:
>>
>> Interrupt: pin B routed to IRQ 20
>
> For the record, I also see the IRQ value change after loading the
> i2c-i801 driver on my system (with an ICH10 south bridge.) From 14 to
> 22 in my case. So it's a bit different (no IRQ 0) but not still
> somewhat similar, so I'm still not sure if this has anything to do with
> your issue.
>
>>
>>> Do you see the same (and more generally, this issue) on one, some or
>>> all of your x3550 servers?
>>
>> The issue has occured on at least three x3550s (we have 11). I haven't
>> tested more, because knowingly crashing production machines sucks.
>
> Yes of course, I understand, I did not expect you to do that ;)
>
>> This appears to be the case on other machines. With the module
>> blacklisted (never loaded), lspci shows IRQ 0. After load, IRQ 20.
>> (tested on 3.4 and 3.9).
>
> OK.
>
>>> Are you using IPMI on these machines?
>>
>> Yes, but only for monitoring/sensors, if that makes a difference.
>
> IPMI is still likely to access the SMBus controller. If there's a BMC
> in the machine, it can also access the SMBus slave with its own
> controller. It would be good to rule this out by disabling IPMI
> completely, removing the BMC from the machine if it has one, and
> checking if it makes the issue go away or not.
>
>>> I would appreciate if you could test the following:
>>> * Blacklist i2c-i801 and ics932s401 so that none of them get
>>> auto-loaded.
>>
>> Done.
>>
>>> * Manually load i2c-i801 with interrupts enabled, and see what
>>> happens.
>>
>> Returned immediately:
>>
>> [ 60.527140] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt
>
> This confirms that the i2c-i801 driver loading itself isn't the problem.
>
>>> * If no hang happens, load i2c-dev, find the i801 bus number with
>>> i2cdetect -l (from the i2c-tools package - it should be 4 according
>>> to what you reported so far but there is no guarantee that it won't
>>> change across reboots.)
>>
>> $ i2cdetect -l
>> i2c-0 i2c Radeon i2c bit bus DVI_DDC I2C adapter
>> i2c-1 i2c Radeon i2c bit bus VGA_DDC I2C adapter
>> i2c-2 i2c Radeon i2c bit bus MONID I2C adapter
>> i2c-3 i2c Radeon i2c bit bus CRT2_DDC I2C adapter
>> i2c-4 smbus SMBus I801 adapter at 0440 SMBus adapter
>>
>>> Then do a simple read from a random address
>>> with:
>>> # i2cget 4 0x50 0x00
>>> (Adjust the bus number as needed.)
>>> I am curious if this will hang as well or only when accessing the
>>> clock chip at address 0x69.
>>
>> Yep, that one hangs. The hung task handler picked it up after a few
>> minutes.
>
> OK, this means that any transaction request to the SMBus controller
> causes the hang.
>
> The i2c-i801 driver is optimistically using wait_event() when waiting
> for an interrupt to arrive. I suppose that the interrupt is never
> delivered in your case (all 0 in /proc/interrupts.)
>
> Daniel, shouldn't we use wait_event_timeout() instead to catch issues
> like this and fail cleanly? Maybe even fallback to polling
> automatically?
>
On Fri, 17 May 2013 11:22:17 +0200, Martin Mokrejs wrote:
> Hi,
> while you are chasing some problem with i2c_801 I would like to mention
> that I never got an answer on the thread https://lkml.org/lkml/2013/1/23/405
> about a kmemleak reported by kernel . Maybe this could give you a hint?
> If these do not overlap I would be anyways glad to receive an answer via
> the original thread I have started.
> Thank you,
> Martin
I have no clue what the problem is nor how to investigate it, and in
fact I strongly suspect that this is either a false positive or a
problem in lower layers - driver core, sysfs etc. so nothing I can help
with.
So until someone comes with an evidence that there is an actual memory
leak in the i2c-i801 driver itself I'm not going to pay any attention
to your report, sorry.
--
Jean Delvare
On Fri, May 17, 2013 at 4:36 PM, Jean Delvare <[email protected]> wrote:
> Hi Robert,
>
> On Thu, 16 May 2013 13:44:55 +1000, Robert Norris wrote:
>> On Wed, May 15, 2013 at 09:49:23PM +0200, Jean Delvare wrote:
>> > > Interrupt: pin B routed to IRQ 0
>> >
>> > Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the
>> > reason for this hang. Was it with the i2c-i801 driver loaded, or
>> > blacklisted? Please check if it makes a difference.
>>
>> That was without the driver loaded (blacklisted). After loading (with
>> interrupts enabled) we get:
>>
>> Interrupt: pin B routed to IRQ 20
>
> For the record, I also see the IRQ value change after loading the
> i2c-i801 driver on my system (with an ICH10 south bridge.) From 14 to
> 22 in my case. So it's a bit different (no IRQ 0) but not still
> somewhat similar, so I'm still not sure if this has anything to do with
> your issue.
>
>>
>> > Do you see the same (and more generally, this issue) on one, some or
>> > all of your x3550 servers?
>>
>> The issue has occured on at least three x3550s (we have 11). I haven't
>> tested more, because knowingly crashing production machines sucks.
>
> Yes of course, I understand, I did not expect you to do that ;)
>
>> This appears to be the case on other machines. With the module
>> blacklisted (never loaded), lspci shows IRQ 0. After load, IRQ 20.
>> (tested on 3.4 and 3.9).
>
> OK.
>
>> > Are you using IPMI on these machines?
>>
>> Yes, but only for monitoring/sensors, if that makes a difference.
>
> IPMI is still likely to access the SMBus controller. If there's a BMC
> in the machine, it can also access the SMBus slave with its own
> controller. It would be good to rule this out by disabling IPMI
> completely, removing the BMC from the machine if it has one, and
> checking if it makes the issue go away or not.
>
>> > I would appreciate if you could test the following:
>> > * Blacklist i2c-i801 and ics932s401 so that none of them get
>> > auto-loaded.
>>
>> Done.
>>
>> > * Manually load i2c-i801 with interrupts enabled, and see what
>> > happens.
>>
>> Returned immediately:
>>
>> [ 60.527140] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt
>
> This confirms that the i2c-i801 driver loading itself isn't the problem.
>
>> > * If no hang happens, load i2c-dev, find the i801 bus number with
>> > i2cdetect -l (from the i2c-tools package - it should be 4 according
>> > to what you reported so far but there is no guarantee that it won't
>> > change across reboots.)
>>
>> $ i2cdetect -l
>> i2c-0 i2c Radeon i2c bit bus DVI_DDC I2C adapter
>> i2c-1 i2c Radeon i2c bit bus VGA_DDC I2C adapter
>> i2c-2 i2c Radeon i2c bit bus MONID I2C adapter
>> i2c-3 i2c Radeon i2c bit bus CRT2_DDC I2C adapter
>> i2c-4 smbus SMBus I801 adapter at 0440 SMBus adapter
>>
>> > Then do a simple read from a random address
>> > with:
>> > # i2cget 4 0x50 0x00
>> > (Adjust the bus number as needed.)
>> > I am curious if this will hang as well or only when accessing the
>> > clock chip at address 0x69.
>>
>> Yep, that one hangs. The hung task handler picked it up after a few
>> minutes.
>
> OK, this means that any transaction request to the SMBus controller
> causes the hang.
>
> The i2c-i801 driver is optimistically using wait_event() when waiting
> for an interrupt to arrive. I suppose that the interrupt is never
> delivered in your case (all 0 in /proc/interrupts.)
>
> Daniel, shouldn't we use wait_event_timeout() instead to catch issues
> like this and fail cleanly? Maybe even fallback to polling
> automatically?
We could try to do something like that, I guess. The only question is
how long to wait, b/c SMBus can pretty slow.
But that kind of hack sounds more like something you'd do if irqs were
getting sporadically lost on an otherwise correctly configured system.
In this case, it sounds like there are never interrupts, but we are
expecting some due to an incorrectly assuming that irqs are supported.
What is different about his configuration where there would be no
IRQs?
Was Robert able to get the system working without hangs by disabling
the IRQ feature of i2c-i801 module when it was builtin?
>
> --
> Jean Delvare
On Fri, May 17, 2013 at 10:36:22AM +0200, Jean Delvare wrote:
> IPMI is still likely to access the SMBus controller. If there's a BMC
> in the machine, it can also access the SMBus slave with its own
> controller. It would be good to rule this out by disabling IPMI
> completely, removing the BMC from the machine if it has one, and
> checking if it makes the issue go away or not.
I think they do have a BMC (dmidecode would suggest so). I'll need to
confirm this, and then get the datacentre support guys to pull it (yeah,
messy - inherited machines on the other side of the world). It might
take a couple of days, I'll let you know how it goes.
Cheers,
Rob.
On Fri, May 17, 2013 at 05:54:33PM +0800, Daniel Kurtz wrote:
> Was Robert able to get the system working without hangs by disabling
> the IRQ feature of i2c-i801 module when it was builtin?
Yes. There are no hangs when interrupts are explicitly disabled with
disable_features=0x10 or when 6676a847 is reverted and the module
rebuilt.
Cheers,
Rob.
On Fri, May 17, 2013 at 10:49:28AM +0200, Jean Delvare wrote:
> Hmm, can you please dump the PCI configuration space of the SMBus
> controller?
>
> # /sbin/lspci -s 00:1f.3 -xxx
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
00: 86 80 9b 26 41 05 80 02 09 00 05 0c 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 41 04 00 00 00 00 00 00 00 00 00 00 14 10 dd 02
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00
40: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 80 0f 01 00 00 00 00 00
Rob.
On Fri, 17 May 2013 20:27:06 +1000, Robert Norris wrote:
> On Fri, May 17, 2013 at 10:49:28AM +0200, Jean Delvare wrote:
> > Hmm, can you please dump the PCI configuration space of the SMBus
> > controller?
> >
> > # /sbin/lspci -s 00:1f.3 -xxx
>
> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller (rev 09)
> 00: 86 80 9b 26 41 05 80 02 09 00 05 0c 00 00 00 00
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 41 04 00 00 00 00 00 00 00 00 00 00 14 10 dd 02
> 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00
> 40: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^^
Hmm, no, SMI# isn't enabled. Wrong theory.
--
Jean Delvare
On Fri, May 17, 2013 at 10:36:22AM +0200, Jean Delvare wrote:
> IPMI is still likely to access the SMBus controller. If there's a BMC
> in the machine, it can also access the SMBus slave with its own
> controller. It would be good to rule this out by disabling IPMI
> completely, removing the BMC from the machine if it has one, and
> checking if it makes the issue go away or not.
This ended up being easier than I thought. The BMC can't be physically
removed, but there is a jumper on the board to disable it. We flipped it
and got a message during POST about it not being present. Additionally
all IPMI functions did nothing (hung, but interruptable) which is what
you'd expect. I think it really is disabled.
In this state, I re-ran the previous tests, with identical results. That
is:
- "modprobe i2c_i801" succeeds
- "modprobe i2c_i801 disable_features=0x10" succeeds
- With interrupts disabled, "modprobe ics932s401"
- With interrupts enabled, "modprobe ics932s401" hangs
- With interrupts enabled, "i2cget 4 0x50 0x00" hangs
I'll leave the BMC disabled for now in case that's important for further
testing. If you need other tests run with the BMC enabled, I'll use a
differen machine.
Cheers,
Rob.