Hello list,
i've 36 servers all running vanilla 3.18.18 kernel which have a very
high disk and network load.
Since a few days i encounter regular the following error messages and
pretty often completely hanging disk i/o:
[535040.439859] do_IRQ: 0.126 No irq handler for vector (irq -1)
[548400.353679] do_IRQ: 2.109 No irq handler for vector (irq -1)
[551624.894507] do_IRQ: 4.84 No irq handler for vector (irq -1)
[557524.288691] do_IRQ: 1.158 No irq handler for vector (irq -1)
[559786.928441] do_IRQ: 3.172 No irq handler for vector (irq -1)
[572906.281394] do_IRQ: 3.72 No irq handler for vector (irq -1)
[576611.808128] do_IRQ: 3.118 No irq handler for vector (irq -1)
[577242.682643] do_IRQ: 2.45 No irq handler for vector (irq -1)
[578524.584545] do_IRQ: 5.190 No irq handler for vector (irq -1)
[602109.548268] do_IRQ: 3.101 No irq handler for vector (irq -1)
All systems are Single E5 Xeons and I'm running irqbalance on them.
Chipset:
Intel C602J chipset
Is there anything i can do to fix this? Is there may be a kernel patch
available?
Thanks!
Greets,
Stefan
On Mon, 20 Jul 2015, Stefan Priebe - Profihost AG wrote:
> Hello list,
>
> i've 36 servers all running vanilla 3.18.18 kernel which have a very
> high disk and network load.
>
> Since a few days i encounter regular the following error messages and
> pretty often completely hanging disk i/o:
> [535040.439859] do_IRQ: 0.126 No irq handler for vector (irq -1)
Did this happen right after you updated to 3.18.18?
Which kernel version were you using before that?
Have you observed such error messages before the update?
> All systems are Single E5 Xeons and I'm running irqbalance on them.
Does it stop if you disable irqbalance ?
Thanks,
tglx
Am 20.07.2015 um 12:53 schrieb Thomas Gleixner:
> On Mon, 20 Jul 2015, Stefan Priebe - Profihost AG wrote:
>> Hello list,
>>
>> i've 36 servers all running vanilla 3.18.18 kernel which have a very
>> high disk and network load.
>>
>> Since a few days i encounter regular the following error messages and
>> pretty often completely hanging disk i/o:
>> [535040.439859] do_IRQ: 0.126 No irq handler for vector (irq -1)
>
> Did this happen right after you updated to 3.18.18?
> Which kernel version were you using before that?
> Have you observed such error messages before the update?
No it was always working before or at least i've never noticed a device
hang and such messages. It was vanilla 3.10.78.
>> All systems are Single E5 Xeons and I'm running irqbalance on them.
>
> Does it stop if you disable irqbalance ?
Will try that today.
Stefan
Am 20.07.2015 um 12:53 schrieb Thomas Gleixner:
> On Mon, 20 Jul 2015, Stefan Priebe - Profihost AG wrote:
>> Hello list,
>>
>> i've 36 servers all running vanilla 3.18.18 kernel which have a very
>> high disk and network load.
>>
>> Since a few days i encounter regular the following error messages and
>> pretty often completely hanging disk i/o:
>> [535040.439859] do_IRQ: 0.126 No irq handler for vector (irq -1)
>
> Did this happen right after you updated to 3.18.18?
Seems so.
> Which kernel version were you using before that?
3.10.78
> Have you observed such error messages before the update?
No.
>> All systems are Single E5 Xeons and I'm running irqbalance on them.
>
> Does it stop if you disable irqbalance ?
No. The machines still crash.
Stefan
>
> Thanks,
>
> tglx
>
On Tue, 21 Jul 2015, Stefan Priebe wrote:
> Am 20.07.2015 um 12:53 schrieb Thomas Gleixner:
> > On Mon, 20 Jul 2015, Stefan Priebe - Profihost AG wrote:
> > > Hello list,
> > >
> > > i've 36 servers all running vanilla 3.18.18 kernel which have a very
> > > high disk and network load.
> > >
> > > Since a few days i encounter regular the following error messages and
> > > pretty often completely hanging disk i/o:
> > > [535040.439859] do_IRQ: 0.126 No irq handler for vector (irq -1)
> > >
> > > All systems are Single E5 Xeons and I'm running irqbalance on them.
> >
> > Does it stop if you disable irqbalance ?
>
> No. The machines still crash.
crash as in running into a BUG? Or is it just that disk I/O is stalled?
Can you please provide the full dmesg output of such a machine?
I'll cook up a debug patch for that against 3.18.18.
Thanks,
tglx
Am 21.07.2015 um 23:15 schrieb Thomas Gleixner:
> On Tue, 21 Jul 2015, Stefan Priebe wrote:
>> Am 20.07.2015 um 12:53 schrieb Thomas Gleixner:
>>> On Mon, 20 Jul 2015, Stefan Priebe - Profihost AG wrote:
>>>> Hello list,
>>>>
>>>> i've 36 servers all running vanilla 3.18.18 kernel which have a very
>>>> high disk and network load.
>>>>
>>>> Since a few days i encounter regular the following error messages and
>>>> pretty often completely hanging disk i/o:
>>>> [535040.439859] do_IRQ: 0.126 No irq handler for vector (irq -1)
>>>>
>>>> All systems are Single E5 Xeons and I'm running irqbalance on them.
>>>
>>> Does it stop if you disable irqbalance ?
>>
>> No. The machines still crash.
>
> crash as in running into a BUG? Or is it just that disk I/O is stalled?
Sorry i meant I/O is stalled. It crashes to me as i can't login anymore
due to hanging I/O.
> Can you please provide the full dmesg output of such a machine?
Yes (this time from a machine using 3.18.14) =>
http://pastebin.com/raw.php?i=S6kAk0iS
> I'll cook up a debug patch for that against 3.18.18.
That would be great!
Stefan
> Thanks,
>
> tglx
>
Am 22.07.2015 um 09:23 schrieb Stefan Priebe - Profihost AG:
>
> Am 21.07.2015 um 23:15 schrieb Thomas Gleixner:
>> On Tue, 21 Jul 2015, Stefan Priebe wrote:
>>> Am 20.07.2015 um 12:53 schrieb Thomas Gleixner:
>>>> On Mon, 20 Jul 2015, Stefan Priebe - Profihost AG wrote:
>>>>> Hello list,
>>>>>
>>>>> i've 36 servers all running vanilla 3.18.18 kernel which have a very
>>>>> high disk and network load.
>>>>>
>>>>> Since a few days i encounter regular the following error messages and
>>>>> pretty often completely hanging disk i/o:
>>>>> [535040.439859] do_IRQ: 0.126 No irq handler for vector (irq -1)
>>>>>
>>>>> All systems are Single E5 Xeons and I'm running irqbalance on them.
>>>>
>>>> Does it stop if you disable irqbalance ?
>>>
>>> No. The machines still crash.
>>
>> crash as in running into a BUG? Or is it just that disk I/O is stalled?
>
> Sorry i meant I/O is stalled. It crashes to me as i can't login anymore
> due to hanging I/O.
>
>> Can you please provide the full dmesg output of such a machine?
>
> Yes (this time from a machine using 3.18.14) =>
> http://pastebin.com/raw.php?i=S6kAk0iS
>
>> I'll cook up a debug patch for that against 3.18.18.
Do you have any special upstream commits in mind?
Stefan
>
> That would be great!
>
> Stefan
>
>> Thanks,
>>
>> tglx
>>
On Thu, 23 Jul 2015, Stefan Priebe wrote:
> Do you have any special upstream commits in mind?
Not yet. Could you run a test with 3.16 and 3.17 please? That would
narrow down the problem space quite a bit.
Thanks,
tglx