2007-05-08 08:35:44

by Michal Piotrowski

[permalink] [raw]
Subject: 2.6.21-git8+ BUG: NMI Watchdog detected LOCKUP on CPU1

Hi,

/ filesystem was full

[39525.460000] BUG: NMI Watchdog detected LOCKUP on CPU1, eip 08056990, registers:
[39525.468000] Modules linked in: loop ipt_MASQUERADE iptable_nat nf_nat autofs4 af_packet nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss evdev snd_pcm intel_agp snd_timer snd agpgart soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc unix
[39525.518000] CPU: 1
[39525.518000] EIP: 0073:[<08056990>] Not tainted VLI
[39525.518000] EFLAGS: 00000202 (2.6.21-ga989705c #187)
[39525.529000] EIP is at 0x8056990
[39525.529000] eax: 6e560d60 ebx: 0000000b ecx: 00000000 edx: 000dd15e
[39525.541000] esi: 00000000 edi: 6e560220 ebp: bfeb0a58 esp: bfeb0990
[39525.547000] ds: 007b es: 007b fs: 0000 gs: 0033 ss: 007b
[39525.553000] Process line (pid: 4277, ti=cf200000 task=f6f560b0 task.ti=cf200000)
[39525.560000] Kernel panic - not syncing: Aiee, killing interrupt handler!

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-console.log
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-config

Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)


2007-05-08 10:07:16

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.21-git8+ BUG: NMI Watchdog detected LOCKUP on CPU1

On Tue, 08 May 2007 10:35:14 +0200 Michal Piotrowski <[email protected]> wrote:

> Hi,
>
> / filesystem was full
>
> [39525.460000] BUG: NMI Watchdog detected LOCKUP on CPU1, eip 08056990, registers:
> [39525.468000] Modules linked in: loop ipt_MASQUERADE iptable_nat nf_nat autofs4 af_packet nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss evdev snd_pcm intel_agp snd_timer snd agpgart soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc unix
> [39525.518000] CPU: 1
> [39525.518000] EIP: 0073:[<08056990>] Not tainted VLI
> [39525.518000] EFLAGS: 00000202 (2.6.21-ga989705c #187)
> [39525.529000] EIP is at 0x8056990
> [39525.529000] eax: 6e560d60 ebx: 0000000b ecx: 00000000 edx: 000dd15e
> [39525.541000] esi: 00000000 edi: 6e560220 ebp: bfeb0a58 esp: bfeb0990
> [39525.547000] ds: 007b es: 007b fs: 0000 gs: 0033 ss: 007b
> [39525.553000] Process line (pid: 4277, ti=cf200000 task=f6f560b0 task.ti=cf200000)
> [39525.560000] Kernel panic - not syncing: Aiee, killing interrupt handler!
>
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-console.log
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-config
>

I don't know what caused the CPU to jump into hyperspace like that, but Patrick
tells me that this:

> [38773.921000] printk: 15909 messages suppressed.
> [38773.926000] ipt_hook: happy cracking.
> [38778.921000] printk: 16332 messages suppressed.
> [38778.925000] ipt_hook: happy cracking.
> [38783.921000] printk: 16175 messages suppressed.
> [38783.926000] ipt_hook: happy cracking.
> [38788.921000] printk: 16390 messages suppressed.
> [38788.925000] ipt_hook: happy cracking.
> [38793.921000] printk: 16289 messages suppressed.
> [38793.925000] ipt_hook: happy cracking.
> [38798.921000] printk: 16172 messages suppressed.
> [38798.926000] ipt_hook: happy cracking.
> [38803.921000] printk: 15738 messages suppressed.
> [38803.925000] ipt_hook: happy cracking.
> [38808.921000] printk: 14731 messages suppressed.

happens when a local process sends packets with invalid IP headers
through raw sockets.

[ 5225.195000] UDP: short packet: From 37.126.206.54:46544 39671/1182
to 127.0.0.1:40761

This seems to indicate something on the local machine (packets are not
routed to 127.0.0.1) is sending invalid packets, probably with
incorrectly set up skb pointers.

I'd suggest to add a WARN_ON(1) in ipt_local_hook().

So can you please add the appropriate WARN_ON?

Whatever happens, that printk should be toned down, shouldn't it? We
prefer to not let unprivileged apps spam the logs.

2007-05-08 10:21:48

by Patrick McHardy

[permalink] [raw]
Subject: Re: 2.6.21-git8+ BUG: NMI Watchdog detected LOCKUP on CPU1

Andrew Morton wrote:
> Whatever happens, that printk should be toned down, shouldn't it? We
> prefer to not let unprivileged apps spam the logs.


Only priviledged apps can send these packets. I've never seen it in
practice except for one case that was a bug in the network stack, so
I'd prefer to keep it. The original intention was to give the user
a hint why these packets are not affected by iptables rules, so a
more descriptive message would make sense. I'll queue a patch for
this.

2007-05-08 11:05:39

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.21-git8+ BUG: NMI Watchdog detected LOCKUP on CPU1

On 08/05/07, Andrew Morton <[email protected]> wrote:
> On Tue, 08 May 2007 10:35:14 +0200 Michal Piotrowski <[email protected]> wrote:
>
> > Hi,
> >
> > / filesystem was full
> >
> > [39525.460000] BUG: NMI Watchdog detected LOCKUP on CPU1, eip 08056990, registers:
> > [39525.468000] Modules linked in: loop ipt_MASQUERADE iptable_nat nf_nat autofs4 af_packet nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss evdev snd_pcm intel_agp snd_timer snd agpgart soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc unix
> > [39525.518000] CPU: 1
> > [39525.518000] EIP: 0073:[<08056990>] Not tainted VLI
> > [39525.518000] EFLAGS: 00000202 (2.6.21-ga989705c #187)
> > [39525.529000] EIP is at 0x8056990
> > [39525.529000] eax: 6e560d60 ebx: 0000000b ecx: 00000000 edx: 000dd15e
> > [39525.541000] esi: 00000000 edi: 6e560220 ebp: bfeb0a58 esp: bfeb0990
> > [39525.547000] ds: 007b es: 007b fs: 0000 gs: 0033 ss: 007b
> > [39525.553000] Process line (pid: 4277, ti=cf200000 task=f6f560b0 task.ti=cf200000)
> > [39525.560000] Kernel panic - not syncing: Aiee, killing interrupt handler!
> >
> > http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-console.log
> > http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-config
> >
>
> I don't know what caused the CPU to jump into hyperspace like that, but Patrick
> tells me that this:
>
> > [38773.921000] printk: 15909 messages suppressed.
> > [38773.926000] ipt_hook: happy cracking.
> > [38778.921000] printk: 16332 messages suppressed.
> > [38778.925000] ipt_hook: happy cracking.
> > [38783.921000] printk: 16175 messages suppressed.
> > [38783.926000] ipt_hook: happy cracking.
> > [38788.921000] printk: 16390 messages suppressed.
> > [38788.925000] ipt_hook: happy cracking.
> > [38793.921000] printk: 16289 messages suppressed.
> > [38793.925000] ipt_hook: happy cracking.
> > [38798.921000] printk: 16172 messages suppressed.
> > [38798.926000] ipt_hook: happy cracking.
> > [38803.921000] printk: 15738 messages suppressed.
> > [38803.925000] ipt_hook: happy cracking.
> > [38808.921000] printk: 14731 messages suppressed.
>
> happens when a local process sends packets with invalid IP headers
> through raw sockets.

Yes, it was an isic session.

>
> [ 5225.195000] UDP: short packet: From 37.126.206.54:46544 39671/1182
> to 127.0.0.1:40761
>
> This seems to indicate something on the local machine (packets are not
> routed to 127.0.0.1) is sending invalid packets, probably with
> incorrectly set up skb pointers.
>
> I'd suggest to add a WARN_ON(1) in ipt_local_hook().
>
> So can you please add the appropriate WARN_ON?
>
> Whatever happens, that printk should be toned down, shouldn't it? We
> prefer to not let unprivileged apps spam the logs.
>
>

[39293.925000] ipt_hook: happy cracking.
[39429.024000] printk: 15828 messages suppressed.
[39429.028000] nf_conntrack: table full, dropping packet.
[39430.034000] nf_conntrack: table full, dropping packet.
[39431.039000] nf_conntrack: table full, dropping packet.
[39432.044000] nf_conntrack: table full, dropping packet.
[39444.056000] nf_conntrack: table full, dropping packet.
[39445.061000] nf_conntrack: table full, dropping packet.
[39525.460000] BUG: NMI Watchdog detected LOCKUP on CPU1, eip
08056990, registers:

This lockup occurred after an isic test. Hmmm... linus_stress?

FAIL aio_dio_bugs Command
<LD_LIBRARY_PATH=/usr/local/autotest/client/deps/libaio/lib/
/usr/local/autotest/client/tests/aio_dio_bugs/src/aio-dio-extend-stat
file> failed, rc=32512
GOOD aiostress completed successfully
GOOD bonnie completed successfully
GOOD cpu_hotplug completed successfully
GOOD cyclictest completed successfully
GOOD dbench completed successfully
FAIL disktest running test disktest
<--[random error]-->
FAIL fs_mark Command <./fs_mark -d /mnt -s 10240 -n 1000> failed, rc=256
GOOD fsfuzzer completed successfully
GOOD fsx completed successfully
FAIL interbench Command
</usr/local/autotest/client/tests/interbench/src/interbench -m 'run
#0' -c> failed, rc=256
GOOD iozone completed successfully
FAIL isic running test job
Traceback (most recent call last):
File "/usr/local/autotest/client/bin/job.py", line 179, in __runtest
test.runtest(self, url, tag, args, dargs)
File "/usr/local/autotest/client/bin/test.py", line 195, in runtest
fork_waitfor(job.resultdir, pid)
File "/usr/local/autotest/client/bin/parallel.py", line 40, in fork_waitfor
(pid, status) = os.waitpid(pid, 0)
KeyboardInterrupt
GOOD linus_stress completed successfully

I don't remember what was the next test.

I'll try to find out how to reproduce this lockup. Anyway, IMO it's
not a network related problem.

Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)

2007-05-08 15:24:35

by Chuck Ebbert

[permalink] [raw]
Subject: Re: 2.6.21-git8+ BUG: NMI Watchdog detected LOCKUP on CPU1

Michal Piotrowski wrote:
> Hi,
>
> / filesystem was full
>
> [39525.460000] BUG: NMI Watchdog detected LOCKUP on CPU1, eip 08056990, registers:
> [39525.468000] Modules linked in: loop ipt_MASQUERADE iptable_nat nf_nat autofs4 af_packet nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss evdev snd_pcm intel_agp snd_timer snd agpgart soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc unix
> [39525.518000] CPU: 1
> [39525.518000] EIP: 0073:[<08056990>] Not tainted VLI
> [39525.518000] EFLAGS: 00000202 (2.6.21-ga989705c #187)
> [39525.529000] EIP is at 0x8056990
> [39525.529000] eax: 6e560d60 ebx: 0000000b ecx: 00000000 edx: 000dd15e
> [39525.541000] esi: 00000000 edi: 6e560220 ebp: bfeb0a58 esp: bfeb0990
> [39525.547000] ds: 007b es: 007b fs: 0000 gs: 0033 ss: 007b
> [39525.553000] Process line (pid: 4277, ti=cf200000 task=f6f560b0 task.ti=cf200000)
> [39525.560000] Kernel panic - not syncing: Aiee, killing interrupt handler!
>
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-console.log
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-config
>

Nice. System was in user mode with interrupts enabled but none happened
for a long time. Maybe something forgot to schedule a timer interrupt?

(What mechanism is there to prevent this from happening?)

2007-05-08 16:09:56

by Jiri Kosina

[permalink] [raw]
Subject: Re: 2.6.21-git8+ BUG: NMI Watchdog detected LOCKUP on CPU1

On Tue, 8 May 2007, Andrew Morton wrote:

> happens when a local process sends packets with invalid IP headers
> through raw sockets.
[...]
> Whatever happens, that printk should be toned down, shouldn't it? We
> prefer to not let unprivileged apps spam the logs.

Isn't "unprivileged app sending packet through raw socket" an oxymoron? As
Patrick pointed out, the message is there for a reason, I'd suggest using
printk_ratelimit() in this case.

--
Jiri Kosina

2007-05-08 20:31:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.21-git8+ BUG: NMI Watchdog detected LOCKUP on CPU1

Chuck Ebbert wrote:
> Michal Piotrowski wrote:
>
>> Hi,
>>
>> / filesystem was full
>>
>> [39525.460000] BUG: NMI Watchdog detected LOCKUP on CPU1, eip 08056990, registers:
>> [39525.468000] Modules linked in: loop ipt_MASQUERADE iptable_nat nf_nat autofs4 af_packet nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 binfmt_misc thermal processor fan container nvram snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss evdev snd_pcm intel_agp snd_timer snd agpgart soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc unix
>> [39525.518000] CPU: 1
>> [39525.518000] EIP: 0073:[<08056990>] Not tainted VLI
>> [39525.518000] EFLAGS: 00000202 (2.6.21-ga989705c #187)
>> [39525.529000] EIP is at 0x8056990
>> [39525.529000] eax: 6e560d60 ebx: 0000000b ecx: 00000000 edx: 000dd15e
>> [39525.541000] esi: 00000000 edi: 6e560220 ebp: bfeb0a58 esp: bfeb0990
>> [39525.547000] ds: 007b es: 007b fs: 0000 gs: 0033 ss: 007b
>> [39525.553000] Process line (pid: 4277, ti=cf200000 task=f6f560b0 task.ti=cf200000)
>> [39525.560000] Kernel panic - not syncing: Aiee, killing interrupt handler!
>>
>> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-console.log
>> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-git8/git-config
>>
>>
>
> Nice. System was in user mode with interrupts enabled but none happened
> for a long time. Maybe something forgot to schedule a timer interrupt?
>
> (What mechanism is there to prevent this from happening?)

I think its supposed to be dealt with by explicitly poking at the
softlockup watchdog when exiting a tickless sleep - if we're exiting the
sleep to run the watchdog process, then it should see that it has been
freshly poked and not complain.

But its all fairly subtle and complex, and I couldn't convince myself
that it was correct. I had an experimental patch to explicitly
disable/enable the watchdog when entering a tickless sleep, but it got
overly complex and I dropped it.

I don't think my softlockup watchdog changes have hit git yet (not
-git8+, anyway), so I don't have to worry about being responsible yet ;)

J