2006-10-14 01:51:11

by Nishanth Aravamudan

[permalink] [raw]
Subject: ips: scheduling while atomic in 2.6.18

Hi all,

A server I administer just dumped three scheduling while atomics before
(sort of) hanging hard. Still responds to ping, but ssh is now dead and
the serial console stopped logging.

8-way PIII, 2.6.18 with the 3:1 split. Wanted to get my report out there
before I reset the box, though.

ips 0000:0d:06.0: Resetting controller.
BUG: scheduling while atomic: ipssend/0x00000001/11199
[<c0352e4d>] schedule+0x8ad/0x920
[<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
[<c011c443>] release_console_sem+0x203/0x220
[<c011caee>] vprintk+0x29e/0x380
[<c0126ab0>] lock_timer_base+0x20/0x50
[<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
[<c0126c0c>] __mod_timer+0x9c/0xc0
[<c03536c7>] schedule_timeout+0x57/0xd0
[<c0125fb0>] process_timeout+0x0/0x10
[<c0126e58>] msleep+0x28/0x40
[<c02ba521>] ips_reset_copperhead_memio+0x21/0x60
[<c02b803c>] __ips_eh_reset+0x17c/0x380
[<c028f880>] scsi_done+0x0/0x30
[<c02bba2e>] ips_queue+0x17e/0x1b0
[<c028fdc1>] scsi_dispatch_cmd+0x161/0x260
[<c028f880>] scsi_done+0x0/0x30
[<c0292a90>] scsi_times_out+0x0/0x80
[<c0294ea7>] scsi_request_fn+0x187/0x2f0
[<c0212b0e>] blk_execute_rq_nowait+0x6e/0xc0
[<c0294c11>] scsi_execute_async+0x2b1/0x3c0
[<c0294620>] scsi_end_async+0x0/0x60
[<c02c7940>] sg_cmd_done+0x0/0x260
[<c02c7e28>] sg_common_write+0x288/0x700
[<c02c7940>] sg_cmd_done+0x0/0x260
[<c02c9aec>] sg_write+0x21c/0x300
[<c0350000>] sunrpc_cache_lookup+0x140/0x150
[<c0177837>] do_ioctl+0x87/0x90
[<c01638b5>] vfs_write+0xb5/0x190
[<c016408b>] sys_write+0x4b/0x80
[<c010329b>] syscall_call+0x7/0xb
BUG: scheduling while atomic: ipssend/0x00000001/11199
[<c0352e4d>] schedule+0x8ad/0x920
[<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
[<c011c443>] release_console_sem+0x203/0x220
[<c0126ab0>] lock_timer_base+0x20/0x50
[<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
[<c0126c0c>] __mod_timer+0x9c/0xc0
[<c03536c7>] schedule_timeout+0x57/0xd0
[<c0125fb0>] process_timeout+0x0/0x10
[<c0126e58>] msleep+0x28/0x40
[<c02ba537>] ips_reset_copperhead_memio+0x37/0x60
[<c02b803c>] __ips_eh_reset+0x17c/0x380
[<c028f880>] scsi_done+0x0/0x30
[<c02bba2e>] ips_queue+0x17e/0x1b0
[<c028fdc1>] scsi_dispatch_cmd+0x161/0x260
[<c028f880>] scsi_done+0x0/0x30
[<c0292a90>] scsi_times_out+0x0/0x80
[<c0294ea7>] scsi_request_fn+0x187/0x2f0
[<c0212b0e>] blk_execute_rq_nowait+0x6e/0xc0
[<c0294c11>] scsi_execute_async+0x2b1/0x3c0
[<c0294620>] scsi_end_async+0x0/0x60
[<c02c7940>] sg_cmd_done+0x0/0x260
[<c02c7e28>] sg_common_write+0x288/0x700
[<c02c7940>] sg_cmd_done+0x0/0x260
[<c02c9aec>] sg_write+0x21c/0x300
[<c0350000>] sunrpc_cache_lookup+0x140/0x150
[<c0177837>] do_ioctl+0x87/0x90
[<c01638b5>] vfs_write+0xb5/0x190
[<c016408b>] sys_write+0x4b/0x80
[<c010329b>] syscall_call+0x7/0xb
BUG: scheduling while atomic: ipssend/0x00000001/11199
[<c0352e4d>] schedule+0x8ad/0x920
[<c0352947>] schedule+0x3a7/0x920
[<c0126ab0>] lock_timer_base+0x20/0x50
[<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
[<c0126c0c>] __mod_timer+0x9c/0xc0
[<c03536c7>] schedule_timeout+0x57/0xd0
[<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
[<c0125fb0>] process_timeout+0x0/0x10
[<c0126e58>] msleep+0x28/0x40
[<c02ba5e0>] ips_init_copperhead_memio+0x20/0x150
[<c0125fb0>] process_timeout+0x0/0x10
[<c02ba542>] ips_reset_copperhead_memio+0x42/0x60
[<c02b803c>] __ips_eh_reset+0x17c/0x380
[<c028f880>] scsi_done+0x0/0x30
[<c02bba2e>] ips_queue+0x17e/0x1b0
[<c028fdc1>] scsi_dispatch_cmd+0x161/0x260
[<c028f880>] scsi_done+0x0/0x30
[<c0292a90>] scsi_times_out+0x0/0x80
[<c0294ea7>] scsi_request_fn+0x187/0x2f0
[<c0212b0e>] blk_execute_rq_nowait+0x6e/0xc0
[<c0294c11>] scsi_execute_async+0x2b1/0x3c0
[<c0294620>] scsi_end_async+0x0/0x60
[<c02c7940>] sg_cmd_done+0x0/0x260
[<c02c7e28>] sg_common_write+0x288/0x700
[<c02c7940>] sg_cmd_done+0x0/0x260
[<c02c9aec>] sg_write+0x21c/0x300
[<c0350000>] sunrpc_cache_lookup+0x140/0x150
[<c0177837>] do_ioctl+0x87/0x90
[<c01638b5>] vfs_write+0xb5/0x190
[<c016408b>] sys_write+0x4b/0x80
[<c010329b>] syscall_call+0x7/0xb

Thanks,
Nish

--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center


2006-10-15 15:20:16

by Nick Piggin

[permalink] [raw]
Subject: Re: ips: scheduling while atomic in 2.6.18

Nishanth Aravamudan wrote:
> Hi all,
>
> A server I administer just dumped three scheduling while atomics before
> (sort of) hanging hard. Still responds to ping, but ssh is now dead and
> the serial console stopped logging.
>
> 8-way PIII, 2.6.18 with the 3:1 split. Wanted to get my report out there
> before I reset the box, though.

Thanks for the report. The messages are caused by this commit (cc'ed author):

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=15084a4a63bc300c18b28a8a9afac870c552abce

Not sure whether they are the cause of your hang, but the from the changelog
it doesn't look like the commit was strictly a bugfix so you could try
changing msleep calls in the driver back to MDELAY.

>
> ips 0000:0d:06.0: Resetting controller.
> BUG: scheduling while atomic: ipssend/0x00000001/11199
> [<c0352e4d>] schedule+0x8ad/0x920
> [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c011c443>] release_console_sem+0x203/0x220
> [<c011caee>] vprintk+0x29e/0x380
> [<c0126ab0>] lock_timer_base+0x20/0x50
> [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c0126c0c>] __mod_timer+0x9c/0xc0
> [<c03536c7>] schedule_timeout+0x57/0xd0
> [<c0125fb0>] process_timeout+0x0/0x10
> [<c0126e58>] msleep+0x28/0x40
> [<c02ba521>] ips_reset_copperhead_memio+0x21/0x60
> [<c02b803c>] __ips_eh_reset+0x17c/0x380
> [<c028f880>] scsi_done+0x0/0x30
> [<c02bba2e>] ips_queue+0x17e/0x1b0
> [<c028fdc1>] scsi_dispatch_cmd+0x161/0x260
> [<c028f880>] scsi_done+0x0/0x30
> [<c0292a90>] scsi_times_out+0x0/0x80
> [<c0294ea7>] scsi_request_fn+0x187/0x2f0
> [<c0212b0e>] blk_execute_rq_nowait+0x6e/0xc0
> [<c0294c11>] scsi_execute_async+0x2b1/0x3c0
> [<c0294620>] scsi_end_async+0x0/0x60
> [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c7e28>] sg_common_write+0x288/0x700
> [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c9aec>] sg_write+0x21c/0x300
> [<c0350000>] sunrpc_cache_lookup+0x140/0x150
> [<c0177837>] do_ioctl+0x87/0x90
> [<c01638b5>] vfs_write+0xb5/0x190
> [<c016408b>] sys_write+0x4b/0x80
> [<c010329b>] syscall_call+0x7/0xb
> BUG: scheduling while atomic: ipssend/0x00000001/11199
> [<c0352e4d>] schedule+0x8ad/0x920
> [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c011c443>] release_console_sem+0x203/0x220
> [<c0126ab0>] lock_timer_base+0x20/0x50
> [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c0126c0c>] __mod_timer+0x9c/0xc0
> [<c03536c7>] schedule_timeout+0x57/0xd0
> [<c0125fb0>] process_timeout+0x0/0x10
> [<c0126e58>] msleep+0x28/0x40
> [<c02ba537>] ips_reset_copperhead_memio+0x37/0x60
> [<c02b803c>] __ips_eh_reset+0x17c/0x380
> [<c028f880>] scsi_done+0x0/0x30
> [<c02bba2e>] ips_queue+0x17e/0x1b0
> [<c028fdc1>] scsi_dispatch_cmd+0x161/0x260
> [<c028f880>] scsi_done+0x0/0x30
> [<c0292a90>] scsi_times_out+0x0/0x80
> [<c0294ea7>] scsi_request_fn+0x187/0x2f0
> [<c0212b0e>] blk_execute_rq_nowait+0x6e/0xc0
> [<c0294c11>] scsi_execute_async+0x2b1/0x3c0
> [<c0294620>] scsi_end_async+0x0/0x60
> [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c7e28>] sg_common_write+0x288/0x700
> [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c9aec>] sg_write+0x21c/0x300
> [<c0350000>] sunrpc_cache_lookup+0x140/0x150
> [<c0177837>] do_ioctl+0x87/0x90
> [<c01638b5>] vfs_write+0xb5/0x190
> [<c016408b>] sys_write+0x4b/0x80
> [<c010329b>] syscall_call+0x7/0xb
> BUG: scheduling while atomic: ipssend/0x00000001/11199
> [<c0352e4d>] schedule+0x8ad/0x920
> [<c0352947>] schedule+0x3a7/0x920
> [<c0126ab0>] lock_timer_base+0x20/0x50
> [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c0126c0c>] __mod_timer+0x9c/0xc0
> [<c03536c7>] schedule_timeout+0x57/0xd0
> [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c0125fb0>] process_timeout+0x0/0x10
> [<c0126e58>] msleep+0x28/0x40
> [<c02ba5e0>] ips_init_copperhead_memio+0x20/0x150
> [<c0125fb0>] process_timeout+0x0/0x10
> [<c02ba542>] ips_reset_copperhead_memio+0x42/0x60
> [<c02b803c>] __ips_eh_reset+0x17c/0x380
> [<c028f880>] scsi_done+0x0/0x30
> [<c02bba2e>] ips_queue+0x17e/0x1b0
> [<c028fdc1>] scsi_dispatch_cmd+0x161/0x260
> [<c028f880>] scsi_done+0x0/0x30
> [<c0292a90>] scsi_times_out+0x0/0x80
> [<c0294ea7>] scsi_request_fn+0x187/0x2f0
> [<c0212b0e>] blk_execute_rq_nowait+0x6e/0xc0
> [<c0294c11>] scsi_execute_async+0x2b1/0x3c0
> [<c0294620>] scsi_end_async+0x0/0x60
> [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c7e28>] sg_common_write+0x288/0x700
> [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c9aec>] sg_write+0x21c/0x300
> [<c0350000>] sunrpc_cache_lookup+0x140/0x150
> [<c0177837>] do_ioctl+0x87/0x90
> [<c01638b5>] vfs_write+0xb5/0x190
> [<c016408b>] sys_write+0x4b/0x80
> [<c010329b>] syscall_call+0x7/0xb
>
> Thanks,
> Nish
>


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-15 19:17:52

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: ips: scheduling while atomic in 2.6.18

On 16.10.2006 [01:20:04 +1000], Nick Piggin wrote:
> Nishanth Aravamudan wrote:
> >Hi all,
> >
> >A server I administer just dumped three scheduling while atomics before
> >(sort of) hanging hard. Still responds to ping, but ssh is now dead and
> >the serial console stopped logging.
> >
> >8-way PIII, 2.6.18 with the 3:1 split. Wanted to get my report out there
> >before I reset the box, though.
>
> Thanks for the report. The messages are caused by this commit (cc'ed
> author):
>
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=15084a4a63bc300c18b28a8a9afac870c552abce
>
> Not sure whether they are the cause of your hang, but the from the
> changelog it doesn't look like the commit was strictly a bugfix so you
> could try changing msleep calls in the driver back to MDELAY.

Ah yes, that was what I was going to try next, if I didn't hear back
from anyone :)

I'll let you know if that changes anything. This box has been hanging
for a little while now, on an inconsistent basis -- this was just the
first time I was able to grab these traces. And by hanging hard, I mean
no console, serial or physical, no SysRq, nothing. It's all rather odd.

Will keep you posted.

Thanks,
Nish

--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center

2006-10-16 13:40:08

by Hammer, Jack

[permalink] [raw]
Subject: RE: ips: scheduling while atomic in 2.6.18


The MDELAY/msleep changes are part of a critical bug fix, so if you
change them all back, you're re-introducing the bug. Without them, you
can cause a lock up ( caught by the 2.6 softlock watchdog ) during a
reset.

But you should still try it, and maybe we have to think of another
solution if this is causing your problem.

Let me know what happens if you change it back ...

Jack


-----Original Message-----
From: Nick Piggin [mailto:[email protected]]
Sent: Sunday, October 15, 2006 11:20 AM
To: Nishanth Aravamudan
Cc: IpsLinux; LKML; Hammer, Jack
Subject: Re: ips: scheduling while atomic in 2.6.18

Nishanth Aravamudan wrote:
> Hi all,
>
> A server I administer just dumped three scheduling while atomics
> before (sort of) hanging hard. Still responds to ping, but ssh is now
> dead and the serial console stopped logging.
>
> 8-way PIII, 2.6.18 with the 3:1 split. Wanted to get my report out
> there before I reset the box, though.

Thanks for the report. The messages are caused by this commit (cc'ed
author):

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=c
ommit;h=15084a4a63bc300c18b28a8a9afac870c552abce

Not sure whether they are the cause of your hang, but the from the
changelog it doesn't look like the commit was strictly a bugfix so you
could try changing msleep calls in the driver back to MDELAY.

>
> ips 0000:0d:06.0: Resetting controller.
> BUG: scheduling while atomic: ipssend/0x00000001/11199 [<c0352e4d>]
> schedule+0x8ad/0x920 [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c011c443>] release_console_sem+0x203/0x220 [<c011caee>]
> vprintk+0x29e/0x380 [<c0126ab0>] lock_timer_base+0x20/0x50
> [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30 [<c0126c0c>]
> __mod_timer+0x9c/0xc0 [<c03536c7>] schedule_timeout+0x57/0xd0
> [<c0125fb0>] process_timeout+0x0/0x10 [<c0126e58>] msleep+0x28/0x40
> [<c02ba521>] ips_reset_copperhead_memio+0x21/0x60
> [<c02b803c>] __ips_eh_reset+0x17c/0x380 [<c028f880>]
> scsi_done+0x0/0x30 [<c02bba2e>] ips_queue+0x17e/0x1b0 [<c028fdc1>]
> scsi_dispatch_cmd+0x161/0x260 [<c028f880>] scsi_done+0x0/0x30
> [<c0292a90>] scsi_times_out+0x0/0x80 [<c0294ea7>]
> scsi_request_fn+0x187/0x2f0 [<c0212b0e>]
> blk_execute_rq_nowait+0x6e/0xc0 [<c0294c11>]
> scsi_execute_async+0x2b1/0x3c0 [<c0294620>] scsi_end_async+0x0/0x60
> [<c02c7940>] sg_cmd_done+0x0/0x260 [<c02c7e28>]
> sg_common_write+0x288/0x700 [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c9aec>] sg_write+0x21c/0x300 [<c0350000>]
> sunrpc_cache_lookup+0x140/0x150 [<c0177837>] do_ioctl+0x87/0x90
> [<c01638b5>] vfs_write+0xb5/0x190 [<c016408b>] sys_write+0x4b/0x80
> [<c010329b>] syscall_call+0x7/0xb
> BUG: scheduling while atomic: ipssend/0x00000001/11199 [<c0352e4d>]
> schedule+0x8ad/0x920 [<c03550ef>] _spin_unlock_irqrestore+0xf/0x30
> [<c011c443>] release_console_sem+0x203/0x220 [<c0126ab0>]
> lock_timer_base+0x20/0x50 [<c03550ef>]
> _spin_unlock_irqrestore+0xf/0x30 [<c0126c0c>] __mod_timer+0x9c/0xc0
> [<c03536c7>] schedule_timeout+0x57/0xd0 [<c0125fb0>]
> process_timeout+0x0/0x10 [<c0126e58>] msleep+0x28/0x40 [<c02ba537>]
> ips_reset_copperhead_memio+0x37/0x60
> [<c02b803c>] __ips_eh_reset+0x17c/0x380 [<c028f880>]
> scsi_done+0x0/0x30 [<c02bba2e>] ips_queue+0x17e/0x1b0 [<c028fdc1>]
> scsi_dispatch_cmd+0x161/0x260 [<c028f880>] scsi_done+0x0/0x30
> [<c0292a90>] scsi_times_out+0x0/0x80 [<c0294ea7>]
> scsi_request_fn+0x187/0x2f0 [<c0212b0e>]
> blk_execute_rq_nowait+0x6e/0xc0 [<c0294c11>]
> scsi_execute_async+0x2b1/0x3c0 [<c0294620>] scsi_end_async+0x0/0x60
> [<c02c7940>] sg_cmd_done+0x0/0x260 [<c02c7e28>]
> sg_common_write+0x288/0x700 [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c9aec>] sg_write+0x21c/0x300 [<c0350000>]
> sunrpc_cache_lookup+0x140/0x150 [<c0177837>] do_ioctl+0x87/0x90
> [<c01638b5>] vfs_write+0xb5/0x190 [<c016408b>] sys_write+0x4b/0x80
> [<c010329b>] syscall_call+0x7/0xb
> BUG: scheduling while atomic: ipssend/0x00000001/11199 [<c0352e4d>]
> schedule+0x8ad/0x920 [<c0352947>] schedule+0x3a7/0x920 [<c0126ab0>]
> lock_timer_base+0x20/0x50 [<c03550ef>]
> _spin_unlock_irqrestore+0xf/0x30 [<c0126c0c>] __mod_timer+0x9c/0xc0
> [<c03536c7>] schedule_timeout+0x57/0xd0 [<c03550ef>]
> _spin_unlock_irqrestore+0xf/0x30 [<c0125fb0>]
> process_timeout+0x0/0x10 [<c0126e58>] msleep+0x28/0x40 [<c02ba5e0>]
> ips_init_copperhead_memio+0x20/0x150
> [<c0125fb0>] process_timeout+0x0/0x10 [<c02ba542>]
> ips_reset_copperhead_memio+0x42/0x60
> [<c02b803c>] __ips_eh_reset+0x17c/0x380 [<c028f880>]
> scsi_done+0x0/0x30 [<c02bba2e>] ips_queue+0x17e/0x1b0 [<c028fdc1>]
> scsi_dispatch_cmd+0x161/0x260 [<c028f880>] scsi_done+0x0/0x30
> [<c0292a90>] scsi_times_out+0x0/0x80 [<c0294ea7>]
> scsi_request_fn+0x187/0x2f0 [<c0212b0e>]
> blk_execute_rq_nowait+0x6e/0xc0 [<c0294c11>]
> scsi_execute_async+0x2b1/0x3c0 [<c0294620>] scsi_end_async+0x0/0x60
> [<c02c7940>] sg_cmd_done+0x0/0x260 [<c02c7e28>]
> sg_common_write+0x288/0x700 [<c02c7940>] sg_cmd_done+0x0/0x260
> [<c02c9aec>] sg_write+0x21c/0x300 [<c0350000>]
> sunrpc_cache_lookup+0x140/0x150 [<c0177837>] do_ioctl+0x87/0x90
> [<c01638b5>] vfs_write+0xb5/0x190 [<c016408b>] sys_write+0x4b/0x80
> [<c010329b>] syscall_call+0x7/0xb
>
> Thanks,
> Nish
>


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends
http://au.messenger.yahoo.com