LinuxLists.cc - deadlock with 2.6.9

2004-11-02 07:04:29

Subject: deadlock with 2.6.9

The machine collects remote syslog, roughly 1000 packets per second. The
logs are burned to dvd several times per week. The hanging may have
coincided with the completion of one of the log burning sessions. Before
the last crash I was using ide-scsi to do the burning. Since the last
crash, I switched over to directly using the ide device and have disabled
ide-scsi.

The machine did not crash. Remote access with ssh would hang after
authentication. Already running processes that didn't fork anything
responded fine to the network. I could not fork a shell from a serial
console. I was able to use sysrq to pull debugging information. I was
also able to kill off all running processes (sysrq+e and sysrq+i), then
log in from the serial console and restart things.

I had the same problem with 2.6.8.1

Most of the processes seem to be stuck in schedule_timeout.

The full sysrq and .config are at http://hashbrown.cts.ucla.edu/deadlock/

cbs:~ > lsmod
Module Size Used by
sg 35040 0
sr_mod 14884 0
cdrom 38460 1 sr_mod
ide_scsi 15332 0
e100 29760 0
bonding 64552 0

telnet> send brk
SysRq : Show Regs

Pid: 0, comm: swapper
EIP: 0060:[<c01022cf>] CPU: 0
EIP is at default_idle+0x2f/0x40
EFLAGS: 00000246 Not tainted (2.6.9)
EAX: 00000000 EBX: c039b000 ECX: c01022a0 EDX: c039b000
ESI: c04080a0 EDI: c04081a0 EBP: c039bfc4 DS: 007b ES: 007b
CR0: 8005003b CR2: bffffd66 CR3: 0fb3f000 CR4: 000006d0
[<c0102516>] show_regs+0x146/0x170
[<c0207a81>] __handle_sysrq+0x71/0xf0
[<c021a5aa>] receive_chars+0x11a/0x230
[<c021a9bd>] serial8250_interrupt+0xdd/0xe0
[<c0106c96>] handle_IRQ_event+0x36/0x70
[<c0107063>] do_IRQ+0xe3/0x1b0
[<c0104cf0>] common_interrupt+0x18/0x20
[<c010235b>] cpu_idle+0x3b/0x50
[<c039cb8b>] start_kernel+0x16b/0x190
[<c0100211>] 0xc0100211

telnet> send brk
SysRq : Show Memory
Mem-info:
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 30, high 90, batch 15
cpu 0 cold: low 0, high 30, batch 15
HighMem per-cpu: empty

Free pages: 1348kB (0kB HighMem)
Active:26008 inactive:2118 dirty:10 writeback:0 unstable:0 free:337
slab:16168 mapped:25917 pagetables:12564
DMA free:172kB min:32kB low:64kB high:96kB active:160kB inactive:48kB
present:16384kB
protections[]: 0 0 0
Normal free:1176kB min:480kB low:960kB high:1440kB active:103872kB
inactive:8424kB present:245760kB
protections[]: 0 0 0
HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB
present:0kB
protections[]: 0 0 0
DMA: 23*4kB 2*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 172kB
Normal: 54*4kB 8*8kB 14*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 1176kB
HighMem: empty
Swap cache: add 66873, delete 65915, find 54916/55208, race 0+0
Free swap: 805560kB
65536 pages of RAM
0 pages of HIGHMEM
1786 reserved pages
600111 pages shared
958 pages swap cached

cron S C1205F60 0 2614 435 2628 2613 (NOTLB)
cde0bc9c 00000082 00000000 c1205f60 c12068c0 c4dba97c cde0bc94 c014584e
c4dba940 c57d3468 b7fe7000 00000001 c1205f60 001aff4a d6c5bd65 0003d278
c8c5e530 cf9aed40 7fffffff 00000001 cde0bcd8 c02f9f04 b7fe7000 00000001
Call Trace:
[<c02f9f04>] schedule_timeout+0xb4/0xc0
[<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
[<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
[<c0293cab>] sock_sendmsg+0xbb/0xe0
[<c0295161>] sys_sendto+0xe1/0x100
[<c02951b2>] sys_send+0x32/0x40
[<c0295a1a>] sys_socketcall+0x13a/0x250
[<c0104383>] syscall_call+0x7/0xb
cron S C1205F60 0 2628 435 2640 2614 (NOTLB)
cc738c9c 00000086 c8c5e930 c1205f60 c82e1b7c c4dba73c cc738c94 00000000
c4dba700 00000007 cc738cac cebb0690 c1205f60 00289ad9 af39036a 0003d2be
c8c5ea90 cf9aed40 7fffffff 00000001 cc738cd8 c02f9f04 00000246 000000d0
Call Trace:
[<c02f9f04>] schedule_timeout+0xb4/0xc0
[<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
[<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
[<c0293cab>] sock_sendmsg+0xbb/0xe0
[<c0295161>] sys_sendto+0xe1/0x100
[<c02951b2>] sys_send+0x32/0x40
[<c0295a1a>] sys_socketcall+0x13a/0x250
[<c0104383>] syscall_call+0x7/0xb

cron S C1205F60 0 2640 435 2655 2628 (NOTLB)
c2bfbc9c 00000082 c721b9d0 c1205f60 cd21db7c c4dba07c c2bfbc94 00000000
c4dba040 00000007 cf06b200 cebb0690 c1205f60 0027ced4 87b993a6 0003d304
c721bb30 cf9aed40 7fffffff 00000001 c2bfbcd8 c02f9f04 c138f6c0 c9bdea4c
Call Trace:
[<c02f9f04>] schedule_timeout+0xb4/0xc0
[<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
[<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
[<c0293cab>] sock_sendmsg+0xbb/0xe0
[<c0295161>] sys_sendto+0xe1/0x100
[<c02951b2>] sys_send+0x32/0x40
[<c0295a1a>] sys_socketcall+0x13a/0x250
[<c0104383>] syscall_call+0x7/0xb
sshd S C1205F60 0 2653 1410 2654 17528 (NOTLB)
c163dc9c 00000086 c721b470 c1205f60 00000000 00000000 00000000 00000000
c4dba4c0 00000007 cf06b560 c721af10 c1205f60 00039ddd 67910d34 0003d32a
c721b5d0 cf9aed40 7fffffff 00000001 c163dcd8 c02f9f04 c138f6c0 c9bdea4c
Call Trace:
[<c02f9f04>] schedule_timeout+0xb4/0xc0
[<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
[<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
[<c0293cab>] sock_sendmsg+0xbb/0xe0
[<c0295161>] sys_sendto+0xe1/0x100
[<c02951b2>] sys_send+0x32/0x40
[<c0295a1a>] sys_socketcall+0x13a/0x250
[<c0104383>] syscall_call+0x7/0xb

-Chris

2004-11-04 17:02:21

by Chris Stromsoe

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

I had another deadlock after the completion of a dvd writing session last
night. The dvd was written using the ide interface with growisof.

sysrq+p, sysrq+m, and a partial sysrq+t are at
http://hashbrown.cts.ucla.edu/deadlock/sysrq+t-20041104.

Virtually everything is stuck in schedule_timeout+0xb4/0xc0. It looks
very similar to the last deadlock I had (sysrq+t output is in the same
directory as the above as sysrq+t-20041101).

cbs:~ > lsmod
Module Size Used by
ide_cd 39328 0
cdrom 38460 1 ide_cd
e100 29760 0
bonding 64552 0

Any ideas? Anything in particular to check?

-Chris

On Mon, 1 Nov 2004, Chris Stromsoe wrote:

> The machine collects remote syslog, roughly 1000 packets per second. The
> logs are burned to dvd several times per week. The hanging may have
> coincided with the completion of one of the log burning sessions.
> Before the last crash I was using ide-scsi to do the burning. Since the
> last crash, I switched over to directly using the ide device and have
> disabled ide-scsi.
>
> The machine did not crash. Remote access with ssh would hang after
> authentication. Already running processes that didn't fork anything
> responded fine to the network. I could not fork a shell from a serial
> console. I was able to use sysrq to pull debugging information. I was
> also able to kill off all running processes (sysrq+e and sysrq+i), then
> log in from the serial console and restart things.
>
> I had the same problem with 2.6.8.1
>
> Most of the processes seem to be stuck in schedule_timeout.
>
> The full sysrq and .config are at
> http://hashbrown.cts.ucla.edu/deadlock/
>
>
> cbs:~ > lsmod
> Module Size Used by
> sg 35040 0
> sr_mod 14884 0
> cdrom 38460 1 sr_mod
> ide_scsi 15332 0
> e100 29760 0
> bonding 64552 0
>
>
>
> telnet> send brk
> SysRq : Show Regs
>
> Pid: 0, comm: swapper
> EIP: 0060:[<c01022cf>] CPU: 0
> EIP is at default_idle+0x2f/0x40
> EFLAGS: 00000246 Not tainted (2.6.9)
> EAX: 00000000 EBX: c039b000 ECX: c01022a0 EDX: c039b000
> ESI: c04080a0 EDI: c04081a0 EBP: c039bfc4 DS: 007b ES: 007b
> CR0: 8005003b CR2: bffffd66 CR3: 0fb3f000 CR4: 000006d0
> [<c0102516>] show_regs+0x146/0x170
> [<c0207a81>] __handle_sysrq+0x71/0xf0
> [<c021a5aa>] receive_chars+0x11a/0x230
> [<c021a9bd>] serial8250_interrupt+0xdd/0xe0
> [<c0106c96>] handle_IRQ_event+0x36/0x70
> [<c0107063>] do_IRQ+0xe3/0x1b0
> [<c0104cf0>] common_interrupt+0x18/0x20
> [<c010235b>] cpu_idle+0x3b/0x50
> [<c039cb8b>] start_kernel+0x16b/0x190
> [<c0100211>] 0xc0100211
>
>
>
> telnet> send brk
> SysRq : Show Memory
> Mem-info:
> DMA per-cpu:
> cpu 0 hot: low 2, high 6, batch 1
> cpu 0 cold: low 0, high 2, batch 1
> Normal per-cpu:
> cpu 0 hot: low 30, high 90, batch 15
> cpu 0 cold: low 0, high 30, batch 15
> HighMem per-cpu: empty
>
> Free pages: 1348kB (0kB HighMem)
> Active:26008 inactive:2118 dirty:10 writeback:0 unstable:0 free:337
> slab:16168 mapped:25917 pagetables:12564
> DMA free:172kB min:32kB low:64kB high:96kB active:160kB inactive:48kB
> present:16384kB
> protections[]: 0 0 0
> Normal free:1176kB min:480kB low:960kB high:1440kB active:103872kB
> inactive:8424kB present:245760kB
> protections[]: 0 0 0
> HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB
> present:0kB
> protections[]: 0 0 0
> DMA: 23*4kB 2*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
> 0*2048kB 0*4096kB = 172kB
> Normal: 54*4kB 8*8kB 14*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024kB
> 0*2048kB 0*4096kB = 1176kB
> HighMem: empty
> Swap cache: add 66873, delete 65915, find 54916/55208, race 0+0
> Free swap: 805560kB
> 65536 pages of RAM
> 0 pages of HIGHMEM
> 1786 reserved pages
> 600111 pages shared
> 958 pages swap cached
>
>
>
>
> cron S C1205F60 0 2614 435 2628 2613 (NOTLB)
> cde0bc9c 00000082 00000000 c1205f60 c12068c0 c4dba97c cde0bc94 c014584e
> c4dba940 c57d3468 b7fe7000 00000001 c1205f60 001aff4a d6c5bd65
> 0003d278
> c8c5e530 cf9aed40 7fffffff 00000001 cde0bcd8 c02f9f04 b7fe7000
> 00000001
> Call Trace:
> [<c02f9f04>] schedule_timeout+0xb4/0xc0
> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
> [<c0293cab>] sock_sendmsg+0xbb/0xe0
> [<c0295161>] sys_sendto+0xe1/0x100
> [<c02951b2>] sys_send+0x32/0x40
> [<c0295a1a>] sys_socketcall+0x13a/0x250
> [<c0104383>] syscall_call+0x7/0xb
> cron S C1205F60 0 2628 435 2640 2614 (NOTLB)
> cc738c9c 00000086 c8c5e930 c1205f60 c82e1b7c c4dba73c cc738c94 00000000
> c4dba700 00000007 cc738cac cebb0690 c1205f60 00289ad9 af39036a
> 0003d2be
> c8c5ea90 cf9aed40 7fffffff 00000001 cc738cd8 c02f9f04 00000246
> 000000d0
> Call Trace:
> [<c02f9f04>] schedule_timeout+0xb4/0xc0
> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
> [<c0293cab>] sock_sendmsg+0xbb/0xe0
> [<c0295161>] sys_sendto+0xe1/0x100
> [<c02951b2>] sys_send+0x32/0x40
> [<c0295a1a>] sys_socketcall+0x13a/0x250
> [<c0104383>] syscall_call+0x7/0xb
>
> cron S C1205F60 0 2640 435 2655 2628 (NOTLB)
> c2bfbc9c 00000082 c721b9d0 c1205f60 cd21db7c c4dba07c c2bfbc94 00000000
> c4dba040 00000007 cf06b200 cebb0690 c1205f60 0027ced4 87b993a6
> 0003d304
> c721bb30 cf9aed40 7fffffff 00000001 c2bfbcd8 c02f9f04 c138f6c0
> c9bdea4c
> Call Trace:
> [<c02f9f04>] schedule_timeout+0xb4/0xc0
> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
> [<c0293cab>] sock_sendmsg+0xbb/0xe0
> [<c0295161>] sys_sendto+0xe1/0x100
> [<c02951b2>] sys_send+0x32/0x40
> [<c0295a1a>] sys_socketcall+0x13a/0x250
> [<c0104383>] syscall_call+0x7/0xb
> sshd S C1205F60 0 2653 1410 2654 17528 (NOTLB)
> c163dc9c 00000086 c721b470 c1205f60 00000000 00000000 00000000 00000000
> c4dba4c0 00000007 cf06b560 c721af10 c1205f60 00039ddd 67910d34
> 0003d32a
> c721b5d0 cf9aed40 7fffffff 00000001 c163dcd8 c02f9f04 c138f6c0
> c9bdea4c
> Call Trace:
> [<c02f9f04>] schedule_timeout+0xb4/0xc0
> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
> [<c0293cab>] sock_sendmsg+0xbb/0xe0
> [<c0295161>] sys_sendto+0xe1/0x100
> [<c02951b2>] sys_send+0x32/0x40
> [<c0295a1a>] sys_socketcall+0x13a/0x250
> [<c0104383>] syscall_call+0x7/0xb
>
>
>
>
>
> -Chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-06 22:41:29

by Chris Stromsoe

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

I had a third lockup, this time not related to burning a dvd. As before,
the bulk of the processes that were hung were cron, and looked like:

cron S C1205F60 0 5023 444 5036 5022 (NOTLB)
c721ec9c 00000082 cd1b27f0 c1205f60 ca82ab7c c252a97c c721ec94 c014584e
c252a940 c1fe8900 b7fe7000 ce607150 c1205f60 00239d5b fac8e72b 00003496
cd1b2950 cf9b1d40 7fffffff 00000001 c721ecd8 c02f9f04 b7fe7000 00000001
Call Trace:
[<c02f9f04>] schedule_timeout+0xb4/0xc0
[<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
[<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
[<c0293cab>] sock_sendmsg+0xbb/0xe0
[<c0295161>] sys_sendto+0xe1/0x100
[<c02951b2>] sys_send+0x32/0x40
[<c0295a1a>] sys_socketcall+0x13a/0x250
[<c0104383>] syscall_call+0x7/0xb

Full sysrq+t dump is at <http://hashbrown.cts.ucla.edu/deadlock/>.

Several others that were in the same state:

sendmail-mta S C1205F60 0 416 1 434 334 (NOTLB)
ceaaceac 00000082 00000000 c1205f60 c12068c0 ceaacf44 ceaace84 c0139bb5
cfba6b80 00000246 ceaaceac c01236d6 c1205f60 00005a1e cb12fa34 0000acdf
ce9f97f0 0b4fcb33 ceaacec0 00000007 ceaacee8 c02f9eb5 ceaacec0 0b4fcb33
Call Trace:
[<c02f9eb5>] schedule_timeout+0x65/0xc0
[<c01671ee>] do_select+0x16e/0x2a0
[<c016760e>] sys_select+0x2ae/0x4c0
[<c0104383>] syscall_call+0x7/0xb

ntpd S C1205F60 0 434 1 437 416 (NOTLB)
cf7e1eac 00000086 00000000 c1205f60 c12068c0 00000010 c034eb80 00000000
cfba6700 bffffa10 000000d0 cfb1a2e0 c1205f60 000008d0 0c7e1cd2 0000ace0
ce9f87d0 00000000 7fffffff 00000008 cf7e1ee8 c02f9f04 00000246 cf7e1ed0
Call Trace:
[<c02f9f04>] schedule_timeout+0xb4/0xc0
[<c01671ee>] do_select+0x16e/0x2a0
[<c016760e>] sys_select+0x2ae/0x4c0
[<c0104383>] syscall_call+0x7/0xb

mdadm S C1205F60 0 437 1 440 434 (NOTLB)
cf707eac 00000086 00000000 c1205f60 c12068c0 cfb3c610 00000003 00000000
cf8204c0 00000246 cf707eac c01236d6 c1205f60 00000448 b1f651f9 0000acd3
cfb3d790 0b4fd705 cf707ec0 00000005 cf707ee8 c02f9eb5 cf707ec0 0b4fd705
Call Trace:
[<c02f9eb5>] schedule_timeout+0x65/0xc0
[<c01671ee>] do_select+0x16e/0x2a0
[<c016760e>] sys_select+0x2ae/0x4c0
[<c0104383>] syscall_call+0x7/0xb

atd S C1205F60 0 440 1 444 437 (NOTLB)
ce5e0f48 00000082 00000000 c1205f60 c12068c0 00000000 00000000 00000000
cfba6280 00000246 ce5e0f48 c01236d6 c1205f60 00006970 daf6df4d 0000aa54
ce9f8270 0b5bfcbc ce5e0f5c 000f41a7 ce5e0f84 c02f9eb5 ce5e0f5c 0b5bfcbc
Call Trace:
[<c02f9eb5>] schedule_timeout+0x65/0xc0
[<c01243ce>] sys_nanosleep+0xde/0x160
[<c0104383>] syscall_call+0x7/0xb

The box is P3 SMP, 256Mb ram, dual Intel eepro100 controllers, bonded into
a single failover device. The box is a syslog server for ~100 different
other machines logging at debug level. UDP traffic is constant. syslog
logs to a stripe of two mirrors, built with mdadm.

Each of the three hangs has been exactly the same. Nothing forks, but I
can use sysrq to dump the process table, or to kill everything, log in,
and restart.

After I logged in today, all of the partitions were mounted ro, and one of
the mirrors had kicked a partition out. Tests on the partition and the
disk that it's on so far show no errors.

What would cause everything on the box to hang in schedule_timeout?

-Chris

On Thu, 4 Nov 2004, Chris Stromsoe wrote:

> I had another deadlock after the completion of a dvd writing session
> last night. The dvd was written using the ide interface with growisof.
>
> sysrq+p, sysrq+m, and a partial sysrq+t are at
> http://hashbrown.cts.ucla.edu/deadlock/sysrq+t-20041104.
>
> Virtually everything is stuck in schedule_timeout+0xb4/0xc0. It looks very
> similar to the last deadlock I had (sysrq+t output is in the same directory
> as the above as sysrq+t-20041101).
>
> cbs:~ > lsmod
> Module Size Used by
> ide_cd 39328 0
> cdrom 38460 1 ide_cd
> e100 29760 0
> bonding 64552 0
>
> Any ideas? Anything in particular to check?
>
>
> -Chris
>
> On Mon, 1 Nov 2004, Chris Stromsoe wrote:
>
>> The machine collects remote syslog, roughly 1000 packets per second. The
>> logs are burned to dvd several times per week. The hanging may have
>> coincided with the completion of one of the log burning sessions. Before
>> the last crash I was using ide-scsi to do the burning. Since the last
>> crash, I switched over to directly using the ide device and have disabled
>> ide-scsi.
>>
>> The machine did not crash. Remote access with ssh would hang after
>> authentication. Already running processes that didn't fork anything
>> responded fine to the network. I could not fork a shell from a serial
>> console. I was able to use sysrq to pull debugging information. I was
>> also able to kill off all running processes (sysrq+e and sysrq+i), then log
>> in from the serial console and restart things.
>>
>> I had the same problem with 2.6.8.1
>>
>> Most of the processes seem to be stuck in schedule_timeout.
>>
>> The full sysrq and .config are at http://hashbrown.cts.ucla.edu/deadlock/
>>
>>
>> cbs:~ > lsmod
>> Module Size Used by
>> sg 35040 0
>> sr_mod 14884 0
>> cdrom 38460 1 sr_mod
>> ide_scsi 15332 0
>> e100 29760 0
>> bonding 64552 0
>>
>>
>>
>> telnet> send brk
>> SysRq : Show Regs
>>
>> Pid: 0, comm: swapper
>> EIP: 0060:[<c01022cf>] CPU: 0
>> EIP is at default_idle+0x2f/0x40
>> EFLAGS: 00000246 Not tainted (2.6.9)
>> EAX: 00000000 EBX: c039b000 ECX: c01022a0 EDX: c039b000
>> ESI: c04080a0 EDI: c04081a0 EBP: c039bfc4 DS: 007b ES: 007b
>> CR0: 8005003b CR2: bffffd66 CR3: 0fb3f000 CR4: 000006d0
>> [<c0102516>] show_regs+0x146/0x170
>> [<c0207a81>] __handle_sysrq+0x71/0xf0
>> [<c021a5aa>] receive_chars+0x11a/0x230
>> [<c021a9bd>] serial8250_interrupt+0xdd/0xe0
>> [<c0106c96>] handle_IRQ_event+0x36/0x70
>> [<c0107063>] do_IRQ+0xe3/0x1b0
>> [<c0104cf0>] common_interrupt+0x18/0x20
>> [<c010235b>] cpu_idle+0x3b/0x50
>> [<c039cb8b>] start_kernel+0x16b/0x190
>> [<c0100211>] 0xc0100211
>>
>>
>>
>> telnet> send brk
>> SysRq : Show Memory
>> Mem-info:
>> DMA per-cpu:
>> cpu 0 hot: low 2, high 6, batch 1
>> cpu 0 cold: low 0, high 2, batch 1
>> Normal per-cpu:
>> cpu 0 hot: low 30, high 90, batch 15
>> cpu 0 cold: low 0, high 30, batch 15
>> HighMem per-cpu: empty
>>
>> Free pages: 1348kB (0kB HighMem)
>> Active:26008 inactive:2118 dirty:10 writeback:0 unstable:0 free:337
>> slab:16168 mapped:25917 pagetables:12564
>> DMA free:172kB min:32kB low:64kB high:96kB active:160kB inactive:48kB
>> present:16384kB
>> protections[]: 0 0 0
>> Normal free:1176kB min:480kB low:960kB high:1440kB active:103872kB
>> inactive:8424kB present:245760kB
>> protections[]: 0 0 0
>> HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB
>> present:0kB
>> protections[]: 0 0 0
>> DMA: 23*4kB 2*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
>> 0*2048kB 0*4096kB = 172kB
>> Normal: 54*4kB 8*8kB 14*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024kB
>> 0*2048kB 0*4096kB = 1176kB
>> HighMem: empty
>> Swap cache: add 66873, delete 65915, find 54916/55208, race 0+0
>> Free swap: 805560kB
>> 65536 pages of RAM
>> 0 pages of HIGHMEM
>> 1786 reserved pages
>> 600111 pages shared
>> 958 pages swap cached
>>
>>
>>
>>
>> cron S C1205F60 0 2614 435 2628 2613 (NOTLB)
>> cde0bc9c 00000082 00000000 c1205f60 c12068c0 c4dba97c cde0bc94 c014584e
>> c4dba940 c57d3468 b7fe7000 00000001 c1205f60 001aff4a d6c5bd65
>> 0003d278
>> c8c5e530 cf9aed40 7fffffff 00000001 cde0bcd8 c02f9f04 b7fe7000
>> 00000001
>> Call Trace:
>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>> [<c0295161>] sys_sendto+0xe1/0x100
>> [<c02951b2>] sys_send+0x32/0x40
>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>> [<c0104383>] syscall_call+0x7/0xb
>> cron S C1205F60 0 2628 435 2640 2614 (NOTLB)
>> cc738c9c 00000086 c8c5e930 c1205f60 c82e1b7c c4dba73c cc738c94 00000000
>> c4dba700 00000007 cc738cac cebb0690 c1205f60 00289ad9 af39036a
>> 0003d2be
>> c8c5ea90 cf9aed40 7fffffff 00000001 cc738cd8 c02f9f04 00000246
>> 000000d0
>> Call Trace:
>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>> [<c0295161>] sys_sendto+0xe1/0x100
>> [<c02951b2>] sys_send+0x32/0x40
>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>> [<c0104383>] syscall_call+0x7/0xb
>>
>> cron S C1205F60 0 2640 435 2655 2628 (NOTLB)
>> c2bfbc9c 00000082 c721b9d0 c1205f60 cd21db7c c4dba07c c2bfbc94 00000000
>> c4dba040 00000007 cf06b200 cebb0690 c1205f60 0027ced4 87b993a6
>> 0003d304
>> c721bb30 cf9aed40 7fffffff 00000001 c2bfbcd8 c02f9f04 c138f6c0
>> c9bdea4c
>> Call Trace:
>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>> [<c0295161>] sys_sendto+0xe1/0x100
>> [<c02951b2>] sys_send+0x32/0x40
>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>> [<c0104383>] syscall_call+0x7/0xb
>> sshd S C1205F60 0 2653 1410 2654 17528 (NOTLB)
>> c163dc9c 00000086 c721b470 c1205f60 00000000 00000000 00000000 00000000
>> c4dba4c0 00000007 cf06b560 c721af10 c1205f60 00039ddd 67910d34
>> 0003d32a
>> c721b5d0 cf9aed40 7fffffff 00000001 c163dcd8 c02f9f04 c138f6c0
>> c9bdea4c
>> Call Trace:
>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>> [<c0295161>] sys_sendto+0xe1/0x100
>> [<c02951b2>] sys_send+0x32/0x40
>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>> [<c0104383>] syscall_call+0x7/0xb
>>
>>
>>
>>
>>
>> -Chris
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-07 05:58:24

by Chuck Ebbert

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

Chris Stromsoe wrote:

> I had a third lockup, this time not related to burning a dvd. As before,
> the bulk of the processes that were hung were cron

Why so many cron processes? Is this normal on your system, or does it
look like cron keeps spawning processes because it gets no response on the
sockets?

> The box is P3 SMP

Can you try a uniprocessor kernel?

> syslog logs to a stripe of two mirrors, built with mdadm.

Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.) They are
much more reliable than software RAID.

--Chuck Ebbert 07-Nov-04 00:28:44

2004-11-07 06:48:32

by Bernd Eckenfels

[permalink] [raw]

Subject: OT: cron filling process table (was: deadlock with 2.6.9)

Hello,

just an thought on cron, not very kernel related:

In article <[email protected]> you wrote:
> Why so many cron processes? Is this normal on your system, or does it
> look like cron keeps spawning processes because it gets no response on the
> sockets?

if you have a cron job which is executed very often cron will spawn a new
child everytime the deadline is reached. if the client is stuck for some
reason (ie. uninterruptiple sleep while accessing a broken ressource) it
will soon fill up your systems.

Personally thats why I prefer a cron-like system, which is configured with
maximum concurrency (or always serialize the jobs for a given type). This
has problems with handing jobs, but it is generally more stable for the
system.

Greetings
Bernd

2004-11-07 07:49:23

by Chris Stromsoe

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

On Sun, 7 Nov 2004, Chuck Ebbert wrote:

> Chris Stromsoe wrote:
>
>> I had a third lockup, this time not related to burning a dvd. As
>> before, the bulk of the processes that were hung were cron
>
> Why so many cron processes? Is this normal on your system, or does it
> look like cron keeps spawning processes because it gets no response on
> the sockets?

I'm guessing so many processes because every time one gets started, it
ends up getting stuck in schedule_timeout(), until the system stops
spawning new processes. There are many of them because the system was
running for a day or so before it became unresponsive.

>> The box is P3 SMP
>
> Can you try a uniprocessor kernel?

Boot with nosmp or boot a kernel compiled for up?

>> syslog logs to a stripe of two mirrors, built with mdadm.
>
> Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.)
> They are much more reliable than software RAID.

I've had more problems with reliable hardware raid controllers than I have
with software raid. Your mileage may vary.

-Chris

2004-11-07 07:53:24

by Chris Stromsoe

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

On Sat, 6 Nov 2004, Chris Stromsoe wrote:

> On Sun, 7 Nov 2004, Chuck Ebbert wrote:
>
>> Chris Stromsoe wrote:
>>
>>> I had a third lockup, this time not related to burning a dvd. As
>>> before, the bulk of the processes that were hung were cron
>>
>> Why so many cron processes? Is this normal on your system, or does it
>> look like cron keeps spawning processes because it gets no response on
>> the sockets?
>
> I'm guessing so many processes because every time one gets started, it
> ends up getting stuck in schedule_timeout(), until the system stops
> spawning new processes. There are many of them because the system was
> running for a day or so before it became unresponsive.

That wasn't very clear. The high number of cron jobs is extremely
abnormal. There are generally never more than a handful running at any
one time, and very rarely more than only the parent cron process. The
most frequent cronjob on the system runs every 5 minutes.

-Chris

2004-11-07 08:22:18

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

In article <[email protected]> you wrote:
> Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.) They are
> much more reliable than software RAID.

On what sample do you base this claim?

Generally hardware raid sooner or later makes problems (especially if you
run raid5 in degenerate mode or try to rebuild by disk replacing with
differen/old signature). Also bus hangs are commonly not very well received
by hw raid firmware or drivers.

Issues with software raid are usually quicker to spot.

Greetings
Bernd

2004-11-08 01:13:46

by Chuck Ebbert

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

On Sat, 6 Nov 2004 at 23:49:14 -0800 Chris Stromsoe wrote:

> Boot with nosmp or boot a kernel compiled for up?

nosmp should work, but I'd build a non-smp kernel to get rid of the
locking overhead.

>> Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.)
>> They are much more reliable than software RAID.
>
> I've had more problems with reliable hardware raid controllers than I have
> with software raid. Your mileage may vary.

I've never had trouble with either kind, but recovery has been much less
painful with hardware RAID. It takes just a few seconds to hotswap a drive.

--Chuck Ebbert 07-Nov-04 08:54:08

2004-11-13 03:48:15

by Chuck Ebbert

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

On Sun, 07 Nov 2004 at 09:22:14 +0100 Bernd Eckenfels wrote:

> > Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.) They are
> > much more reliable than software RAID.
>
> On what sample do you base this claim?
>
> Generally hardware raid sooner or later makes problems (especially if you
> run raid5 in degenerate mode or try to rebuild by disk replacing with
> differen/old signature). Also bus hangs are commonly not very well received
> by hw raid firmware or drivers.

I had 28 mirror sets on Compaq SMART2/p controllers in one server (four
controllers, two SCSI channels each, seven disks per channel.) All the disks
on channel A of each controller were mirrored to those on channel B, so even
complete failure of one channel didn't cause a problem.

Once a disk was marked 'failed' in the controller NVRAM there was no way to
convince it that some newly-inserted disk contained valid data.

Booting up with SCSI cables connected wrongly (channels A and B swapped) got you
a nice error message informing you of this fact. Swapping SCSI IDs on different
disks on the same channel was also detected and reported nicely.

And attempting to boot with a bad cable (bent pin) gave a message saying 'either
power down NOW and check cables or I will mark every disk on that channel as
failed.'

Of course this system was 100% Compaq; even the disks had Compaq firmware
though the labels said IBM. And it was very expensive...

--Chuck Ebbert 12-Nov-04 22:42:32

2004-11-15 20:19:52

by Bill Davidsen

[permalink] [raw]

Subject: Re: deadlock with 2.6.9

Chuck Ebbert wrote:
> On Sun, 07 Nov 2004 at 09:22:14 +0100 Bernd Eckenfels wrote:
>
>
>>>Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.) They are
>>>much more reliable than software RAID.
>>
>>On what sample do you base this claim?
>>
>>Generally hardware raid sooner or later makes problems (especially if you
>>run raid5 in degenerate mode or try to rebuild by disk replacing with
>>differen/old signature). Also bus hangs are commonly not very well received
>>by hw raid firmware or drivers.
>
>
> I had 28 mirror sets on Compaq SMART2/p controllers in one server (four
> controllers, two SCSI channels each, seven disks per channel.) All the disks
> on channel A of each controller were mirrored to those on channel B, so even
> complete failure of one channel didn't cause a problem.
>
> Once a disk was marked 'failed' in the controller NVRAM there was no way to
> convince it that some newly-inserted disk contained valid data.
>
> Booting up with SCSI cables connected wrongly (channels A and B swapped) got you
> a nice error message informing you of this fact. Swapping SCSI IDs on different
> disks on the same channel was also detected and reported nicely.
>
> And attempting to boot with a bad cable (bent pin) gave a message saying 'either
> power down NOW and check cables or I will mark every disk on that channel as
> failed.'
>
> Of course this system was 100% Compaq; even the disks had Compaq firmware
> though the labels said IBM. And it was very expensive...

I have had quite good luck with the IBM "ServeRAID" controllers, running
30+ systems with one or two of them and between 1-10TB storage. To
convince the controller that a drive marked bad really is good does
require shutting down the system and saying reassuring things to the
firmware from the console, but it can be done. I'm told that I may be
able to do that running now, using the control application, but given
that the servers are spread over four timezones I have no urge to try
the remote version.

I would like to have some form of hardware RAID on at least what I need
to boot, so a failure of one of the boot drives doesn't leave the
machine unable to boot. Particularly with (S)ATA systems which tend to
fail to the mirror only when the first drive is *dead*, not when it's
too sick to give good data, but still trying.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-11-20 03:05:07

by Chris Stromsoe

[permalink] [raw]

Subject: deadlock also with 2.6.9 nosmp, and 2.6.10-rc1 nosmp, was Re: deadlock with 2.6.9

I am still seeing this with 2.6.9 nosmp and with 2.6.10-rc1 nosmp. The
symptoms are the same. Every so often, things stop forking. sysrq+t
shows almost every running process in schedule_timeout. It seems to
happen every few days.

The system is a syslog collector for >100 devices that log everything at
debug level. Every 3 or 4 days (twice a week) it writes all of the logs
to dvd using growisofs.

Every few days -- maybe 3 or 4, but not tied in any way that I can see to
the dvd writing -- the system becomes unresponsive; nothing forks. I can
use sysrq+e and sysrq+i to kill everything and then regain console access
and restart everything through /etc/rc*.d/* manually. Life continues on
until the next hang.

I have had the same problem with 2.6.8.1 smp, 2.6.9 smp, 2.6.9 smp booted
with nosmp, and now 2.6.10-rc1 smp booted with nosmp.

I'm not sure where to look next. This is fairly reproducible; it just
takes a few days.

-Chris

On Sat, 6 Nov 2004, Chris Stromsoe wrote:

> I had a third lockup, this time not related to burning a dvd. As before, the
> bulk of the processes that were hung were cron, and looked like:
>
> cron S C1205F60 0 5023 444 5036 5022 (NOTLB)
> c721ec9c 00000082 cd1b27f0 c1205f60 ca82ab7c c252a97c c721ec94 c014584e
> c252a940 c1fe8900 b7fe7000 ce607150 c1205f60 00239d5b fac8e72b
> 00003496
> cd1b2950 cf9b1d40 7fffffff 00000001 c721ecd8 c02f9f04 b7fe7000
> 00000001
> Call Trace:
> [<c02f9f04>] schedule_timeout+0xb4/0xc0
> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
> [<c0293cab>] sock_sendmsg+0xbb/0xe0
> [<c0295161>] sys_sendto+0xe1/0x100
> [<c02951b2>] sys_send+0x32/0x40
> [<c0295a1a>] sys_socketcall+0x13a/0x250
> [<c0104383>] syscall_call+0x7/0xb
>
>
> Full sysrq+t dump is at <http://hashbrown.cts.ucla.edu/deadlock/>.
>
>
> Several others that were in the same state:
>
> sendmail-mta S C1205F60 0 416 1 434 334 (NOTLB)
> ceaaceac 00000082 00000000 c1205f60 c12068c0 ceaacf44 ceaace84 c0139bb5
> cfba6b80 00000246 ceaaceac c01236d6 c1205f60 00005a1e cb12fa34
> 0000acdf
> ce9f97f0 0b4fcb33 ceaacec0 00000007 ceaacee8 c02f9eb5 ceaacec0
> 0b4fcb33
> Call Trace:
> [<c02f9eb5>] schedule_timeout+0x65/0xc0
> [<c01671ee>] do_select+0x16e/0x2a0
> [<c016760e>] sys_select+0x2ae/0x4c0
> [<c0104383>] syscall_call+0x7/0xb
>
>
> ntpd S C1205F60 0 434 1 437 416 (NOTLB)
> cf7e1eac 00000086 00000000 c1205f60 c12068c0 00000010 c034eb80 00000000
> cfba6700 bffffa10 000000d0 cfb1a2e0 c1205f60 000008d0 0c7e1cd2
> 0000ace0
> ce9f87d0 00000000 7fffffff 00000008 cf7e1ee8 c02f9f04 00000246
> cf7e1ed0
> Call Trace:
> [<c02f9f04>] schedule_timeout+0xb4/0xc0
> [<c01671ee>] do_select+0x16e/0x2a0
> [<c016760e>] sys_select+0x2ae/0x4c0
> [<c0104383>] syscall_call+0x7/0xb
>
>
> mdadm S C1205F60 0 437 1 440 434 (NOTLB)
> cf707eac 00000086 00000000 c1205f60 c12068c0 cfb3c610 00000003 00000000
> cf8204c0 00000246 cf707eac c01236d6 c1205f60 00000448 b1f651f9
> 0000acd3
> cfb3d790 0b4fd705 cf707ec0 00000005 cf707ee8 c02f9eb5 cf707ec0
> 0b4fd705
> Call Trace:
> [<c02f9eb5>] schedule_timeout+0x65/0xc0
> [<c01671ee>] do_select+0x16e/0x2a0
> [<c016760e>] sys_select+0x2ae/0x4c0
> [<c0104383>] syscall_call+0x7/0xb
>
>
>
> atd S C1205F60 0 440 1 444 437 (NOTLB)
> ce5e0f48 00000082 00000000 c1205f60 c12068c0 00000000 00000000 00000000
> cfba6280 00000246 ce5e0f48 c01236d6 c1205f60 00006970 daf6df4d
> 0000aa54
> ce9f8270 0b5bfcbc ce5e0f5c 000f41a7 ce5e0f84 c02f9eb5 ce5e0f5c
> 0b5bfcbc
> Call Trace:
> [<c02f9eb5>] schedule_timeout+0x65/0xc0
> [<c01243ce>] sys_nanosleep+0xde/0x160
> [<c0104383>] syscall_call+0x7/0xb
>
>
>
> The box is P3 SMP, 256Mb ram, dual Intel eepro100 controllers, bonded into a
> single failover device. The box is a syslog server for ~100 different other
> machines logging at debug level. UDP traffic is constant. syslog logs to a
> stripe of two mirrors, built with mdadm.
>
> Each of the three hangs has been exactly the same. Nothing forks, but I can
> use sysrq to dump the process table, or to kill everything, log in, and
> restart.
>
> After I logged in today, all of the partitions were mounted ro, and one of
> the mirrors had kicked a partition out. Tests on the partition and the disk
> that it's on so far show no errors.
>
> What would cause everything on the box to hang in schedule_timeout?
>
>
> -Chris
>
> On Thu, 4 Nov 2004, Chris Stromsoe wrote:
>
>> I had another deadlock after the completion of a dvd writing session last
>> night. The dvd was written using the ide interface with growisof.
>>
>> sysrq+p, sysrq+m, and a partial sysrq+t are at
>> http://hashbrown.cts.ucla.edu/deadlock/sysrq+t-20041104.
>>
>> Virtually everything is stuck in schedule_timeout+0xb4/0xc0. It looks very
>> similar to the last deadlock I had (sysrq+t output is in the same directory
>> as the above as sysrq+t-20041101).
>>
>> cbs:~ > lsmod
>> Module Size Used by
>> ide_cd 39328 0
>> cdrom 38460 1 ide_cd
>> e100 29760 0
>> bonding 64552 0
>>
>> Any ideas? Anything in particular to check?
>>
>>
>> -Chris
>>
>> On Mon, 1 Nov 2004, Chris Stromsoe wrote:
>>
>>> The machine collects remote syslog, roughly 1000 packets per second. The
>>> logs are burned to dvd several times per week. The hanging may have
>>> coincided with the completion of one of the log burning sessions. Before
>>> the last crash I was using ide-scsi to do the burning. Since the last
>>> crash, I switched over to directly using the ide device and have disabled
>>> ide-scsi.
>>>
>>> The machine did not crash. Remote access with ssh would hang after
>>> authentication. Already running processes that didn't fork anything
>>> responded fine to the network. I could not fork a shell from a serial
>>> console. I was able to use sysrq to pull debugging information. I was
>>> also able to kill off all running processes (sysrq+e and sysrq+i), then
>>> log in from the serial console and restart things.
>>>
>>> I had the same problem with 2.6.8.1
>>>
>>> Most of the processes seem to be stuck in schedule_timeout.
>>>
>>> The full sysrq and .config are at http://hashbrown.cts.ucla.edu/deadlock/
>>>
>>>
>>> cbs:~ > lsmod
>>> Module Size Used by
>>> sg 35040 0
>>> sr_mod 14884 0
>>> cdrom 38460 1 sr_mod
>>> ide_scsi 15332 0
>>> e100 29760 0
>>> bonding 64552 0
>>>
>>>
>>>
>>> telnet> send brk
>>> SysRq : Show Regs
>>>
>>> Pid: 0, comm: swapper
>>> EIP: 0060:[<c01022cf>] CPU: 0
>>> EIP is at default_idle+0x2f/0x40
>>> EFLAGS: 00000246 Not tainted (2.6.9)
>>> EAX: 00000000 EBX: c039b000 ECX: c01022a0 EDX: c039b000
>>> ESI: c04080a0 EDI: c04081a0 EBP: c039bfc4 DS: 007b ES: 007b
>>> CR0: 8005003b CR2: bffffd66 CR3: 0fb3f000 CR4: 000006d0
>>> [<c0102516>] show_regs+0x146/0x170
>>> [<c0207a81>] __handle_sysrq+0x71/0xf0
>>> [<c021a5aa>] receive_chars+0x11a/0x230
>>> [<c021a9bd>] serial8250_interrupt+0xdd/0xe0
>>> [<c0106c96>] handle_IRQ_event+0x36/0x70
>>> [<c0107063>] do_IRQ+0xe3/0x1b0
>>> [<c0104cf0>] common_interrupt+0x18/0x20
>>> [<c010235b>] cpu_idle+0x3b/0x50
>>> [<c039cb8b>] start_kernel+0x16b/0x190
>>> [<c0100211>] 0xc0100211
>>>
>>>
>>>
>>> telnet> send brk
>>> SysRq : Show Memory
>>> Mem-info:
>>> DMA per-cpu:
>>> cpu 0 hot: low 2, high 6, batch 1
>>> cpu 0 cold: low 0, high 2, batch 1
>>> Normal per-cpu:
>>> cpu 0 hot: low 30, high 90, batch 15
>>> cpu 0 cold: low 0, high 30, batch 15
>>> HighMem per-cpu: empty
>>>
>>> Free pages: 1348kB (0kB HighMem)
>>> Active:26008 inactive:2118 dirty:10 writeback:0 unstable:0 free:337
>>> slab:16168 mapped:25917 pagetables:12564
>>> DMA free:172kB min:32kB low:64kB high:96kB active:160kB inactive:48kB
>>> present:16384kB
>>> protections[]: 0 0 0
>>> Normal free:1176kB min:480kB low:960kB high:1440kB active:103872kB
>>> inactive:8424kB present:245760kB
>>> protections[]: 0 0 0
>>> HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB
>>> present:0kB
>>> protections[]: 0 0 0
>>> DMA: 23*4kB 2*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
>>> 0*2048kB 0*4096kB = 172kB
>>> Normal: 54*4kB 8*8kB 14*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB
>>> 0*1024kB 0*2048kB 0*4096kB = 1176kB
>>> HighMem: empty
>>> Swap cache: add 66873, delete 65915, find 54916/55208, race 0+0
>>> Free swap: 805560kB
>>> 65536 pages of RAM
>>> 0 pages of HIGHMEM
>>> 1786 reserved pages
>>> 600111 pages shared
>>> 958 pages swap cached
>>>
>>>
>>>
>>>
>>> cron S C1205F60 0 2614 435 2628 2613 (NOTLB)
>>> cde0bc9c 00000082 00000000 c1205f60 c12068c0 c4dba97c cde0bc94 c014584e
>>> c4dba940 c57d3468 b7fe7000 00000001 c1205f60 001aff4a d6c5bd65
>>> 0003d278
>>> c8c5e530 cf9aed40 7fffffff 00000001 cde0bcd8 c02f9f04 b7fe7000
>>> 00000001
>>> Call Trace:
>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>> [<c0295161>] sys_sendto+0xe1/0x100
>>> [<c02951b2>] sys_send+0x32/0x40
>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>> [<c0104383>] syscall_call+0x7/0xb
>>> cron S C1205F60 0 2628 435 2640 2614 (NOTLB)
>>> cc738c9c 00000086 c8c5e930 c1205f60 c82e1b7c c4dba73c cc738c94 00000000
>>> c4dba700 00000007 cc738cac cebb0690 c1205f60 00289ad9 af39036a
>>> 0003d2be
>>> c8c5ea90 cf9aed40 7fffffff 00000001 cc738cd8 c02f9f04 00000246
>>> 000000d0
>>> Call Trace:
>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>> [<c0295161>] sys_sendto+0xe1/0x100
>>> [<c02951b2>] sys_send+0x32/0x40
>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>> [<c0104383>] syscall_call+0x7/0xb
>>>
>>> cron S C1205F60 0 2640 435 2655 2628 (NOTLB)
>>> c2bfbc9c 00000082 c721b9d0 c1205f60 cd21db7c c4dba07c c2bfbc94 00000000
>>> c4dba040 00000007 cf06b200 cebb0690 c1205f60 0027ced4 87b993a6
>>> 0003d304
>>> c721bb30 cf9aed40 7fffffff 00000001 c2bfbcd8 c02f9f04 c138f6c0
>>> c9bdea4c
>>> Call Trace:
>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>> [<c0295161>] sys_sendto+0xe1/0x100
>>> [<c02951b2>] sys_send+0x32/0x40
>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>> [<c0104383>] syscall_call+0x7/0xb
>>> sshd S C1205F60 0 2653 1410 2654 17528 (NOTLB)
>>> c163dc9c 00000086 c721b470 c1205f60 00000000 00000000 00000000 00000000
>>> c4dba4c0 00000007 cf06b560 c721af10 c1205f60 00039ddd 67910d34
>>> 0003d32a
>>> c721b5d0 cf9aed40 7fffffff 00000001 c163dcd8 c02f9f04 c138f6c0
>>> c9bdea4c
>>> Call Trace:
>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>> [<c0295161>] sys_sendto+0xe1/0x100
>>> [<c02951b2>] sys_send+0x32/0x40
>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>> [<c0104383>] syscall_call+0x7/0xb
>>>
>>>
>>>
>>>
>>>
>>> -Chris
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-23 18:23:03

by Chris Stromsoe

[permalink] [raw]

Subject: Re: deadlock also with 2.6.9 nosmp, and 2.6.10-rc1 nosmp, was Re: deadlock with 2.6.9

I managed to get access to the system while it was exhibiting the problem.
There are two cron processes both in D state. I was not able to attach to
either process with gdb. /proc/pid/wchan showed do_fork for both
processes.

I was able to 'killall -9 cron' and kill all of the cron tasks.

I had several other processes that /proc/pid/wchan showed in
unix_wait_for_peer (sshd), futex_wait (syslogd), and do_exit (zombie
sshd).

When I killed syslogd, the zombie sshd processes disappeared and
everything started magically working again.

-Chris

On Fri, 19 Nov 2004, Chris Stromsoe wrote:

> I am still seeing this with 2.6.9 nosmp and with 2.6.10-rc1 nosmp. The
> symptoms are the same. Every so often, things stop forking. sysrq+t
> shows almost every running process in schedule_timeout. It seems to
> happen every few days.
>
> The system is a syslog collector for >100 devices that log everything at
> debug level. Every 3 or 4 days (twice a week) it writes all of the logs
> to dvd using growisofs.
>
> Every few days -- maybe 3 or 4, but not tied in any way that I can see
> to the dvd writing -- the system becomes unresponsive; nothing forks.
> I can use sysrq+e and sysrq+i to kill everything and then regain console
> access and restart everything through /etc/rc*.d/* manually. Life
> continues on until the next hang.
>
> I have had the same problem with 2.6.8.1 smp, 2.6.9 smp, 2.6.9 smp
> booted with nosmp, and now 2.6.10-rc1 smp booted with nosmp.
>
> I'm not sure where to look next. This is fairly reproducible; it just
> takes a few days.
>
>
> -Chris
>
>
> On Sat, 6 Nov 2004, Chris Stromsoe wrote:
>
>> I had a third lockup, this time not related to burning a dvd. As before,
>> the bulk of the processes that were hung were cron, and looked like:
>>
>> cron S C1205F60 0 5023 444 5036 5022 (NOTLB)
>> c721ec9c 00000082 cd1b27f0 c1205f60 ca82ab7c c252a97c c721ec94 c014584e
>> c252a940 c1fe8900 b7fe7000 ce607150 c1205f60 00239d5b fac8e72b
>> 00003496
>> cd1b2950 cf9b1d40 7fffffff 00000001 c721ecd8 c02f9f04 b7fe7000
>> 00000001
>> Call Trace:
>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>> [<c0295161>] sys_sendto+0xe1/0x100
>> [<c02951b2>] sys_send+0x32/0x40
>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>> [<c0104383>] syscall_call+0x7/0xb
>>
>>
>> Full sysrq+t dump is at <http://hashbrown.cts.ucla.edu/deadlock/>.
>>
>>
>> Several others that were in the same state:
>>
>> sendmail-mta S C1205F60 0 416 1 434 334 (NOTLB)
>> ceaaceac 00000082 00000000 c1205f60 c12068c0 ceaacf44 ceaace84 c0139bb5
>> cfba6b80 00000246 ceaaceac c01236d6 c1205f60 00005a1e cb12fa34
>> 0000acdf
>> ce9f97f0 0b4fcb33 ceaacec0 00000007 ceaacee8 c02f9eb5 ceaacec0
>> 0b4fcb33
>> Call Trace:
>> [<c02f9eb5>] schedule_timeout+0x65/0xc0
>> [<c01671ee>] do_select+0x16e/0x2a0
>> [<c016760e>] sys_select+0x2ae/0x4c0
>> [<c0104383>] syscall_call+0x7/0xb
>>
>>
>> ntpd S C1205F60 0 434 1 437 416 (NOTLB)
>> cf7e1eac 00000086 00000000 c1205f60 c12068c0 00000010 c034eb80 00000000
>> cfba6700 bffffa10 000000d0 cfb1a2e0 c1205f60 000008d0 0c7e1cd2
>> 0000ace0
>> ce9f87d0 00000000 7fffffff 00000008 cf7e1ee8 c02f9f04 00000246
>> cf7e1ed0
>> Call Trace:
>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>> [<c01671ee>] do_select+0x16e/0x2a0
>> [<c016760e>] sys_select+0x2ae/0x4c0
>> [<c0104383>] syscall_call+0x7/0xb
>>
>>
>> mdadm S C1205F60 0 437 1 440 434 (NOTLB)
>> cf707eac 00000086 00000000 c1205f60 c12068c0 cfb3c610 00000003 00000000
>> cf8204c0 00000246 cf707eac c01236d6 c1205f60 00000448 b1f651f9
>> 0000acd3
>> cfb3d790 0b4fd705 cf707ec0 00000005 cf707ee8 c02f9eb5 cf707ec0
>> 0b4fd705
>> Call Trace:
>> [<c02f9eb5>] schedule_timeout+0x65/0xc0
>> [<c01671ee>] do_select+0x16e/0x2a0
>> [<c016760e>] sys_select+0x2ae/0x4c0
>> [<c0104383>] syscall_call+0x7/0xb
>>
>>
>>
>> atd S C1205F60 0 440 1 444 437 (NOTLB)
>> ce5e0f48 00000082 00000000 c1205f60 c12068c0 00000000 00000000 00000000
>> cfba6280 00000246 ce5e0f48 c01236d6 c1205f60 00006970 daf6df4d
>> 0000aa54
>> ce9f8270 0b5bfcbc ce5e0f5c 000f41a7 ce5e0f84 c02f9eb5 ce5e0f5c
>> 0b5bfcbc
>> Call Trace:
>> [<c02f9eb5>] schedule_timeout+0x65/0xc0
>> [<c01243ce>] sys_nanosleep+0xde/0x160
>> [<c0104383>] syscall_call+0x7/0xb
>>
>>
>>
>> The box is P3 SMP, 256Mb ram, dual Intel eepro100 controllers, bonded into
>> a single failover device. The box is a syslog server for ~100 different
>> other machines logging at debug level. UDP traffic is constant. syslog
>> logs to a stripe of two mirrors, built with mdadm.
>>
>> Each of the three hangs has been exactly the same. Nothing forks, but I
>> can use sysrq to dump the process table, or to kill everything, log in, and
>> restart.
>>
>> After I logged in today, all of the partitions were mounted ro, and one of
>> the mirrors had kicked a partition out. Tests on the partition and the
>> disk that it's on so far show no errors.
>>
>> What would cause everything on the box to hang in schedule_timeout?
>>
>>
>> -Chris
>>
>> On Thu, 4 Nov 2004, Chris Stromsoe wrote:
>>
>>> I had another deadlock after the completion of a dvd writing session last
>>> night. The dvd was written using the ide interface with growisof.
>>>
>>> sysrq+p, sysrq+m, and a partial sysrq+t are at
>>> http://hashbrown.cts.ucla.edu/deadlock/sysrq+t-20041104.
>>>
>>> Virtually everything is stuck in schedule_timeout+0xb4/0xc0. It looks
>>> very similar to the last deadlock I had (sysrq+t output is in the same
>>> directory as the above as sysrq+t-20041101).
>>>
>>> cbs:~ > lsmod
>>> Module Size Used by
>>> ide_cd 39328 0
>>> cdrom 38460 1 ide_cd
>>> e100 29760 0
>>> bonding 64552 0
>>>
>>> Any ideas? Anything in particular to check?
>>>
>>>
>>> -Chris
>>>
>>> On Mon, 1 Nov 2004, Chris Stromsoe wrote:
>>>
>>>> The machine collects remote syslog, roughly 1000 packets per second. The
>>>> logs are burned to dvd several times per week. The hanging may have
>>>> coincided with the completion of one of the log burning sessions. Before
>>>> the last crash I was using ide-scsi to do the burning. Since the last
>>>> crash, I switched over to directly using the ide device and have disabled
>>>> ide-scsi.
>>>>
>>>> The machine did not crash. Remote access with ssh would hang after
>>>> authentication. Already running processes that didn't fork anything
>>>> responded fine to the network. I could not fork a shell from a serial
>>>> console. I was able to use sysrq to pull debugging information. I was
>>>> also able to kill off all running processes (sysrq+e and sysrq+i), then
>>>> log in from the serial console and restart things.
>>>>
>>>> I had the same problem with 2.6.8.1
>>>>
>>>> Most of the processes seem to be stuck in schedule_timeout.
>>>>
>>>> The full sysrq and .config are at http://hashbrown.cts.ucla.edu/deadlock/
>>>>
>>>>
>>>> cbs:~ > lsmod
>>>> Module Size Used by
>>>> sg 35040 0
>>>> sr_mod 14884 0
>>>> cdrom 38460 1 sr_mod
>>>> ide_scsi 15332 0
>>>> e100 29760 0
>>>> bonding 64552 0
>>>>
>>>>
>>>>
>>>> telnet> send brk
>>>> SysRq : Show Regs
>>>>
>>>> Pid: 0, comm: swapper
>>>> EIP: 0060:[<c01022cf>] CPU: 0
>>>> EIP is at default_idle+0x2f/0x40
>>>> EFLAGS: 00000246 Not tainted (2.6.9)
>>>> EAX: 00000000 EBX: c039b000 ECX: c01022a0 EDX: c039b000
>>>> ESI: c04080a0 EDI: c04081a0 EBP: c039bfc4 DS: 007b ES: 007b
>>>> CR0: 8005003b CR2: bffffd66 CR3: 0fb3f000 CR4: 000006d0
>>>> [<c0102516>] show_regs+0x146/0x170
>>>> [<c0207a81>] __handle_sysrq+0x71/0xf0
>>>> [<c021a5aa>] receive_chars+0x11a/0x230
>>>> [<c021a9bd>] serial8250_interrupt+0xdd/0xe0
>>>> [<c0106c96>] handle_IRQ_event+0x36/0x70
>>>> [<c0107063>] do_IRQ+0xe3/0x1b0
>>>> [<c0104cf0>] common_interrupt+0x18/0x20
>>>> [<c010235b>] cpu_idle+0x3b/0x50
>>>> [<c039cb8b>] start_kernel+0x16b/0x190
>>>> [<c0100211>] 0xc0100211
>>>>
>>>>
>>>>
>>>> telnet> send brk
>>>> SysRq : Show Memory
>>>> Mem-info:
>>>> DMA per-cpu:
>>>> cpu 0 hot: low 2, high 6, batch 1
>>>> cpu 0 cold: low 0, high 2, batch 1
>>>> Normal per-cpu:
>>>> cpu 0 hot: low 30, high 90, batch 15
>>>> cpu 0 cold: low 0, high 30, batch 15
>>>> HighMem per-cpu: empty
>>>>
>>>> Free pages: 1348kB (0kB HighMem)
>>>> Active:26008 inactive:2118 dirty:10 writeback:0 unstable:0 free:337
>>>> slab:16168 mapped:25917 pagetables:12564
>>>> DMA free:172kB min:32kB low:64kB high:96kB active:160kB inactive:48kB
>>>> present:16384kB
>>>> protections[]: 0 0 0
>>>> Normal free:1176kB min:480kB low:960kB high:1440kB active:103872kB
>>>> inactive:8424kB present:245760kB
>>>> protections[]: 0 0 0
>>>> HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB
>>>> present:0kB
>>>> protections[]: 0 0 0
>>>> DMA: 23*4kB 2*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
>>>> 0*2048kB 0*4096kB = 172kB
>>>> Normal: 54*4kB 8*8kB 14*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB
>>>> 0*1024kB 0*2048kB 0*4096kB = 1176kB
>>>> HighMem: empty
>>>> Swap cache: add 66873, delete 65915, find 54916/55208, race 0+0
>>>> Free swap: 805560kB
>>>> 65536 pages of RAM
>>>> 0 pages of HIGHMEM
>>>> 1786 reserved pages
>>>> 600111 pages shared
>>>> 958 pages swap cached
>>>>
>>>>
>>>>
>>>>
>>>> cron S C1205F60 0 2614 435 2628 2613 (NOTLB)
>>>> cde0bc9c 00000082 00000000 c1205f60 c12068c0 c4dba97c cde0bc94 c014584e
>>>> c4dba940 c57d3468 b7fe7000 00000001 c1205f60 001aff4a d6c5bd65
>>>> 0003d278
>>>> c8c5e530 cf9aed40 7fffffff 00000001 cde0bcd8 c02f9f04 b7fe7000
>>>> 00000001
>>>> Call Trace:
>>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>>> [<c0295161>] sys_sendto+0xe1/0x100
>>>> [<c02951b2>] sys_send+0x32/0x40
>>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>>> [<c0104383>] syscall_call+0x7/0xb
>>>> cron S C1205F60 0 2628 435 2640 2614 (NOTLB)
>>>> cc738c9c 00000086 c8c5e930 c1205f60 c82e1b7c c4dba73c cc738c94 00000000
>>>> c4dba700 00000007 cc738cac cebb0690 c1205f60 00289ad9 af39036a
>>>> 0003d2be
>>>> c8c5ea90 cf9aed40 7fffffff 00000001 cc738cd8 c02f9f04 00000246
>>>> 000000d0
>>>> Call Trace:
>>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>>> [<c0295161>] sys_sendto+0xe1/0x100
>>>> [<c02951b2>] sys_send+0x32/0x40
>>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>>> [<c0104383>] syscall_call+0x7/0xb
>>>>
>>>> cron S C1205F60 0 2640 435 2655 2628 (NOTLB)
>>>> c2bfbc9c 00000082 c721b9d0 c1205f60 cd21db7c c4dba07c c2bfbc94 00000000
>>>> c4dba040 00000007 cf06b200 cebb0690 c1205f60 0027ced4 87b993a6
>>>> 0003d304
>>>> c721bb30 cf9aed40 7fffffff 00000001 c2bfbcd8 c02f9f04 c138f6c0
>>>> c9bdea4c
>>>> Call Trace:
>>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>>> [<c0295161>] sys_sendto+0xe1/0x100
>>>> [<c02951b2>] sys_send+0x32/0x40
>>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>>> [<c0104383>] syscall_call+0x7/0xb
>>>> sshd S C1205F60 0 2653 1410 2654 17528 (NOTLB)
>>>> c163dc9c 00000086 c721b470 c1205f60 00000000 00000000 00000000 00000000
>>>> c4dba4c0 00000007 cf06b560 c721af10 c1205f60 00039ddd 67910d34
>>>> 0003d32a
>>>> c721b5d0 cf9aed40 7fffffff 00000001 c163dcd8 c02f9f04 c138f6c0
>>>> c9bdea4c
>>>> Call Trace:
>>>> [<c02f9f04>] schedule_timeout+0xb4/0xc0
>>>> [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
>>>> [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
>>>> [<c0293cab>] sock_sendmsg+0xbb/0xe0
>>>> [<c0295161>] sys_sendto+0xe1/0x100
>>>> [<c02951b2>] sys_send+0x32/0x40
>>>> [<c0295a1a>] sys_socketcall+0x13a/0x250
>>>> [<c0104383>] syscall_call+0x7/0xb
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -Chris
>>>> -
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>> in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at http://www.tux.org/lkml/
>>>>
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>