2004-03-09 17:38:23

by Anders K. Pedersen

[permalink] [raw]
Subject: 2.6.3 userspace freeze

Hello,

Last night I upgraded two of our webservers from Linux 2.4 to 2.6.3.
During the night both of them rebooted spontanously (i.e. no indication
of why in the log files) several times, so this morning I attached a
serial console to capture the kernel messages, when they rebooted.

What I found was that all of a sudden my SSH connections to the server
and the local vtys would freeze, and it would stop responding to TCP
connections, while still responding to ICMP echo requests. Apparently
all userspace processes just froze. After approximately 60 seconds, it
logged "SOFTDOG: Initiating system reboot.", and rebooted. This was the
only kernel message, except for the boot messages. This happened
repeatedly on both servers.

Both servers run (mainly) Apache 1.3 and Sun Chili ASP (several hundred
processes each), and the freezes seemed to happen during high load
peaks.

I have attached the kernel .config (same on both servers) and the kernel
boot messages including the software watchdog reboot message. Both
servers are identical IBM xSeries 345 servers. I have other similar
servers running 2.6.3 for other purposes without any problems (so far).

Any ideas on what's wrong, or how to find out, would be much
appreciated.

--
Med venlig hilsen - Best regards

Anders K. Pedersen
Network Engineer
------------------------------------------------
Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
Phone: +45 45 880 888 - Fax: +45 45 880 777
Mail: [email protected] - http://www.cohaesio.com
------------------------------------------------


Attachments:
2.6.3-config (22.00 kB)
2.6.3-freeze.log (10.07 kB)
Download all attachments

2004-03-09 22:51:13

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.3 userspace freeze

"Anders K. Pedersen" <[email protected]> wrote:
>
> Last night I upgraded two of our webservers from Linux 2.4 to 2.6.3.
> During the night both of them rebooted spontanously (i.e. no indication
> of why in the log files) several times, so this morning I attached a
> serial console to capture the kernel messages, when they rebooted.
>
> What I found was that all of a sudden my SSH connections to the server
> and the local vtys would freeze, and it would stop responding to TCP
> connections, while still responding to ICMP echo requests. Apparently
> all userspace processes just froze. After approximately 60 seconds, it
> logged "SOFTDOG: Initiating system reboot.", and rebooted. This was the
> only kernel message, except for the boot messages. This happened
> repeatedly on both servers.
>
> Both servers run (mainly) Apache 1.3 and Sun Chili ASP (several hundred
> processes each), and the freezes seemed to happen during high load
> peaks.
>
> I have attached the kernel .config (same on both servers) and the kernel
> boot messages including the software watchdog reboot message. Both
> servers are identical IBM xSeries 345 servers. I have other similar
> servers running 2.6.3 for other purposes without any problems (so far).
>
> Any ideas on what's wrong, or how to find out, would be much
> appreciated.

It could be a kernel deadlock, or a memory leak, or a disk device driver
bug.

Would it be possible to run a `vmstat 1' somewhere and capture the last
thirty or so lines prior to the reboot?

2004-03-10 09:22:53

by Anders K. Pedersen

[permalink] [raw]
Subject: RE: 2.6.3 userspace freeze

Hello,

"Andrew Morton" <[email protected]> wrote:
> It could be a kernel deadlock, or a memory leak, or a disk
> device driver
> bug.
>
> Would it be possible to run a `vmstat 1' somewhere and
> capture the last
> thirty or so lines prior to the reboot?

I ran it again tonight - here are the last lines from 'vmstat 1' on the
serial console:

procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 0 0 0 55736 25252 552272 0 0 44 80 1302 614 9
2 90
0 0 0 0 56952 25284 552308 0 0 0 244 1189 495 4
1 95
0 0 0 0 56376 25284 552308 0 0 0 0 1205 598 4
1 95
0 0 0 0 54832 25316 552276 0 0 0 132 1136 528 2
8 91
0 0 0 0 51488 25324 552268 0 0 0 160 1124 384 4
3 94
0 0 0 0 51416 25332 552260 0 0 0 48 1091 348 1
0 99
0 0 0 0 51272 25332 552260 0 0 0 0 1055 290 1
0 98
0 0 0 0 51072 25332 552260 0 0 0 0 1078 345 1
0 99
0 0 0 0 50616 25340 552320 0 0 0 40 1156 491 7
9 85
0 0 0 0 50232 25348 552312 0 0 0 68 1188 509 4
0 95
0 0 0 0 50232 25356 552372 0 0 16 56 1101 395 1
1 99
0 0 0 0 50232 25356 552372 0 0 0 0 1075 314 0
0 99
1 0 0 0 50104 25356 552372 0 0 24 0 1094 323 2
0 98
0 0 0 0 50232 25364 552364 0 0 0 40 1150 516 3
8 89
0 0 0 0 50120 25388 552408 0 0 36 324 1151 374 2
1 98
0 0 0 0 50056 25396 552468 0 0 32 48 1142 401 1
0 99
2 0 0 0 49032 25396 552604 0 0 108 0 1146 380 3
1 96
0 0 0 0 49800 25404 552596 0 0 12 32 1250 516 1
0 98
0 0 0 0 49688 25404 552596 0 0 0 0 1209 532 6
7 87
0 0 0 0 49368 25412 552588 0 0 0 200 1347 780 6
1 92
0 0 0 0 49368 25432 552704 0 0 76 64 1296 696 10
2 88
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 0 0 0 48024 25456 552748 0 0 52 0 1148 418 4
3 94
0 0 0 0 41560 25456 552748 0 0 12 0 1213 483 10
3 87
0 0 0 0 35504 25464 552740 0 0 32 116 1227 631 8
10 82
2 0 0 0 26656 25464 552740 0 0 0 0 1274 491 16
6 78
0 0 0 0 12040 23296 552936 0 0 0 248 1322 691 12
11 77
4 0 1 0 6128 21452 547368 0 0 4 0 1370 704 16
10 74
0 0 0 0 6448 17008 533520 0 0 0 0 1138 517 9
18 73
2 0 0 0 5512 16900 531996 0 0 0 120 1200 560 6
8 86
0 0 0 0 5832 16792 529724 0 0 8 192 1197 524 7
6 88
1 0 0 0 6848 16804 529168 0 0 12 68 1148 411 3
2 95
0 0 0 0 6976 16836 529136 0 0 0 212 1178 428 6
1 93
0 0 0 0 6976 16836 529136 0 0 0 0 1051 262 0
0 100
0 0 0 0 6912 16884 529088 0 0 12 116 1137 404 2
1 98
0 0 0 0 6464 16892 529080 0 0 8 0 1096 364 2
8 90
0 0 0 0 6016 16788 528504 0 0 8 180 1147 384 4
1 95
0 0 0 0 6784 16796 527544 0 0 8 0 1105 344 4
1 95
0 0 0 0 7104 16808 523316 0 0 8 0 1173 493 12
2 86
0 2 0 0 5952 16608 519980 0 0 4260 200 1860 912 29
10 61
0 1 0 0 5568 16624 520780 0 0 3852 8 1926 616 16
21 62
0 1 0 0 5944 16676 518348 0 0 2864 976 1498 829 17
4 79
1 0 0 0 5680 16524 516324 0 0 1072 4 1405 1003 39
8 53
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 1 0 0 5936 16504 511584 0 0 1188 0 1452 1003 40
7 53
2 0 0 0 6608 16568 508664 0 0 1116 184 1476 1034 42
6 52
1 0 0 0 5520 16628 506156 0 0 1056 0 1475 1086 37
22 41
0 1 0 0 5520 16716 504300 0 0 1176 408 1418 1026 32
7 61
2 0 0 0 5648 16636 499824 0 0 1172 144 1444 1003 39
6 56
0 1 0 0 5648 16692 498612 0 0 1176 0 1405 955 33
5 61
0 1 0 0 6040 16708 494584 0 0 1072 344 1422 1069 33
7 59
0 1 0 0 6056 16728 491300 0 0 852 0 1462 976 38
23 39
0 1 0 0 7080 16660 488580 0 0 1112 336 1492 1050 27
6 67
1 0 0 0 5864 16700 486976 0 0 1200 116 1502 1209 20
5 75
0 1 0 0 6568 16584 482060 0 0 1592 0 1514 1204 24
8 68
9 0 9 0 508980 16616 469516 0 0 1452 708 3347 1647 14
68 18
6 0 0 0 476088 17520 470924 0 0 2396 316 1449 2154 80
19 1
3 0 0 0 474096 17680 470968 0 0 140 0 1131 4116 87
13 0
4 0 0 0 471792 17716 471000 0 0 0 1564 1080 5567 87
13 0
1 1 0 0 463344 17804 477372 0 0 5856 0 1438 3048 87
10 3
3 0 0 0 456880 17980 481140 0 0 4605 0 1758 2237 78
18 4
2 0 0 0 455408 18616 481184 0 0 584 2080 1259 731 77
20 3
2 0 2 0 442136 19096 481180 0 0 484 0 1158 467 45
55 0
3 0 1 0 430520 19300 481248 0 0 172 284 1119 329 11
89 0
0 2 0 0 418720 19508 481244 0 0 200 8 1085 301 12
88 0
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
2 0 0 0 399552 20080 482032 0 0 1255 822 1232 691 44
44 12
7 3 0 0 331768 20208 482448 0 0 588 0 1424 788 60
40 0
0 1 0 0 293128 20464 482668 0 0 412 0 1576 980 56
34 9
0 1 1 0 289544 21364 482856 0 0 1056 252 1526 1380 9
7 85
2 1 0 0 285000 22376 483068 0 0 1164 24 1551 1259 16
5 79
0 1 1 0 277976 23024 483372 0 0 924 60 1674 1442 30
7 64
0 1 0 0 273560 23828 483520 0 0 900 1128 1480 1149 9
21 70
0 1 0 0 268832 24656 483576 0 0 864 0 1373 844 9
5 86
SOFTDOG: Initiating system reboot.

I also got had a top running - this is the last output before it froze:

04:02:33 up 3:33, 2 users, load average: 1.27, 0.44, 0.35
266 processes: 265 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states: 24.4% user 15.1% system 0.1% nice 22.0% iowait 38.2%
idle
CPU1 states: 25.2% user 13.2% system 0.2% nice 49.2% iowait 11.1%
idle
Mem: 2072988k av, 1795780k used, 277208k free, 0k shrd, 23232k
buff
573192k active, 372544k inactive
Swap: 8384880k av, 0k used, 8384880k free 483436k
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU
COMMAND
httpd 16 0 111M 104M 11320 S 62.8 5.1 0:25 0 httpd
4306 root 16 0 2476 1464 1936 R 3.3 0.0 5:57 1 top
29761 root 34 19 1464 520 1368 D N 2.5 0.0 0:00 0
updatedb
29889 root 16 0 29628 27M 1712 S 2.1 1.3 0:00 1
rotatelogspsoft
1515 root 16 0 372M 65M 44444 S 0.7 3.2 1:10 1
caspeng
1973 root 16 0 1416 472 1356 S 0.7 0.0 1:40 1 vmstat
29891 root 16 0 2584 1300 1712 S 0.5 0.0 0:00 1
rotatelogspsoft
1559 root 16 0 2068 1064 1864 S 0.3 0.0 0:44 1
soagent
29890 root 16 0 2684 1400 1712 S 0.3 0.0 0:00 1
rotatelogspsoft
1 root 16 0 1376 484 1328 S 0.0 0.0 0:01 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0
migration/0
3 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0
ksoftirqd/0
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 1
migration/1
5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1
ksoftirqd/1
6 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 0
events/0
7 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 1
events/1
8 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 0
kblockd/0
9 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 1
kblockd/1
10 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 khubd
14 root 15 0 0 0 0 SW 0.0 0.0 0:07 1
kswapd0
13 root 25 0 0 0 0 SW 0.0 0.0 0:00 0
pdflush
12 root 15 0 0 0 0 SW 0.0 0.0 0:00 1
pdflush
11 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 kirqd
15 root 15 -10 0 0 0 SW< 0.0 0.0 0:00 0 aio/0
16 root 10 -10 0 0 0 SW< 0.0 0.0 0:00 1 aio/1
17 root 25 0 0 0 0 SW 0.0 0.0 0:00 1
scsi_eh_0
18 root 19 0 0 0 0 SW 0.0 0.0 0:00 0
kseriod
19 root 15 0 0 0 0 SW 0.0 0.0 0:02 1
kjournald
155 root 15 0 0 0 0 SW 0.0 0.0 0:00 1
kjournald
156 root 15 0 0 0 0 DW 0.0 0.0 0:00 1
kjournald
157 root 15 0 0 0 0 SW 0.0 0.0 0:03 1
kjournald
236 root 18 0 1352 404 1312 S 0.0 0.0 0:00 0
mingetty
303 root 6 -10 1356 420 1308 S < 0.0 0.0 0:00 0
watchdogd
591 root 16 0 1428 540 1368 S 0.0 0.0 0:01 1
syslogd
596 root 16 0 1368 444 1324 S 0.0 0.0 0:00 0 klogd
760 ntp 16 0 1896 1892 1740 S 0.0 0.0 0:00 1 ntpd
781 named 15 0 52648 50M 2056 S 0.0 2.4 0:12 1 named
799 root 15 0 3468 1484 3184 S 0.0 0.0 0:00 1 sshd
902 root 16 0 3708 1200 3416 S 0.0 0.0 0:00 1 master

If you need anything further, please let me know.

Med venlig hilsen - Best regards

Anders K. Pedersen
Network Engineer
------------------------------------------------
Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
Phone: +45 45 880 888 - Fax: +45 45 880 777
Mail: [email protected] - http://www.cohaesio.com
------------------------------------------------

2004-03-10 13:59:36

by Jan Kara

[permalink] [raw]
Subject: Re: 2.6.3 userspace freeze

Hello,

I don't understand much the output of vmstat but just to rule out one
possibility - can you try whether you will observe deadlocks also when
you will not have quotas turned on (I've seen that you have at least
compiled them in the kernel)?

Honza

> Hello,
>
> "Andrew Morton" <[email protected]> wrote:
> > It could be a kernel deadlock, or a memory leak, or a disk
> > device driver
> > bug.
> >
> > Would it be possible to run a `vmstat 1' somewhere and
> > capture the last
> > thirty or so lines prior to the reboot?
>
> I ran it again tonight - here are the last lines from 'vmstat 1' on the
> serial console:
>
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us
> sy id
> 1 0 0 0 55736 25252 552272 0 0 44 80 1302 614 9
> 2 90
> 0 0 0 0 56952 25284 552308 0 0 0 244 1189 495 4
> 1 95
> 0 0 0 0 56376 25284 552308 0 0 0 0 1205 598 4
> 1 95
> 0 0 0 0 54832 25316 552276 0 0 0 132 1136 528 2
> 8 91
> 0 0 0 0 51488 25324 552268 0 0 0 160 1124 384 4
> 3 94
> 0 0 0 0 51416 25332 552260 0 0 0 48 1091 348 1
> 0 99
> 0 0 0 0 51272 25332 552260 0 0 0 0 1055 290 1
> 0 98
> 0 0 0 0 51072 25332 552260 0 0 0 0 1078 345 1
> 0 99
> 0 0 0 0 50616 25340 552320 0 0 0 40 1156 491 7
> 9 85
> 0 0 0 0 50232 25348 552312 0 0 0 68 1188 509 4
> 0 95
> 0 0 0 0 50232 25356 552372 0 0 16 56 1101 395 1
> 1 99
> 0 0 0 0 50232 25356 552372 0 0 0 0 1075 314 0
> 0 99
> 1 0 0 0 50104 25356 552372 0 0 24 0 1094 323 2
> 0 98
> 0 0 0 0 50232 25364 552364 0 0 0 40 1150 516 3
> 8 89
> 0 0 0 0 50120 25388 552408 0 0 36 324 1151 374 2
> 1 98
> 0 0 0 0 50056 25396 552468 0 0 32 48 1142 401 1
> 0 99
> 2 0 0 0 49032 25396 552604 0 0 108 0 1146 380 3
> 1 96
> 0 0 0 0 49800 25404 552596 0 0 12 32 1250 516 1
> 0 98
> 0 0 0 0 49688 25404 552596 0 0 0 0 1209 532 6
> 7 87
> 0 0 0 0 49368 25412 552588 0 0 0 200 1347 780 6
> 1 92
> 0 0 0 0 49368 25432 552704 0 0 76 64 1296 696 10
> 2 88
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us
> sy id
> 0 0 0 0 48024 25456 552748 0 0 52 0 1148 418 4
> 3 94
> 0 0 0 0 41560 25456 552748 0 0 12 0 1213 483 10
> 3 87
> 0 0 0 0 35504 25464 552740 0 0 32 116 1227 631 8
> 10 82
> 2 0 0 0 26656 25464 552740 0 0 0 0 1274 491 16
> 6 78
> 0 0 0 0 12040 23296 552936 0 0 0 248 1322 691 12
> 11 77
> 4 0 1 0 6128 21452 547368 0 0 4 0 1370 704 16
> 10 74
> 0 0 0 0 6448 17008 533520 0 0 0 0 1138 517 9
> 18 73
> 2 0 0 0 5512 16900 531996 0 0 0 120 1200 560 6
> 8 86
> 0 0 0 0 5832 16792 529724 0 0 8 192 1197 524 7
> 6 88
> 1 0 0 0 6848 16804 529168 0 0 12 68 1148 411 3
> 2 95
> 0 0 0 0 6976 16836 529136 0 0 0 212 1178 428 6
> 1 93
> 0 0 0 0 6976 16836 529136 0 0 0 0 1051 262 0
> 0 100
> 0 0 0 0 6912 16884 529088 0 0 12 116 1137 404 2
> 1 98
> 0 0 0 0 6464 16892 529080 0 0 8 0 1096 364 2
> 8 90
> 0 0 0 0 6016 16788 528504 0 0 8 180 1147 384 4
> 1 95
> 0 0 0 0 6784 16796 527544 0 0 8 0 1105 344 4
> 1 95
> 0 0 0 0 7104 16808 523316 0 0 8 0 1173 493 12
> 2 86
> 0 2 0 0 5952 16608 519980 0 0 4260 200 1860 912 29
> 10 61
> 0 1 0 0 5568 16624 520780 0 0 3852 8 1926 616 16
> 21 62
> 0 1 0 0 5944 16676 518348 0 0 2864 976 1498 829 17
> 4 79
> 1 0 0 0 5680 16524 516324 0 0 1072 4 1405 1003 39
> 8 53
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us
> sy id
> 0 1 0 0 5936 16504 511584 0 0 1188 0 1452 1003 40
> 7 53
> 2 0 0 0 6608 16568 508664 0 0 1116 184 1476 1034 42
> 6 52
> 1 0 0 0 5520 16628 506156 0 0 1056 0 1475 1086 37
> 22 41
> 0 1 0 0 5520 16716 504300 0 0 1176 408 1418 1026 32
> 7 61
> 2 0 0 0 5648 16636 499824 0 0 1172 144 1444 1003 39
> 6 56
> 0 1 0 0 5648 16692 498612 0 0 1176 0 1405 955 33
> 5 61
> 0 1 0 0 6040 16708 494584 0 0 1072 344 1422 1069 33
> 7 59
> 0 1 0 0 6056 16728 491300 0 0 852 0 1462 976 38
> 23 39
> 0 1 0 0 7080 16660 488580 0 0 1112 336 1492 1050 27
> 6 67
> 1 0 0 0 5864 16700 486976 0 0 1200 116 1502 1209 20
> 5 75
> 0 1 0 0 6568 16584 482060 0 0 1592 0 1514 1204 24
> 8 68
> 9 0 9 0 508980 16616 469516 0 0 1452 708 3347 1647 14
> 68 18
> 6 0 0 0 476088 17520 470924 0 0 2396 316 1449 2154 80
> 19 1
> 3 0 0 0 474096 17680 470968 0 0 140 0 1131 4116 87
> 13 0
> 4 0 0 0 471792 17716 471000 0 0 0 1564 1080 5567 87
> 13 0
> 1 1 0 0 463344 17804 477372 0 0 5856 0 1438 3048 87
> 10 3
> 3 0 0 0 456880 17980 481140 0 0 4605 0 1758 2237 78
> 18 4
> 2 0 0 0 455408 18616 481184 0 0 584 2080 1259 731 77
> 20 3
> 2 0 2 0 442136 19096 481180 0 0 484 0 1158 467 45
> 55 0
> 3 0 1 0 430520 19300 481248 0 0 172 284 1119 329 11
> 89 0
> 0 2 0 0 418720 19508 481244 0 0 200 8 1085 301 12
> 88 0
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us
> sy id
> 2 0 0 0 399552 20080 482032 0 0 1255 822 1232 691 44
> 44 12
> 7 3 0 0 331768 20208 482448 0 0 588 0 1424 788 60
> 40 0
> 0 1 0 0 293128 20464 482668 0 0 412 0 1576 980 56
> 34 9
> 0 1 1 0 289544 21364 482856 0 0 1056 252 1526 1380 9
> 7 85
> 2 1 0 0 285000 22376 483068 0 0 1164 24 1551 1259 16
> 5 79
> 0 1 1 0 277976 23024 483372 0 0 924 60 1674 1442 30
> 7 64
> 0 1 0 0 273560 23828 483520 0 0 900 1128 1480 1149 9
> 21 70
> 0 1 0 0 268832 24656 483576 0 0 864 0 1373 844 9
> 5 86
> SOFTDOG: Initiating system reboot.
>
> I also got had a top running - this is the last output before it froze:
>
> 04:02:33 up 3:33, 2 users, load average: 1.27, 0.44, 0.35
> 266 processes: 265 sleeping, 1 running, 0 zombie, 0 stopped
> CPU0 states: 24.4% user 15.1% system 0.1% nice 22.0% iowait 38.2%
> idle
> CPU1 states: 25.2% user 13.2% system 0.2% nice 49.2% iowait 11.1%
> idle
> Mem: 2072988k av, 1795780k used, 277208k free, 0k shrd, 23232k
> buff
> 573192k active, 372544k inactive
> Swap: 8384880k av, 0k used, 8384880k free 483436k
> cached
>
> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU
> COMMAND
> httpd 16 0 111M 104M 11320 S 62.8 5.1 0:25 0 httpd
> 4306 root 16 0 2476 1464 1936 R 3.3 0.0 5:57 1 top
> 29761 root 34 19 1464 520 1368 D N 2.5 0.0 0:00 0
> updatedb
> 29889 root 16 0 29628 27M 1712 S 2.1 1.3 0:00 1
> rotatelogspsoft
> 1515 root 16 0 372M 65M 44444 S 0.7 3.2 1:10 1
> caspeng
> 1973 root 16 0 1416 472 1356 S 0.7 0.0 1:40 1 vmstat
> 29891 root 16 0 2584 1300 1712 S 0.5 0.0 0:00 1
> rotatelogspsoft
> 1559 root 16 0 2068 1064 1864 S 0.3 0.0 0:44 1
> soagent
> 29890 root 16 0 2684 1400 1712 S 0.3 0.0 0:00 1
> rotatelogspsoft
> 1 root 16 0 1376 484 1328 S 0.0 0.0 0:01 0 init
> 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0
> migration/0
> 3 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0
> ksoftirqd/0
> 4 root RT 0 0 0 0 SW 0.0 0.0 0:00 1
> migration/1
> 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1
> ksoftirqd/1
> 6 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 0
> events/0
> 7 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 1
> events/1
> 8 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 0
> kblockd/0
> 9 root 5 -10 0 0 0 SW< 0.0 0.0 0:00 1
> kblockd/1
> 10 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 khubd
> 14 root 15 0 0 0 0 SW 0.0 0.0 0:07 1
> kswapd0
> 13 root 25 0 0 0 0 SW 0.0 0.0 0:00 0
> pdflush
> 12 root 15 0 0 0 0 SW 0.0 0.0 0:00 1
> pdflush
> 11 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 kirqd
> 15 root 15 -10 0 0 0 SW< 0.0 0.0 0:00 0 aio/0
> 16 root 10 -10 0 0 0 SW< 0.0 0.0 0:00 1 aio/1
> 17 root 25 0 0 0 0 SW 0.0 0.0 0:00 1
> scsi_eh_0
> 18 root 19 0 0 0 0 SW 0.0 0.0 0:00 0
> kseriod
> 19 root 15 0 0 0 0 SW 0.0 0.0 0:02 1
> kjournald
> 155 root 15 0 0 0 0 SW 0.0 0.0 0:00 1
> kjournald
> 156 root 15 0 0 0 0 DW 0.0 0.0 0:00 1
> kjournald
> 157 root 15 0 0 0 0 SW 0.0 0.0 0:03 1
> kjournald
> 236 root 18 0 1352 404 1312 S 0.0 0.0 0:00 0
> mingetty
> 303 root 6 -10 1356 420 1308 S < 0.0 0.0 0:00 0
> watchdogd
> 591 root 16 0 1428 540 1368 S 0.0 0.0 0:01 1
> syslogd
> 596 root 16 0 1368 444 1324 S 0.0 0.0 0:00 0 klogd
> 760 ntp 16 0 1896 1892 1740 S 0.0 0.0 0:00 1 ntpd
> 781 named 15 0 52648 50M 2056 S 0.0 2.4 0:12 1 named
> 799 root 15 0 3468 1484 3184 S 0.0 0.0 0:00 1 sshd
> 902 root 16 0 3708 1200 3416 S 0.0 0.0 0:00 1 master
>
> If you need anything further, please let me know.
>
> Med venlig hilsen - Best regards
>
> Anders K. Pedersen
> Network Engineer
> ------------------------------------------------
> Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
> Phone: +45 45 880 888 - Fax: +45 45 880 777
> Mail: [email protected] - http://www.cohaesio.com
> ------------------------------------------------
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Jan Kara <[email protected]>
SuSE CR Labs

2004-03-10 15:30:25

by Anders K. Pedersen

[permalink] [raw]
Subject: RE: 2.6.3 userspace freeze

Hello,

I will try this to night; just to make sure I understand you correctly:
You just want me to turn off quotas on all file systems (currently they
are in use on one of them), and it is not necessary to recompile the
kernel without quota support?

Regards,
Anders K. Pedersen

> -----Original Message-----
> From: Jan Kara [mailto:[email protected]]
> Sent: Wednesday, March 10, 2004 2:59 PM
> To: Anders K. Pedersen
> Cc: Andrew Morton; [email protected]
> Subject: Re: 2.6.3 userspace freeze
>
>
> Hello,
>
> I don't understand much the output of vmstat but just to
> rule out one
> possibility - can you try whether you will observe deadlocks also when
> you will not have quotas turned on (I've seen that you have at least
> compiled them in the kernel)?
>
> Honza

2004-03-10 15:44:06

by Jan Kara

[permalink] [raw]
Subject: Re: 2.6.3 userspace freeze

Hello,

> I will try this to night; just to make sure I understand you correctly:
> You just want me to turn off quotas on all file systems (currently they
> are in use on one of them), and it is not necessary to recompile the
> kernel without quota support?
Yes, just turning quotas off with quotaoff(8) should be enough to rule
out possible deadlocks caused by quotas.

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2004-03-11 00:56:15

by Anders K. Pedersen

[permalink] [raw]
Subject: RE: 2.6.3 userspace freeze

Hello,

> -----Original Message-----
> From: Jan Kara [mailto:[email protected]]
> Sent: Wednesday, March 10, 2004 4:44 PM
> To: Anders K. Pedersen
> Cc: Andrew Morton; [email protected]
> Subject: Re: 2.6.3 userspace freeze

> > I will try this to night; just to make sure I understand
> you correctly:
> > You just want me to turn off quotas on all file systems
> (currently they
> > are in use on one of them), and it is not necessary to recompile the
> > kernel without quota support?
> Yes, just turning quotas off with quotaoff(8) should be
> enough to rule
> out possible deadlocks caused by quotas.

I just tried booting the 2.6.3 kernel without any qoutas enabled in
fstab, and it failed like previously after only 17 minutes.

Output from vmstat 1 before the freeze:

procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 1 0 0 1388352 67052 138104 0 0 440 364 1309 806
15 5 80
0 1 0 0 1386304 67496 138068 0 0 436 668 1410 1008
24 5 71
0 1 0 0 1384512 67980 138196 0 0 584 692 1583 1295
33 5 62
0 1 1 0 1385024 68436 138284 0 0 500 492 1404 995
17 2 80
1 1 0 0 1383888 68880 138316 0 0 448 568 1422 1099
24 22 53
0 1 0 0 1381968 69276 138328 0 0 396 316 1382 929
26 4 71
0 1 0 0 1381016 69712 138300 0 0 428 1488 1415 1095
12 3 85
0 1 0 0 1379864 70136 138352 0 0 424 500 1275 873
11 2 87
0 1 1 0 1378792 70600 138296 0 0 460 660 1320 967
11 1 88
0 1 0 0 1377704 71020 138352 0 0 416 884 1445 1245
25 24 51
0 1 0 0 1376616 71440 138340 0 0 420 556 1372 991
23 5 72
0 1 0 0 1374632 71884 138372 0 0 436 1308 1423 1065
17 4 79
0 1 0 0 1373672 72324 138340 0 0 444 448 1319 983
9 3 88
0 1 0 0 1372584 72752 138320 0 0 428 316 1309 840
16 2 82
0 1 0 0 1371496 73208 138340 0 0 448 688 1315 1085
9 13 78
0 3 0 0 1370216 73676 138552 0 0 672 604 1436 1172
13 3 84
0 1 0 0 1367976 74088 138616 0 0 448 524 1442 1035
22 4 74
0 1 0 0 1365280 74536 138644 0 0 448 564 1306 918
9 4 87
0 1 0 0 1363808 74960 138696 0 0 424 360 1291 804
11 2 87
0 1 1 0 1362848 75428 138704 0 0 460 772 1321 1117
18 27 55
0 1 0 0 1361504 75916 138692 0 0 488 464 1322 928
15 2 83
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 1 0 0 1360224 76376 138708 0 0 452 432 1286 783
15 3 83
1 0 0 0 1358712 76812 138680 0 0 436 604 1366 1091
11 12 76
0 1 0 0 1357624 77256 138780 0 0 556 740 1363 1025
7 3 90
0 1 0 0 1356600 77708 138804 0 0 452 492 1303 943
10 17 74
0 1 0 0 1355064 78196 138860 0 0 480 244 1322 928
24 5 71
1 0 0 0 1353976 78668 138864 0 0 472 272 1323 837
11 2 87
0 1 0 0 1352824 79132 138876 0 0 464 560 1365 993
11 3 86
0 1 0 0 1350152 79592 138824 0 0 460 508 1434 1044
22 3 75
0 1 0 0 1350344 80060 138900 0 0 500 348 1279 894
10 17 73
0 1 0 0 1349200 80544 138892 0 0 480 668 1387 1124
20 5 75
0 1 0 0 1348240 80988 138924 0 0 436 1256 1427 1256
6 2 92
0 1 0 0 1345440 81488 138968 0 0 524 500 1368 940
14 4 82
0 1 0 0 1344272 81952 138912 0 0 464 440 1352 918
13 4 83
2 0 0 0 1344784 82428 138980 0 0 476 552 1310 914
6 8 86
0 1 0 0 1343632 82904 138912 0 0 468 664 1331 1163
6 15 79
0 1 0 0 1342352 83376 138916 0 0 464 816 1354 1085
5 2 93
0 1 0 0 1341264 83820 138948 0 0 444 316 1216 678
3 1 96
0 1 0 0 1340112 84248 138996 0 0 428 392 1233 752
3 2 95
0 1 0 0 1338824 84660 139060 0 0 464 524 1363 949
11 1 88
0 1 1 0 1337352 85132 139064 0 0 512 608 1374 1143
9 21 70
0 1 0 0 1335992 85616 139124 0 0 516 1012 1655 1483
16 4 79
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 0 0 0 1334904 86100 139116 0 0 488 352 1248 788
3 2 95
0 1 0 0 1334008 86552 139140 0 0 448 696 1359 1045
3 1 96
0 1 0 0 1333048 87036 139200 0 0 484 492 1423 1021
4 2 94
0 1 1 0 1331960 87528 139184 0 0 476 596 1295 990
4 22 74
1 0 0 0 1369848 87920 139268 0 0 392 664 1291 837
14 10 76
1 0 0 0 1361848 87920 139268 0 0 0 488 1216 648
54 6 41
1 0 0 0 1357048 87928 139260 0 0 16 300 1147 513
49 4 47
1 1 1 0 1351096 87936 139320 0 0 80 704 1276 840
50 6 44
1 1 1 0 1346168 87988 139336 0 0 44 1308 1337 996
49 16 35
1 1 1 0 1336312 88012 139312 0 0 72 840 1408 972
51 6 43
18 1 17 0 1567736 88012 139380 0 0 0 512 1401 718
26 50 23
1 0 1 0 1547448 88012 139380 0 0 0 1284 1375 1122
49 27 24
1 0 0 0 1545016 88012 139380 0 0 0 104 1077 298
48 4 49
1 0 0 0 1542328 88036 139424 0 0 0 904 1170 568
48 11 41
1 0 0 0 1539208 88036 139424 0 0 0 548 1165 623
49 6 45
1 0 0 0 1534792 88036 139424 0 0 0 356 1116 457
47 6 47
1 1 3 0 1521464 88036 139424 0 0 0 420 1119 414
32 32 36
2 1 2 0 1508904 88036 139424 0 0 0 808 1210 620
12 89 0
1 0 0 0 1497624 88056 139404 0 0 4 972 1151 515
13 84 3
0 0 0 0 1497616 88056 139404 0 0 0 0 1037 281
1 11 88
3 0 0 0 1427536 88056 139404 0 0 0 0 1157 392
39 24 36
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 1 0 0 1385352 88116 139480 0 0 152 0 1190 507
20 28 52
1 0 0 0 1386952 88124 139472 0 0 16 0 1087 323
1 1 99
SOFTDOG: Initiating system reboot.

Last output from top before it froze:

01:26:32 up 17 min, 2 users, load average: 1.23, 0.52, 0.31
253 processes: 188 sleeping, 4 running, 61 zombie, 0 stopped
CPU0 states: 45.0% user 36.0% system 0.0% nice 3.1% iowait 15.2%
idle
CPU1 states: 21.3% user 34.4% system 0.0% nice 0.2% iowait 43.0%
idle
Mem: 2072988k av, 569868k used, 1503120k free, 0k shrd, 88036k
buff
403116k active, 101256k inactive
Swap: 8384880k av, 0k used, 8384880k free 139424k
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU
COMMAND
1109 root 25 0 113M 105M 11316 R 88.6 5.2 0:16 1 httpd
2018 root 16 0 2360 1348 1936 R 1.5 0.0 0:22 0 top
1879 root 16 0 1412 468 1356 S 0.7 0.0 0:05 1 vmstat
19 root 15 0 0 0 0 SW 0.3 0.0 0:01 1
kjournald
781 named 15 0 52932 50M 2056 S 0.3 2.4 0:02 1 named
3144 root 25 0 3656 748 3388 S 0.3 0.0 0:00 1
rotatelogs
3145 root 25 0 3656 748 3388 S 0.3 0.0 0:00 0
rotatelogs
3147 root 25 0 3656 748 3388 S 0.3 0.0 0:00 0
rotatelogs
3149 root 25 0 3656 748 3388 S 0.3 0.0 0:00 0
rotatelogs
3153 root 25 0 3660 752 3388 S 0.3 0.0 0:00 0
rotatelogs
3154 root 25 0 3660 748 3388 S 0.3 0.0 0:00 1
rotatelogs
3155 root 25 0 3660 748 3388 S 0.3 0.0 0:00 1
rotatelogs
3163 root 25 0 3660 752 3388 S 0.3 0.0 0:00 0
rotatelogs
3167 root 25 0 3660 752 3388 S 0.3 0.0 0:00 0
rotatelogs
3178 root 25 0 3660 752 3388 S 0.3 0.0 0:00 1
rotatelogs
3179 root 25 0 3660 752 3388 S 0.3 0.0 0:00 0
rotatelogs
3181 root 25 0 3660 752 3388 S 0.3 0.0 0:00 0
rotatelogs
3185 root 25 0 3664 756 3388 S 0.3 0.0 0:00 0
rotatelogs
3186 root 25 0 3664 752 3388 S 0.3 0.0 0:00 1
rotatelogs
3191 root 25 0 3664 756 3388 S 0.3 0.0 0:00 0
rotatelogs
3202 root 25 0 3656 748 3388 S 0.3 0.0 0:00 0
rotatelogs
3206 root 25 0 3656 748 3388 S 0.3 0.0 0:00 0
rotatelogs
3207 root 25 0 3656 748 3388 S 0.3 0.0 0:00 1
rotatelogs
3210 root 25 0 3656 748 3388 S 0.3 0.0 0:00 1
rotatelogs
3222 root 25 0 3660 752 3388 S 0.3 0.0 0:00 1
rotatelogs
3224 root 25 0 3660 752 3388 S 0.3 0.0 0:00 1
rotatelogs
3229 root 25 0 448 136 276 R 0.3 0.0 0:00 0
rotatelogs
591 root 15 0 1428 540 1368 S 0.1 0.0 0:00 0
syslogd
1312 ftpd 15 0 2312 1188 1864 S 0.1 0.0 0:00 1
proftpd
3135 root 25 0 3664 752 3388 S 0.1 0.0 0:00 1
rotatelogs
3136 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
3137 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
3138 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
3139 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
3140 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
3141 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
3142 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
3143 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
3146 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
3148 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
3150 root 25 0 3660 752 3388 S 0.1 0.0 0:00 1
rotatelogs

Regards,
Anders K. Pedersen

2004-03-12 08:49:05

by Anders K. Pedersen

[permalink] [raw]
Subject: RE: 2.6.3 userspace freeze

Hello again,

To night I reproduced this issue with Linux 2.6.4 final - I've attached
the dmesg and .config for this kernel.

Output from vmstat 1 before the freeze:

procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 0 0 0 459408 241776 642496 0 0 0 208 1166 548 5
2 92
0 0 0 0 459408 241780 642492 0 0 0 88 1241 540 9
8 84
0 0 0 0 458320 241780 642492 0 0 0 0 1202 456 6
1 93
0 0 0 0 458320 241784 642488 0 0 0 80 1161 440 1
1 98
0 0 0 0 458384 241784 642488 0 0 0 0 1184 460 5
1 94
0 0 0 0 458400 241784 642488 0 0 0 56 1189 437 4
1 95
0 0 0 0 458400 241784 642488 0 0 0 28 1148 424 1
8 91
0 0 0 0 459936 241784 642556 0 0 0 124 1229 520 7
1 92
0 0 0 0 459936 241792 642616 0 0 64 60 1185 444 1
1 98
0 0 0 0 458272 241792 642616 0 0 4 0 1237 501 11
3 86
0 0 0 0 458272 241792 642616 0 0 0 0 1173 414 3
0 97
1 0 0 0 460128 241792 642616 0 0 0 0 1162 390 1
7 92
0 0 0 0 459936 241800 642608 0 0 0 272 1266 646 11
4 86
0 0 0 0 459936 241808 642600 0 0 0 64 1194 497 1
1 98
0 0 0 0 459936 241808 642600 0 0 0 0 1169 449 3
1 96
0 0 0 0 459936 241808 642668 0 0 32 0 1134 374 1
1 99
1 0 0 0 457952 241816 642660 0 0 20 0 1099 363 4
7 89
0 0 0 0 457968 241820 642656 0 0 0 88 1118 371 1
2 97
0 0 0 0 457968 241824 642720 0 0 16 144 1186 444 1
1 99
0 0 0 0 457840 241824 642720 0 0 0 48 1148 385 0
0 99
0 0 0 0 457648 241840 642704 0 0 16 116 1163 452 1
1 99
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 0 0 0 455464 241840 642704 0 0 24 0 1122 369 4
3 94
0 0 0 0 457640 241848 642696 0 0 0 136 1198 454 5
8 87
2 0 0 0 457576 241848 642696 0 0 0 64 1218 482 5
1 94
0 0 0 0 457576 241848 642696 0 0 0 128 1138 397 1
0 99
0 0 0 0 457576 241848 642696 0 0 0 0 1132 413 1
0 99
0 0 0 0 457576 241848 642696 0 0 0 92 1198 511 2
0 98
1 0 0 0 457256 241852 642760 0 0 12 116 1177 539 5
8 88
1 0 0 0 457768 241856 642756 0 0 0 64 1224 483 2
1 96
0 0 0 0 457656 241856 642756 0 0 4 64 1162 456 2
2 96
0 0 0 0 457656 241856 642756 0 0 0 0 1174 420 3
0 97
0 0 0 0 457656 241856 642756 0 0 0 120 1358 588 4
1 96
0 0 0 0 457592 241860 642752 0 0 0 200 1179 437 8
8 85
0 0 0 0 457592 241860 642752 0 0 0 64 1184 474 1
0 98
0 0 0 0 457656 241860 642752 0 0 0 212 1176 438 5
1 94
1 0 0 0 457656 241868 642812 0 0 12 0 1123 391 4
1 96
0 0 0 0 457464 241868 642812 0 0 0 128 1108 387 2
1 98
1 0 0 0 453048 241872 642808 0 0 0 60 1160 537 40
10 50
0 0 0 0 450664 241872 642808 0 0 0 60 1395 681 15
1 84
0 0 0 0 450664 241872 642808 0 0 4 0 1414 616 6
1 93
0 0 0 0 450600 241872 642808 0 0 0 0 1159 390 6
0 93
4 0 0 0 449792 241876 642804 0 0 4 88 1151 405 2
2 96
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 0 0 0 437864 241952 651908 0 0 8776 132 1385 754 44
21 35
1 0 0 0 433376 241956 651904 0 0 4 964 1184 448 48
4 48
1 0 0 0 428320 241960 651900 0 0 8 8 1092 358 47
6 46
1 0 0 0 422432 241960 651900 0 0 12 0 1112 345 49
5 46
1 0 0 0 415264 241960 651900 0 0 0 268 1090 363 48
9 44
1 0 0 0 405536 241960 651900 0 0 8 128 1160 423 51
18 31
19 0 16 0 687024 241972 651956 0 0 0 568 1341 301 18
65 18
2 0 0 0 709040 242668 654048 0 0 1952 0 1283 1829 66
34 0
3 0 0 0 716656 242812 653292 0 0 132 1608 1136 4260 86
14 0
2 0 0 0 714544 242864 653308 0 0 16 0 1059 5388 88
12 0
2 0 0 0 708736 242884 657164 0 0 3256 16 1346 3042 80
20 0
2 0 0 0 709056 243112 660812 0 0 4516 428 1766 2697 67
12 20
2 0 0 0 702080 243860 661016 0 0 744 0 1220 597 84
16 0
2 0 0 0 694272 244836 660992 0 0 976 1700 1310 852 86
13 1
3 0 0 0 705176 245260 660976 0 0 424 144 1179 677 18
79 3
3 0 0 0 699992 245436 660936 0 0 176 0 1119 347 11
89 0
0 1 1 0 693464 245944 660972 0 0 480 1314 1180 615 12
65 23
5 0 0 0 630272 246636 661028 0 0 688 648 1302 744 57
36 7
0 1 0 0 569640 246884 661120 0 0 352 0 1654 1114 56
42 2
0 1 0 0 555880 248504 661472 0 0 1880 0 1781 1418 28
16 56
0 1 0 0 547304 250128 661548 0 0 1624 0 1543 1208 11
22 67
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 1 0 0 545016 250924 661568 0 0 788 1004 1301 921 3
4 93
1 1 0 0 541368 251856 661588 0 0 928 1780 1511 1095 25
7 67
SOFTDOG: Initiating system reboot.

Last output from top before it froze:

04:02:21 up 2:11, 2 users, load average: 1.13, 0.36, 0.17
271 processes: 270 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states: 30.0% user 30.3% system 0.2% nice 14.1% iowait 24.3%
idle
CPU1 states: 16.3% user 27.2% system 1.1% nice 49.1% iowait 7.0%
idle
Mem: 2073128k av, 1523840k used, 549288k free, 0k shrd, 250020k
buff
860220k active, 513392k inactive
Swap: 8384880k av, 0k used, 8384880k free 661520k
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU
COMMAND
16 0 114M 106M 11320 S 59.8 5.2 0:58 1 httpd
23536 root 34 19 1664 720 1368 D N 6.0 0.0 0:00 1
updatedb
1994 root 17 0 2416 1396 1936 R 2.7 0.0 3:44 0 top
23699 paradona 23 0 4660 3428 2280 S 2.7 0.1 0:00 0
topic.cgi
23683 root 15 0 30328 28M 1712 S 2.1 1.4 0:00 0
rotatelogspsoft
1853 root 16 0 1420 472 1356 S 0.9 0.0 1:02 0 vmstat
1552 root 16 0 373M 61M 44620 S 0.3 3.0 0:18 0
caspeng
1615 root 16 0 2068 1064 1864 S 0.3 0.0 0:20 1
soagent
23640 root 25 0 3660 752 3388 S 0.3 0.0 0:00 0
rotatelogs
23656 root 25 0 3660 752 3388 S 0.3 0.0 0:00 1
rotatelogs
23684 root 15 0 2696 1416 1712 S 0.3 0.0 0:00 1
rotatelogspsoft
23685 root 15 0 2584 1300 1712 S 0.3 0.0 0:00 1
rotatelogspsoft
939 root 16 0 6132 1824 5788 S 0.1 0.0 0:01 1 sshd
23622 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
23623 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
23624 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
23625 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
23626 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
23627 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
23628 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
rotatelogs
23629 root 25 0 3656 748 3388 S 0.1 0.0 0:00 0
rotatelogs
23630 root 25 0 3660 752 3388 S 0.1 0.0 0:00 1
rotatelogs
23631 root 25 0 3660 752 3388 S 0.1 0.0 0:00 0
rotatelogs
23632 root 25 0 3660 752 3388 S 0.1 0.0 0:00 1
rotatelogs
23633 root 25 0 3660 752 3388 S 0.1 0.0 0:00 0
rotatelogs
23634 root 25 0 3660 748 3388 S 0.1 0.0 0:00 1
rotatelogs
23635 root 25 0 3660 748 3388 S 0.1 0.0 0:00 0
rotatelogs
23636 root 25 0 3660 748 3388 S 0.1 0.0 0:00 0
rotatelogs
23637 root 25 0 3660 748 3388 S 0.1 0.0 0:00 1
rotatelogs
23638 root 25 0 3660 752 3388 S 0.1 0.0 0:00 0
rotatelogs
23639 root 25 0 3660 752 3388 S 0.1 0.0 0:00 1
rotatelogs
23641 root 25 0 3660 752 3388 S 0.1 0.0 0:00 1
rotatelogs
23642 root 25 0 3660 752 3388 S 0.1 0.0 0:00 0
rotatelogs
23643 root 25 0 3660 752 3388 S 0.1 0.0 0:00 1
rotatelogs

If there is anything I can do to debug the reason for these deadlocks,
please let me know.

Regards,
Anders K. Pedersen

> -----Original Message-----
> From: Anders K. Pedersen
> Sent: Thursday, March 11, 2004 1:46 AM
> To: 'Jan Kara'
> Cc: Andrew Morton; [email protected]
> Subject: RE: 2.6.3 userspace freeze

> > -----Original Message-----
> > From: Jan Kara [mailto:[email protected]]
> > Sent: Wednesday, March 10, 2004 4:44 PM
> > To: Anders K. Pedersen
> > Cc: Andrew Morton; [email protected]
> > Subject: Re: 2.6.3 userspace freeze
>
> > > I will try this to night; just to make sure I understand
> > you correctly:
> > > You just want me to turn off quotas on all file systems
> > (currently they
> > > are in use on one of them), and it is not necessary to
> recompile the
> > > kernel without quota support?
> > Yes, just turning quotas off with quotaoff(8) should be
> > enough to rule
> > out possible deadlocks caused by quotas.
>
> I just tried booting the 2.6.3 kernel without any qoutas
> enabled in fstab, and it failed like previously after only 17 minutes.


Attachments:
2.6.4-dmesg (9.29 kB)
2.6.4-dmesg
2.6.4-config (21.13 kB)
2.6.4-config
Download all attachments

2004-03-12 10:09:19

by Con Kolivas

[permalink] [raw]
Subject: Re: 2.6.3 userspace freeze

On Fri, 12 Mar 2004 07:47 pm, Anders K. Pedersen wrote:
> To night I reproduced this issue with Linux 2.6.4 final - I've attached
> the dmesg and .config for this kernel.

> 23536 root 34 19 1664 720 1368 D N 6.0 0.0 0:00 1
> updatedb

Each log you've shown so far shows you getting updatedb stuck in D which
appears to be the common link. It could be your updatedb is busy scanning
directories it probably shouldn't.
Check your updatedb.conf (usually in /etc) and see that you have at least
these entries in PRUNEFS

PRUNEFS="nfs,smbfs,ncpfs,proc,devpts,supermount,vfat,iso9660,udf,usbdevfs,devfs,usbfs,sysfs"

Con

2004-03-12 10:46:30

by Olaf Dietsche

[permalink] [raw]
Subject: Re: 2.6.3 userspace freeze

"Anders K. Pedersen" <[email protected]> writes:

> Last output from top before it froze:
>
[...]
> 23640 root 25 0 3660 752 3388 S 0.3 0.0 0:00 0
> rotatelogs
> 23656 root 25 0 3660 752 3388 S 0.3 0.0 0:00 1
> rotatelogs
> 23684 root 15 0 2696 1416 1712 S 0.3 0.0 0:00 1
> rotatelogspsoft
> 23685 root 15 0 2584 1300 1712 S 0.3 0.0 0:00 1
> rotatelogspsoft
> 939 root 16 0 6132 1824 5788 S 0.1 0.0 0:01 1 sshd
> 23622 root 25 0 3656 748 3388 S 0.1 0.0 0:00 1
> rotatelogs
[many rotatelogs]
> 23643 root 25 0 3660 752 3388 S 0.1 0.0 0:00 1
> rotatelogs
>
> If there is anything I can do to debug the reason for these deadlocks,
> please let me know.

There are always many rotatelogs started. Maybe that's a hint for
further investigation.

Regards, Olaf.

2004-03-12 10:52:24

by Anders K. Pedersen

[permalink] [raw]
Subject: RE: 2.6.3 userspace freeze

Hello,

> -----Original Message-----
> From: Con Kolivas [mailto:[email protected]]
> Sent: Friday, March 12, 2004 11:09 AM
> To: Anders K. Pedersen
> Cc: Jan Kara; Andrew Morton; [email protected]
> Subject: Re: 2.6.3 userspace freeze
>
>
> On Fri, 12 Mar 2004 07:47 pm, Anders K. Pedersen wrote:
> > To night I reproduced this issue with Linux 2.6.4 final -
> I've attached
> > the dmesg and .config for this kernel.
>
> > 23536 root 34 19 1664 720 1368 D N 6.0 0.0 0:00 1
> > updatedb
>
> Each log you've shown so far shows you getting updatedb stuck
> in D which
> appears to be the common link. It could be your updatedb is
> busy scanning
> directories it probably shouldn't.
> Check your updatedb.conf (usually in /etc) and see that you
> have at least
> these entries in PRUNEFS
>
> PRUNEFS="nfs,smbfs,ncpfs,proc,devpts,supermount,vfat,iso9660,u
> df,usbdevfs,devfs,usbfs,sysfs"

Thank you for the suggestion - I had the following in my updatedb.conf:

PRUNEFS="devpts NFS nfs afs proc smbfs autofs auto iso9660"

so I have just added:

PRUNEFS="$PRUNEFS ncpfs supermount vfat udf usbdevfs devfs usbfs sysfs"

Of these, the only active one is sysfs. I will give it another try with
these settings to night. However, while you're correct that top shows
updatedb in D state in the latest test (in the mail dated 2004-mar-12
09:47 - crashed at 04:02:21) as well as the first one I submitted (mail
dated 2004-mar-10 10:13 - crashed at 04:02:33), the one in between (mail
dated 2004-mar-11 01:46) doesn't as it crashed at 01:26:32, and updatedb
doesn't start until 04:02 as part of cron.daily. Also, when I originally
upgraded this server to 2.6.3 it rebooted four or five time during the
night, before we downgraded it to 2.4 again, and updatedb couldn't have
been the cause for more than one of these. One thing, I will try though,
is to run updatedb manually with and without the additions above to see
if it triggers the deadlock immediately.

Regards,
Anders K. Pedersen

2004-03-12 17:26:44

by Anders K. Pedersen

[permalink] [raw]
Subject: Re: 2.6.3 userspace freeze

Hello,

On Fri, 2004-03-12 at 11:46, Olaf Dietsche wrote:
> There are always many rotatelogs started. Maybe that's a hint for
> further investigation.

The rotatelogs processes are used to write log data from Apache (by use
of CustomLog/ErrorLog directives) to rotating files, so this is quite
normal. I just made the following pstree, which is typical for this
server:

init-+-agent.be---agent.be
|-agetty
|-atd
|-bdflush
|-caspd---caspd---caspd---caspeng---caspeng---22*[caspeng]
|-crond
|-httpd-+-233*[httpd]
| |-120*[rotatelogs]
| `-3*[rotatelogspsoft]
|-keventd
|-khubd
|-4*[kjournald]
|-klogd
|-ksoftirqd_CPU0
|-ksoftirqd_CPU1
|-kswapd
|-kupdated
|-logger
|-master-+-2*[cleanup]
| |-pickup
| |-qmgr
| |-4*[smtp]
| `-trivial-rewrite
|-7*[mingetty]
|-named
|-ntpd
|-proftpd---16*[proftpd]
|-scsi_eh_0
|-soagent
|-sshd-+-sshd---script-runner.p
| `-sshd---bash---pstree
|-syslogd
|-ulogd
`-watchdogd

--
Med venlig hilsen - Best regards

Anders K. Pedersen
Network Engineer
------------------------------------------------
Cohaesio A/S - Maglebjergvej 5D - DK-2800 Lyngby
Phone: +45 45 880 888 - Fax: +45 45 880 777
Mail: [email protected] - http://www.cohaesio.com
------------------------------------------------

2004-03-18 15:50:30

by Anders K. Pedersen

[permalink] [raw]
Subject: RE: 2.6.3 userspace freeze

Hello,

> > -----Original Message-----
> > From: Con Kolivas [mailto:[email protected]]
> > Sent: Friday, March 12, 2004 11:09 AM
> > To: Anders K. Pedersen
> > Cc: Jan Kara; Andrew Morton; [email protected]
> > Subject: Re: 2.6.3 userspace freeze

> > Each log you've shown so far shows you getting updatedb stuck
> > in D which
> > appears to be the common link. It could be your updatedb is
> > busy scanning
> > directories it probably shouldn't.
> > Check your updatedb.conf (usually in /etc) and see that you
> > have at least
> > these entries in PRUNEFS
> >
> > PRUNEFS="nfs,smbfs,ncpfs,proc,devpts,supermount,vfat,iso9660,u
> > df,usbdevfs,devfs,usbfs,sysfs"
>
> Thank you for the suggestion - I had the following in my
> updatedb.conf:
>
> PRUNEFS="devpts NFS nfs afs proc smbfs autofs auto iso9660"
>
> so I have just added:
>
> PRUNEFS="$PRUNEFS ncpfs supermount vfat udf usbdevfs devfs
> usbfs sysfs"
>
> Of these, the only active one is sysfs. I will give it
> another try with these settings to night. However, while
> you're correct that top shows updatedb in D state in the
> latest test (in the mail dated 2004-mar-12 09:47 - crashed at
> 04:02:21) as well as the first one I submitted (mail dated
> 2004-mar-10 10:13 - crashed at 04:02:33), the one in between
> (mail dated 2004-mar-11 01:46) doesn't as it crashed at
> 01:26:32, and updatedb doesn't start until 04:02 as part of
> cron.daily. Also, when I originally upgraded this server to
> 2.6.3 it rebooted four or five time during the night, before
> we downgraded it to 2.4 again, and updatedb couldn't have
> been the cause for more than one of these. One thing, I will
> try though, is to run updatedb manually with and without the
> additions above to see if it triggers the deadlock immediately.

I've tried running updatedb manually a few times under 2.6.4 now - both
with and without the PRUNEFS addition above. All of these completed
without any problems.

Also, I left the server running 2.6.4 for a night, where it rebooted
five times with different intervals (ranging from ~20 minutes to ~3
hours), before I switched it back to 2.4. In other words, it doesn't
seem to be any of the cron.daily'es (i.e. updatedb), that specifically
triggers it.

Any other suggestions would be most welcome.

Regards,
Anders K. Pedersen