Have a webserver running Zope (specifically the ZEO db) which dies every
few days with no messages in syslog. Locks up so tight a powercycle is
required to recover. System has 1gb RAM, 2xSMP, kernel configured with
4gb highmem.
Since the kernel doesn't provide any info in syslog when it dies, I just
ran a vmstat 30 to a file and waited for the next untimely demise.
Here's what happened when it died last time. Note the sudden surge in
disk activity (bi)
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 1 0 33312 10356 5252 696940 0 0 20 1 1649 1603 6
2 92
0 1 0 33312 10352 5236 696564 0 0 26 3 1593 1548 3
2 94
0 1 0 33312 10276 5236 696600 0 0 9 1 1639 1596 5
2 92
0 2 0 33312 16932 5236 694304 0 0 11 3 1702 1709 7
4 89
0 1 0 33312 12644 5236 698784 0 0 10 2 1560 1513 3
2 95
0 1 0 33312 10456 5236 700600 0 0 14 3 1487 1443 6
2 93
0 1 0 33504 10452 5236 700652 0 1 3 1 1806 1785 5
2 92
0 1 0 33504 10468 5236 699992 0 0 5 3 2118 2116 6
3 91
0 1 0 33692 10484 5232 699312 0 5 1 8 3146 3215 7
5 88
1 0 0 33692 10544 5232 698832 0 1 0 3 3377 3457 10
5 85
1 0 0 33692 10468 5232 697804 0 3 2 3 3636 3721 8
5 87
1 0 0 33692 10420 5232 697876 0 2 2 4 1662 1609 35
3 63
1 0 0 33692 9540 5232 698940 0 0 7 1 752 624 46
2 52
1 0 0 33692 9592 5232 698900 0 1 6 4 397 372 50
1 49
1 0 0 33692 9504 5232 698980 0 0 2 1 136 284 49
1 49
1 0 0 33692 9492 5292 698992 0 3 741 4 215 467 50
1 49
1 0 0 34000 12624 5296 695936 0 6 547 10 236 408 49
1 49
1 0 0 34000 21912 5300 678984 1 0 499 6 1992 2112 55
7 38
1 0 0 34000 9976 5300 693104 0 0 517 13 320 413 49
1 49
1 0 0 34000 11916 5300 691128 1 0 561 3 289 413 53
1 46
1 0 0 34000 10172 5296 692100 0 0 497 5 288 374 49
1 49
1 0 0 34000 22012 5296 680216 0 0 556 1 309 421 50
1 49
1 0 0 34000 9544 5296 692804 0 0 584 3 306 433 50
1 49
1 0 0 34000 10816 5296 696748 0 0 469 1 414 522 51
3 46
<death>
I'd be more than willing to collect any other data required here, just
let me know what would be of assistance. Note though that I only have
remote access to this box, so getting magic sysrq info could be
difficult/impossible (tho I do have console access if that helps).
Thanks,
Phil Oester
On Thu, Dec 27, 2001 at 11:06:50PM -0800, Phil Oester wrote:
> Have a webserver running Zope (specifically the ZEO db) which dies every
> few days with no messages in syslog. Locks up so tight a powercycle is
> required to recover. System has 1gb RAM, 2xSMP, kernel configured with
> 4gb highmem.
Do you have RAID1 on the disks ?
Apparently "noapic" option helps, e.g. breaking the SYMMETRIC part of SMP.
You may also try "nmi_watchdog=1", if you have serial console attached
to the box for kernel message logging (and command).
> Since the kernel doesn't provide any info in syslog when it dies, I just
> ran a vmstat 30 to a file and waited for the next untimely demise.
> Here's what happened when it died last time. Note the sudden surge in
> disk activity (bi)
Yes, looks familiar. My hangups have been during high disc activity too.
My box is located into a place into which I have difficult access, e.g.
I can't use it to collect the debug data, and do magics (press reset)
to recover.
> I'd be more than willing to collect any other data required here, just
> let me know what would be of assistance. Note though that I only have
> remote access to this box, so getting magic sysrq info could be
> difficult/impossible (tho I do have console access if that helps).
>
> Thanks,
>
> Phil Oester
/Matti Aarnio
No RAID1 on disks.
Here's /proc/meminfo within 1 minute of the box dying last night:
total: used: free: shared: buffers: cached:
Mem: 1054371840 1044684800 9687040 0 7802880 834752512
Swap: 535797760 7626752 528171008
MemTotal: 1029660 kB
MemFree: 9460 kB
MemShared: 0 kB
Buffers: 7620 kB
Cached: 811872 kB
SwapCached: 3316 kB
Active: 231880 kB
Inactive: 747344 kB
HighTotal: 131072 kB
HighFree: 1028 kB <--------- See comment below
LowTotal: 898588 kB
LowFree: 8432 kB
SwapTotal: 523240 kB
SwapFree: 515792 kB
The HighFree value was at 2044 for the prior hour. It went to 1028
within 1 minute of the box freezing. Out of HighMem???
Here's vmstat within 30 seconds of freezing:
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 0 0 7448 9536 7616 812092 0 0 932 2 235 577 50
1 49
Seems VM related.
-Phil
-----Original Message-----
From: Matti Aarnio [mailto:[email protected]]
Sent: Friday, December 28, 2001 2:02 AM
To: Phil Oester
Cc: [email protected]
Subject: Re: 2.4.17 still croaks under heavy load
On Thu, Dec 27, 2001 at 11:06:50PM -0800, Phil Oester wrote:
> Have a webserver running Zope (specifically the ZEO db) which dies
every
> few days with no messages in syslog. Locks up so tight a powercycle
is
> required to recover. System has 1gb RAM, 2xSMP, kernel configured
with
> 4gb highmem.
Do you have RAID1 on the disks ?
Apparently "noapic" option helps, e.g. breaking the SYMMETRIC part of
SMP.
You may also try "nmi_watchdog=1", if you have serial console attached
to the box for kernel message logging (and command).
> Since the kernel doesn't provide any info in syslog when it dies, I
just
> ran a vmstat 30 to a file and waited for the next untimely demise.
> Here's what happened when it died last time. Note the sudden surge in
> disk activity (bi)
Yes, looks familiar. My hangups have been during high disc activity
too.
My box is located into a place into which I have difficult access,
e.g.
I can't use it to collect the debug data, and do magics (press reset)
to recover.
> I'd be more than willing to collect any other data required here, just
> let me know what would be of assistance. Note though that I only have
> remote access to this box, so getting magic sysrq info could be
> difficult/impossible (tho I do have console access if that helps).
>
> Thanks,
>
> Phil Oester
/Matti Aarnio
Phil Oester worte:
>
> No RAID1 on disks.
>
> Here's /proc/meminfo within 1 minute of the box dying last night:
>
> total: used: free: shared: buffers: cached:
> Mem: 1054371840 1044684800 9687040 0 7802880 834752512
> Swap: 535797760 7626752 528171008
> MemTotal: 1029660 kB
> MemFree: 9460 kB
> MemShared: 0 kB
> Buffers: 7620 kB
> Cached: 811872 kB
> SwapCached: 3316 kB
> Active: 231880 kB
> Inactive: 747344 kB
> HighTotal: 131072 kB
> HighFree: 1028 kB <--------- See comment below
> LowTotal: 898588 kB
> LowFree: 8432 kB
> SwapTotal: 523240 kB
> SwapFree: 515792 kB
>
> The HighFree value was at 2044 for the prior hour. It went to 1028
> within 1 minute of the box freezing. Out of HighMem???
>
> Here's vmstat within 30 seconds of freezing:
>
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us
> sy id
> 1 0 0 7448 9536 7616 812092 0 0 932 2 235 577 50
> 1 49
>
> Seems VM related.
Hello Phil,
can you please try Andrea Arcangeli's 10_vm-21?
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17rc2aa2/10_vm-21
I think we need more "pressure" to get these "fixes" into 2.4.18...
Regards,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
> ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17=
> rc2aa2/10_vm-21
>
> I think we need more "pressure" to get these "fixes" into 2.4.18...
Andrea has been asked several times to feed the patches into the tree in
small managable chunks each explained. I don't think its pressure to get
them in you have to worry about
On Thu, Dec 27, 2001 at 11:06:50PM -0800, Phil Oester wrote:
> Have a webserver running Zope (specifically the ZEO db) which dies every
> few days with no messages in syslog. Locks up so tight a powercycle is
> required to recover. System has 1gb RAM, 2xSMP, kernel configured with
> 4gb highmem.
>
> Since the kernel doesn't provide any info in syslog when it dies, I just
> ran a vmstat 30 to a file and waited for the next untimely demise.
> Here's what happened when it died last time. Note the sudden surge in
> disk activity (bi)
I am seeing the same kind of deaths on multiple very different SMP boxes
since 2.2 days. They do not die in the "high load" case but only the
high load boxes are unstable. I am having on "testcase" where the box
crashes at least every 24 hours (mutella). Boxes i have seen this
happening on are
Box1:
Dual Celeron 400
IDE Raid
SCSI System disk
1GB Ram (No Highmem) (Used to have 512M)
EEPro 100
Box2:
Dual PIII 1Ghz
Serverworks Board
1GB Ram (No Highmem)
ICP Vortex Raid
EEPro 100
I have 3 machines of the exakt same type of the latter type. All are
unstable and tend to crash depending on application every 24 hours to
every 2-3 Weeks.
No notice in the syslog, nothing on the serial console. There are
completly dead without any sign before. I have tried to capture
informations about processes, swap, memory etc - Within 1 minute
prior to crash the boxes are basically idle.
Flo
--
Florian Lohoff [email protected] +49-5201-669912
Nine nineth on september the 9th Welcome to the new billenium