LinuxLists.cc - syslog full of kernel BUGS, frequent intermittent instability

2003-02-26 09:02:09

Subject: syslog full of kernel BUGS, frequent intermittent instability

Hi folks,

My recently installed Mandrake 9.0 has been unstable since day one. The syslog is full of kernel BUG lines (see below), the crashes are frequent, and I don't know how to reproduce them - recognize no pattern to them.

I have run memtest86 overnight (~13 hours) - it reported no errors. So would I be correct in assuming that my RAM can be ruled out? I have also passed both the "noapic" and "mem=nopentium" parameters to lilo, but that hasn't resulted in any noticeable improvement.

Also, I have a big fan blowing into the open box to cool everything, which fixed an overheating problem I had last summer. Also I recently cleaned out any dust inside the box, which wasn't much anyway.

The system is dual booting with win98 - and while win 98 does crash occasionally, it is much less frequent than what I'm getting with the mdk9 partition.

The hardware is as follows:

MSI super7 mobo w/ alladin 5 chipset
AMD k6-2 350 MHz w/ 3D-now
128m pc-100 RAM (single stick)
/dev/hda = 13 gig Quantum Fireball
/dev/hdc = Ricoh 7060A CDRW
PSU: 235 watt (Kobian)
video: Voodoo Banshee (Creative) AGP/PCI plugged into AGP slot
sound: creative sb16 P&P (vibra) ISA card
USB device: Alcatel SpeedtouchUSB ADSL modem

note: I do not have a floppy drive installed

My BIOS is at the "failsafe" settings, with only one or two exceptions (to keep it from running like a 386). No overclocking or any other tweaking anywhere on my part. The BIOS has been flashed to the latest recommmended version for my particular mobo revision.

This is a stock mandrake 9.0 installation.

$ cat /proc/version
Linux version 2.4.19-16mdk ([email protected]) (gcc version 3.2 (Mandrake Linux 9.0 3.2-1mdk)) #1 Fri Sep 20 18:15:05 CEST 2002

The crashes often result in a hard freeze (no reaction to ctl-alt-del or to ctl-alt-bckspc - keyboard LED doesn't respond to caps lock). I am in X11 almost exclusively - so I have not seen whether it will also crash at the console. Sometimes I can go days without a crash, other times it can crash several times in a few hours. Not all of the kernel BUG messages resulted in my system crashing, at times I notice the BUG line in the log but wouldn't have noticed anything unusual as a user in X.

I would appreciate any insight anyone can provide. Let me know if I can provide any more info.

I haven't suscribed to this ML because I don't want the volume in my mailbox - but will monitor this thread through google groups.

sample output of grep -i bug /var/log/syslog (on a bad day)

Jan 4 18:32:38 localhost kernel: kernel BUG at page_alloc.c:224!
Jan 4 18:32:38 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:39:21 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:39:50 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:39:50 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 18:41:48 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:41:48 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 18:42:48 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:42:53 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:42:53 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 18:43:35 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:43:35 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 18:43:43 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:43:43 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 18:43:55 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 18:43:55 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 18:44:42 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:00:14 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:00:14 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 19:00:32 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:00:32 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 19:25:32 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:25:33 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:25:33 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:25:33 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:25:33 localhost kernel: kernel BUG at mmap.c:1245!
Jan 4 19:25:38 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:25:38 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 4 19:36:53 localhost kernel: kernel BUG at page_alloc.c:224!
Jan 4 19:36:55 localhost kernel: kernel BUG at page_alloc.c:224!

Syslog excerpt providing some context for a single crash;

Jan 13 12:01:01 localhost CROND[2334]: (root) CMD (nice -n 19 run-parts /etc/cron.hourly)
Jan 13 13:01:00 localhost CROND[2399]: (root) CMD (nice -n 19 run-parts /etc/cron.hourly)
Jan 13 13:21:20 localhost kernel: kernel BUG at page_alloc.c:224!
Jan 13 13:21:20 localhost kernel: invalid operand: 0000
Jan 13 13:21:20 localhost kernel: CPU: 0
Jan 13 13:21:20 localhost kernel: EIP: 0010:[rmqueue+631/672] Not tainted
Jan 13 13:21:20 localhost kernel: EIP: 0010:[<c0134307>] Not tainted
Jan 13 13:21:20 localhost kernel: EFLAGS: 00013202
Jan 13 13:21:20 localhost kernel: eax: 00000040 ebx: 00007000 ecx: 00001000 edx: 00002c2c
Jan 13 13:21:20 localhost kernel: esi: c100001c edi: c10797ac ebp: c025f5cc esp: c67e1e54
Jan 13 13:21:20 localhost kernel: ds: 0018 es: 0018 ss: 0018
Jan 13 13:21:20 localhost kernel: Process X (pid: 1942, stackpage=c67e1000)
Jan 13 13:21:20 localhost kernel: Stack: 00001000 c10797ac c772d3e0 00001c2c 00003292 00000000 c025f5cc c025f5cc
Jan 13 13:21:20 localhost kernel: c025f788 00000001 c67e1eb4 c0134564 c5725e2c c67e1ec8 00000000 00000018
Jan 13 13:21:20 localhost kdm[1939]: Server for display :0 terminated unexpectedly
Jan 13 13:21:20 localhost kernel: c025f5cc c025f784 00000000 000001d2 c10797d8 00104025 00000025 c35ba4e0
Jan 13 13:21:21 localhost kernel: Call Trace: [__alloc_pages+116/608] [do_anonymous_page+98/272] [handle_mm_fault+87/192] [do_page_fault+529/1397] [ppp_synctty:__insmod_ppp_synctty_O/lib/modules/2.4.19-16mdk/kernel/driv+-493048/96]
Jan 13 13:21:21 localhost kernel: Call Trace: [<c0134564>] [<c012a122>] [<c012a3d7>] [<c0118551>] [<c8862a08>]
Jan 13 13:21:21 localhost kernel: [ppp_synctty:__insmod_ppp_synctty_O/lib/modules/2.4.19-16mdk/kernel/driv+-492738/96] [ppp_synctty:__insmod_ppp_synctty_O/lib/modules/2.4.19-16mdk/kernel/driv+-437849/96] [handle_IRQ_event+55/112] [do_IRQ+123/192] [do_IRQ+154/192] [do_page_fault+0/1397]
Jan 13 13:21:21 localhost kernel: [<c8862b3e>] [<c88701a7>] [<c010a247>] [<c010a3db>] [<c010a3fa>] [<c0118340>]
Jan 13 13:21:21 localhost kernel: [error_code+52/64]
Jan 13 13:21:21 localhost kernel: [<c01090f4>]
Jan 13 13:21:21 localhost kernel:
Jan 13 13:21:21 localhost kernel: Code: 0f 0b e0 00 80 5e 23 c0 8b 47 18 a9 80 00 00 00 74 08 0f 0b
Jan 13 13:21:21 localhost kernel: kernel BUG at page_alloc.c:97!
Jan 13 13:21:21 localhost kernel: invalid operand: 0000
Jan 13 13:21:21 localhost kernel: CPU: 0
Jan 13 13:21:21 localhost kernel: EIP: 0010:[__free_pages_ok+71/848] Not tainted
Jan 13 13:21:21 localhost kernel: EIP: 0010:[<c0133d87>] Not tainted
Jan 13 13:21:21 localhost kernel: EFLAGS: 00210286
Jan 13 13:21:21 localhost kernel: eax: ffffffff ebx: c10797ac ecx: c4ff5e48 edx: 0000462d
Jan 13 13:21:21 localhost kernel: esi: 00000000 edi: 00000000 ebp: c63bbf08 esp: c63bbed0
Jan 13 13:21:21 localhost kernel: ds: 0018 es: 0018 ss: 0018
Jan 13 13:21:21 localhost kernel: Process kpaint (pid: 2459, stackpage=c63bb000)
Jan 13 13:21:21 localhost kernel: Stack: c025f67c fffffffe 00007000 c100001c c10e98d8 c10e98ac c025f5cc c102c01c
Jan 13 13:21:21 localhost kernel: 00200213 ffffffff 00000419 02c2c025 00034000 c552eb04 c63bbf28 c012a92f
Jan 13 13:21:21 localhost kernel: c10797ac c10797ac 00000035 40400000 c78fc404 40400000 c63bbf60 c01291c2
Jan 13 13:21:21 localhost kernel: Call Trace: [zap_pte_range+271/308] [zap_page_range+130/256] [exit_mmap+180/304] [mmput+53/128] [do_exit+133/560]
Jan 13 13:21:21 localhost kernel: Call Trace: [<c012a92f>] [<c01291c2>] [<c012bcb4>] [<c011a385>] [<c011e915>]
Jan 13 13:21:21 localhost kernel: [sys_exit+17/32] [system_call+51/64]
Jan 13 13:21:21 localhost kernel: [<c011eaf1>] [<c0108fe3>]
Jan 13 13:21:21 localhost kernel:
Jan 13 13:21:21 localhost kernel: Code: 0f 0b 61 00 80 5e 23 c0 8b 15 30 86 2c c0 89 d8 29 d0 c1 f8
Jan 13 13:21:32 localhost devfsd[105]: error calling: "unlink" in "GLOBAL"
Jan 13 13:21:37 localhost last message repeated 13 times
Jan 13 13:23:04 localhost init: Switching to runlevel: 6
Jan 13 13:23:07 localhost lisa: Stopping lisa: succeeded
Jan 13 13:23:07 localhost dm: Stopping display manager:

2003-02-26 09:32:55

by John Bradford

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

> My recently installed Mandrake 9.0 has been unstable since day one.
> The syslog is full of kernel BUG lines (see below), the crashes are
> frequent, and I don't know how to reproduce them - recognize no
> pattern to them.

[snip]

Since you have eliminated a lot of the hardware, I would check whether
the PSU is working correctly, if necessary by swapping in a spare one
for a day or two.

The easiest way to exercise the machine is probably to do kernel
compiles in a loop. Memtest will exercise the memory, but not
particularly exercise the CPU.

John.

2003-02-26 11:47:55

by Denis Vlasenko

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On 26 February 2003 11:12, wyleus wrote:
> Hi folks,
>
> My recently installed Mandrake 9.0 has been unstable since day one.
> The syslog is full of kernel BUG lines (see below), the crashes are
> frequent, and I don't know how to reproduce them - recognize no
> pattern to them.
>
> I have run memtest86 overnight (~13 hours) - it reported no errors.
> So would I be correct in assuming that my RAM can be ruled out? I
> have also passed both the "noapic" and "mem=nopentium" parameters to
> lilo, but that hasn't resulted in any noticeable improvement.

cpuburn will help you rule out defective CPU theory.
Also you can start removing/swapping hardware parts.

> I would appreciate any insight anyone can provide. Let me know if I
> can provide any more info.

Test with some vanilla 2.4 kernels, not a distro one. If 2.4.20 crashes,
try some of the earlier kernels too. Compile them for 386 uniprocessor
with debugging and magic SysRq enabled. Provide your .config

Run your klogd with -x to make it stop decoding oopses.
Run oopses thru ksymoops and provide result.
Provide lsmod, lspci output, some of /proc/* files (interrupts etc)

> I haven't suscribed to this ML because I don't want the volume in my
> mailbox - but will monitor this thread through google groups.

People are going to CC you I believe, so don't worry.
--
vda

2003-02-27 00:14:48

by wyleus

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

Hi John - this reply is just to thank you for your help, and also using
your advice to cc lkml the quoted text below in case it helps anyone else
in the future.

On Wed, 26 Feb 2003 22:24:40 +0000 (GMT), John Bradford <[email protected]>
wrote:

>> > Since you have eliminated a lot of the hardware, I would check whether
>> > the PSU is working correctly, if necessary by swapping in a spare one
>> > for a day or two.
>> > > The easiest way to exercise the machine is probably to do kernel
>> > compiles in a loop. Memtest will exercise the memory, but not
>> > particularly exercise the CPU.
>>
>> Thanks for replying, John.
>
> No problem.
>
>> I'm a linux newbie and get heart attacks when someone suggests
>> compiling anything, *especially* messing around with the kernel.
>
> I didn't actually mean compile a new kernel to use, just goint through
> the compile stresses the CPU, cache, and RAM quite extensively.
>
> Programs to do burn-in testing tend to do things in a synthetic
> pattern, where as a kernel compile stresses the machine with a
> real-life workload.
>
>> Thus far, I've only installed RPMs though the mandrake gui - though
>> I'm starting to feel like I could get brave and tackle my first
>> tarball configure/make/install soon.
>> Hope I'm not making you sick here... :-(
>
> No problems. Bear in mind, though, that LKML is primarily a kernel
> developers mailing list - in general nobody will mind these sorts of
> questions, and infact they can be very useful in tracking down obscure
> bugs that only occur on certain hardware, but if other developers are
> busy, you might find your post is ignored. Don't take it personally,
> this list typically gets 200 posts a day, and in general, people skim
> through it for subjects that are most relevant to where their skills
> lie.
>
>> Hope you won't mind bearing with some newbie questions;
>>
>> For the PSU, I don't have a spare one available, nor any other
>> computer parts.
>
> OK, that's a pity, because swapping it out is the easiest way to
> elimiate that as a problem, but never mind.
>
>> I received another reply to my post from Denis Vlasenko who
>> suggested, among other things, trying a program called cpuburn. To
>> my delight, I found it was available in RPM format in mandrake
>> contribs. It's running right now as I write this, and has been
>> about for about 50 minutes so far - with no crashes yet. According
>> to the README included with the package, "If sub-spec, your system
>> may lock up after 2-10 minutes", but I'll keep running it some more
>> and try with different options for a more solid result.
>
> Hmmm, to be honest, like I mentioned above, some of these burn-in
> programs can be a bit synthetic, and not trigger something that a real
> workload will trigger within minutes.
>
>> The readme also mentions that it stresses more than just the cpu,
>> but also the mobo, cooling system, and PSU.
>
> Err, well all programs do that to a degree ;-).
>
>> Excerpt: "The goal has been to maximize heat production from the
>> CPU, putting stress on the CPU itself, cooling system, motherboard
>> (especially voltage regulators) and power supply (likely cause of
>> burnBX/MMX errors)."
>
> I would be skeptical of claims like that unless they reference a
> specific CPU - different chipsets handle things differently - what
> stresses a 386 might not particularly stress a 486. Here again, we're
> talking synthetic loads. Having said that, Denis Vlasenko is well
> respected on this mailing list, so it might be a particularly good
> utility for burn-in testing, I just don't have personal experience of
> it.
>
>> dumb newbie question #1: I'm replying to your email personally.
>> Since I'm not suscribed to lkml, would it be possible (or advisable)
>> to reply to your message but have it posted on lkml?
>
> The normal protocol is to always CC LKML in on your replies, unless
> you're going off topic.
>
>> Example; if I just sent a new message to lkml with the subject line
>> "Re: syslog full of kernel BUGS, frequent intermittent instability"
>> would that have the same effect as replying to your message from
>> lkml? (I'm using the sylpheed-claws mail client if that make any
>> difference).
>
> No, most people use threaded mailreaders, and you don't want to break
> the thread. The subject line is not really important, but if you
> change it, most people put something like:
>
> Underrated PSUs (was: syslog full of kernel BUGS)
>
> and it's an unofficial standard to put a one word 'sub-subject' in
> square brackets, E.G. [PATCH], or [BUG].
>
> I use ELM, and just use reply, then manually add anybody and the list
> in to the CC list as necessary.
>
> Do not worry about CCing people in who you know are subscribed - if
> you want a developer's attention, CC them. Most of us delete list
> mail unread if we're too busy to read it.
>
>> dumb question #2: If I eventually have no choice but to try a
>> different kernel, would it be possible to install one as easily as
>> installing any old RPM - and if so would it be just as easy to
>> restore my system after "trying out" the new kernel. Wouldn't
>> changing the kernel break some of the specific settings mandrake set
>> for my system (e.g. my USB/DSL modem)? I don't want to mess up what
>> I already have - don't want to reinstall from scratch if I break
>> everything - and I don't feel comfortable enough to replace
>> something as critical to my system as the kernel with all of the
>> arcane options and nuances that I don't have a clue about.
>
> I am not the best person to ask distro-specific questions like this,
> as I don't follow any particular distribution, but just install
> everything from source, but with some distributions, it *is* very easy
> to mess up that particular distribution's own way of doing things, by
> trying to install something from source. In that respect, you would
> be best off asking another Mandrake user for guidance, rather than a
> general Linux person.
>
>> Sorry for the dumb questions, John,
>
> No problem.
>
>> it's frustrating to be a clueless newbie - seems like you have to
>> read hours/days/weeks of endless how-to's and man pages just to get
>> the simplest little things done. I'm slowly learning, but it's
>> verrrry slooow progress. :-(
>
> If you've got time, skim through this mailing list's archives - there
> is a lot of info burried amoungst flamewars, arguments, and
> occasionally development work...
>
> John.
>
>
>

--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

2003-02-27 13:13:38

by wyleus

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

> cpuburn will help you rule out defective CPU theory.
> Also you can start removing/swapping hardware parts.
>
Thanks for taking the time to reply time, Dennis. I ran cpuburn as you suggested. Specifically, I ran the burnK6 binary for about an hour (from an xterm, and I also had other stuff like galeon running simultaneously) and didn't get any hiccups. Monitoring with top, I saw that I had 0% free cpu during the test.

However I also later ran the burnMMX binary, and that one would quit after running a few minutes without printing any error messages to the screen - nor did it leave any messages in /var/log/*. Dunno if that's normal? Shouldn't it keep running until I kill it?

Can I draw any conclusions thus far? Could I conclude now that the CPU and RAM are OK (because of the memtest86 and cpuburn tests)? Also, assuming that all of my hardware is OK (yes a big assumption), could the problems I have be due to a misconfiguration or otherwise software problem?

As for swapping parts, I don't have any extra computer parts. Given the nature of these kernel bug messages in the syslog, can I use that fact to logically limit the range of hardware that can cause these problems? I.E. can I say for example that malfunctioning PSU, RAM, CPU or Mobo can cause these symptoms, but that the ISA sound card, or the video card would not? Or do I have no choice but to suspect each and every piece of hardware in my box?

I would like to narrow things down if possible, but don't have the experience/knowledge to make these judgements myself. Which is why I seeking advice, of course.

> Test with some vanilla 2.4 kernels, not a distro one. If 2.4.20 crashes,
> try some of the earlier kernels too. Compile them for 386 uniprocessor
> with debugging and magic SysRq enabled. Provide your .config
>
As I mentioned to John, I'm still a newbie and don't feel comfortable yet in my competency to replace the kernel. I have a lot of (newbie) time invested in this installation and don't want to break it by messing with stuff that's way beyond my understanding.

> Run your klogd with -x to make it stop decoding oopses.
> Run oopses thru ksymoops and provide result.
>
I'm not clear whether you intended these suggestions to apply after I've changed the kernel, or also to my current setup.

Assuming you mean applying it currently - grepping for klogd in /etc shows that mandrake uses an environment variable KLOGD_OPTIONS to run klogd. This was set to "-2". However the variable is set to this value in no less than 32 different files. 24 of them see to be related to runlevels (4x6), and 8 unrelated. This means that (if my logic is correct) that the same variable is set 9 times for a particular runlevel (rl-3 in my case). Which one should I change that won't get overridden? Or should I change them all, or use another approach?

(Thanks again for your help - wyleus signing off)

> Provide lsmod, lspci output, some of /proc/* files (interrupts etc)

Assuming more is better than less;

# lspci
00:00.0 Host bridge: Acer Laboratories Inc. [ALi] M1541 (rev 04)
00:01.0 PCI bridge: Acer Laboratories Inc. [ALi] M1541 PCI to AGP Controller (rev 04)
00:02.0 USB Controller: Acer Laboratories Inc. [ALi] USB 1.1 Controller (rev 03)
00:07.0 ISA bridge: Acer Laboratories Inc. [ALi] M1533 PCI to ISA Bridge [Aladdin IV] (rev c3)
00:0f.0 IDE interface: Acer Laboratories Inc. [ALi] M5229 IDE (rev c1)
01:00.0 VGA compatible controller: 3Dfx Interactive, Inc. Voodoo Banshee (rev 03)

# lsmod
Module Size Used by Not tainted
tdfx 31524 1
agpgart 31840 0 (autoclean) (unused)
sr_mod 15096 0 (autoclean) (unused)
parport_pc 21672 1 (autoclean)
lp 6720 0 (autoclean)
parport 23936 1 (autoclean) [parport_pc lp]
ipt_TOS 984 12 (autoclean)
ipt_LOG 3384 5 (autoclean)
ipt_REJECT 2744 4 (autoclean)
ipt_state 568 9 (autoclean)
iptable_mangle 2072 1 (autoclean)
ip_nat_irc 2384 0 (unused)
ip_nat_ftp 2992 0 (unused)
iptable_nat 15224 2 [ip_nat_irc ip_nat_ftp]
ip_conntrack_irc 3056 1
ip_conntrack_ftp 3952 1
ip_conntrack 18400 4 [ipt_state ip_nat_irc ip_nat_ftp iptable_nat i
p_conntrack_irc ip_conntrack_ftp]
iptable_filter 1644 1 (autoclean)
ip_tables 11672 9 [ipt_TOS ipt_LOG ipt_REJECT ipt_state iptable_
mangle iptable_nat iptable_filter]
nfsd 66576 0 (autoclean)
lockd 46480 0 (autoclean) [nfsd]
sunrpc 60188 0 (autoclean) [nfsd lockd]
ppp_synctty 5952 1 (autoclean)
ppp_generic 20064 3 (autoclean) [ppp_synctty]
slhc 5072 0 (autoclean) [ppp_generic]
n_hdlc 6368 1 (autoclean)
nls_iso8859-1 2844 1 (autoclean)
nls_cp850 3580 1 (autoclean)
vfat 9588 1 (autoclean)
fat 31864 0 (autoclean) [vfat]
supermount 14340 1 (autoclean)
ide-cd 28712 0
cdrom 26848 0 [sr_mod ide-cd]
ide-scsi 8212 0
scsi_mod 90372 2 [sr_mod ide-scsi]
sb 7668 0
sb_lib 34958 0 [sb]
uart401 6628 0 [sb_lib]
sound 55732 0 [sb_lib uart401]
soundcore 3780 0 [sb_lib sound]
usb-ohci 18216 0 (unused)
usbcore 58304 1 [usb-ohci]
rtc 6560 0 (autoclean)
ext3 74004 2
jbd 38452 2 [ext3]

# cat /proc/interrupts
CPU0
0: 382609 XT-PIC timer
1: 8129 XT-PIC keyboard
2: 0 XT-PIC cascade
5: 1 XT-PIC soundblaster
8: 1 XT-PIC rtc
9: 1559 XT-PIC usb-ohci
12: 48716 XT-PIC PS/2 Mouse
14: 12691 XT-PIC ide0
15: 7 XT-PIC ide1
NMI: 0
LOC: 0
ERR: 0
MIS: 0

# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 350.810
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr mce cx8 pge mmx syscall 3dnow k6_mtrr
bogomips : 699.59

# cat /proc/devices
Character devices:
1 mem
2 pty/m%d
3 pty/s%d
4 tts/%d
5 cua/%d
6 lp
7 vcs
10 misc
14 sound
29 fb
108 ppp
128 ptm
136 pts/%d
162 raw
180 usb
226 drm

Block devices:
1 ramdisk
3 ide0
9 md
11 sr
22 ide1

# cat /proc/dma
1: SoundBlaster8
4: cascade

# cat /proc/filesystems
nodev rootfs
nodev bdev
nodev proc
nodev sockfs
nodev tmpfs
nodev shm
nodev pipefs
ext2
nodev ramfs
nodev devfs
nodev devpts
ext3
nodev usbdevfs
nodev usbfs
nodev supermount
vfat

# cat /proc/iomem
00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000f0000-000fffff : System ROM
00100000-07ffffff : System RAM
00100000-0022614c : Kernel code
0022614d-0029547f : Kernel data
c7c00000-cbcfffff : PCI Bus #01
c8000000-c9ffffff : 3Dfx Interactive, Inc. Voodoo Banshee
cbe00000-cfefffff : PCI Bus #01
cc000000-cdffffff : 3Dfx Interactive, Inc. Voodoo Banshee
dffff000-dfffffff : Acer Laboratories Inc. [ALi] USB 1.1 Controller
dffff000-dfffffff : usb-ohci
e0000000-e3ffffff : Acer Laboratories Inc. [ALi] M1541
fec00000-fec00fff : reserved
fee00000-fee00fff : reserved
fffe0000-ffffffff : reserved

# cat /proc/ioports
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
0213-0213 : isapnp read
0220-022f : soundblaster
0330-0333 : MPU-401 UART
0376-0376 : ide1
0378-037a : parport0
03c0-03df : vga+
03f6-03f6 : ide0
0a79-0a79 : isapnp write
0cf8-0cff : PCI conf1
c000-cfff : PCI Bus #01
cc00-ccff : 3Dfx Interactive, Inc. Voodoo Banshee
ffa0-ffaf : Acer Laboratories Inc. [ALi] M5229 IDE
ffa0-ffa7 : ide0
ffa8-ffaf : ide1

# cat /proc/pci
PCI devices found:
Bus 0, device 0, function 0:
Host bridge: Acer Laboratories Inc. [ALi] M1541 (rev 4).
Master Capable. Latency=64.
Non-prefetchable 32 bit memory at 0xe0000000 [0xe3ffffff].
Bus 0, device 1, function 0:
PCI bridge: Acer Laboratories Inc. [ALi] M1541 PCI to AGP Controller (rev 4).
Master Capable. Latency=64. Min Gnt=11.
Bus 0, device 2, function 0:
USB Controller: Acer Laboratories Inc. [ALi] USB 1.1 Controller (rev 3).
IRQ 9.
Master Capable. Latency=64. Max Lat=80.
Non-prefetchable 32 bit memory at 0xdffff000 [0xdfffffff].
Bus 0, device 7, function 0:
ISA bridge: Acer Laboratories Inc. [ALi] M1533 PCI to ISA Bridge [Aladdin IV] (rev 195).
Bus 0, device 15, function 0:
IDE interface: Acer Laboratories Inc. [ALi] M5229 IDE (rev 193).
IRQ 14.
Master Capable. Latency=32. Min Gnt=2.Max Lat=4.
I/O at 0xffa0 [0xffaf].
Bus 1, device 0, function 0:
VGA compatible controller: 3Dfx Interactive, Inc. Voodoo Banshee (rev 3).
IRQ 11.
Non-prefetchable 32 bit memory at 0xcc000000 [0xcdffffff].
Prefetchable 32 bit memory at 0xc8000000 [0xc9ffffff].
I/O at 0xcc00 [0xccff].

2003-02-28 06:41:53

by Denis Vlasenko

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On 27 February 2003 15:23, wyleus wrote:
> > cpuburn will help you rule out defective CPU theory.
> > Also you can start removing/swapping hardware parts.
>
> Thanks for taking the time to reply time, Dennis. I ran cpuburn as
> you suggested. Specifically, I ran the burnK6 binary for about an
> hour (from an xterm, and I also had other stuff like galeon running
> simultaneously) and didn't get any hiccups. Monitoring with top, I
> saw that I had 0% free cpu during the test.
>
> However I also later ran the burnMMX binary, and that one would quit
> after running a few minutes without printing any error messages to
> the screen - nor did it leave any messages in /var/log/*. Dunno if
> that's normal? Shouldn't it keep running until I kill it?

AFAIK there is a README in cpuburn. Let me see... yes, and Design.
Here is an exerpt:
==== Design ====
All at 2 * 5.5 * 97 MHz (26'C ambient). Higher and my CPU1 will lockup
under burnP6 in 5-10 min . kernel compiles are stable to 99 MHz for
24 h. But 98 MHz will give `burnBX` errors every 5-8 hours, and 95
MHz will give burnMMX D errors every ~6 hours, so now I run 94 MHz.
Errors seem to increase 10x for every 1 MHz.
...
REVISED BURNMMX: I started this project as simply a way for AMD system
owners to check out their systems. I was very surpised when my own
system started throwing errors with the MMX memory moves, and had to
downclock from 2 * 5.5 * 97 MHz to 94 MHz. It would seem that the simple
memory moves are more fragile (less robust to interrupts) than the 2%
higher bandwidth string moves.
==== README ====
TO USE: root priviliges are NOT required. It has been designed for ELF
Linux, but also tested under FreeBSD. and a.out. Burn Testing
is best done from a ramdisk distribution (tomsrtbt) or with
filesystems unmounted or mounted read-only. untar the source
in a convenient directory:
`tar zxf cpuburn`
compile excutables
`make`
run desired program in background [ _repeat_ for SMP]:
`burnP6 || echo $? &`

Monitor progress of cpuburn by `ps`. When finished, `kill` the burn*
process(es). If you have temperature probes (fingers) or the lm-sensors
package, you can check your CPU temperature and/or system voltages.

If an error occurs in calculations, it will be preserved, and the
program will terminate with error code 254 for an integer/memory error,
and error code 255 for a FP/MMX error. Error checking happens every
10-40 sec for burnP6/K6/K7 and I haven't seen any CPU errors in testing
[lockups occur first]. burnBX and burnMMX check for error every 512 MB
(4-10 sec), and error termination is frequently seen, lockups are rarer.

burnBX and burnMMX are essentially very intense RAM testers. They can
also take an optional parameter indicating the RAM size to be tested:

A = 2 kB E = 32 kB I = 512 kB M = 8 MB
B = 4 F = 64 J = 1 MB N = 16
C = 8 G = 128 K = 2 O = 32
D = 16 H = 256 L = 4 P = 64
================

so, you need to run "burnMMX x; echo $?" to see an error code.
It's definitely bad that it does not run forever.
(btw I'm CCing cpuburn author)

> Can I draw any conclusions thus far? Could I conclude now that the
> CPU and RAM are OK (because of the memtest86 and cpuburn tests)?

As you see abobe, something in hw is not ok.

> Also, assuming that all of my hardware is OK (yes a big assumption),
> could the problems I have be due to a misconfiguration or otherwise
> software problem?
>
> As for swapping parts, I don't have any extra computer parts. Given
> the nature of these kernel bug messages in the syslog, can I use that
> fact to logically limit the range of hardware that can cause these
> problems? I.E. can I say for example that malfunctioning PSU, RAM,
> CPU or Mobo can cause these symptoms, but that the ISA sound card, or
> the video card would not? Or do I have no choice but to suspect each
> and every piece of hardware in my box?

We don't know. If it's a PSU, removing some seemingly irrelevant part
can reduce wattage and make system stable... There are tons
of possibilities, the only way to know is to experiment.

> > Test with some vanilla 2.4 kernels, not a distro one. If 2.4.20
> > crashes, try some of the earlier kernels too. Compile them for 386
> > uniprocessor with debugging and magic SysRq enabled. Provide your
> > .config
>
> As I mentioned to John, I'm still a newbie and don't feel comfortable
> yet in my competency to replace the kernel. I have a lot of (newbie)
> time invested in this installation and don't want to break it by
> messing with stuff that's way beyond my understanding.

It's not that scary as it seems. Compiling kernel won't damage a system.
You can damage it only by improper install. Read lilo (or what is Mandrake
using?) docs carefully first.

> > Run your klogd with -x to make it stop decoding oopses.
> > Run oopses thru ksymoops and provide result.
>
> I'm not clear whether you intended these suggestions to apply after
> I've changed the kernel, or also to my current setup.

To current setup.

> Assuming you mean applying it currently - grepping for klogd in /etc
> shows that mandrake uses an environment variable KLOGD_OPTIONS to run
> klogd. This was set to "-2". However the variable is set to this
> value in no less than 32 different files. 24 of them see to be
> related to runlevels (4x6), and 8 unrelated. This means that (if my
> logic is correct) that the same variable is set 9 times for a
> particular runlevel (rl-3 in my case). Which one should I change
> that won't get overridden? Or should I change them all, or use
> another approach?

Wow. Maybe simply turn them all into "-x -2"?
(my man klogd says nil about -2)

> > Provide lsmod, lspci output, some of /proc/* files (interrupts etc)
>
> Assuming more is better than less;
>
> # lsmod
> Module Size Used by Not tainted
> tdfx 31524 1
> agpgart 31840 0 (autoclean) (unused)
> sr_mod 15096 0 (autoclean) (unused)
> parport_pc 21672 1 (autoclean)
> lp 6720 0 (autoclean)
> parport 23936 1 (autoclean) [parport_pc lp]
> ipt_TOS 984 12 (autoclean)
> ipt_LOG 3384 5 (autoclean)
> ipt_REJECT 2744 4 (autoclean)
> ipt_state 568 9 (autoclean)
> iptable_mangle 2072 1 (autoclean)
> ip_nat_irc 2384 0 (unused)
> ip_nat_ftp 2992 0 (unused)
> iptable_nat 15224 2 [ip_nat_irc ip_nat_ftp]
> ip_conntrack_irc 3056 1
> ip_conntrack_ftp 3952 1
> ip_conntrack 18400 4 [ipt_state ip_nat_irc ip_nat_ftp
> iptable_nat i p_conntrack_irc ip_conntrack_ftp]
> iptable_filter 1644 1 (autoclean)
> ip_tables 11672 9 [ipt_TOS ipt_LOG ipt_REJECT
> ipt_state iptable_ mangle iptable_nat iptable_filter]
> nfsd 66576 0 (autoclean)
> lockd 46480 0 (autoclean) [nfsd]
> sunrpc 60188 0 (autoclean) [nfsd lockd]
> ppp_synctty 5952 1 (autoclean)
> ppp_generic 20064 3 (autoclean) [ppp_synctty]
> slhc 5072 0 (autoclean) [ppp_generic]
> n_hdlc 6368 1 (autoclean)
> nls_iso8859-1 2844 1 (autoclean)
> nls_cp850 3580 1 (autoclean)
> vfat 9588 1 (autoclean)
> fat 31864 0 (autoclean) [vfat]
> supermount 14340 1 (autoclean)
> ide-cd 28712 0
> cdrom 26848 0 [sr_mod ide-cd]
> ide-scsi 8212 0
> scsi_mod 90372 2 [sr_mod ide-scsi]
> sb 7668 0
> sb_lib 34958 0 [sb]
> uart401 6628 0 [sb_lib]
> sound 55732 0 [sb_lib uart401]
> soundcore 3780 0 [sb_lib sound]
> usb-ohci 18216 0 (unused)
> usbcore 58304 1 [usb-ohci]
> rtc 6560 0 (autoclean)
> ext3 74004 2
> jbd 38452 2 [ext3]

Wow. Do you really use ALL this stuff or it's Mandrake install
default?
--
vda

2003-02-28 13:37:07

by Robert Redelmeier

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On Fri, Feb 28, 2003 at 08:41:41AM +0200, Denis Vlasenko wrote:
> On 27 February 2003 15:23, wyleus wrote:
> > Thanks for taking the time to reply time, Dennis. I ran cpuburn as
> > you suggested. Specifically, I ran the burnK6 binary for about an
> > hour (from an xterm, and I also had other stuff like galeon running
> > simultaneously) and didn't get any hiccups. Monitoring with top, I
> > saw that I had 0% free cpu during the test.
> >
> > However I also later ran the burnMMX binary, and that one would quit
> > after running a few minutes without printing any error messages to
> > the screen - nor did it leave any messages in /var/log/*. Dunno if
> > that's normal? Shouldn't it keep running until I kill it?

No, it's not normal. `burnMMX` should keep running. If it quits,
an `echo $?` will get the return code. burnMMX is as more than
an MMX math tester, it also uses MMX to test the RAM controller
and busses. It is specifically designed to induce crosstalk and
catch slow state transitions.

What has happened is that burnMMX has provoked a memory error
and quit. This is not surprising. Many K6s were not capable of
100 MHz bus operation, and even more motherboards could not run at
100 stably. Not to mention the slews of expensive sub-spec RAM
of the era and small AT power supplies.

You have bad hardware. You must expect trouble. Linux runs hardware
pretty hard. Correctness then Performance appears to be Linus'
philosophy. If you are lucky, you can down-clock your bus. If you
are _very_ lucky, a kernel without any K6 optimizations [compiled for
a 386] in the `bzero` and `bcopy` routines might reduce your error
frequency. But if X detects and uses K6 routines, you're hosed.

-- Robert author `cpuburn` http://users.ev1.net/~redelm

2003-02-28 14:02:20

by John Bradford

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

> You have bad hardware. You must expect trouble. Linux runs hardware
> pretty hard. Correctness then Performance appears to be Linus'
> philosophy. If you are lucky, you can down-clock your bus. If you
> are _very_ lucky, a kernel without any K6 optimizations [compiled for
> a 386] in the `bzero` and `bcopy` routines might reduce your error
> frequency. But if X detects and uses K6 routines, you're hosed.

Also, try re-seating your RAM chips, and make sure that the CPU fan
and heatsink are free of dust and properly attached to the CPU.

John.

2003-03-01 13:11:45

by wyleus

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

Before I start, I just want to thank all three of you (Dennis, John, and Robert) for your patience and very helpful comments. I'm learning a lot from you guys. I'm CC'ing you all on this reply, please let me know if you'd rather not have me do that.

On Fri, 28 Feb 2003 08:41:41 +0200
Denis Vlasenko <[email protected]> wrote:

> Wow. Do you really use ALL this stuff or it's Mandrake install
> default?

It's the mandrake default AFAIK. I don't know what all that stuff is,
so I don't mess with it. My installation does "feel" bloated (very
unscientific opinion): it "feels" much less responsive in the GUI
(currently icewm - KDE was like molasses, but very nice IMHO) than win98, especially certain apps such as galeon. But that's another problem, and I'll move on to that problem after the stability issue is resolved. I suppose one day, after getting more comfortable with linux, I'll move from this newbie distro to something like debian or gentoo - but I'm far from ready for that at this point in time - right now I need mandrake to hold my hand. :-|

On Fri, 28 Feb 2003 07:47:00 -0600
Robert Redelmeier <[email protected]> wrote:

> What has happened is that burnMMX has provoked a memory error
> and quit. This is not surprising. Many K6s were not capable of
> 100 MHz bus operation, and even more motherboards could not run at
> 100 stably. Not to mention the slews of expensive sub-spec RAM
> of the era and small AT power supplies.

On Fri, 28 Feb 2003 14:13:01 +0000 (GMT)
John Bradford <[email protected]> wrote:

> Also, try re-seating your RAM chips, and make sure that the CPU fan
> and heatsink are free of dust and properly attached to the CPU.

Yesterday I ran burnMMX repeatedly and recorded the results in a text file. Today, I took everything apart and cleaned up any dust and then moved the single RAM stick into the next slot over (I have 3 slots in total). Initially I was elated as I ran three tests for about 20 minutes each with no errors. But my bubble popped on the 4th run. Changing slots does look like it improved things judging from the results, but still not as it should be - at least that's the way it looks to me. I'm still running tests as I write this, but will copy the results so far below and let you judge;

Robert - thanks for this cpuburn program of yours, it's very helpful, and for a scared newbie like me, it sure is simpler than facing a kernel compile :-) It's small, simple, and effective. I guess other people find it useful too, since someone took the time to RPM package it on the mandrake contrib mirrors.

I'm not sure why cpuburn uncovered errors, while memtest86 didn't - in retrospect this may be my own doing because I ran memtest86 without fully reading the docs and just let it run with whatever the default options were without really understanding what it was doing - so I may have got what I deserved. :-(

Thanks again, here are my notes on what I've done so far;

Friday, Feb 28 2003
Results of burnMMX tests

command: burnMMX x; echo $?

where x represents memory size parameter passed to burnMMX as follows;

<small excerpt from cpuburn readme>
burnBX and burnMMX are essentially very intense RAM testers. They can
also take an optional parameter indicating the RAM size to be tested:

A = 2 kB E = 32 kB I = 512 kB M = 8 MB
B = 4 F = 64 J = 1 MB N = 16
C = 8 G = 128 K = 2 O = 32
D = 16 H = 256 L = 4 P = 64

the default memsize used when none is specified is F=64k

exit codes for burnMMX are as follows;
130 = process killed manually using ctl-c
254 = integer/memory error
255 = FP/MMX error

mem runtime exit
size (minutes) code

A (2K) 26:00 130
28:15 130
F (64K) 2:00 254
11:00 130
6:00 130
21:42 130
G (128K) 6:00 130
H (256K) 3:25 254
2:40 254
0:45 254 1 these
1:35 254 1 are
0:40 254 1 consecutive
3:45 254 1 runs
33:00 130 1
7:00 254 1
7:00 254 1
5:16 254 1
17:19 254 1
I (512K) 6:00 254
1:48 254
5:34 254

Sat, March 1, 2003

Switched the RAM stick from the first slot (closest
to CPU), to the middle slot;

command: time burnMMX x; echo $?

(using the time command, manual exits using
ctl-c provide exit code 2, but I still list it
here as 130 in the table for consistency)

mem runtime exit
size (minutes) code

G (128K) 2:46 254
21:50 130
H (256K) 20:12 130
33:46 130
I (512K) 20:06 130
21:58 130
J (1024K) 21:57 130

Only one error so far after 7 runs, which seems much better than before, but still unnacceptable I guess...

Where should I go from here? Try another slot? Buy new RAM? More testing?

wyleus

2003-03-01 14:44:14

by John Bradford

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

> It's the mandrake default AFAIK. I don't know what all that stuff is,
> so I don't mess with it. My installation does "feel" bloated (very
> unscientific opinion): it "feels" much less responsive in the GUI

As your machine is quite old, you would probably get a noticable speed
increase from mounting your filesystems with noatime, which is very
straightforward and shouldn't cause any problems - just edit
/etc/fstab, and add the option noatime after each disk partition, for
example, you might have something like:

/dev/hda2 / ext3 defaults 1 1

which you can change to

/dev/hda2 / ext3 defaults, noatime 1 1

This is a bit off-topic, but in my experience is about the best way to
increase performance on old, (and not so old), hardware, apart from
compiling a custom kernel. Without noatime, every time you read a
file, the current date and time is written to the disk. With noatime,
it's only recorded for a write. Almost no programs use the access
time data.

> Yesterday I ran burnMMX repeatedly and recorded the results in a
> text file. Today, I took everything apart and cleaned up any dust
> and then moved the single RAM stick into the next slot over (I have
> 3 slots in total).

Are you sure there isn't a correct slot that it should be in? Most
motherboard manuals specify that the slots should be used in a
specific order.

> Initially I was elated as I ran three tests for about 20 minutes
> each with no errors. But my bubble popped on the 4th run. Changing
> slots does look like it improved things judging from the results,
> but still not as it should be - at least that's the way it looks to
> me.

I seriously doubt that a single RAM module should be installed in the
middle slot of three. One of the end slotf would seem more likely.

> I'm still running tests as I write this, but will copy the
> results so far below and let you judge;

> Where should I go from here? Try another slot? Buy new RAM? More
> testing?

It might have been disconnecting and reconnecting the RAM that
improved things, not the change of slot. Try both end slots.

John.

2003-03-01 14:53:29

by Jan-Benedict Glaw

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On Sat, 2003-03-01 14:55:58 +0000, John Bradford <[email protected]>
wrote in message <[email protected]>:
> > It's the mandrake default AFAIK. I don't know what all that stuff is,
> > so I don't mess with it. My installation does "feel" bloated (very
> > unscientific opinion): it "feels" much less responsive in the GUI
>
> /dev/hda2 / ext3 defaults 1 1
>
> which you can change to
>
> /dev/hda2 / ext3 defaults, noatime 1 1
you loose -----^

> This is a bit off-topic, but in my experience is about the best way to
> increase performance on old, (and not so old), hardware, apart from
> compiling a custom kernel. Without noatime, every time you read a
> file, the current date and time is written to the disk. With noatime,
> it's only recorded for a write. Almost no programs use the access
> time data.

Except some email clients...

MfG, JBG

--
Jan-Benedict Glaw [email protected] . +49-172-7608481
"Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg
fuer einen Freien Staat voll Freier B?rger" | im Internet! | im Irak!
ret = do_actions((curr | FREE_SPEECH) & ~(IRAQ_WAR_2 | DRM | TCPA));

Attachments:

(No filename) (1.20 kB)
(No filename) (189.00 B)
Download all attachments

2003-03-01 15:23:41

by Robert Redelmeier

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On Sat, Mar 01, 2003 at 02:55:58PM +0000, John Bradford wrote:
> /dev/hda2 / ext3 defaults, noatime 1 1
^

No space. At times Unix is just as fussy as IBM JCL :)
Agreed on the value of `noatime`. It made a huge difference
in responsiveness on an old 486sx25 laptop.

> I seriously doubt that a single RAM module should be installed in the
> middle slot of three. One of the end slotf would seem more likely.

Agreed. But on some mobos, it doesn't matter. Then I prefer the
furthest slot from the RAM controller. Sure, signal time-of-flight
is a few ps longer, but at least the bus is terminated and will
have fewer reflections.

> It might have been disconnecting and reconnecting the RAM that
> improved things, not the change of slot. Try both end slots.

Yes. Reseating will creat new electrical connections.

-- Robert author `cpuburn` http://users.ev1.net/~redelm

2003-03-01 20:45:12

by wyleus

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On Sat, 1 Mar 2003 14:55:58 +0000 (GMT)
John Bradford <[email protected]> wrote:

> As your machine is quite old, you would probably get a noticable speed
> increase from mounting your filesystems with noatime, which is very
> straightforward and shouldn't cause any problems - just edit
> /etc/fstab, and add the option noatime after each disk partition, for

Thanks for the tip. I applied it, but haven't noticed any
immediately obvious benefits in the short time I've spent so far. Guess
I won't be able to avoid the dreaded kernel recompile forever. ;-)

> Are you sure there isn't a correct slot that it should be in? Most
> motherboard manuals specify that the slots should be used in a
> specific order.

Guess I've got one of those unusual mobos - it's an MSI-5169. Here's an
excerpt from my printed manual (typed manually);

"2.2-3 Memory Population Rules

1. This mainboard supports Table Free memory, so memory can be installed
in DIMM1, DIMM2, or DIMM3 in any order."

> It might have been disconnecting and reconnecting the RAM that
> improved things, not the change of slot. Try both end slots.

I've been testing it all day, and have collected more data (please see
below)- it's definitely a huge improvement over what I had before,
whatever the underlying reason may be. Looks like a whole order of
magnitude to me. Still not perfect though.

Tomorrow, I will try switching to an end slot as you suggest, and see if
that gets me any further improvement. At this point, I'm thinking of
taking the money and running because I am planning on upgrading my
system in a few months anyway, and think I can tolerate a few crashes
until then - just so long as they're not happening almost daily as they
were before. After all, I'm coming from the world of windows, so I'm not
exactly a virgin in this respect. ;-) I will demand perfection from
the new system when I finally get it though.

It may be more productive for me now to spend my time learning and
investigating and tweaking other stuff, as I have a long list of things
(like some performance issues I mentioned before) I would like to work
on to get this installation working to my liking.

When I get my new system this spring, it'll have a spacious 100 gig hard
disk, which I plan to partition with a few different distros -
probably mandrake again, along with debian and gentoo. I'd like to get
my comfort factor up between now and then, and hope to gain experience
in a lot of different aspects of administration and just plain
understanding how stuff in the linux world works. Hopefully I'll have
kernel recompile or two under my belt by then, and won't be so clueless
about how the system all fits together.

This experience and your advice and really helped me a lot. I can't
thank you guys enough for your help, you've been great - not just in
helping me solve my problem, but also in learning a great deal by
walking me through it and taking the time to answer my dumb
questions.

My already high respect for the community has gone up yet another notch.
I wish the rest of the world could always be this nice.

best regards

wyleus

....

Here's an updated version of the notes I took;

Friday, Feb 28 2003
Results of burnMMX tests

command: burnMMX x; echo $?

where x represents memory size parameter passed to burnMMX as follows;

<small excerpt from cpuburn readme>
burnBX and burnMMX are essentially very intense RAM testers. They can
also take an optional parameter indicating the RAM size to be tested:

A = 2 kB E = 32 kB I = 512 kB M = 8 MB
B = 4 F = 64 J = 1 MB N = 16
C = 8 G = 128 K = 2 O = 32
D = 16 H = 256 L = 4 P = 64

the default memsize used when none is specified is F=64k

exit codes for burnMMX are as follows;
130 = process killed manually using ctl-c
254 = integer/memory error
255 = FP/MMX error

mem runtime exit
size (minutes) code

A (2K) 26:00 130
28:15 130
F (64K) 2:00 254
11:00 130
6:00 130
21:42 130
G (128K) 6:00 130
H (256K) 3:25 254
2:40 254
0:45 254 1 these
1:35 254 1 are
0:40 254 1 consecutive
3:45 254 1 runs
33:00 130 1
7:00 254 1
7:00 254 1
5:16 254 1
17:19 254 1
I (512K) 6:00 254
1:48 254
5:34 254
=====
Total runtime ~197 minutes
# of failures 14
ave run/fail 14 minutes

Sat, March 1, 2003

Switched the RAM stick from the first slot (closest
to CPU), to the middle slot;

command: time burnMMX x; echo $?

(using the time command, manual exits using
ctl-c provide exit code 2, but I still list it
as 130 in the table for consistency)

mem runtime exit
size (minutes) code

F (64K) 22:42 130
12:20 254
23:37 130
192:47 130
G (128K) 2:46 254
21:50 130
30:55 130
43:59 130
H (256K) 20:12 130
33:46 130
24:33 130
I (512K) 20:06 130
21:58 130
21:03 254
J (1024K) 21:57 130
28:46 130
=====
Total runtime ~543 minutes
# of failures 3
ave run/fail 181 minutes

2003-03-02 02:26:38

by jw schultz

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On Sat, Mar 01, 2003 at 04:03:50PM +0100, Jan-Benedict Glaw wrote:
> On Sat, 2003-03-01 14:55:58 +0000, John Bradford <[email protected]>
> wrote in message <[email protected]>:
> > > It's the mandrake default AFAIK. I don't know what all that stuff is,
> > > so I don't mess with it. My installation does "feel" bloated (very
> > > unscientific opinion): it "feels" much less responsive in the GUI
> >
> > /dev/hda2 / ext3 defaults 1 1
> >
> > which you can change to
> >
> > /dev/hda2 / ext3 defaults, noatime 1 1
> you loose -----^
>
> > This is a bit off-topic, but in my experience is about the best way to
> > increase performance on old, (and not so old), hardware, apart from
> > compiling a custom kernel. Without noatime, every time you read a
> > file, the current date and time is written to the disk. With noatime,
> > it's only recorded for a write. Almost no programs use the access
> > time data.
>
> Except some email clients...

And as you and i already hashed out on the rsync list mutt is
perfectly happy and fully functional with noatime,nodiratime
because it updates the atime manually which still works.
As the headers indicate i'm using mutt, It spots new mail
in other mboxes just fine with noatime turned on. I get the
"Inc:" count, c<tab> works, and folder lists indicate
correctly which folders have been updated since last visit.

Perhaps you are thinking of another MUA or an oooold version
of mutt?

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2003-03-02 11:04:35

by wyleus

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On Sun, 2 Mar 2003 08:54:41 +0000 (GMT)
John Bradford <[email protected]> wrote:

> It's interesting to note that these will be running mainly in cache
> memory, and they have worked much more reliably than the ones below,
> which use main memory more heavily.

I also had this in mind from the first few times I ran the program.
Which is why I chose to run several time-limited runs across different
memory sizes that straddled the L2 cache size (128k), instead of letting
each one run 'till it died.

My thinking was/is that this probably says that the problem lies beyond
the cache, and probably points to the RAM chip itself, or at least
something in the logical path after the cache but before reaching the
RAM. (I'm no expert, so I don't know what those other possibilities
could include).

John, I think I may have been premature in my celebrations. After I
wrote that message, I continued testing late into the night, and I
later started getting rapid consecutive failures.

This got me the cynical thought "what, does it depend on the time of day
now?". But I suppose that's really a possibility, isn't it? It could
even be affected by my local electricity - voltage variations, or
whatever? Just taking stabs in the dark here. I'm a foreigner living
on a small island (Cyprus) and I suppose it could be possible that they
may not be up to the more reliable generation standards in other
countries. Certainly, my local ISP seems to go down more often than I do
(they're running windows). We have a LUG here, maybe I should ask the
local linfolk if they are experiencing any instability. I do remember
making a comment to my wife that ever since arriving here a few years
ago, my windows partition (where I've spent most of my time) seems to
crash more often than it used to.

Some other observations about the crashes I've been getting (it's
been a few days since the last one) - they tend to clump together. That
is, I could go several days without incident, but then have 3 or 4
crashes in one night. I'll have to untar my logs and concatenate them
to verify this numerically, but that's what I remember. Also, they may
happen more at night than they do in the day time, also IIRC. I shut
off by box overnight.

Another observation I'll mention, but I don't know if it's of any
significance, is that when I got the hard freezes, my computer wouldn't
respond to ctl-alt-del nor to the capslock key. But recently I learned
about the Magic SysRq feature, and for the last couple of crashes I've
verified that the kernel DOES respond to those. Does this say anything
helpful?

Man, do I feel confused compared to yesterday.

Here's an updated version of my notes. Guess I should start recording
the time/date of each run from now on. I've been doing this manually so
far, I should probably automate it into a script that loops the test
and appends the output into a text file. (Got to read some man pages to
figure out how to do that)

Friday, Feb 28 2003
Results of burnMMX tests

command: burnMMX x; echo $?

where x represents memory size parameter passed to burnMMX as follows;

<small excerpt from cpuburn readme>
burnBX and burnMMX are essentially very intense RAM testers. They can
also take an optional parameter indicating the RAM size to be tested:

A = 2 kB E = 32 kB I = 512 kB M = 8 MB
B = 4 F = 64 J = 1 MB N = 16
C = 8 G = 128 K = 2 O = 32
D = 16 H = 256 L = 4 P = 64

the default memsize used when none is specified is F=64k

exit codes for burnMMX are as follows;
130 = process killed manually using ctl-c
254 = integer/memory error
255 = FP/MMX error

mem runtime exit
size (minutes) code

A (2K) 26:00 130
28:15 130
F (64K) 2:00 254
11:00 130
6:00 130
21:42 130
G (128K) 6:00 130
H (256K) 3:25 254
2:40 254
0:45 254 1 these
1:35 254 1 are
0:40 254 1 consecutive
3:45 254 1 runs
33:00 130 1
7:00 254 1
7:00 254 1
5:16 254 1
17:19 254 1
I (512K) 6:00 254
1:48 254
5:34 254
=====
Total runtime ~197 minutes
# of failures 14
ave run/fail 14 minutes

Sat, March 1, 2003

Switched the RAM stick from the first slot (closest
to CPU), to the middle slot;

command: time burnMMX x; echo $?

(using the time command, manual exits using
ctl-c provide exit code 2, but I still list it
as 130 in the table for consistency)

mem runtime exit
size (minutes) code

F (64K) 22:42 130
12:20 254 *
23:37 130
192:47 130
G (128K) 2:46 254 *
21:50 130
30:55 130
43:13 254 *
H (256K) 20:12 130
33:46 130
24:33 130
18:59 254 *
I (512K) 20:06 130
21:58 130
21:03 254 *
26:50 254 *
J (1024K) 21:57 130
28:46 130
1:38 254 *
2.10 254 *
12.50 254 *
4.47 254 *
3.39 254 *
=====
Total runtime ~604 minutes
# of failures 11
ave run/fail 55 minutes

2003-03-03 15:12:45

by Denis Vlasenko

[permalink] [raw]

Subject: Re: syslog full of kernel BUGS, frequent intermittent instability

On 1 March 2003 15:21, wyleus wrote:
> Friday, Feb 28 2003
> Results of burnMMX tests
>
> command: burnMMX x; echo $?
>
> where x represents memory size parameter passed to burnMMX as
> follows;
>
> <small excerpt from cpuburn readme>
> burnBX and burnMMX are essentially very intense RAM testers. They
> can also take an optional parameter indicating the RAM size to be
> tested:
>
> A = 2 kB E = 32 kB I = 512 kB M = 8 MB
> B = 4 F = 64 J = 1 MB N = 16
> C = 8 G = 128 K = 2 O = 32
> D = 16 H = 256 L = 4 P = 64
>
> the default memsize used when none is specified is F=64k
>
> exit codes for burnMMX are as follows;
> 130 = process killed manually using ctl-c
> 254 = integer/memory error
> 255 = FP/MMX error
>
> mem runtime exit
> size (minutes) code
>
> A (2K) 26:00 130
> 28:15 130
> F (64K) 2:00 254
> 11:00 130
> 6:00 130
> 21:42 130
> G (128K) 6:00 130
> H (256K) 3:25 254
> 2:40 254
> 0:45 254 1 these
> 1:35 254 1 are
> 0:40 254 1 consecutive
> 3:45 254 1 runs
> 33:00 130 1
> 7:00 254 1
> 7:00 254 1
> 5:16 254 1
> 17:19 254 1
> I (512K) 6:00 254
> 1:48 254
> 5:34 254
>
> Sat, March 1, 2003
>
> Switched the RAM stick from the first slot (closest
> to CPU), to the middle slot;
>
> command: time burnMMX x; echo $?
>
> (using the time command, manual exits using
> ctl-c provide exit code 2, but I still list it
> here as 130 in the table for consistency)
>
> mem runtime exit
> size (minutes) code
>
> G (128K) 2:46 254
> 21:50 130
> H (256K) 20:12 130
> 33:46 130
> I (512K) 20:06 130
> 21:58 130
> J (1024K) 21:57 130
>
> Only one error so far after 7 runs, which seems much better than
> before, but still unnacceptable I guess...
>
> Where should I go from here? Try another slot? Buy new RAM? More
> testing?

You should underclock and/or overvolt your system until it runs these
tests stably.
--
vda