2003-08-26 03:31:46

by Cheetah

[permalink] [raw]
Subject: High load under low io with 2.4.19+

The quick version: Almost any i/o goes slow and can trigger high load

I'm experiencing abnormally high load averages and matching system
slow-downs when doing even minimal i/o on my system. When this load
issue is bad, the system load can move towards 5 or 10, with little or
no cpu usage, and not much disk access (<1MB/s, no thrashing sounds).
When the load average gets high because of this kind of thing, the
system can become quite unresponsive.

Some empirical data points:

Compiling the kernel with -j2 (2-cpu system) while doing nothing else
intensive (various idle processes common to a desktop workstation) can
drive the load up towards 10. During this time, the two cpus in
combination use, on average, 50% of capacity. At a more detailed level,
what I often see is that the two cpus will hit 100% utilization for a
couple seconds, and then drop down to near zero for a couple seconds.
Every other multi-cpu system I've used, make -j ncpus on the kernel can
come very close to fully loading all cpus.

Running find / >/dev/null can bring the system almost to its knees, to
the point where the mouse cursor gets jittery in X. Running find /
without piping it to /dev/null shows frequent pauses in the output for
2-5 seconds at a time.

Starting up pine (I have some startup key sequences pine auto-runs
which do a couple things) can take 10+ seconds. On my previous system,
which was nominally much slower, these startup actions would usually
take < 1 sec.

Generally, any i/o that can't be satisfied from cache goes rather slow,
and any large amount of i/o drives the system load way up and makes the
system unresponsive (can take seconds to respond to mouse or keyboard
event).

According to sysstat/sar, my 5-minute load average hasn't dropped below
0.1 since this begun, even during the day when the system should be
completely idle. sar is reporting daily load averages almost never
going below 0.5. For the vast majority of the day (when I'm not
actively using it), this machine does almost nothing besides fetching
small amounts of e-mail and logging a couple small IRC channels.

Debian's 2.4.18-bf24 kernel that came with 3.0r1 doesn't behave
perfectly for me, but does behave quite a bit better. Self-built 2.4.18
through 2.4.22 all exhibit this issue. I've tried the -aa and -rmap
patches on .21, and they made no difference whatsoever. I've tried
using the noapic boot option, and that didn't help. I've tried
disabling HIGHMEM and HIGHIO, and that didn't help.

I verified that the ide-raid adaptor wasn't just going totally slow by
doing things like dd if=/dev/zero of=test, and was seeing ~30MB/sec
sustained (according to stats in /proc/partitions).

I've run similarly configured kernels on other systems with similar OS
configurations, but lower end hardware configurations (single cpu,
single ide hd, 512MB ram) and not seen these kinds of issues

Basic hardware information:
Dual Athlon MP 2400+
MSI K7D-Master L motherboard (AMD MPX Chipset)
1GB ECC RAM (Check+Scrub enabled in BIOS)
3Ware 7500-4 LP ATA-RAID Controller (3x 120GB drive, RAID5)

The BIOS is set to use MP spec 1.1, since I saw some things on the web
that indicated that 1.4 wouldn't work.

Some notes related to some kernel threads I came upon trying to solve
this bug:

MTRRs seem to be enabled OK: # cat /proc/mtrr
reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1
reg01: base=0xe8000000 (3712MB), size= 128MB: write-combining, count=1
reg05: base=0xe0000000 (3584MB), size= 128MB: write-combining, count=1

However, I'm seeing this in my dmesg every time I boot:
# dmesg | grep mtrr
mtrr: v1.40 (20010327) Richard Gooch ([email protected])
mtrr: detected mtrr type: Intel
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs

3d accelleration on the nvidia card works quite well, however, so I
imagine mtrr has to be working more or less.

The symptoms (i/o pausing for several seconds) seem similar to those
described in the "get_request starvation bug" thread
(http://lkml.org/lkml/2002/2/8/73) and "design locking bug in
wait_on_page/wait_on_buffer/get_request_wait" thread
(http://lkml.org/lkml/2002/11/11/251) but those patches seems to be
integrated already in one form or another by 2.4.22. I don't see the
i/o pausing for hours at a time, but then again I'm not running abusive
benchmark programs trying to trigger such behavior.

Detailed hardware info as suggested by
http://kernel.org/pub/linux/docs/lkml/reporting-bugs.html attached as
hardware_info.txt

PS: please CC me on replies

--
-Cheetah
"Reality is that which, when you stop believing in it, doesn't go away".
-- Philip K. Dick
GPG pubkey fingerprint: A57F B354 FD30 A502 795B 9637 3EF1 3F22 A85E 2AD1


Attachments:
hardware_info.txt (18.02 kB)
hardware_info.txt