Hi,
In March of last year I briefly corresponded with this list about a
performance problem that shows up on the dual Athlon MP running a 2.4 kernel
(apparently any 2.4 kernel, though I happen to be using 2.4.18). I don't
have a solution, but I do have some more information.
First, and probably the reason you haven't heard more complaints about the
problem, its severity is evidently dependent on the size of main memory. At
512MB it doesn't seem to be much of a problem (right, Mathieu?). At 2.5GB,
which is what I have, it can be quite serious. For instance, if I start two
`find' processes at the roots of different filesystems, the system can spend
(according to `top') 95% - 98% of its time in the kernel. It even gets
worse than that, but `top' stops updating -- in fact, the system can seem
completely frozen, but it does recover eventually. Stopping or killing one
of the `find' processes brings it back fairly quickly, though it can take a
while to accomplish that.
The dual Pentium, as you are probably aware, has no trace of the problem.
What I think is going on (after doing some profiling, some lockmeter
measurements, and other stuff I'll describe below) is simply that the
spinlock handoff time is much longer on the Athlon than on the Pentium. So
any spinlock contention hurts much worse on the Athlon.
The kernel spinlock loop includes the Pentium 4 PAUSE instruction (aka "REP
NOP"). The Intel Pentium instruction set reference manual has this to say
about PAUSE:
When executing a spin-wait loop, a Pentium 4 or Intel Xeon processor
suffers a severe performance penalty when exiting the loop because it
detects a possible memory order violation. The PAUSE instruction provides
a hint to the processor that the code sequence is a spin-wait loop. The
processor uses this hint to avoid the memory order violation in most
situations, which greatly improves processor performance.
I don't know enough about cache coherency protocols to know what a memory
order violation is or what it might cost in nanoseconds, but clearly the
Pentium had a bad enough problem that Intel decided to fix it. I speculate
that the Athlon has the same problem.
I urge AMD to do the following:
(1) Figure out whether this is in fact the problem.
(2) Figure out whether a different instruction sequence for the spinlock
inner loop would work better. (The existing sequence, for those at AMD who
may not have the kernel source handy, is simply CMPB 0, [lock address];
PAUSE; JLE [address of CMPB].)
(3) Make sure the Hammer doesn't have the same problem -- or if it does, fix
it! I assure you it will show up in the field.
Meanwhile, supposing that (a) this is the problem and (b) there's no
alternative instruction sequence that cures it, the question remains what to
do about it.
One of the first things I found when profiling was that the worst of the
problem seemed to be in the slab allocator. I made some changes to
`mm/slab.c' in 2.4.18 intended to reduce spinlock contention therein. I
believe they were successful (though I should do some before-and-after runs
with lockmeter to verify this), and I think they ameliorate the problem
somewhat. However, it appears that there are other sources of contention
that show up once that one is fixed (the next one appears, according to the
profiler, to be somewhere in `shrink_cache' in `mm/vmscan.c'). (Sorry I
never released my patch to `slab.c'; since it wasn't a complete cure, and
I didn't understand why not, and didn't have any more time to try to figure
it out, I didn't send it out. But it is stable; I've been using it for over
a year.)
I could continue to try to remove sources of contention in 2.4, but of
course another interesting question is how much it would help simply to
switch to 2.6. I don't know that the 2.6 beta is quite to the point where
I'm ready to try it; and anyway 2.6 is probably still a few months from
general use. If it's thought that 2.6 reduces contention generally, it
might be worth a shot; but on the other hand it probably would still be
worthwhile to fix this in 2.4, if I can. Comments?
-- Scott
On Mer, 2003-07-30 at 22:50, Scott L. Burson wrote:
> First, and probably the reason you haven't heard more complaints about the
> problem, its severity is evidently dependent on the size of main memory. At
> 512MB it doesn't seem to be much of a problem (right, Mathieu?). At 2.5GB,
> which is what I have, it can be quite serious. For instance, if I start two
> `find' processes at the roots of different filesystems, the system can spend
> (according to `top') 95% - 98% of its time in the kernel. It even gets
> worse than that, but `top' stops updating -- in fact, the system can seem
> completely frozen, but it does recover eventually. Stopping or killing one
> of the `find' processes brings it back fairly quickly, though it can take a
> while to accomplish that.
Thats the well understood DMA bounce buffers problem. It should be
better in current 2.4 or with something like the Red Hat enterpise
kenrel or probably the -aa patches.
Its nothing to do with AMD although it can in part depend on what I/O
dvevices your system has how much data hits the bounce buffers
Scott L. Burson wrote:
> Hi,
>
> First, and probably the reason you haven't heard more complaints about the
> problem, its severity is evidently dependent on the size of main memory. At
> 512MB it doesn't seem to be much of a problem (right, Mathieu?).
Right. I have 1.5 GB and can reproduce the problem.
And 'append mem=512M' in lilo made things works nicely too.
> At 2.5GB,
> which is what I have, it can be quite serious. For instance, if I start two
> `find' processes at the roots of different filesystems, the system can spend
> (according to `top') 95% - 98% of its time in the kernel. It even gets
> worse than that, but `top' stops updating -- in fact, the system can seem
> completely frozen, but it does recover eventually. Stopping or killing one
> of the `find' processes brings it back fairly quickly, though it can take a
> while to accomplish that.
In fact, last week I had such bad warm reboots that I opened the box and
all of a sudden everythings was working fine again.
So I would say I have a problem of low power supply or fan. And I think
I have read some post about it in the past:
[System Starvation under heavy io load with HIGHMEM4G]
http://www.ussg.iu.edu/hypermail/linux/kernel/0303.2/1435.html
[Tyan 2460/Dual Athlon MP hangs]
http://www.ussg.iu.edu/hypermail/linux/kernel/0207.0/0040.html
Eventhought I have a Tyan S2460, I read that:
[The Thunder K7 is an Extended ATX board, measuring 12 ? 13 inches. It
only supports Registered DDR PC1600/2100 memory, so your old DIMMs won't
work. Your old power supply won't work either. The Thunder K7 needs an
extra 8-pin power connector. It's not the same extra power connector
that Intel Pentium 4 Xeon-based motherboards need either, so you must
get a special power supply that currently only will work with this one
board.]
http://www.linuxjournal.com/bg/advice/ulb_02.php
What do you think of it ?
Here is my uptime:
$ uptime
5:15pm up 3 days, 3:18, 13 users, load average: 0.08, 0.25, 0.21
And I have been running rather heavy jobs ('make -j') with a lot of IO...
One final thing, I am pretty novice at those things, so please
appologize if I said something completely dumb.
my 2 cents,
mathieu
From: Alan Cox <[email protected]>
Date: 30 Jul 2003 23:59:00 +0100
On Mer, 2003-07-30 at 22:50, Scott L. Burson wrote:
> First, and probably the reason you haven't heard more complaints about the
> problem, its severity is evidently dependent on the size of main memory. At
> 512MB it doesn't seem to be much of a problem (right, Mathieu?). At 2.5GB,
> which is what I have, it can be quite serious. For instance, if I start two
> `find' processes at the roots of different filesystems, the system can spend
> (according to `top') 95% - 98% of its time in the kernel. It even gets
> worse than that, but `top' stops updating -- in fact, the system can seem
> completely frozen, but it does recover eventually. Stopping or killing one
> of the `find' processes brings it back fairly quickly, though it can take a
> while to accomplish that.
Thats the well understood DMA bounce buffers problem.
It's definitely not the bounce buffers problem. I installed the patch and
it doesn't help (well, maybe it helps a little; it's hard to tell).
However, I have pretty strong evidence that it's not the spinlock handoff
time either. I wrote a small benchmark that starts two threads that do
nothing but hand two spinlocks back and forth. The Athlon runs it an order
of magnitude _faster_ than the P4 (5ns vs. 50ns, roughly, per handoff).
I'm fairly certain that lock contention is involved somehow, though.
Lockmeter reports that lock waiting is consuming about 35% of the CPU cycles
when the problem is happening. This isn't the 90% - 95% number I expected
-- the latter being the percentage of time spent in the kernel, as reported
by `top' -- but it's high enough to wonder about, and it may be artificially
low. Lockmeter has a spinlock that protects its data structures, and the
profile says 36% of the time is being spent in the routine that acquires
that lock. This suggests that lockmeter isn't counting that time.
One oddity pointed up by lockmeter is that `pagemap_lru_lock' is held by
`shrink_cache' some 85% of the time. This seems way too high, and I am
looking into it.
-- Scott