Hi all,
Well, I have found the smoking gun as to what is causing the performance
problems that I have been seeing on my dual Athlon box (Tyan S2466, dual
Athlon MP 2800+, 2.5GB memory).
It doesn't, indeed, have anything to do with the Athlon. The reason I
thought it did was that a dual Pentium box that I have access to (2.4GHz,
2GB memory) has no trace of the problem. That machine is running Red Hat
7.2, which is 2.4.7-based. It didn't occur to me that there might have been
a severe performance problem introduced into the kernel sometime between
2.4.7 and 2.4.18, but that, it turns out, is exactly what happened.
The problem is in `try_to_free_pages' and its associated routines,
`shrink_caches' and `shrink_cache', in `mm/vmscan.c'. After I made some
changes to greatly reduce lock contention in the slab allocator and
`shrink_cache', and then instrumented `shrink_cache' to see what it was
doing, the problem showed up very clearly.
In one approximately 60-second period with the problematic workload running,
`try_to_free_pages' was called 511 times. It made 2597 calls to
`shrink_caches', which made 2592 calls to `shrink_cache' (i.e. it was very
rare for `kmem_cache_reap' to release enough pages itself). The main loop
of `shrink_cache' was executed -- brace yourselves -- 189 million times!
During that time it called `page_cache_release' on only 31265 pages.
`shrink_cache' didn't even exist in 2.4.7. Whatever mechanism 2.4.7 had for
releasing pages was evidently much more time-efficient, at least in the
particular situation I'm looking at.
Clearly the kernel group has been aware of the problems with `shrink_cache',
as I see that it has received quite a bit of attention in the course of 2.5
development. I am hopeful that the problem will be substantially
ameliorated in 2.6.0. (The comment at the top of `try_to_free_pages' --
"This is a fairly lame algorithm - it can result in excessive CPU burning"
-- suggests it won't be cured entirely.)
However, it seems the kernel group may not have been aware of just how bad
the problem can be in recent 2.4 kernels on dual-processor machines with
lots of memory. It's bad enough that running two `find' jobs at the same
time on large filesystems can bring the machine pretty much to its knees.
There are many things about this code I don't understand, but the most
puzzling is this. When `try_to_free_pages' is called, it sets out to free
32 pages (the value of `SWAP_CLUSTER_MAX'). It's prepared to do a very
large amount of work to accomplish this goal, and if it fails, it will call
`out_of_memory'. Given that, what's odd is that it's being called when
memory isn't even close to being full (there's a good 800MB free, according
to `top'). It seems crazy that `out_of_memory' might be called when there
are hundreds of MB of free pages, just because `shrink_caches' couldn't find
32 pages to free. It suggests to me that `try_to_free_pages' is being
called in two contexts: one when a page allocation fails, and the other just
for general cleaning, and that it's the latter context that's causing the
problem.
I will do some more instrumentation to try to verify this. Comments
solicited.
-- Scott
"Scott L. Burson" <[email protected]> wrote:
>
> The problem is in `try_to_free_pages' and its associated routines,
This is not unusual.
> In one approximately 60-second period with the problematic workload running,
What is the problematic workload? Please describe it in great detail.
> Clearly the kernel group has been aware of the problems with `shrink_cache',
> as I see that it has received quite a bit of attention in the course of 2.5
> development. I am hopeful that the problem will be substantially
> ameliorated in 2.6.0. (The comment at the top of `try_to_free_pages' --
> "This is a fairly lame algorithm - it can result in excessive CPU burning"
> -- suggests it won't be cured entirely.)
That comment has thus far proved to be wrong.
> However, it seems the kernel group may not have been aware of just how bad
> the problem can be in recent 2.4 kernels on dual-processor machines with
> lots of memory. It's bad enough that running two `find' jobs at the same
> time on large filesystems can bring the machine pretty much to its knees.
oh, is that the workload?
Send a copy of /proc/meminfo, captured when the badness is happening. Also
/proc/slabinfo.
Probably you will find that all of the low memory is consumed by inodes and
dentries. ext2 is particularly prone to this because its directory pages
are placed in highmem, and those pages can pin down the dentries (and hence
the inodes).
So sigh. It is a problem which has been solved for a year at least. Try
running one of Andrea's kernels, from
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4
The most important patch for you is 10_inode-highmem-2.
On Sat, 2 Aug 2003, Scott L. Burson wrote:
> In one approximately 60-second period with the problematic workload running,
> `try_to_free_pages' was called 511 times. It made 2597 calls to
> `shrink_caches', which made 2592 calls to `shrink_cache' (i.e. it was very
> rare for `kmem_cache_reap' to release enough pages itself). The main loop
> of `shrink_cache' was executed -- brace yourselves -- 189 million times!
> During that time it called `page_cache_release' on only 31265 pages.
Can you reproduce this problem with the -rmap patch for the 2.4 VM?
Arjan, wli, myself and others have done quite a bit of work to make
sure the VM doesn't run around in circles madly when faced with a
large memory configuration.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Hi Scott
On Sun, 3 Aug 2003 06:03, Scott L. Burson wrote:
> In one approximately 60-second period with the problematic workload
> running, `try_to_free_pages' was called 511 times. It made 2597 calls to
> `shrink_caches', which made 2592 calls to `shrink_cache' (i.e. it was very
> rare for `kmem_cache_reap' to release enough pages itself). The main loop
> of `shrink_cache' was executed -- brace yourselves -- 189 million times!
> During that time it called `page_cache_release' on only 31265 pages.
I noticed a curly section of the vm code when I was playing around with some
hacks that are in the -ck kernel and this section might be helpful as it
wasn't a hack so much as a fix in mm/vmscan.c around line 600. The problem
is when the priority drops to 1 it should do the most cache reaping but
instead bypasses some of it.
You could try this modification and see if it helps.
This isn't a real patch but you should get the idea.
Con
nr_pages -= kmem_cache_reap(gfp_mask);
- if (nr_pages <= 0)
- return 0;
+ if (nr_pages < 1)
+ goto shrinkcheck;
nr_pages = chunk_size;
/* try to keep the active list 2/3 of the size of the cache */
ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
refill_inactive(ratio);
nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
- if (nr_pages <= 0)
- return 0;
+ /*
+ * Will return if nr_pages have been freed unless the
+ * priority managed to reach 1. If the vm is under this much
+ * pressure then shrink the d/i/dqcaches regardless. CK 2003
+ */
+shrinkcheck:
+ if (nr_pages < 1){
+ if (priority > 1)
+ return 0;
+ else
+ nr_pages = 0;
+ }
+
shrink_dcache_memory(priority, gfp_mask);
shrink_icache_memory(priority, gfp_mask);
From: Andrew Morton <[email protected]>
Date: Sat, 2 Aug 2003 14:44:22 -0700
It is a problem which has been solved for a year at least. Try
running one of Andrea's kernels, from
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4
The most important patch for you is 10_inode-highmem-2.
I tried that patch by itself and it appeared to make a noticeable
improvement, but it was far from a complete fix.
So next I tried the SuSE 8.2 kernel. It is clearly *much* better, and I see
that Andrea in fact did a bit of work on `mm/vmscan.c'. The key patch
appears to be `05_vm_06_swap_out-3', but it's possible that several or all
of the `05_vm_*' patches are helpful.
I see that even the very most recent Red Hat kernel (2.4.20-19.7, released
only two weeks ago) does not seem to have these fixes. (I wasn't running Red
Hat -- my machine started out with SuSE 7.3, and I hand-upgraded it to
2.4.18 -- but Mathieu Malaterre, who is CCed above and whose query to me
about the problem got me started looking at it again, is using Red Hat.)
So I strongly urge the powers that be to include these patches in 2.4.22.
-- Scott
On Sun, 3 Aug 2003, Scott L. Burson wrote:
> I see that even the very most recent Red Hat kernel (2.4.20-19.7,
> released only two weeks ago) does not seem to have these fixes.
Look again. The kernel that came with RH9 has pretty much
all of the highmem fixes, the update kernels later on have
them all.
The main difference is that the VM in RH9 is closer to that
of 2.5, so the patches don't look the same.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Scott wrote:
>The problem is in `try_to_free_pages' and its associated routines,
>`shrink_caches' and `shrink_cache', in `mm/vmscan.c'. After I made some
>changes to greatly reduce lock contention in the slab allocator and
>`shrink_cache',
>
How did you change the slab locking?
> and then instrumented `shrink_cache' to see what it was
>doing, the problem showed up very clearly.
>
>In one approximately 60-second period with the problematic workload running,
>`try_to_free_pages' was called 511 times. It made 2597 calls to
>`shrink_caches', which made 2592 calls to `shrink_cache' (i.e. it was very
>rare for `kmem_cache_reap' to release enough pages itself).
>
2.6 contains a simple fix: I've removed kmem_cache_reap. Instead the
code checks for empty pages in the slab caches every other second.
--
Manfred