There have been a lot of issues lately on our oracle servers with runaway
processes and swapping, enough that I feel compelled to report them here. We
have tried all different kernels from 2.4.6 to 2.4.16, and the problem seems
to happen with all of them (but is more pronounced on certain kernels)
Basically, what happens is that after an unspecified amount of time, the
boxes will become unresponsve and start swapping wildly. At that time, I
will login to the box to see what is going on, and I generally see something
like this:
adam@xpdb:~$ uptime
11:21am up 42 days, 18:53, 3 users, load average: 54.72, 21.21, 17.60
adam@xpdb:~$ free
total used free shared buffers cached
Mem: 5528464 5522744 5720 0 476 5349784
-/+ buffers/cache: 172484 5355980
Swap: 2939804 1302368 1637436
As you can see, there are supposedly 5.3 gigs of memory free (not counting
memory used for cache). However, the box is swapping like mad (about 10 megs
every 2 seconds according to vmstat) and the load is skyrocketing.
Now top, on the other hand, has a very different idea about the amount of
free memory:
CPU states: 0.0% user, 0.1% system, 0.1% nice, 0.0% idle
Mem: 5528464K av, 5523484K used, 4980K free, 0K shrd, 340K buff
Swap: 2939804K av, 1082008K used, 1857796K free 5351892K
cached
So, what am I supposed to believe? Is 'free' a useful tool? Is it providing
accurate results? I'm constantly fielding questions from people who want to
know why a box is swapping, even though 'free' reports a whole bunch of
memory free, and I'm tired of not having an answer for them.
Thanks,
--Adam
--
Adam McKenna <[email protected]> | GPG: 17A4 11F7 5E7E C2E7 08AA
http://flounder.net/publickey.html | 38B0 05D0 8BF7 2C6D 110A
> processes and swapping, enough that I feel compelled to report them here. We
> have tried all different kernels from 2.4.6 to 2.4.16, and the problem seems
> to happen with all of them (but is more pronounced on certain kernels)
The only kernels you are likely to see that not happen on are
- The 2.4.9 kernel with Rik's patches that Linus didnt take
(Red Hat 2.4.9-*)
- 2.4.17/18pre with the rmap11/rmap12 patches
- 2.4.17/18pre with the -aa patched VM
(which I believe is also in the SuSE kernel packages)
- 2.2
The base VM in Linus tree has been broken since before 2.4.0 and while
somewhat better is still that - broken. The major vendors don't ship it for
a reason.
Alan
[email protected] ("Adam McKenna") writes:
> Now top, on the other hand, has a very different idea about the amount of
> free memory:
A very different idea? The difference is about 1M if I read it
correctly.
--
N?r folk sp?rger mig, om jeg er n?rd, bliver jeg altid ilde til mode
og svarer lidt undskyldende: "Nej, jeg bruger RedHat".
-- Allan Olesen p? dk.edb.system.unix
On Fri, Feb 01, 2002 at 11:24:16AM -0800, Adam McKenna wrote:
> adam@xpdb:~$ uptime
> 11:21am up 42 days, 18:53, 3 users, load average: 54.72, 21.21, 17.60
> adam@xpdb:~$ free
> total used free shared buffers cached
> Mem: 5528464 5522744 5720 0 476 5349784
> -/+ buffers/cache: 172484 5355980
> Swap: 2939804 1302368 1637436
> As you can see, there are supposedly 5.3 gigs of memory free (not counting
> memory used for cache). However, the box is swapping like mad (about 10 megs
> every 2 seconds according to vmstat) and the load is skyrocketing.
That 5.3GB is without kernel caches. I see 5.7MB...
On Fri, Feb 01, 2002 at 11:24:16AM -0800, Adam McKenna wrote:
> Now top, on the other hand, has a very different idea about the amount of
> free memory:
> CPU states: 0.0% user, 0.1% system, 0.1% nice, 0.0% idle
> Mem: 5528464K av, 5523484K used, 4980K free, 0K shrd, 340K buff
> Swap: 2939804K av, 1082008K used, 1857796K free 5351892K
> cached
They actually agree. The line you're reading with 5.3GB in it subtracts
kernel caches from the memory in use.
The fun bit about swapping like mad is because kernel caches are not
being flushed and shrunk properly in response to growth of the working
set. In more concrete terms, the kernel is making decisions which prefer
to keep things like the page cache, the dentry cache, the inode cache,
and the buffer cache in memory over the working sets of your programs.
There is some tradeoff: it is probably also not desirable to allow the
working set to erode kernel caches to the absolute minimum (or at least
not very easily), but obviously what tradeoffs are happening here are
suboptimal for your workload (and generally insufficiently adaptive). It
appears that when the kernel caches are done with you you've got 172MB
out of 5.5GB of physical memory left for your programs' anonymous memory.
What kernel/VM are you using?
Could you follow up with /proc/slabinfo and /proc/meminfo?
Cheers,
Bill
On Fri, Feb 01, 2002 at 12:11:45PM -0800, William Lee Irwin III wrote:
> On Fri, Feb 01, 2002 at 11:24:16AM -0800, Adam McKenna wrote:
> > adam@xpdb:~$ uptime
> > 11:21am up 42 days, 18:53, 3 users, load average: 54.72, 21.21, 17.60
> > adam@xpdb:~$ free
> > total used free shared buffers cached
> > Mem: 5528464 5522744 5720 0 476 5349784
> > -/+ buffers/cache: 172484 5355980
> > Swap: 2939804 1302368 1637436
> > As you can see, there are supposedly 5.3 gigs of memory free (not counting
> > memory used for cache). However, the box is swapping like mad (about 10 megs
> > every 2 seconds according to vmstat) and the load is skyrocketing.
>
> That 5.3GB is without kernel caches. I see 5.7MB...
>
> On Fri, Feb 01, 2002 at 11:24:16AM -0800, Adam McKenna wrote:
> > Now top, on the other hand, has a very different idea about the amount of
> > free memory:
> > CPU states: 0.0% user, 0.1% system, 0.1% nice, 0.0% idle
> > Mem: 5528464K av, 5523484K used, 4980K free, 0K shrd, 340K buff
> > Swap: 2939804K av, 1082008K used, 1857796K free 5351892K
> > cached
>
> They actually agree. The line you're reading with 5.3GB in it subtracts
> kernel caches from the memory in use.
>
> The fun bit about swapping like mad is because kernel caches are not
> being flushed and shrunk properly in response to growth of the working
> set. In more concrete terms, the kernel is making decisions which prefer
> to keep things like the page cache, the dentry cache, the inode cache,
> and the buffer cache in memory over the working sets of your programs.
> There is some tradeoff: it is probably also not desirable to allow the
> working set to erode kernel caches to the absolute minimum (or at least
> not very easily), but obviously what tradeoffs are happening here are
> suboptimal for your workload (and generally insufficiently adaptive). It
> appears that when the kernel caches are done with you you've got 172MB
> out of 5.5GB of physical memory left for your programs' anonymous memory.
>
> What kernel/VM are you using?
2.4.6-xfs but we've also seen this with 2.4.14-xfs (xfs 1.0.2 release)
> Could you follow up with /proc/slabinfo and /proc/meminfo?
We've already rebooted the box, next time we are experiencing the problem
I'll send this info.
Meanwhile, is there any way to tune the kernel cache?
--Adam
--
Adam McKenna <[email protected]> | GPG: 17A4 11F7 5E7E C2E7 08AA
http://flounder.net/publickey.html | 38B0 05D0 8BF7 2C6D 110A
On Fri, Feb 01, 2002 at 12:11:45PM -0800, William Lee Irwin III wrote:
>> What kernel/VM are you using?
On Fri, Feb 01, 2002 at 12:32:50PM -0800, Adam McKenna wrote:
> 2.4.6-xfs but we've also seen this with 2.4.14-xfs (xfs 1.0.2 release)
You appear to be in more trouble than I can get you out of. Could you
try again with -aa or -rmap against a recent kernel? (mainline VM appears
not to behave as well as either of these).
On Fri, Feb 01, 2002 at 12:11:45PM -0800, William Lee Irwin III wrote:
>> Could you follow up with /proc/slabinfo and /proc/meminfo?
On Fri, Feb 01, 2002 at 12:32:50PM -0800, Adam McKenna wrote:
> We've already rebooted the box, next time we are experiencing the problem
> I'll send this info.
> Meanwhile, is there any way to tune the kernel cache?
Kernel hacking. Until you get yourself a stabler VM it probably won't
be meaningful to try that directly, though making this tunable would be
nice, too. I've heard it's difficult to merge xfs with VM changes due to
its invasiveness in the VM, but I've heard of it being done several times.
Cheers,
Bill
>The only kernels you are likely to see that not happen on are
>- The 2.4.9 kernel with Rik's patches that Linus didnt take
(Red Hat 2.4.9-*)
>- 2.4.17/18pre with the rmap11/rmap12 patches
>- 2.4.17/18pre with the -aa patched VM
(which I believe is also in the SuSE kernel packages)
>- 2.2
>The base VM in Linus tree has been broken since before 2.4.0 and while
>somewhat better is still that - broken. The major vendors don't ship it for
>a reason.
Why is this?
Is linus working toward what he believes will be a better impementation? Is
he just being stubborn?
I guess I just can't imagine any reason why he would want
large enterprise applications running poorly when there are obvious fixes.
Believe it or not, im not trying to start a flame war, just trying to
understand the logic.
--Buddy
On 2 February 2002 05:18, Buddy Lumpkin wrote:
> >The only kernels you are likely to see that not happen on are
> >
> >- The 2.4.9 kernel with Rik's patches that Linus didnt take
>
> (Red Hat 2.4.9-*)
>
> >- 2.4.17/18pre with the rmap11/rmap12 patches
> >- 2.4.17/18pre with the -aa patched VM
>
> (which I believe is also in the SuSE kernel packages)
>
> >- 2.2
> >
> >The base VM in Linus tree has been broken since before 2.4.0 and while
> >somewhat better is still that - broken. The major vendors don't ship it
> > for a reason.
>
> Why is this?
>
> Is linus working toward what he believes will be a better impementation? Is
> he just being stubborn?
> I guess I just can't imagine any reason why he would want
> large enterprise applications running poorly when there are obvious fixes.
>
> Believe it or not, im not trying to start a flame war, just trying to
> understand the logic.
Let's not get into politics. I suggest trying -aa kernels. If that solves
your problems, report to lkml, Linus and Marcelo - this will help
Andrea patches to get in mainline 2.5/2.4 faster.
As a minimum, you may try attached patch.
--
vda
vmscan.patch.2.4.17.d (author: "M.H.VanLeeuwen" <[email protected]>)
====================================================================
--- linux.virgin/mm/vmscan.c Mon Dec 31 12:46:25 2001
+++ linux/mm/vmscan.c Fri Jan 11 18:03:05 2002
@@ -394,9 +394,9 @@
if (PageDirty(page) && is_page_cache_freeable(page) && page->mapping) {
/*
* It is not critical here to write it only if
- * the page is unmapped beause any direct writer
+ * the page is unmapped because any direct writer
* like O_DIRECT would set the PG_dirty bitflag
- * on the phisical page after having successfully
+ * on the physical page after having successfully
* pinned it and after the I/O to the page is finished,
* so the direct writes to the page cannot get lost.
*/
@@ -480,11 +480,14 @@
/*
* Alert! We've found too many mapped pages on the
- * inactive list, so we start swapping out now!
+ * inactive list.
+ * Move referenced pages to the active list.
*/
- spin_unlock(&pagemap_lru_lock);
- swap_out(priority, gfp_mask, classzone);
- return nr_pages;
+ if (PageReferenced(page) && !PageLocked(page)) {
+ del_page_from_inactive_list(page);
+ add_page_to_active_list(page);
+ }
+ continue;
}
/*
@@ -521,6 +524,9 @@
}
spin_unlock(&pagemap_lru_lock);
+ if (max_mapped <= 0 && (nr_pages > 0 || priority < DEF_PRIORITY))
+ swap_out(priority, gfp_mask, classzone);
+
return nr_pages;
}
> >The base VM in Linus tree has been broken since before 2.4.0 and while
> >somewhat better is still that - broken. The major vendors don't ship it for
> >a reason.
>
> Why is this?
Linus kept ignoring Rik's patches and making other changes, then at 2.4.10
switched to Andrea's VM and ignored most of the follow up changes that
made that one work
> Believe it or not, im not trying to start a flame war, just trying to
> understand the logic.
You've got me there. I don't understand either.
Alan
as an outside observer during that time it looked like Linus was rejecting
patches from Rik (at least in part) becouse the reasoning behind them
wasn't fully explained and it appeared to be deteriorating into 'tweak
these magic numbers to fix problem A, discover that that caused problem B,
tweak them again to fix B anc cause C .... tweak them again to fix J and
cause A' type circles with nobody (other then possibly rik) understanding
what was really causing the problems (at least if they did understand them
they weren't posted here)
the fundamental problem was that while the VM would work well most of the
time, once in a while it would hit a pathalogical condition that would
lockup the machine completely, the new VM was seen as not nessasarily
being quite as good in the best cases, but avoiding the worst lockups (of
course it had a few problems of it's own, but these seemed to be easier to
fix without causing additional problems)
David Lang
On Sat, 2 Feb 2002, Alan Cox wrote:
> Date: Sat, 2 Feb 2002 17:11:43 +0000 (GMT)
> From: Alan Cox <[email protected]>
> To: [email protected]
> Cc: [email protected], [email protected],
> [email protected]
> Subject: Re: should I trust 'free' or 'top'?
>
> > >The base VM in Linus tree has been broken since before 2.4.0 and while
> > >somewhat better is still that - broken. The major vendors don't ship it for
> > >a reason.
> >
> > Why is this?
>
> Linus kept ignoring Rik's patches and making other changes, then at 2.4.10
> switched to Andrea's VM and ignored most of the follow up changes that
> made that one work
>
> > Believe it or not, im not trying to start a flame war, just trying to
> > understand the logic.
>
> You've got me there. I don't understand either.
>
> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> tweak them again to fix B anc cause C .... tweak them again to fix J and
> cause A' type circles with nobody (other then possibly rik) understanding
> what was really causing the problems (at least if they did understand them
> they weren't posted here)
Plenty of people understood them. The continual changing was also not
helped by the fact that at least three totally contradictory sets of
patches were getting applied (Riks, the use once stuff and Linus own
changes)
> the fundamental problem was that while the VM would work well most of the
> time, once in a while it would hit a pathalogical condition that would
> lockup the machine completely, the new VM was seen as not nessasarily
> being quite as good in the best cases, but avoiding the worst lockups (of
> course it had a few problems of it's own, but these seemed to be easier to
> fix without causing additional problems)
The original Andrea vm was faster on light loads, and even less stable on
anything sane. In the -aa patches it does quite well, but those aren't
merged either.
Alan
> Re: should I trust 'free' or 'top'?
>
>
> On Fri, Feb 01, 2002 at 11:24:16AM -0800, Adam McKenna wrote:
> > adam@xpdb:~$ uptime
> > 11:21am up 42 days, 18:53, 3 users, load average: 54.72, 21.21, 17.60
> > adam@xpdb:~$ free
> > total used free shared buffers cached
> > Mem: 5528464 5522744 5720 0 476 5349784
> > -/+ buffers/cache: 172484 5355980
> > Swap: 2939804 1302368 1637436
> > As you can see, there are supposedly 5.3 gigs of memory free (not counting
> > memory used for cache). However, the box is swapping like mad (about 10 megs
> > every 2 seconds according to vmstat) and the load is skyrocketing.
>
> That 5.3GB is without kernel caches. I see 5.7MB...
>
And this is the problem. Caches should make the system behave better
and not get into its ways ...
It is time that one of the approches gets accepted for the current
"stable" mainline. I do not care much which it is for 2.4.x. Both rmap
and -aa seem to fix most of the problems. Having one of them accepted
should make it easier to fix-up the remaining pathological cases.
Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759