2006-12-04 01:54:51

by Aucoin

[permalink] [raw]
Subject: RE: la la la la ... swappiness


I should also have made it clear that under a full load OOM kills critical
data moving processes because of (what appears to be) the out of control
memory consumption by disk I/O cache related to the tar.

As a side note, even now, *hours* after the tar has completed and even
though I have swappiness set to 0, cache pressure set to 9999, all dirty
timeouts set to 1 and all dirty ratios set to 1, I still have a 360+K
inactive page count and my "free" memory is less than 10% of normal. I'm not
pretending to understand what's happening here but shouldn't some kind of
expiration have kicked in by now and freed up all those inactive pages? The
*instant* I manually push a "3" into drop_caches I have 100% of my normal
free memory and the inactive page count drops below 2K. Maybe I completely
misunderstood the purpose of all those dials but I really did get the
feeling that twisting them all tight would make the housekeeping algorithms
more aggressive.

What, if anything, besides manually echoing a "3" to drop_caches will cause
all those inactive pages to be put back on the free list ?

-----Original Message-----
From: Aucoin [mailto:[email protected]]
Sent: Sunday, December 03, 2006 5:57 PM
To: 'Tim Schmielau'
Cc: 'Andrew Morton'; '[email protected]'; '[email protected]';
'[email protected]'
Subject: RE: la la la la ... swappiness

We want it to swap less for this particular operation because it is low
priority compared to the rest of what's going on inside the box.

We've considered both artificially manipulating swap on the fly similar to
your suggestion as well a parallel thread that pumps a 3 into drop_caches
every few seconds while the update is running, but these seem too much like
hacks for our liking. Mind you, if we don't have a choice we'll do what we
need to get the job done but there's a nagging voice in our conscience that
says keep looking for a more elegant solution and work *with* the kernel
rather than working against it or trying to trick it into doing what we
want.

We've already disabled OOM so we can at least keep our testing alive while
searching for a more elegant solution. Although we want to avoid swap in
this particular instance for this particular reason, in our hearts we agree
with Andrew that swap can be your friend and get you out of a jam once in a
while. Even more, we'd like to leave OOM active if we can because we want to
be told when somebody's not being a good memory citizen.

Some background, what we've done is carve up a huge chunk of memory that is
shared between three resident processes as write cache for a proprietary
block system layout that is part of a scalable storage architecture
currently capable of RAID 0, 1, 5 (soon 6) virtualized across multiple
chassis's, essentially treating each machine as a "disk" and providing
multipath I/O to multiple iSCSI targets as part of a grid/array storage
solution. Whew! We also have a version that leverages a battery backed write
cache for higher performance at an additional cost. This software is
installable on any commodity platform with 4-N disks supported by Linux,
I've even put it on an Optiplex with 4 simulated disks. Yawn ... yet another
iSCSI storage solution, but this one scales linearly in capacity as well as
performance. As such, we have no user level apps on the boxes and precious
little disk to spare for additional swap so our version of the swap
manipulation solution is to turn swap completely off for the duration of the
update.

I hope I haven't muddied things up even more but basically what we want to
do is find a way to limit the number of cached pages for disk I/O on the OS
filesystem, even if it drastically slows down the untar and verify process
because the disk I/O we really care about is not on any of the OS
partitions.

Louis Aucoin

-----Original Message-----
From: Tim Schmielau [mailto:[email protected]]
Sent: Sunday, December 03, 2006 2:47 PM
To: Aucoin
Cc: 'Andrew Morton'; [email protected]; [email protected];
[email protected]
Subject: RE: la la la la ... swappiness

On Sun, 3 Dec 2006, Aucoin wrote:

> during tar extraction ... inactive pages reaches levels as high as ~375000

So why do you want the system to swap _less_? You need to find some free
memory for the additional processes to run in, and you have lots of
inactive pages, so I think you want to swap out _more_ pages.

I'd suggest to temporarily add a swapfile before you update your system.
This can even help in bringing your memory use to the state before if you
do it like this
- swapon additional swapfile
- update your database software
- swapoff swap partition
- swapon swap partition
- swapoff additional swapfile

Tim



2006-12-04 04:59:53

by Andrew Morton

[permalink] [raw]
Subject: Re: la la la la ... swappiness

On Sun, 3 Dec 2006 19:54:41 -0600
"Aucoin" <[email protected]> wrote:

> What, if anything, besides manually echoing a "3" to drop_caches will cause
> all those inactive pages to be put back on the free list ?

There is no reason for the kernel to do that - a clean, inactive page is
immediately reclaimable on demand.

2006-12-04 07:22:33

by Kyle Moffett

[permalink] [raw]
Subject: Re: la la la la ... swappiness

On Dec 03, 2006, at 20:54:41, Aucoin wrote:
> As a side note, even now, *hours* after the tar has completed and
> even though I have swappiness set to 0, cache pressure set to 9999,
> all dirty timeouts set to 1 and all dirty ratios set to 1, I still
> have a 360+K inactive page count and my "free" memory is less than
> 10% of normal.

The point you're missing is that an "inactive" page is a free page
that happens to have known clean data on it corresponding to
something on disk. If you need to use the inactive page for
something all you have to do is either zero it or fill it with data
from elsewhere. There is _no_ practical reason for the kernel to
turn an "inactive" page into a "free" page. On my Linux systems
after heavy local-disk and network intensive read-only load I have no
more than 2% "free" memory, most of the rest is "inactive" (in one
case some 2GB of it). There's nothing _wrong_ with that much
"inactive" memory, it just means that you were using it for data at
one point, then didn't need it anymore and haven't reused it since.

> I'm not pretending to understand what's happening here but
> shouldn't some kind of expiration have kicked in by now and freed
> up all those inactive pages?

Nope; the pages will continue to contain valid data until you
overwrite them with new data somehow. Now, if they were "dirty"
pages, containing unwritten data, then you would be correct.

> The *instant* I manually push a "3" into drop_caches I have 100% of
> my normal free memory and the inactive page count drops below 2K.
> Maybe I completely misunderstood the purpose of all those dials but
> I really did get the feeling that twisting them all tight would
> make the housekeeping algorithms more aggressive.

In this case you're telling the kernel to go beyond its normal
housekeeping and delete perfectly good data from memory. The only
reason to do that is usually to make benchmarks mildly more
repeatable and doing it on a regular basis tends to kill performance.

Cheers,
Kyle Moffett

> [copy of long previous email snipped]

PS: No need to put a copy of the entire message you are replying to
at the end of your post, it just chews up space. If anything please
quote inline immediately before the appropriate portion of your reply
so we can get the gist, much as I have done above.


2006-12-04 14:39:33

by Aucoin

[permalink] [raw]
Subject: RE: la la la la ... swappiness

> PS: No need to put a copy of the entire message

Apologies for the lapse in protocol.

> The point you're missing is that an "inactive" page is a free
> page that happens to have known clean data on it

I understand now where the inactive page count is coming from.
I don't understand why there is no way for me to make the kernel
prefer to reclaim inactive pages before choosing swap.

> In this case you're telling the kernel to go beyond its
> normal housekeeping and delete perfectly good data from
> memory. The only reason to do that is usually to make

The definition of perfectly good here may be up for debate or
someone can explain it to me. This perfectly good data was
cached under the tar yet hours after the tar has completed the
pages are still cached.


2006-12-04 16:09:24

by David Lang

[permalink] [raw]
Subject: Re: la la la la ... swappiness

I think that I am seeing two seperate issues here that are getting mixed up.

1. while doing the tar + patch the system is choosing to use memory for
caching the tar (pushing program data out to cache).

2. after the tar has completed the data remins in the cache.

the answer for #2 is the one that is being stated in the response below, namely
that this shouldn't matter, the memory used for the inactive cache is just as
good as free memory (i.e. it can be used immediatly for other purposes with no
swap needed), so the fact that it's inactive instead of free doesn't matter.

however the real problem that Aucoin is running into is #1, namely that when the
patching process (tar, etc) kicks off the system is choosing to use it's ram as
a cache instead of leaving it in use for the processes that are running. if he
manually forces the system to drop it's cache (echoing 3 into drop_caches
repeatedly during the run of the patch process) he is able to keep this under
control.

from the documentation on swappiness it seems like setting it to 0 would do what
he wants (tell the system not to swap out process memory to make room for more
cache), but he's reporting that this is not working as expected.

this is the same type of problem that people run into with the nightly updatedb
run pushing inactive programs out of ram makeing the system sluggish the next
morning.

IIRC there is a flag that can be passed to the open that tells the system that
the data is 'use once' and not to cache it, is it possible to do ld_preload
tricks to force this parameter for all the programs that his patch script is
useing?

David Lang

On Mon, 4 Dec 2006, Kyle Moffett wrote:

> On Dec 03, 2006, at 20:54:41, Aucoin wrote:
>> As a side note, even now, *hours* after the tar has completed and even
>> though I have swappiness set to 0, cache pressure set to 9999, all dirty
>> timeouts set to 1 and all dirty ratios set to 1, I still have a 360+K
>> inactive page count and my "free" memory is less than 10% of normal.
>
> The point you're missing is that an "inactive" page is a free page that
> happens to have known clean data on it corresponding to something on disk.
> If you need to use the inactive page for something all you have to do is
> either zero it or fill it with data from elsewhere. There is _no_ practical
> reason for the kernel to turn an "inactive" page into a "free" page. On my
> Linux systems after heavy local-disk and network intensive read-only load I
> have no more than 2% "free" memory, most of the rest is "inactive" (in one
> case some 2GB of it). There's nothing _wrong_ with that much "inactive"
> memory, it just means that you were using it for data at one point, then
> didn't need it anymore and haven't reused it since.
>
>> I'm not pretending to understand what's happening here but shouldn't some
>> kind of expiration have kicked in by now and freed up all those inactive
>> pages?
>
> Nope; the pages will continue to contain valid data until you overwrite them
> with new data somehow. Now, if they were "dirty" pages, containing unwritten
> data, then you would be correct.
>
>> The *instant* I manually push a "3" into drop_caches I have 100% of my
>> normal free memory and the inactive page count drops below 2K. Maybe I
>> completely misunderstood the purpose of all those dials but I really did
>> get the feeling that twisting them all tight would make the housekeeping
>> algorithms more aggressive.
>
> In this case you're telling the kernel to go beyond its normal housekeeping
> and delete perfectly good data from memory. The only reason to do that is
> usually to make benchmarks mildly more repeatable and doing it on a regular
> basis tends to kill performance.
>
> Cheers,
> Kyle Moffett
>
>> [copy of long previous email snipped]
>
> PS: No need to put a copy of the entire message you are replying to at the
> end of your post, it just chews up space. If anything please quote inline
> immediately before the appropriate portion of your reply so we can get the
> gist, much as I have done above.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-12-04 16:11:29

by Chris Friesen

[permalink] [raw]
Subject: Re: la la la la ... swappiness

Aucoin wrote:

> The definition of perfectly good here may be up for debate or
> someone can explain it to me. This perfectly good data was
> cached under the tar yet hours after the tar has completed the
> pages are still cached.

If nothing else has asked for that memory since the tar, there is no
reason to evict the pages from the cache. The inactive memory is
basically "free, but still contains the previous data".

If anything asks for memory, those pages will be filled with zeros or
the new information. In the meantime, the kernel keeps them in the
cache in case anyone wants the old information.

It doesn't hurt anything to keep the pages around with the old data in
them--and it might help.

Chris

2006-12-04 17:43:12

by Aucoin

[permalink] [raw]
Subject: RE: la la la la ... swappiness

> From: David Lang [mailto:[email protected]]
> I think that I am seeing two seperate issues here that are getting mixed
> up.

Fair enough.

> however the real problem that Aucoin is running into is patching process
> (tar, etc) kicks off the system is choosing to use it's

First name Louis, yes but we haven't resorted to echoing 3 in a loop at
drop_caches yet.

> from the documentation on swappiness it seems like setting it to 0 would
> do what he wants

That's what I thought, but some responses would seem to indicate that these
two "types" of memory are completely independent of each other and
swappiness has no impact on the type that is currently annoying me. It just
doesn't seem like a fair way to run a kernel when you have a dial dial to
control swappiness but then there's this rogue memory consumption that lives
outside the control of the swappiness dial and you end up swapping anyway.

> this is the same type of problem that people run into with the nightly
> updatedb

I would imagine so, yes. But take that example and instead of programs going
in active over night substitute programs that go inactive for only a few
seconds ... swap thrash, oom-killer, game over.

> IIRC there is a flag that can be passed to the open that tells the system
> that

I'll check into it.