2004-04-28 21:34:14

by Brett E.

[permalink] [raw]
Subject: ~500 megs cached yet 2.6.5 goes into swap hell

06:18:52 PM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
06:18:53 PM 55332 1238644 95.72 14660 497888 450740 79364 14.97 9692
06:18:54 PM 55268 1238708 95.73 14660 497888 450740 79364 14.97 9692
06:18:55 PM 40060 1253916 96.90 14860 512920 450740 79364 14.97 9692
06:18:57 PM 6120 1287856 99.53 15340 546644 450740 79364 14.97 9692
06:18:59 PM 6632 1287344 99.49 15864 550880 450740 79364 14.97 9692
06:19:00 PM 6440 1287536 99.50 16020 552628 450740 79364 14.97 9692
06:19:02 PM 7648 1286328 99.41 15980 548452 450740 79364 14.97 9692
06:19:03 PM 6504 1287472 99.50 16008 548832 450740 79364 14.97 9692
06:19:04 PM 7592 1286384 99.41 15980 530160 450740 79364 14.97 9692
06:19:05 PM 6192 1287784 99.52 15716 499008 450740 79364 14.97 9692
06:19:06 PM 6544 1287432 99.49 15732 494640 450740 79364 14.97 9692
06:19:07 PM 7104 1286872 99.45 15768 488756 450740 79364 14.97 9692
06:19:08 PM 7592 1286384 99.41 15844 488680 450740 79364 14.97 9692
06:19:10 PM 7416 1286560 99.43 15936 479136 450740 79364 14.97 9692
06:19:13 PM 7024 1286952 99.46 15912 467808 450744 79360 14.97 9688
06:19:14 PM 7096 1286880 99.45 15664 427736 450744 79360 14.97 9684
06:19:15 PM 7240 1286736 99.44 15604 415692 450744 79360 14.97 9684
06:19:16 PM 6712 1287264 99.48 15616 414524 450744 79360 14.97 9684
06:19:18 PM 6200 1287776 99.52 15652 409660 450744 79360 14.97 9684
06:19:19 PM 10600 1283376 99.18 15724 407004 450744 79360 14.97 9684


06:18:52 PM pgpgin/s pgpgout/s fault/s majflt/s
06:18:53 PM 0.00 712.00 1236.00 0.00
06:18:54 PM 12.12 8.08 1067.68 0.00
06:18:55 PM 7497.03 11.88 2844.55 0.00
06:18:57 PM 10626.00 310.00 1422.50 0.00
06:18:59 PM 11758.00 196.00 346.50 0.00
06:19:00 PM 7828.00 608.00 136.00 0.00
06:19:02 PM 145.27 1136.32 1108.96 0.00
06:19:03 PM 905.05 13822.22 663.64 0.00
06:19:04 PM 689.11 2384.16 9437.62 0.00
06:19:05 PM 499.01 9572.28 13467.33 0.00
06:19:06 PM 3444.00 1340.00 1825.00 0.00
06:19:07 PM 7720.00 2032.00 3034.00 0.00
06:19:08 PM 5420.00 1304.00 688.00 0.00
06:19:10 PM 4045.77 4304.48 2188.56 0.00
06:19:13 PM 1079.07 5528.68 2046.90 0.00
06:19:14 PM 696.00 920.00 15650.00 0.00
06:19:15 PM 1478.79 1187.88 5046.46 0.00
06:19:16 PM 1000.00 2752.94 539.22 0.00


meminfo:
meminfo:

MemTotal: 1293976 kB
MemFree: 8320 kB
Buffers: 13396 kB
Cached: 436428 kB
SwapCached: 9516 kB
Active: 810472 kB
Inactive: 346816 kB
HighTotal: 393216 kB
HighFree: 1152 kB
LowTotal: 900760 kB
LowFree: 7168 kB
SwapTotal: 530104 kB
SwapFree: 450796 kB
Dirty: 33704 kB
Writeback: 10268 kB
Mapped: 710732 kB
Slab: 115240 kB
Committed_AS: 942592 kB
PageTables: 4612 kB
VmallocTotal: 114680 kB
VmallocUsed: 560 kB
VmallocChunk: 114120 kB



slabinfo - version: 2.0
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <batchcount> <limit> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
rpc_buffers 8 8 2048 2 1 : tunables 24 12 8 : slabdata 4 4 0
rpc_tasks 8 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0
rpc_inode_cache 12 14 512 7 1 : tunables 54 27 8 : slabdata 2 2 0
unix_sock 192 203 512 7 1 : tunables 54 27 8 : slabdata 29 29 0
ip_conntrack 9926 14860 384 10 1 : tunables 54 27 8 : slabdata 1486 1486 216
tcp_tw_bucket 2028 6450 128 30 1 : tunables 120 60 8 : slabdata 215 215 384
tcp_bind_bucket 207 800 16 200 1 : tunables 120 60 8 : slabdata 4 4 16
tcp_open_request 113 290 64 58 1 : tunables 120 60 8 : slabdata 5 5 3
inet_peer_cache 2 58 64 58 1 : tunables 120 60 8 : slabdata 1 1 0
ip_fib_hash 18 200 16 200 1 : tunables 120 60 8 : slabdata 1 1 0
ip_dst_cache 23046 23145 256 15 1 : tunables 120 60 8 : slabdata 1543 1543 0
arp_cache 11 30 256 15 1 : tunables 120 60 8 : slabdata 2 2 0
raw4_sock 0 0 512 7 1 : tunables 54 27 8 : slabdata 0 0 0
udp_sock 10 21 512 7 1 : tunables 54 27 8 : slabdata 3 3 0
tcp_sock 248 408 1024 4 1 : tunables 54 27 8 : slabdata 102 102 0
flow_cache 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
udf_inode_cache 0 0 512 7 1 : tunables 54 27 8 : slabdata 0 0 0
nfs_write_data 36 42 512 7 1 : tunables 54 27 8 : slabdata 6 6 0
nfs_read_data 32 35 512 7 1 : tunables 54 27 8 : slabdata 5 5 0
nfs_inode_cache 15 24 640 6 1 : tunables 54 27 8 : slabdata 4 4 0
nfs_page 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
isofs_inode_cache 0 0 384 10 1 : tunables 54 27 8 : slabdata 0 0 0
fat_inode_cache 0 0 512 7 1 : tunables 54 27 8 : slabdata 0 0 0
ext2_inode_cache 7294 7294 512 7 1 : tunables 54 27 8 : slabdata 1042 1042 0
journal_handle 0 0 28 123 1 : tunables 120 60 8 : slabdata 0 0 0
journal_head 0 0 48 77 1 : tunables 120 60 8 : slabdata 0 0 0
revoke_table 0 0 12 250 1 : tunables 120 60 8 : slabdata 0 0 0
revoke_record 0 0 16 200 1 : tunables 120 60 8 : slabdata 0 0 0
ext3_inode_cache 0 0 512 7 1 : tunables 54 27 8 : slabdata 0 0 0
ext3_xattr 0 0 48 77 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_pwq 0 0 36 99 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_epi 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
kioctx 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
kiocb 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
dnotify_cache 0 0 20 166 1 : tunables 120 60 8 : slabdata 0 0 0
file_lock_cache 9 40 96 40 1 : tunables 120 60 8 : slabdata 1 1 0
fasync_cache 0 0 16 200 1 : tunables 120 60 8 : slabdata 0 0 0
shmem_inode_cache 3 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0
posix_timers_cache 0 0 88 43 1 : tunables 120 60 8 : slabdata 0 0 0
uid_cache 5 112 32 112 1 : tunables 120 60 8 : slabdata 1 1 0
sgpool-128 32 32 2048 2 1 : tunables 24 12 8 : slabdata 16 16 0
sgpool-64 32 32 1024 4 1 : tunables 54 27 8 : slabdata 8 8 0
sgpool-32 32 32 512 8 1 : tunables 54 27 8 : slabdata 4 4 0
sgpool-16 32 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0
sgpool-8 32 60 128 30 1 : tunables 120 60 8 : slabdata 2 2 0
deadline_drq 0 0 52 71 1 : tunables 120 60 8 : slabdata 0 0 0
as_arq 296 348 64 58 1 : tunables 120 60 8 : slabdata 6 6 60
blkdev_requests 312 312 160 24 1 : tunables 120 60 8 : slabdata 13 13 60
biovec-BIO_MAX_PAGES 256 256 3072 2 2 : tunables 24 12 8 : slabdata 128 128 0
biovec-128 256 260 1536 5 2 : tunables 24 12 8 : slabdata 52 52 0
biovec-64 629 640 768 5 1 : tunables 54 27 8 : slabdata 128 128 38
biovec-16 315 315 256 15 1 : tunables 120 60 8 : slabdata 21 21 0
biovec-4 348 348 64 58 1 : tunables 120 60 8 : slabdata 6 6 0
biovec-1 520 600 16 200 1 : tunables 120 60 8 : slabdata 3 3 60
bio 870 870 64 58 1 : tunables 120 60 8 : slabdata 15 15 180
sock_inode_cache 573 910 512 7 1 : tunables 54 27 8 : slabdata 130 130 0
skbuff_head_cache 296 870 256 15 1 : tunables 120 60 8 : slabdata 58 58 30
sock 4 10 384 10 1 : tunables 54 27 8 : slabdata 1 1 0
proc_inode_cache 1417 1530 384 10 1 : tunables 54 27 8 : slabdata 153 153 0
sigqueue 130 130 144 26 1 : tunables 120 60 8 : slabdata 5 5 0
radix_tree_node 7117 8955 260 15 1 : tunables 54 27 8 : slabdata 597 597 189
bdev_cache 6 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0
mnt_cache 20 58 64 58 1 : tunables 120 60 8 : slabdata 1 1 0
inode_cache 566 580 384 10 1 : tunables 54 27 8 : slabdata 58 58 0
dentry_cache 167775 176055 256 15 1 : tunables 120 60 8 : slabdata 11737 11737 0
filp 2057 2790 256 15 1 : tunables 120 60 8 : slabdata 186 186 0
names_cache 25 25 4096 1 1 : tunables 24 12 8 : slabdata 25 25 0
idr_layer_cache 3 28 136 28 1 : tunables 120 60 8 : slabdata 1 1 0
buffer_head 35463 50481 52 71 1 : tunables 120 60 8 : slabdata 711 711 0
mm_struct 331 360 640 6 1 : tunables 54 27 8 : slabdata 60 60 0
vm_area_struct 10667 12586 64 58 1 : tunables 120 60 8 : slabdata 217 217 0
fs_cache 331 464 64 58 1 : tunables 120 60 8 : slabdata 8 8 0
files_cache 346 371 512 7 1 : tunables 54 27 8 : slabdata 53 53 0
signal_cache 447 696 64 58 1 : tunables 120 60 8 : slabdata 12 12 0
sighand_cache 345 380 1408 5 2 : tunables 24 12 8 : slabdata 76 76 0
task_struct 434 450 1456 5 2 : tunables 24 12 8 : slabdata 90 90 0
pte_chain 139628 145500 128 30 1 : tunables 120 60 8 : slabdata 4850 4850 0
pgd 330 330 4096 1 1 : tunables 24 12 8 : slabdata 330 330 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-131072 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-65536 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-32768 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0
size-16384 1 1 16384 1 4 : tunables 8 4 0 : slabdata 1 1 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0
size-8192 446 446 8192 1 2 : tunables 8 4 0 : slabdata 446 446 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8 : slabdata 0 0 0
size-4096 65 66 4096 1 1 : tunables 24 12 8 : slabdata 65 66 0
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8 : slabdata 0 0 0
size-2048 245 294 2048 2 1 : tunables 24 12 8 : slabdata 147 147 4
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0
size-1024 109 128 1024 4 1 : tunables 54 27 8 : slabdata 32 32 0
size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0
size-512 268 488 512 8 1 : tunables 54 27 8 : slabdata 61 61 0
size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
size-256 424 465 256 15 1 : tunables 120 60 8 : slabdata 31 31 0
size-128(DMA) 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
size-128 2387 3090 128 30 1 : tunables 120 60 8 : slabdata 103 103 0
size-64(DMA) 0 0 64 58 1 : tunables 120 60 8 : slabdata 0 0 0
size-64 334 406 64 58 1 : tunables 120 60 8 : slabdata 7 7 0
size-32(DMA) 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0
size-32 744 784 32 112 1 : tunables 120 60 8 : slabdata 7 7 0
kmem_cache 104 104 148 26 1 : tunables 120 60 8 : slabdata 4 4 0


Attachments:
attach.1 (14.82 kB)

2004-04-28 23:58:42

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

"Brett E." <[email protected]> wrote:
>
> I attached sar, slabinfo and /proc/meminfo data on the 2.6.5 machine. I
> reproduce this behavior by simply untarring a 260meg file on a
> production server, the machine becomes sluggish as it swaps to disk.

I see no swapout from the info which you sent.

A `vmstat 1' trace would be more useful.

> Is there a way to limit the cache so this machine, which has 1 gigabyte of
> memory, doesn't dip into swap?

Decrease /proc/sys/vm/swappiness?

Swapout is good. It frees up unused memory. I run my desktop machines at
swappiness=100.

2004-04-29 00:04:21

by Brett E.

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Brett E. wrote:

> Same thing happens on 2.4.18.
>
> I attached sar, slabinfo and /proc/meminfo data on the 2.6.5 machine. I
> reproduce this behavior by simply untarring a 260meg file on a
> production server, the machine becomes sluggish as it swaps to disk. Is
> there a way to limit the cache so this machine, which has 1 gigabyte of
> memory, doesn't dip into swap?
>
> Thanks,
>
> Brett
>

I created a hack which allocates memory causing cache to go down, then
exits, freeing up the malloc'ed memory. This brings free memory up by
400 megs and brings the cache down to close to 0, of course the cache
grows right afterwards. It would be nice to cap the cache datastructures
in the kernel but I've been posting about this since September to no
avail so my expectations are pretty low.

Here's the code:

#define ALLOC_SIZE 1024*1024
#define NUM_ALLOC 400

int main() {
char* ptr;
int i,j;

for(i=0;i<NUM_ALLOC;i++) {
ptr = (void*)malloc(ALLOC_SIZE);
for(j=0;j<ALLOC_SIZE;j+=512) {
ptr[j]=0;
}
}

return 0;
}


...
Maybe I can make it a hack of all hacks and have it parse out "Cached"
from /proc/meminfo and allocate that many bytes.


2004-04-29 00:10:31

by Jeff Garzik

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton wrote:
> Swapout is good. It frees up unused memory. I run my desktop machines at
> swappiness=100.


The definition of "unused" is quite subjective and app-dependent...

I've see reports with increasing frequency about the swappiness of the
2.6.x kernels, from people who were already annoyed at the swappiness of
2.4.x kernels :)

Favorite pathological (and quite common) examples are the various 4am
cron jobs that scan your entire filesystem. Running that process
overnight on a quiet machines practically guarantees a huge burst of
disk activity, with unwanted results:
1) Inode and page caches are blown away
2) A lot of your desktop apps are swapped out

Additionally, a (IMO valid) maxim of sysadmins has been "a properly
configured server doesn't swap". There should be no reason why this
maxim becomes invalid over time. When Linux starts to swap out apps the
sysadmin knows will be useful in an hour, or six hours, or a day just
because it needs a bit more file cache, I get worried.

There IMO should be some way to balance the amount of anon-vma's such
that the sysadmin can say "stop taking 70% of my box's memory for
disposable cache, use it instead for apps you would otherwise swap out,
you memory-hungry kernel you."

Jeff



2004-04-29 00:13:59

by Jeff Garzik

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

#define MEGS 140
#define MEG (1024 * 1024)

int main (int argc, char *argv[])
{
void **data;
int i, r;
size_t megs = MEGS;

if ((argc >= 2) && (atoi(argv[1]) > 0))
megs = atoi(argv[1]);

data = malloc (megs * sizeof (void*));
if (!data) abort();

memset (data, 0, megs * sizeof (void*));

srand(time(NULL));

for (i = 0; i < megs; i++) {
data[i] = malloc(MEG);
memset (data[i], i, MEG);
printf("malloc/memset %03d/%03lu\n", i+1, megs);
}
for (i = megs - 1; i >= 0; i--) {
r = rand() % 200;
memset (data[i], r, MEG);
printf("memset #2 %03d/%03lu = %d\n", i+1, megs, r);
}
printf("done\n");
return 0;
}


Attachments:
fillmem.c (707.00 B)

2004-04-29 00:35:30

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Jeff Garzik wrote:
> Andrew Morton wrote:
>
>> Swapout is good. It frees up unused memory. I run my desktop
>> machines at
>> swappiness=100.
>
>
>
> The definition of "unused" is quite subjective and app-dependent...
>
> I've see reports with increasing frequency about the swappiness of the
> 2.6.x kernels, from people who were already annoyed at the swappiness of
> 2.4.x kernels :)
>
> Favorite pathological (and quite common) examples are the various 4am
> cron jobs that scan your entire filesystem. Running that process
> overnight on a quiet machines practically guarantees a huge burst of
> disk activity, with unwanted results:
> 1) Inode and page caches are blown away
> 2) A lot of your desktop apps are swapped out
>
> Additionally, a (IMO valid) maxim of sysadmins has been "a properly
> configured server doesn't swap". There should be no reason why this
> maxim becomes invalid over time. When Linux starts to swap out apps the
> sysadmin knows will be useful in an hour, or six hours, or a day just
> because it needs a bit more file cache, I get worried.
>

I don't know. What if you have some huge application that only
runs once per day for 10 minutes? Do you want it to be consuming
100MB of your memory for the other 23 hours and 50 minutes for
no good reason?

Anyway, I have a small set of VM patches which attempt to improve
this sort of behaviour if anyone is brave enough to try them.
Against -mm kernels only I'm afraid (the objrmap work causes some
porting difficulty).

2004-04-29 00:43:30

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Jeff Garzik wrote:
> Brett E. wrote:
>
>> exits, freeing up the malloc'ed memory. This brings free memory up by
>> 400 megs and brings the cache down to close to 0, of course the cache
>
>
> Yeah, I have something similar (attached). Run it like
>
> fillmem <number-of-megabytes>
>
>
>> grows right afterwards. It would be nice to cap the cache
>> datastructures in the kernel but I've been posting about this since
>> September to no avail so my expectations are pretty low.
>
>
> This is a frequent request... although I disagree with a hard cap on
> the cache, I think the request (and similar ones) should hopefully
> indicate to the VM gurus that the kernel likes cache better than anon
> VMAs that must be swapped out.
>

For 2.6.6-rc2-mm2:
http://www.kerneltrap.org/~npiggin/vm-rollup.patch.gz

/proc/sys/vm/mapped_page_cost - indicate which *you* like better ;)

2004-04-29 00:45:05

by Wakko Warner

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

> I don't know. What if you have some huge application that only
> runs once per day for 10 minutes? Do you want it to be consuming
> 100MB of your memory for the other 23 hours and 50 minutes for
> no good reason?

I keep soffice open all the time. The box in question has 512mb of ram.
This is one app, even though I use it infrequently, would prefer that it
never be swapped out. Mainly when I want to use it, I *WANT* it now (ie not
waiting for it to come back from swap)

This is just my oppinion. I personally feel that cache should use available
memory, not already used memory (swapping apps out for more cache).

--
Lab tests show that use of micro$oft causes cancer in lab animals

2004-04-29 00:44:58

by Brett E.

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

swappiness of 0:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 11 168396 9260 16032 508496 1 2 84 120 142 132 37 7 48 8
0 8 168396 6636 16056 510580 0 0 1320 76 1334 1639 13 3 0 84
0 9 168396 6432 16068 510092 0 0 836 60 1242 1124 13 2 0 85
17 9 168396 7200 16084 508580 0 0 1248 148 1318 1351 11 3 7 80
0 10 168396 14488 16104 507064 0 0 1364 904 1488 1977 20 4 0 76
8 8 168396 11992 16116 508752 0 0 1124 88 1304 1345 11 3 0 86
16 8 168396 10008 16140 510768 0 0 1392 172 1434 1970 21 4 0 74
3 11 168396 13592 16152 512524 0 0 1072 1364 1625 2544 32 6 2 60
0 9 168396 6560 16252 519632 0 0 5644 380 1431 2073 19 6 0 76
0 10 168396 6576 16320 519564 0 0 3840 1208 1259 1013 9 4 0 88
0 6 168396 7040 16420 519260 0 0 5408 356 1311 1281 11 4 0 85
0 10 168396 8640 16432 517616 0 0 2020 116 1496 2268 26 6 0 68
0 10 168396 8384 16528 516704 0 0 4972 4124 1526 2278 30 8 4 60
0 7 168396 7744 16528 517248 0 0 552 16 1267 1012 8 2 1 89
0 7 168396 6528 16532 517788 0 0 304 12024 1175 174 1 1 0 98
0 8 168396 7488 16552 515728 0 0 1408 7376 1111 173 1 1 0 98
12 8 168396 8824 16556 514024 0 0 1956 2724 1301 1582 19 5 0 76
12 14 168396 6944 16504 492860 0 0 1524 0 1637 2458 71 12 0 17
0 7 168396 7072 16596 491272 0 0 5624 0 1296 1168 14 5 0 82
0 7 168396 6936 16708 488712 0 0 7520 1496 1287 1737 17 7 0 76
0 6 168396 6328 16756 488324 0 0 4712 2000 1234 786 7 4 0 89
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 12 168396 6448 13400 485600 96 0 8896 29480 8401 6544 11 3 0 86
4 13 168396 10016 12760 487056 0 0 1600 0 1475 2089 21 5 0 74
10 7 168396 6944 12764 486304 0 0 1620 1028 1509 2366 37 6 0 57
0 11 168396 6760 12780 493088 0 0 3880 2240 1455 2034 22 5 0 72
0 6 168396 6240 12796 493480 0 0 8064 2032 1248 1167 9 5 0 87
0 8 168396 7200 12820 492300 0 0 7336 2416 1387 1738 21 7 0 73
1 11 168396 7968 12820 491144 0 0 3628 784 1551 2490 27 8 0 65
0 4 168396 6544 12844 499144 0 0 8948 584 1318 1310 12 6 0 83
5 7 168396 8856 12856 496820 0 0 4536 5792 1288 1336 11 4 0 85
0 6 168396 7640 12856 498112 0 0 920 1352 1180 545 4 1 0 96
0 6 168396 7512 12860 497836 0 0 2053 10676 1152 133 0 1 0 98
0 6 168396 8056 12864 496676 0 0 1664 0 1079 138 1 1 0 99
1 7 168396 6400 12896 495964 0 0 7665 1649 1171 465 4 4 0 93
0 8 168396 7920 12908 495340 0 0 956 672 1410 1851 22 6 0 73
8 13 168396 6936 12908 477456 0 0 684 3299 1643 2473 54 11 0 34
0 15 168396 6584 12912 470516 0 0 832 1340 1376 1703 32 5 0 63
0 18 168396 11184 12912 471128 0 0 620 0 1541 2548 35 6 0 60
0 8 168396 11084 12912 472012 0 0 904 3384 1459 1771 29 6 0 65
1 9 168396 6844 12924 473088 0 0 1064 4 1390 1694 24 3 0 73
1 12 168396 7248 12932 469408 0 0 960 132 1562 2412 43 6 0 51
5 9 168396 13228 12932 468996 4 0 1160 128 1628 2732 38 7 0 55
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 11 168396 11628 12932 469948 0 0 880 1388 1619 2407 29 6 0 66
0 9 168396 16500 12936 471372 0 0 1068 2200 1385 1464 14 3 0 84
1 4 168396 14132 12964 473724 0 0 1712 5432 1516 2168 31 5 0 65
5 12 168396 19000 12972 475824 0 0 1360 160 1468 1918 30 5 0 66
0 9 168396 16720 12976 477724 0 0 1376 128 1599 2943 36 7 0 56
0 8 168396 20260 12976 480920 0 0 1884 224 1785 2982 38 8 0 54
5 12 168396 7368 13000 493136 0 0 10312 648 1372 1743 18 8 1 74
1 12 168396 7480 13008 492312 0 0 2848 1628 1532 2428 26 7 0 67
0 11 168396 7992 13016 498696 0 0 3484 3368 1546 2521 28 7 0 64
1 14 168396 6760 13028 499772 0 20 5536 812 1534 2733 32 7 0 62
0 17 168396 6120 13048 507504 0 0 7944 6596 1539 2217 26 8 0 66
0 5 168396 6424 13048 507164 0 0 564 4352 1229 332 4 1 0 95
0 14 168396 7128 13048 506552 0 0 1576 3280 1374 2009 22 4 0 74
0 10 168396 12536 13048 508184 0 0 1008 0 1283 1488 19 3 0 79
8 10 168392 10104 13056 510256 32 0 1460 192 1628 2668 30 8 0 62
1 6 168392 14904 13076 512960 64 0 1736 0 1696 3413 43 11 0 46
3 4 168392 19384 13088 515804 0 0 1708 0 1483 2337 27 6 0 67
7 5 168392 14328 13096 520624 0 0 2760 280 1659 2531 36 6 0 58
0 5 168392 16124 13108 527004 0 0 3416 276 1319 1308 15 6 0 80
0 5 168392 9724 13124 533448 0 0 3284 3840 1231 739 10 2 0 88
18 9 168392 9532 13124 533516 0 0 64 3108 1314 473 6 2 0 92
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 14 168392 8636 13124 534264 0 0 808 0 1511 2083 24 5 0 72
1 10 168392 8156 13124 535012 0 0 716 6360 1459 2052 21 4 1 76
0 5 168392 7388 13124 535624 0 0 628 2892 1449 1890 23 4 9 64
14 4 168392 14204 13128 536576 64 0 828 2056 1534 2161 31 5 3 61
0 4 168392 13820 13132 536980 0 0 356 3821 1490 2082 29 7 2 61
1 2 168392 20700 13136 537656 0 0 636 1568 1589 2516 31 6 5 58
1 6 168392 19708 13144 538600 0 0 1024 2732 1558 2504 39 7 6 49
6 3 168392 26688 13144 539144 0 0 520 5372 1624 2589 40 7 0 53
0 4 168392 34036 13148 539820 0 0 692 1568 1602 3144 37 7 6 50
2 3 168392 33300 13152 540564 0 0 732 5212 1624 2892 30 5 16 48
0 3 168392 40276 13160 541508 0 0 960 4188 1655 2542 37 7 6 52
1 2 168392 40340 13160 542052 0 0 464 2056 1719 3330 42 8 2 49
3 2 168392 47252 13164 542460 64 0 572 260 1728 3321 43 8 6 42
5 4 168392 54880 13168 542936 64 0 484 232 1725 3352 51 8 9 31




MemTotal: 1293976 kB
MemFree: 77964 kB
Buffers: 15568 kB
Cached: 525740 kB
SwapCached: 38596 kB
Active: 677728 kB
Inactive: 442556 kB
HighTotal: 393216 kB
HighFree: 768 kB
LowTotal: 900760 kB
LowFree: 77196 kB
SwapTotal: 530104 kB
SwapFree: 365860 kB
Dirty: 50036 kB
Writeback: 14036 kB
Mapped: 570488 kB
Slab: 82860 kB
Committed_AS: 853200 kB
PageTables: 4400 kB
VmallocTotal: 114680 kB
VmallocUsed: 560 kB
VmallocChunk: 114120 kB

swappiness of 100:


procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 168304 198344 13736 463148 1 2 85 121 144 137 37 7 48 9
1 0 168304 198200 13736 463216 0 0 44 372 1830 3379 48 8 28 17
0 0 168304 198120 13736 463284 0 0 32 8 1680 3064 41 7 51 1
2 1 168304 145512 14952 514768 0 0 68 864 1586 2377 33 25 34 7
0 4 168304 135336 15176 524744 0 0 248 1092 1410 657 10 6 0 85
0 5 168304 135144 15176 524744 0 0 0 956 1315 178 1 1 0 98
0 5 168304 133544 15188 525616 0 0 412 10604 1267 795 14 3 0 84
0 4 168304 112112 15212 527564 0 0 1012 4768 1630 2875 74 12 0 14
0 4 168304 84432 15504 545836 0 0 6120 12484 1627 2396 54 15 0 32
0 6 168304 57496 16040 568352 0 0 180 32512 2432 1297 16 7 0 77
1 5 168304 36696 16452 585688 0 0 8 8120 1338 259 7 8 0 84
10 6 168304 23752 16712 595220 0 0 64 13304 1317 1029 19 7 0 73
8 8 168304 7248 16936 589440 0 0 284 5380 1607 2234 63 15 0 22
0 17 168304 6192 16944 587324 0 0 932 5328 1381 1636 17 4 0 80
0 15 168304 7400 16940 581888 0 0 600 752 1389 1500 23 4 0 74
3 11 168304 7400 16944 581136 0 0 688 16 1289 877 7 1 0 92
5 23 168304 7304 16944 578892 0 0 888 606 1287 1189 9 3 0 89
2 10 168304 7784 16940 571756 0 0 780 1136 1436 1855 30 4 0 67
0 7 168304 6184 16940 568016 0 0 816 104 1458 1757 23 5 1 71
0 7 168304 6752 16940 556184 0 0 732 188 1441 1726 34 7 0 59
0 11 168304 6488 16940 555368 0 0 652 344 1285 1361 11 2 0 87
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 15 168304 6640 16996 553952 0 0 3316 960 1357 1578 11 4 0 85
1 15 168304 6512 17072 551876 0 0 6064 3804 1289 1132 14 5 0 82
1 14 168304 6896 17212 542792 0 0 6032 3176 1333 1496 26 8 0 66
0 11 168296 6504 17220 527728 0 0 3508 4204 1323 1326 38 8 0 54
1 7 168296 7376 17292 526992 0 0 3476 1120 1236 909 8 3 0 88
1 12 168292 6992 17388 520728 0 0 4268 4776 1354 1409 25 5 0 70
0 13 168292 6544 17388 521068 0 0 424 48 1213 914 6 1 0 93
2 16 168292 6536 17388 519028 0 0 756 4 1308 1582 18 3 0 78
0 7 168292 6456 17388 516784 0 0 668 5016 1308 1289 13 2 0 85
0 7 168292 11168 17388 516036 0 0 644 112 1340 1366 14 3 0 84
0 13 168292 8992 17392 516712 0 0 696 2068 1358 1713 17 3 0 80
5 11 168292 6896 17392 513176 0 0 616 92 1431 1792 35 5 0 61
0 6 168292 7280 17416 512488 0 0 1212 2396 1431 2187 20 3 0 77
0 10 168292 7216 17436 512260 0 0 1472 44 1372 1647 14 4 1 82
0 9 168292 12576 17444 512580 0 0 1044 84 1252 1067 10 2 0 87
0 11 168292 10912 17464 514192 0 0 1136 2640 1304 1353 12 3 0 86
1 4 168292 9312 17472 515816 0 0 1144 720 1244 1212 10 2 0 89
1 9 168292 13744 17488 517976 0 0 1276 96 1417 1750 15 4 0 82
0 11 168292 11760 17508 519656 0 0 1228 1732 1362 2056 17 4 0 79
0 9 168292 9712 17524 521204 0 0 1116 832 1338 1286 12 3 0 86
1 10 168292 7072 17644 524056 0 0 5600 160 1514 2443 24 8 0 68
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 10 168292 8036 17672 530312 0 0 6048 184 1319 1684 16 5 0 79
1 10 168292 7392 17648 531064 0 0 7080 132 1340 1502 11 5 0 83
1 13 168292 6840 17644 531620 0 0 6372 360 1379 1858 15 6 0 80
1 7 168292 7288 17652 528416 0 0 924 5 1317 690 11 2 0 87
0 5 168292 6648 17664 528472 0 0 136 9961 1293 279 1 1 0 98
0 6 168292 8056 17664 526024 0 0 4 6297 1154 107 0 1 0 99
0 14 168292 7416 17680 528388 0 0 1480 2327 1476 2175 24 6 0 70
2 10 168292 6576 17604 504120 0 0 884 3724 1349 1494 56 9 0 35
2 7 168292 10016 17604 504052 0 0 920 1052 1358 1564 25 4 0 71
1 8 168292 12064 17608 506564 0 0 1604 1108 1397 1779 17 4 0 79
0 12 168292 6656 17604 509764 0 0 3904 2488 1413 1661 19 5 0 76
1 15 168292 6848 17516 505296 0 0 5196 2052 1445 2156 29 6 0 65
0 11 168292 8256 17520 507672 0 0 4284 780 1300 1566 19 5 1 75
1 13 168292 7792 17504 508572 0 0 6808 2303 1364 1465 13 5 0 83
0 7 168292 6896 17516 508780 0 0 4516 3256 1315 1473 12 4 0 84
0 6 168292 8160 17524 507604 0 8 2373 448 1196 558 3 2 0 96
0 4 168292 7904 17532 507664 0 0 52 10924 1234 244 3 1 0 96
1 19 168292 6640 17532 508140 0 0 1740 7580 1250 1175 13 4 0 83
0 11 168292 6704 17532 508344 0 0 1104 12 1258 1166 11 2 0 87
1 6 168292 6256 17536 502296 0 0 1724 404 1489 1894 34 7 0 58
0 10 168292 8552 17536 496788 0 0 1824 1548 1561 2694 38 8 3 52
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 5 168292 12036 17544 499432 0 0 1724 148 1515 2175 28 6 0 66
7 6 168292 6380 17544 501608 0 0 1412 128 1291 1381 22 3 0 74
9 8 168292 11580 17540 502428 0 0 1540 148 1436 1863 22 6 1 71
0 14 168292 9276 17540 504264 0 0 1228 660 1616 2405 30 6 0 65
0 10 168292 6280 17556 507220 0 0 7260 1948 1355 1679 16 6 0 78
0 12 168292 7176 17564 506164 0 0 7224 2040 1294 1406 10 6 0 84
0 10 168292 6984 17572 506168 0 0 6608 2340 1316 1511 11 5 0 84
0 17 168292 6108 17584 506836 0 8 6928 4196 1386 1543 13 5 0 82
0 9 168292 6712 17584 506020 0 0 592 12 1188 694 7 1 0 91
0 6 168292 6200 17584 506292 0 0 312 4420 1231 360 1 1 0 98
1 7 168292 6456 17584 502996 32 0 612 8952 1464 1424 17 4 0 79
0 6 168296 3200 17580 496572 0 4 636 500 1592 2571 49 10 0 41
0 17 168304 3124 17580 494836 0 8 868 8 1505 2131 32 6 0 62
0 7 168404 8364 17580 492808 0 220 628 4264 1406 1908 24 4 0 73
0 12 168620 5804 17580 493008 0 340 1108 1312 1517 2096 30 5 0 66
0 10 168620 9884 17584 495112 0 0 1416 0 1424 2218 24 5 0 71
0 12 168620 7324 17588 497624 0 0 1620 0 1441 1633 16 4 0 80
0 14 168620 10580 17588 499460 0 0 1228 144 1383 1637 25 5 0 71
0 8 168632 6784 17608 503888 0 124 4056 124 1432 1655 18 5 0 76
0 11 169212 7348 17628 503444 0 676 3748 864 1383 1622 19 4 5 71
4 13 169212 5236 17628 506012 16 0 1808 3896 1364 1903 20 4 4 73
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 6 169296 3444 17640 507780 0 88 1208 1364 1580 2414 33 7 0 60
0 17 173120 14424 17624 504260 0 3832 956 3836 1383 1716 17 3 0 80
0 9 174236 13336 17624 505592 0 1116 1056 2672 1298 1119 10 3 0 88
0 16 174236 5464 17640 513396 0 0 4254 3100 1367 1253 10 4 0 86
0 12 174732 5176 17624 514112 0 496 776 568 1354 1686 16 4 0 80
0 17 178196 7008 17624 508172 0 3464 696 3480 1366 1636 23 5 0 73
0 14 180832 10424 17620 498148 0 2640 664 2640 1307 1274 24 4 0 72
0 6 180832 9448 17620 498912 52 0 784 12 1316 1169 9 2 2 86
0 13 180832 8360 17620 499584 8 0 680 176 1405 1717 17 4 4 75
5 11 180832 9192 17620 500196 0 0 680 128 1579 2259 34 6 8 51
1 9 180832 8104 17624 501076 0 0 820 340 1355 1412 12 2 13 74
16 7 180832 7016 17632 502020 0 0 972 256 1364 1607 20 3 0 77
0 8 180832 12280 17632 502600 32 0 636 1930 1394 1472 15 3 0 81
2 5 180832 11256 17632 503484 0 0 812 3356 1420 1983 23 4 0 73
0 9 180832 18128 17632 503824 0 0 372 1712 1420 1865 25 5 0 71
0 7 180832 17152 17632 504708 0 0 848 1456 1327 1609 18 3 2 77
4 11 180832 24376 17640 505312 0 0 628 3552 1470 1961 21 3 20 55
0 2 180832 23544 17640 505924 0 0 612 1560 1447 1918 21 4 7 69
0 15 180832 22584 17648 506732 0 0 740 1376 1313 1216 15 3 14 67
0 5 180832 22072 17652 507204 0 0 572 1476 1770 3492 46 9 1 44
1 12 180832 20984 17652 508292 0 0 1040 3232 1741 3186 50 9 5 37
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
8 5 180832 27736 17652 509108 0 0 828 3726 1631 2891 51 11 0 37
4 6 180832 26936 17652 509856 0 0 700 5672 1702 2875 39 9 13 39
0 3 180832 33568 17652 510672 0 0 880 232 1677 2889 42 7 2 49
0 16 180832 32992 17652 511624 0 0 948 136 1559 2325 29 6 5 59
13 9 180832 40240 17652 512236 0 0 604 176 1403 1779 17 3 3 76
1 9 180832 39248 17652 513732 0 0 1476 144 1705 3237 41 9 3 48
3 4 180832 46724 17652 514276 0 0 504 272 1721 3376 39 8 0 53
3 7 180832 45940 17652 515092 0 0 828 840 1753 3187 43 9 6 42
4 8 180832 53268 17652 515704 0 0 560 3424 1545 2224 29 6 12 53
2 2 180832 60276 17652 516384 0 0 664 356 1698 3135 37 7 8 48
5 6 180832 59508 17652 517124 8 0 772 880 1562 2492 28 6 9 58
0 4 180776 67384 17652 517520 0 0 420 3280 1595 3019 36 8 9 48
1 7 180776 66968 17652 517928 0 0 336 1568 1558 2338 28 5 22 45
0 8 180776 66456 17652 518336 0 0 400 332 1501 2040 28 6 5 61
0 5 180776 73240 17652 519280 8 0 980 48 1604 2806 32 8 12 50
0 8 180776 72664 17652 519960 0 0 688 5048 1667 2822 35 6 16 42
1 2 180716 79924 17656 520492 0 0 504 8 1638 2522 37 7 22 35
3 0 180716 79124 17660 520896 0 0 424 8 1683 3526 48 8 27 16
0 0 180604 86020 17664 521592 20 0 696 16 1783 3452 56 11 20 12
1 1 180604 93412 17664 522340 0 0 720 8 1796 3540 52 10 24 15
5 0 180604 92516 17664 523020 0 0 708 240 1799 3012 47 7 18 27
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
8 0 180604 99556 17664 523632 0 0 580 8 1811 3788 51 10 25 13
0 3 180604 107132 17664 524176 0 0 528 8 1798 3622 61 10 13 16
14 1 180604 106524 17668 524716 0 0 572 16 1780 3195 48 11 9 32
0 0 180604 105884 17668 525300 96 0 644 8 1931 4016 66 12 16 5
10 12 180604 113244 17668 525744 32 0 544 651 1733 3057 40 8 17 36
5 0 180604 109020 17672 526592 32 0 868 8 1778 3294 49 9 7 34
8 1 180604 104764 17672 527680 0 0 1012 8 1835 3832 65 11 5 19
1 1 180604 110460 17672 528428 0 0 736 16 1772 3721 56 10 24 11
0 1 180604 109692 17672 529172 4 0 752 8 1791 3388 51 9 28 11
1 3 180604 108988 17676 529844 4 0 708 632 1860 3637 54 9 9 28
0 1 180604 108732 17676 530116 0 0 252 16 1770 2861 46 8 16 32
4 0 180604 108220 17684 530652 0 0 460 8 1650 3194 41 8 36 14
5 0 180604 107772 17684 531128 0 0 512 16 1625 3173 40 7 41 11
3 1 180556 114956 17684 531612 0 0 504 8 1750 3649 51 10 29 11
13 0 180556 122300 17684 532088 0 0 452 724 1835 3344 50 9 14 27
6 1 180556 121836 17684 532564 0 0 496 16 1782 4114 53 10 30 7
1 0 180556 121324 17684 533176 0 0 524 8 1686 3157 43 8 40 10
1 0 180556 128844 17684 533516 0 0 384 16 1720 3460 51 8 31 11
3 0 180556 128652 17684 533712 8 0 216 8 1740 3888 45 8 38 8
13 4 180556 136076 17684 534324 0 0 576 808 1794 3053 43 8 19 30
1 0 180556 135756 17684 534596 0 0 288 16 1728 2986 42 8 25 25
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 180556 135308 17684 535072 0 0 456 8 1715 3644 42 7 41 9



MemTotal: 1293976 kB
MemFree: 7528 kB
Buffers: 16940 kB
Cached: 581684 kB
SwapCached: 32552 kB
Active: 852360 kB
Inactive: 342852 kB
HighTotal: 393216 kB
HighFree: 768 kB
LowTotal: 900760 kB
LowFree: 6760 kB
SwapTotal: 530104 kB
SwapFree: 361800 kB
Dirty: 48908 kB
Writeback: 12996 kB
Mapped: 594156 kB
Slab: 78336 kB
Committed_AS: 883164 kB
PageTables: 4452 kB
VmallocTotal: 114680 kB
VmallocUsed: 560 kB
VmallocChunk: 114120 kB


Attachments:
attach.2 (20.37 kB)

2004-04-29 00:50:27

by Brett E.

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Jeff Garzik wrote:

> Andrew Morton wrote:
>
>> Swapout is good. It frees up unused memory. I run my desktop
>> machines at
>> swappiness=100.
>
>
>
> The definition of "unused" is quite subjective and app-dependent...
>
> I've see reports with increasing frequency about the swappiness of the
> 2.6.x kernels, from people who were already annoyed at the swappiness of
> 2.4.x kernels :)
>
> Favorite pathological (and quite common) examples are the various 4am
> cron jobs that scan your entire filesystem. Running that process
> overnight on a quiet machines practically guarantees a huge burst of
> disk activity, with unwanted results:
> 1) Inode and page caches are blown away
> 2) A lot of your desktop apps are swapped out
>
> Additionally, a (IMO valid) maxim of sysadmins has been "a properly
> configured server doesn't swap". There should be no reason why this
> maxim becomes invalid over time. When Linux starts to swap out apps the
> sysadmin knows will be useful in an hour, or six hours, or a day just
> because it needs a bit more file cache, I get worried.
>
> There IMO should be some way to balance the amount of anon-vma's such
> that the sysadmin can say "stop taking 70% of my box's memory for
> disposable cache, use it instead for apps you would otherwise swap out,
> you memory-hungry kernel you."
>
> Jeff

Or how about "Use ALL the cache you want Mr. Kernel. But when I want
more physical memory pages, just reap cache pages and only swap out when
the cache is down to a certain size(configurable, say 100megs or
something)."


2004-04-29 00:53:33

by Jeff Garzik

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Wakko Warner wrote:
> This is just my oppinion. I personally feel that cache should use available
> memory, not already used memory (swapping apps out for more cache).


Strongly agreed, though there are pathological cases that prevent this
from being something that's easy to implement on a global basis.

Jeff



2004-04-29 00:57:32

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Wakko Warner wrote:
>>I don't know. What if you have some huge application that only
>>runs once per day for 10 minutes? Do you want it to be consuming
>>100MB of your memory for the other 23 hours and 50 minutes for
>>no good reason?
>
>
> I keep soffice open all the time. The box in question has 512mb of ram.
> This is one app, even though I use it infrequently, would prefer that it
> never be swapped out. Mainly when I want to use it, I *WANT* it now (ie not
> waiting for it to come back from swap)
>
> This is just my oppinion. I personally feel that cache should use available
> memory, not already used memory (swapping apps out for more cache).
>

On the other hand, suppose that with soffice resident the entire
time, you don't have enough memory to cache an entire kernel tree
(or video you are editing, or whatever).

Now your find | xargs grep keeps taking 30s every time you run
it, or your video is un-editable...

2004-04-29 00:59:51

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
> Anyway, I have a small set of VM patches which attempt to improve
> this sort of behaviour if anyone is brave enough to try them.
> Against -mm kernels only I'm afraid (the objrmap work causes some
> porting difficulty).

Is this the same patch you wanted me to try?

Remember, the embedded system where NFS IO was pushing my
application out of memory. Setting swappiness to zero was a
temporary fix.


2004-04-29 01:03:09

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

"Brett E." <[email protected]> wrote:
>
> Or how about "Use ALL the cache you want Mr. Kernel. But when I want
> more physical memory pages, just reap cache pages and only swap out when
> the cache is down to a certain size(configurable, say 100megs or
> something)."

Have you tried decreasing /proc/sys/vm/swappiness? That's what it is for.

My point is that decreasing the tendency of the kernel to swap stuff out is
wrong. You really don't want hundreds of megabytes of BloatyApp's
untouched memory floating about in the machine. Get it out on the disk,
use the memory for something useful.

2004-04-29 01:13:41

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

"Brett E." <[email protected]> wrote:
>
> > I see no swapout from the info which you sent.
>
> pgpgout/s gives the total number of blocks paged out to disk per second,
> it peaks at 13,000 and hovers around 3,000 per the attachment.

Nope. pgpgout is simply writes to disk, of all types.

swapout is accounted for under pswpout and your vmstat trace shows a little
bit of (healthy) swapout with swappiness=100 and negligible swapout with
swappiness=0. In both cases, negligible swapin. That's all just fine.

> Swapping out is good, but when that's coupled with swapping in as is the
> case on my side, it creates a thrashing situation where we swap out to
> disk pages which are being used, we then immediately swap those pages
> back in, etc etc..

Look at your "si" column in vmstat. It's practically all zeroes.

> The usage pattern by the way is on a server which continuously hits a
> database and reads files so I don't know what "swappiness" should be set
> to exactly. Every hour or so it wants to untar tarballs and by then the
> cache is large. From here, the system swaps in and out more while cache
> decreases. Basically, it should do what I believe Solaris does... simply
> reclaim cache and not swap. Capping cache would be good too but the
> best solution IMO is to simply reclaim the cache on an as-needed basis
> before thinking about swapping.

swappiness=100: swaps a lot. swappiness=0: doesn't swap much.

With a funny workload like that you might choose to set swappiness to 0
just around the hourly tar operation, but as the machine seems to not be
swapping there doesn't seem to be a need.

2004-04-29 01:25:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton wrote:
> "Brett E." <[email protected]> wrote:
>
>> Or how about "Use ALL the cache you want Mr. Kernel. But when I want
>> more physical memory pages, just reap cache pages and only swap out when
>> the cache is down to a certain size(configurable, say 100megs or
>> something)."
>
>
> Have you tried decreasing /proc/sys/vm/swappiness? That's what it is for.
>
> My point is that decreasing the tendency of the kernel to swap stuff out is
> wrong. You really don't want hundreds of megabytes of BloatyApp's
> untouched memory floating about in the machine. Get it out on the disk,
> use the memory for something useful.

Well, if it's truly untouched, then it never needs to be allocated a
page or swapped out at all... just accounted for (overcommit on/off,
etc. here)

But I assume you are not talking about that, but instead talking about
_rarely_ used pages, that were filled with some amount of data at some
point in time. These are at the heart of the thread (or my point, at
least) -- BloatyApp may be Oracle with a huge cache of its own, for
which swapping out may be a huge mistake. Or Mozilla. After some
amount of disk IO on my 512MB machine, Mozilla would be swapped out...
when I had only been typing an email minutes before.

BloatyApp? yes. Should it have been swapped out? Absolutely not. The
'SIZE' in top was only 160M and there were no other major apps running.

Applications are increasingly playing second fiddle to cache ;-(

Regardless of /proc/sys/vm/swappiness, I think it's a valid concern of
sysadmins who request "hard cache limit", because they are seeing
pathological behavior such that apps get swapped out when cache is over
50% of all available memory.

Jeff


2004-04-29 01:29:49

by Brett E.

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton wrote:

> "Brett E." <[email protected]> wrote:
>
>>>I see no swapout from the info which you sent.
>>
>> pgpgout/s gives the total number of blocks paged out to disk per second,
>> it peaks at 13,000 and hovers around 3,000 per the attachment.
>
>
> Nope. pgpgout is simply writes to disk, of all types.
That is what is confusing me.. From the sar man page:

pgpgin/s
Total number of kilobytes the system paged in from disk per second.

pgpgout/s
Total number of kilobytes the system paged out to disk per second.



>
> swapout is accounted for under pswpout and your vmstat trace shows a little
> bit of (healthy) swapout with swappiness=100 and negligible swapout with
> swappiness=0. In both cases, negligible swapin. That's all just fine.
>
>
>> Swapping out is good, but when that's coupled with swapping in as is the
>> case on my side, it creates a thrashing situation where we swap out to
>> disk pages which are being used, we then immediately swap those pages
>> back in, etc etc..
>
>
> Look at your "si" column in vmstat. It's practically all zeroes.
>
>
>> The usage pattern by the way is on a server which continuously hits a
>> database and reads files so I don't know what "swappiness" should be set
>> to exactly. Every hour or so it wants to untar tarballs and by then the
>> cache is large. From here, the system swaps in and out more while cache
>> decreases. Basically, it should do what I believe Solaris does... simply
>> reclaim cache and not swap. Capping cache would be good too but the
>> best solution IMO is to simply reclaim the cache on an as-needed basis
>> before thinking about swapping.
>
>
> swappiness=100: swaps a lot. swappiness=0: doesn't swap much.
>
> With a funny workload like that you might choose to set swappiness to 0
> just around the hourly tar operation, but as the machine seems to not be
> swapping there doesn't seem to be a need.

Yeah, it wouldn't help if paging isn't the problem. I'd like more
clarificaton on sar before I throw out paging being the culprit.


2004-04-29 01:38:01

by Paul Mackerras

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

I wrote:

> What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
> gigabyte or so of data over to another machine, it then takes several
> seconds to change focus from one window to another. I can see it
> slowly redraw the window title bars. It looks like the window manager
> is getting swapped/paged out.

I meant to add that this is with swappiness = 60.

Paul.

2004-04-29 01:37:39

by Paul Mackerras

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton writes:

> My point is that decreasing the tendency of the kernel to swap stuff out is
> wrong. You really don't want hundreds of megabytes of BloatyApp's
> untouched memory floating about in the machine. Get it out on the disk,
> use the memory for something useful.

What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
gigabyte or so of data over to another machine, it then takes several
seconds to change focus from one window to another. I can see it
slowly redraw the window title bars. It looks like the window manager
is getting swapped/paged out.

This machine has 2.5GB of ram, so I really don't see why it would need
to swap at all. There should be plenty of page cache pages that are
clean and not in use by any process that could be discarded. It seems
like as soon as there is any memory shortage at all it picks on the
window manager and chucks out all its pages. :(

Paul.

2004-04-29 01:42:57

by Tim Connors

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

"Brett E." <[email protected]> said on Wed, 28 Apr 2004 17:49:43 -0700:
> Or how about "Use ALL the cache you want Mr. Kernel. But when I want
> more physical memory pages, just reap cache pages and only swap out when
> the cache is down to a certain size(configurable, say 100megs or
> something)."

Oh how dearly I would love that...

I have a huge app that operates on a large file (but both are a bit
smaller than available memory, by maybe a hundred or two megs - enough
for to keep the entire working set in RAM, anyway). I create these
large files over and over (on another host, so cache does absolutely
no good whatsoever, since we are streaming a read once), but don't
delete the old ones, so they all remain in cache. So when I close one
copy of the app, and open up a new one on a different file, when it
comes time to allocate those several hundred megs, it rather blows
away my mozilla or my X session(! -- since I need it to display the
results) or my window manager, and keeps growing that cache.

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
When you are chewing on life's grissle, don't grumble - give a whistle!
This'll help things turn out for the best
Always look on the bright side of life

2004-04-29 01:42:53

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Jeff Garzik <[email protected]> wrote:
>
> Andrew Morton wrote:
> > "Brett E." <[email protected]> wrote:
> >
> >> Or how about "Use ALL the cache you want Mr. Kernel. But when I want
> >> more physical memory pages, just reap cache pages and only swap out when
> >> the cache is down to a certain size(configurable, say 100megs or
> >> something)."
> >
> >
> > Have you tried decreasing /proc/sys/vm/swappiness? That's what it is for.
> >
> > My point is that decreasing the tendency of the kernel to swap stuff out is
> > wrong. You really don't want hundreds of megabytes of BloatyApp's
> > untouched memory floating about in the machine. Get it out on the disk,
> > use the memory for something useful.
>
> Well, if it's truly untouched, then it never needs to be allocated a
> page or swapped out at all... just accounted for (overcommit on/off,
> etc. here)
>
> But I assume you are not talking about that, but instead talking about
> _rarely_ used pages, that were filled with some amount of data at some
> point in time.

Of course. My fairly modest desktop here stabilises at about 300 megs
swapped out, with negligible swapin. That's all just crap which apps
aren't using any more. Getting that memory out on disk, relatively freely
is an important optimisation.

> These are at the heart of the thread (or my point, at
> least) -- BloatyApp may be Oracle with a huge cache of its own, for
> which swapping out may be a huge mistake. Or Mozilla. After some
> amount of disk IO on my 512MB machine, Mozilla would be swapped out...
> when I had only been typing an email minutes before.

OK, so it takes four seconds to swap mozilla back in, and you noticed it.

Did you notice that those three kernel builds you just did ran in twenty
seconds less time because they had more cache available? Nope.

> Regardless of /proc/sys/vm/swappiness, I think it's a valid concern of
> sysadmins who request "hard cache limit", because they are seeing
> pathological behavior such that apps get swapped out when cache is over
> 50% of all available memory.

We should be sceptical of this. If they can provide *numbers* then fine.
Otherwise, the subjective "oh gee, that took a long time" seat-of-the-pants
stuff does not impress. If they want to feel better about it then sure,
set swappiness to zero and live with less cache for the things which need
it...

Let me point out that the kernel right now, with default swappiness very
much tends to reclaim cache rather than swapping stuff out. The
top-of-thread report was incorrect, due to a misreading of kernel
instrumentation.

2004-04-29 01:47:27

by Rik van Riel

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, 28 Apr 2004, Andrew Morton wrote:

> You really don't want hundreds of megabytes of BloatyApp's untouched
> memory floating about in the machine.

But people do. The point here is LATENCY, when a user comes
back from lunch and continues typing in OpenOffice, his system
should behave just like he left it.

Making the user have very bad interactivity for the first
minute or so is a Bad Thing, even if the computer did run
more efficiently while the user wasn't around to notice...

IMHO, the VM on a desktop system really should be optimised to
have the best interactive behaviour, meaning decent latency
when switching applications.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-04-29 01:51:09

by Tim Connors

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Nick Piggin <[email protected]> said on Thu, 29 Apr 2004 10:54:36 +1000:
> Wakko Warner wrote:
> >>I don't know. What if you have some huge application that only
> >>runs once per day for 10 minutes? Do you want it to be consuming
> >>100MB of your memory for the other 23 hours and 50 minutes for
> >>no good reason?
> >
> >
> > I keep soffice open all the time. The box in question has 512mb of ram.
> > This is one app, even though I use it infrequently, would prefer that it
> > never be swapped out. Mainly when I want to use it, I *WANT* it now (ie not
> > waiting for it to come back from swap)
> >
> > This is just my oppinion. I personally feel that cache should use available
> > memory, not already used memory (swapping apps out for more cache).
> >
>
> On the other hand, suppose that with soffice resident the entire
> time, you don't have enough memory to cache an entire kernel tree
> (or video you are editing, or whatever).

For the kernel example, I only ever compile once before rebooting[1]
:)

This I think is the kind of thing that a kernel will never
automatically detect. This *must* be in the hands of the
administrator, who will know what they are doing (hopefully).

[1] I have never had enough memory on machines that I use to compile
kernels, to cache an entire tree anyway -- I'd much rather mozilla use
it than a cache which will never be reused

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
[transporting bed]... across several suburbs and a large salt water
harbour. Well, they thoughtfully bridged the harbour in the 1930s, so
the problem was actually transporting it across several suburbs and a
long single span bridge. -- Hipatia

2004-04-29 01:49:03

by Rik van Riel

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, 28 Apr 2004, Andrew Morton wrote:

> OK, so it takes four seconds to swap mozilla back in, and you noticed it.
>
> Did you notice that those three kernel builds you just did ran in twenty
> seconds less time because they had more cache available? Nope.

That's exactly why desktops should be optimised to give
the best performance where the user notices it most...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-04-29 01:56:19

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Paul Mackerras <[email protected]> wrote:
>
> Andrew Morton writes:
>
> > My point is that decreasing the tendency of the kernel to swap stuff out is
> > wrong. You really don't want hundreds of megabytes of BloatyApp's
> > untouched memory floating about in the machine. Get it out on the disk,
> > use the memory for something useful.
>
> What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
> gigabyte or so of data over to another machine, it then takes several
> seconds to change focus from one window to another. I can see it
> slowly redraw the window title bars. It looks like the window manager
> is getting swapped/paged out.
>
> This machine has 2.5GB of ram, so I really don't see why it would need
> to swap at all. There should be plenty of page cache pages that are
> clean and not in use by any process that could be discarded. It seems
> like as soon as there is any memory shortage at all it picks on the
> window manager and chucks out all its pages. :(
>

I suspect rsync is taking two passes across the source files for its
checksumming thing. If so, this will defeat the pagecache use-once logic.
The kernel sees the second touch of the pages and assumes that there will
be a third touch.

I use scp ;)

2004-04-29 02:01:47

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Rik van Riel <[email protected]> wrote:
>
> IMHO, the VM on a desktop system really should be optimised to
> have the best interactive behaviour, meaning decent latency
> when switching applications.

I'm gonna stick my fingers in my ears and sing "la la la" until people tell
me "I set swappiness to zero and it didn't do what I wanted it to do".

2004-04-29 02:19:27

by Tim Connors

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton <[email protected]> said on Wed, 28 Apr 2004 18:40:08 -0700:
> Jeff Garzik <[email protected]> wrote:
> > These are at the heart of the thread (or my point, at
> > least) -- BloatyApp may be Oracle with a huge cache of its own, for
> > which swapping out may be a huge mistake. Or Mozilla. After some
> > amount of disk IO on my 512MB machine, Mozilla would be swapped out...
> > when I had only been typing an email minutes before.
>
> OK, so it takes four seconds to swap mozilla back in, and you noticed it.

Actually, about 20-30 seconds on all of my boxs (no, I have no idea
why so slow even on the P4 I have here - swapping has always seemed
overly slow on this machine, and yes, DMA is turned on) with a ~100MB
mozilla image (plus the parts of X that get swapped out and need to be
swapped in before the user sees any effect - X takes up about ~100MB
res memory typically here, since I tend to have so many apps with
cached pixmaps open and in current use).

> Did you notice that those three kernel builds you just did ran in twenty
> seconds less time because they had more cache available? Nope.

Nope, because I never run 3 builds before rebooting - I do however run
a lot of software that only ever reads a file once (the file was
written on another host on the cluster, so the caching done at write
time is of no benefit to us here.

This is something that should be up to the admin, because the kernel
*cannot* know what I want. And I don't think /proc/.../swapiness is
enough to define what we want.

> > Regardless of /proc/sys/vm/swappiness, I think it's a valid concern of
> > sysadmins who request "hard cache limit", because they are seeing
> > pathological behavior such that apps get swapped out when cache is over
> > 50% of all available memory.
>
> We should be sceptical of this. If they can provide *numbers* then fine.
> Otherwise, the subjective "oh gee, that took a long time" seat-of-the-pants
> stuff does not impress. If they want to feel better about it then sure,
> set swappiness to zero and live with less cache for the things which need
> it...

OK - I'll try to get around to giving you a vmstat 1 and maybe top
output, and timing things next time I run one of these big
visualisation jobs (it'd be very nice if this was all backported to
2.4, since this is what we are mostly using here -- I think I can find
a 2.6 machine though)...

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
My code is giving me mixed signals. SIGSEGV then SIGILL then SIGBUS. -- me

2004-04-29 02:29:46

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 06:57:20PM -0700, Andrew Morton wrote:
> Rik van Riel <[email protected]> wrote:
> >
> > IMHO, the VM on a desktop system really should be optimised to
> > have the best interactive behaviour, meaning decent latency
> > when switching applications.
>
> I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> me "I set swappiness to zero and it didn't do what I wanted it to do".

It does, but it's a bit too coarse of a solution. It just means that
the page cache always loses.

2004-04-29 02:37:43

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Marc Singer <[email protected]> wrote:
>
> On Wed, Apr 28, 2004 at 06:57:20PM -0700, Andrew Morton wrote:
> > Rik van Riel <[email protected]> wrote:
> > >
> > > IMHO, the VM on a desktop system really should be optimised to
> > > have the best interactive behaviour, meaning decent latency
> > > when switching applications.
> >
> > I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> > me "I set swappiness to zero and it didn't do what I wanted it to do".
>
> It does, but it's a bit too coarse of a solution. It just means that
> the page cache always loses.

That's what people have been asking for. What are you suggesting should
happen instead?

2004-04-29 02:41:56

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton <[email protected]> wrote:
>
> Paul Mackerras <[email protected]> wrote:
> >
> ...
> > What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
> > gigabyte or so of data over to another machine, it then takes several
> > seconds to change focus from one window to another. I can see it
> > slowly redraw the window title bars. It looks like the window manager
> > is getting swapped/paged out.
> >
> > This machine has 2.5GB of ram, so I really don't see why it would need
> > to swap at all. There should be plenty of page cache pages that are
> > clean and not in use by any process that could be discarded. It seems
> > like as soon as there is any memory shortage at all it picks on the
> > window manager and chucks out all its pages. :(
> >
>
> I suspect rsync is taking two passes across the source files for its
> checksumming thing. If so, this will defeat the pagecache use-once logic.
> The kernel sees the second touch of the pages and assumes that there will
> be a third touch.

OK, a bit of fiddling does indicate that if a file is present on both
client and server, and is modified on the client, the rsync client will
indeed touch the pagecache pages twice. Does this describe the files which
you're copying at all?

One thing you could do is to run `watch -n1 cat /proc/meminfo'. Cause lots
of memory to be freed up then do the copy. Monitor the size of the active
and inactive lists. If the active list is growing then we know that rsync
is touching pages twice.

That would be an unfortunate special-case.

2004-04-29 02:42:12

by Rik van Riel

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, 28 Apr 2004, Andrew Morton wrote:
> Rik van Riel <[email protected]> wrote:
> >
> > IMHO, the VM on a desktop system really should be optimised to
> > have the best interactive behaviour, meaning decent latency
> > when switching applications.
>
> I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> me "I set swappiness to zero and it didn't do what I wanted it to do".

Agreed, you shouldn't be the one to fix this problem.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-04-29 02:45:48

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Rik van Riel <[email protected]> wrote:
>
> On Wed, 28 Apr 2004, Andrew Morton wrote:
> > Rik van Riel <[email protected]> wrote:
> > >
> > > IMHO, the VM on a desktop system really should be optimised to
> > > have the best interactive behaviour, meaning decent latency
> > > when switching applications.
> >
> > I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> > me "I set swappiness to zero and it didn't do what I wanted it to do".
>
> Agreed, you shouldn't be the one to fix this problem.
>

What problem?

2004-04-29 02:58:25

by Paul Mackerras

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton writes:

> OK, a bit of fiddling does indicate that if a file is present on both
> client and server, and is modified on the client, the rsync client will
> indeed touch the pagecache pages twice. Does this describe the files which
> you're copying at all?

The client/server thing is a bit misleading, what matters is the
direction of the transfer. In the case I saw this morning, the G5 was
the sender. In any case I was using the -W switch, which tells it not
to use the rsync algorithm but just transfer the whole file. So I
believe that rsync on the G5 side was just reading the file through
once.

I have also noticed similar behaviour after doing a bk pull on a
kernel tree.

The really strange thing is that the behaviour seems to get worse the
more RAM you have. I haven't noticed any problem at all on my laptop
with 768MB, only on the G5, which has 2.5GB. (The laptop is still on
2.6.2-rc3 though, so I will try a newer kernel on it.)

Regards,
Paul.

2004-04-29 03:10:20

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Paul Mackerras <[email protected]> wrote:
>
> Andrew Morton writes:
>
> > OK, a bit of fiddling does indicate that if a file is present on both
> > client and server, and is modified on the client, the rsync client will
> > indeed touch the pagecache pages twice. Does this describe the files which
> > you're copying at all?
>
> The client/server thing is a bit misleading, what matters is the
> direction of the transfer. In the case I saw this morning, the G5 was
> the sender. In any case I was using the -W switch, which tells it not
> to use the rsync algorithm but just transfer the whole file. So I
> believe that rsync on the G5 side was just reading the file through
> once.
>
> I have also noticed similar behaviour after doing a bk pull on a
> kernel tree.
>
> The really strange thing is that the behaviour seems to get worse the
> more RAM you have. I haven't noticed any problem at all on my laptop
> with 768MB, only on the G5, which has 2.5GB. (The laptop is still on
> 2.6.2-rc3 though, so I will try a newer kernel on it.)
>

Is the laptop x86 or ppc? IIRC there were problems with the pte-referenced
handling on ppc? Or was it ppc64? It shouldn't make any difference in
this case I guess.

To investigate this sort of thing you're better off using just a local `dd'
to ascertain the pattern which is causing the problem. Keep things simple.

What happens if you do a 4G writeout with dd? Is there any swapout? There
shouldn't be much at all. If the big dd indeed does not cause swapout,
then what is different about rsync?

2004-04-29 03:13:12

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 07:35:41PM -0700, Andrew Morton wrote:
> Marc Singer <[email protected]> wrote:
> >
> > On Wed, Apr 28, 2004 at 06:57:20PM -0700, Andrew Morton wrote:
> > > Rik van Riel <[email protected]> wrote:
> > > >
> > > > IMHO, the VM on a desktop system really should be optimised to
> > > > have the best interactive behaviour, meaning decent latency
> > > > when switching applications.
> > >
> > > I'm gonna stick my fingers in my ears and sing "la la la" until people tell
> > > me "I set swappiness to zero and it didn't do what I wanted it to do".
> >
> > It does, but it's a bit too coarse of a solution. It just means that
> > the page cache always loses.
>
> That's what people have been asking for. What are you suggesting should
> happen instead?

I'm thinking that the problem is that the page cache is greedier that
most people expect. For example, if I could hold the page cache to be
under a specific size, then I could do some performance measurements.
E.g, compile kernel with a 768K page cache, 512K, 256K and 128K. On a
machine with loads of RAM, where's the optimal page cache size?

2004-04-29 03:15:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 12:58:13PM +1000, Paul Mackerras wrote:
> The client/server thing is a bit misleading, what matters is the
> direction of the transfer. In the case I saw this morning, the G5 was
> the sender. In any case I was using the -W switch, which tells it not
> to use the rsync algorithm but just transfer the whole file. So I
> believe that rsync on the G5 side was just reading the file through
> once.
> I have also noticed similar behaviour after doing a bk pull on a
> kernel tree.
> The really strange thing is that the behaviour seems to get worse the
> more RAM you have. I haven't noticed any problem at all on my laptop
> with 768MB, only on the G5, which has 2.5GB. (The laptop is still on
> 2.6.2-rc3 though, so I will try a newer kernel on it.)

Looks like you've got a system with an issue. Any chance you could send
logs from an instrumented test run?

Thanks.


-- wli

2004-04-29 03:20:16

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Marc Singer <[email protected]> wrote:
>
> > That's what people have been asking for. What are you suggesting should
> > happen instead?
>
> I'm thinking that the problem is that the page cache is greedier that
> most people expect. For example, if I could hold the page cache to be
> under a specific size, then I could do some performance measurements.
> E.g, compile kernel with a 768K page cache, 512K, 256K and 128K. On a
> machine with loads of RAM, where's the optimal page cache size?

Nope, there's no point in leaving free memory floating about when the
kernel can and will reclaim clean pagecache on demand.

What you discuss above is just an implementation detail. Forget it. What
are the requirements? Thus far I've seen

a) updatedb causes cache reclaim

b) updatedb causes swapout

c) prefer that openoffice/mozilla not get paged out when there's heavy
pagecache demand.

For a) we don't really have a solution. Some have been proposed but they
could have serious downsides.

For b) and c) we can tune the pageout-vs-cache reclaim tendency with
/proc/sys/vm/swappiness, only nobody seems to know that.

What else is there?

2004-04-29 03:57:47

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton wrote:
> Paul Mackerras <[email protected]> wrote:
>
>>Andrew Morton writes:
>>
>>
>>>My point is that decreasing the tendency of the kernel to swap stuff out is
>>>wrong. You really don't want hundreds of megabytes of BloatyApp's
>>>untouched memory floating about in the machine. Get it out on the disk,
>>>use the memory for something useful.
>>
>>What I have noticed with 2.6.6-rc1 on my dual G5 is that if I rsync a
>>gigabyte or so of data over to another machine, it then takes several
>>seconds to change focus from one window to another. I can see it
>>slowly redraw the window title bars. It looks like the window manager
>>is getting swapped/paged out.
>>
>>This machine has 2.5GB of ram, so I really don't see why it would need
>>to swap at all. There should be plenty of page cache pages that are
>>clean and not in use by any process that could be discarded. It seems
>>like as soon as there is any memory shortage at all it picks on the
>>window manager and chucks out all its pages. :(
>>
>
>
> I suspect rsync is taking two passes across the source files for its
> checksumming thing. If so, this will defeat the pagecache use-once logic.
> The kernel sees the second touch of the pages and assumes that there will
> be a third touch.
>

I'm not very impressed with the pagecache use-once logic, and I
have a patch to remove it completely and treat non-mapped touches
(IMO) more sanely.

2004-04-29 03:58:45

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Marc Singer wrote:
> On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
>
>>Anyway, I have a small set of VM patches which attempt to improve
>>this sort of behaviour if anyone is brave enough to try them.
>>Against -mm kernels only I'm afraid (the objrmap work causes some
>>porting difficulty).
>
>
> Is this the same patch you wanted me to try?
>
> Remember, the embedded system where NFS IO was pushing my
> application out of memory. Setting swappiness to zero was a
> temporary fix.
>
>

Yes this is the same patch I wanted you to try. Yes I
remember your problem!

Didn't anyone come up with a patch for you to test the
stale PTE theory? If so, what where the results?

2004-04-29 04:13:09

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 08:19:24PM -0700, Andrew Morton wrote:
> Marc Singer <[email protected]> wrote:
> >
> > > That's what people have been asking for. What are you suggesting should
> > > happen instead?
> >
> > I'm thinking that the problem is that the page cache is greedier that
> > most people expect. For example, if I could hold the page cache to be
> > under a specific size, then I could do some performance measurements.
> > E.g, compile kernel with a 768K page cache, 512K, 256K and 128K. On a
> > machine with loads of RAM, where's the optimal page cache size?
>
> Nope, there's no point in leaving free memory floating about when the
> kernel can and will reclaim clean pagecache on demand.

It could work differently from that. For example, if we had 500M
total, we map 200M, then we do 400M of IO. Perhaps we'd like to be
able to say that a 400M page cache is too big. The problem isn't
about reclaiming pagecache it's about the cost of swapping pages back
in. The page cache can tend to favor swapping mapped pages over
reclaiming it's own pages that are less likely to be used. Of course,
it doesn't know that...which is the rub.

If I thought I had an method for doing this, I'd write code to try it
out.

> What you discuss above is just an implementation detail. Forget it. What
> are the requirements? Thus far I've seen

The requirement is that we'd like to see pages aged more gracefully.
A mapped page that is used continuously for ten minutes and then left
to idle for 10 minutes is more valuable than an IO page that was read
once and then not used for ten minutes. As the mapped page ages, it's
value decays.

> a) updatedb causes cache reclaim
>
> b) updatedb causes swapout
>
> c) prefer that openoffice/mozilla not get paged out when there's heavy
> pagecache demand.
>
> For a) we don't really have a solution. Some have been proposed but they
> could have serious downsides.
>
> For b) and c) we can tune the pageout-vs-cache reclaim tendency with
> /proc/sys/vm/swappiness, only nobody seems to know that.

I've read the source for where swappiness comes into play. Yet I
cannot make a statement about what it means. Can you?

2004-04-29 04:20:51

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 01:48:02PM +1000, Nick Piggin wrote:
> Marc Singer wrote:
> >On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
> >
> >>Anyway, I have a small set of VM patches which attempt to improve
> >>this sort of behaviour if anyone is brave enough to try them.
> >>Against -mm kernels only I'm afraid (the objrmap work causes some
> >>porting difficulty).
> >
> >
> >Is this the same patch you wanted me to try?
> >
> > Remember, the embedded system where NFS IO was pushing my
> > application out of memory. Setting swappiness to zero was a
> > temporary fix.
> >
> >
>
> Yes this is the same patch I wanted you to try. Yes I
> remember your problem!
>
> Didn't anyone come up with a patch for you to test the
> stale PTE theory? If so, what where the results?

Russell King is working on a lot of things for the MMU code in ARM.
I'm waiting to see where he ends up. I believe he's planning on
removing the lazy PTE release logic.

I hacked at it for some time. And I'm convinced that I correctly
forced the TLBs to be flushed. Still, I was never able to get the
system to behave.

Now, I just read a comment you or WLI made about the page cache
use-once logic. I wonder if that's the real culprit? As I wrote to
Andrew Morton, the kernel seems to be assigning an awful lot of value
to page cache pages that are used once (or twice?). I know that it
would be expensive to perform an HTG aging algorithm where the head of
the LRU list is really LRU. Does your patch pursue this line of
thought?

2004-04-29 04:26:45

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Marc Singer wrote:
> On Thu, Apr 29, 2004 at 01:48:02PM +1000, Nick Piggin wrote:
>
>>Marc Singer wrote:
>>
>>>On Thu, Apr 29, 2004 at 10:21:24AM +1000, Nick Piggin wrote:
>>>
>>>
>>>>Anyway, I have a small set of VM patches which attempt to improve
>>>>this sort of behaviour if anyone is brave enough to try them.
>>>>Against -mm kernels only I'm afraid (the objrmap work causes some
>>>>porting difficulty).
>>>
>>>
>>>Is this the same patch you wanted me to try?
>>>
>>> Remember, the embedded system where NFS IO was pushing my
>>> application out of memory. Setting swappiness to zero was a
>>> temporary fix.
>>>
>>>
>>
>>Yes this is the same patch I wanted you to try. Yes I
>>remember your problem!
>>
>>Didn't anyone come up with a patch for you to test the
>>stale PTE theory? If so, what where the results?
>
>
> Russell King is working on a lot of things for the MMU code in ARM.
> I'm waiting to see where he ends up. I believe he's planning on
> removing the lazy PTE release logic.
>
> I hacked at it for some time. And I'm convinced that I correctly
> forced the TLBs to be flushed. Still, I was never able to get the
> system to behave.
>
> Now, I just read a comment you or WLI made about the page cache
> use-once logic. I wonder if that's the real culprit? As I wrote to
> Andrew Morton, the kernel seems to be assigning an awful lot of value
> to page cache pages that are used once (or twice?). I know that it
> would be expensive to perform an HTG aging algorithm where the head of
> the LRU list is really LRU. Does your patch pursue this line of
> thought?
>

Yes it includes something which should help that. Along with
the "split active lists" that I mentioned might help your
problem when WLI first came up with the change to the
swappiness calculation for your problem.

It would be great if you had time to give my patch a run.
It hasn't been widely stress tested yet though, so no
production systems, of course!

2004-04-29 04:34:58

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Marc Singer <[email protected]> wrote:
>
> It could work differently from that. For example, if we had 500M
> total, we map 200M, then we do 400M of IO. Perhaps we'd like to be
> able to say that a 400M page cache is too big.

Try it - you'll find that the system will leave all of your 200M of mapped
memory in place. You'll be left with 300M of pagecache from that I/O
activity. There may be a small amount of unmapping activity if the I/O is
a write, or if the system has a small highmem zone. Maybe.

Beware that both ARM and NFS seem to be doing odd things, so try it on a
PC+disk first ;)

> The problem isn't
> about reclaiming pagecache it's about the cost of swapping pages back
> in. The page cache can tend to favor swapping mapped pages over
> reclaiming it's own pages that are less likely to be used. Of course,
> it doesn't know that...which is the rub.

No, the system will only start to unmap pages if reclaim of unmapped
pagecache is getting into difficulty. The threshold of "getting into
difficulty" is controlled by /proc/sys/vm/swappiness.

> The requirement is that we'd like to see pages aged more gracefully.
> A mapped page that is used continuously for ten minutes and then left
> to idle for 10 minutes is more valuable than an IO page that was read
> once and then not used for ten minutes. As the mapped page ages, it's
> value decays.

yes, remembering aging info over that period of time is hard. We only have
six levels of aging: referenced+active, unreferenced+active,
referenced+inactive,unreferenced+inactive, plus position-on-lru*2.

> I've read the source for where swappiness comes into play. Yet I
> cannot make a statement about what it means. Can you?

It controls the level of page reclaim distress at which we decide to start
reclaiming mapped pages.

We prefer to reclaim pagecache, but we have to start swapping at *some*
level of reclaim failure. swappiness sets that level, in rather vague
units.

It might make sense to recast swappiness in terms of
pages_reclaimed/pages_scanned, which is the real metric of page reclaim
distress. But that would only affect the meaning of the actual number - it
wouldn't change the tunable's effect on the system.

2004-04-29 06:20:35

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell


> The really strange thing is that the behaviour seems to get worse the
> more RAM you have. I haven't noticed any problem at all on my laptop
> with 768MB, only on the G5, which has 2.5GB. (The laptop is still on
> 2.6.2-rc3 though, so I will try a newer kernel on it.)

Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
that the accounting doesn't work properly.

Ben.


2004-04-29 06:24:48

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Benjamin Herrenschmidt <[email protected]> wrote:
>
>
> > The really strange thing is that the behaviour seems to get worse the
> > more RAM you have. I haven't noticed any problem at all on my laptop
> > with 768MB, only on the G5, which has 2.5GB. (The laptop is still on
> > 2.6.2-rc3 though, so I will try a newer kernel on it.)
>
> Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
> that the accounting doesn't work properly.

heh. It should have zone->spanned_pages - zone->present_pages = 2G.

2004-04-29 06:33:12

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, 2004-04-29 at 16:22, Andrew Morton wrote:
> Benjamin Herrenschmidt <[email protected]> wrote:
> >
> >
> > > The really strange thing is that the behaviour seems to get worse the
> > > more RAM you have. I haven't noticed any problem at all on my laptop
> > > with 768MB, only on the G5, which has 2.5GB. (The laptop is still on
> > > 2.6.2-rc3 though, so I will try a newer kernel on it.)
> >
> > Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
> > that the accounting doesn't work properly.
>
> heh. It should have zone->spanned_pages - zone->present_pages = 2G.

That should be fine, I'll check later, I can't reboot mine right now.

I'm initializing the zone with free_area_init_node() and I _am_ passing
the hole size. Paul, also check if you have NUMA enabled in .config, it changes
the way zones are initialized, I may have gotten that case wrong.

Ben.


2004-04-29 06:33:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

At some point in the past, I wrote:
>> The really strange thing is that the behaviour seems to get worse the
>> more RAM you have. I haven't noticed any problem at all on my laptop
>> with 768MB, only on the G5, which has 2.5GB. (The laptop is still on
>> 2.6.2-rc3 though, so I will try a newer kernel on it.)

On Thu, Apr 29, 2004 at 04:12:38PM +1000, Benjamin Herrenschmidt wrote:
> Your G5 also has a 2Gb IO hole in the middle of zone DMA, it's possible
> that the accounting doesn't work properly.

Hmm, ->present_pages vs. ->spanned_pages distinction(s) should cover
this, or should have at one point. How are those being set at the moment?


-- wli

2004-04-29 06:39:50

by William Lee Irwin III

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 09:20:47PM -0700, Marc Singer wrote:
> Now, I just read a comment you or WLI made about the page cache
> use-once logic. I wonder if that's the real culprit? As I wrote to
> Andrew Morton, the kernel seems to be assigning an awful lot of value
> to page cache pages that are used once (or twice?). I know that it
> would be expensive to perform an HTG aging algorithm where the head of
> the LRU list is really LRU. Does your patch pursue this line of
> thought?

I don't recall ever having seen an actual pure LRU patch.

The physical scanning infrastructure should be enough to implement most
global replacement algorithms with. It's always good to compare
alternatives. Also, we should have an implementation of random
replacement just as a control case to verify we do better than random.


-- wli

2004-04-29 07:36:19

by Russell King

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 09:20:47PM -0700, Marc Singer wrote:
> Russell King is working on a lot of things for the MMU code in ARM.
> I'm waiting to see where he ends up. I believe he's planning on
> removing the lazy PTE release logic.

Essentially it came to a grinding halt due to the shere size of the
task of sorting out the crappy includes, which is far to large for a
stable kernel.

I may go back to the original problem and sort it a different way,
but for the time being, I'm occupied in other areas.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-04-29 08:02:24

by Wichert Akkerman

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Previously Marc Singer wrote:
> I'm thinking that the problem is that the page cache is greedier that
> most people expect. For example, if I could hold the page cache to be
> under a specific size, then I could do some performance measurements.

It is actually greedy enough that when my nightly cron starts I suddenly
see apache and pdns_recursor being killed consistently every day.

Wichert.

--
Wichert Akkerman <[email protected]> It is simple to make things.
http://www.wiggy.net/ It is hard to make things simple.

2004-04-29 09:42:02

by Helge Hafting

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Brett E. wrote:
[...]
> Or how about "Use ALL the cache you want Mr. Kernel. But when I want
> more physical memory pages, just reap cache pages and only swap out when
> the cache is down to a certain size(configurable, say 100megs or
> something)."

Problem: reaping cache is equivalent to swapping in some cases.
The cache isn't merely "files read & written".
It is also all your executable code. Code is not different from
files being read at all. Dumping too much cache will dump the
code you're executing, and then it have to be reloaded from disk.


Helge Hafting

2004-04-29 10:44:31

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Russell King wrote:
> On Wed, Apr 28, 2004 at 09:20:47PM -0700, Marc Singer wrote:
>
>>Russell King is working on a lot of things for the MMU code in ARM.
>>I'm waiting to see where he ends up. I believe he's planning on
>>removing the lazy PTE release logic.
>
>
> Essentially it came to a grinding halt due to the shere size of the
> task of sorting out the crappy includes, which is far to large for a
> stable kernel.
>
> I may go back to the original problem and sort it a different way,
> but for the time being, I'm occupied in other areas.
>

Anyway, Marc said he tried flushing the tlb and that didn't
solve his problem.

The problem might be the one identified in the thread:
2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

2004-04-29 11:04:30

by Russell King

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 08:44:25PM +1000, Nick Piggin wrote:
> Anyway, Marc said he tried flushing the tlb and that didn't
> solve his problem.

Nevertheless, when you have a TLB with ASIDs, there will be even less
pressure to flush these entries from the TLB, so in effect we might
as well save the expense of implementing the page aging in the first
place.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-04-29 13:51:59

by Horst H. von Brand

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

"Brett E." <[email protected]> said:

[...]

> I created a hack which allocates memory causing cache to go down, then
> exits, freeing up the malloc'ed memory. This brings free memory up by
> 400 megs and brings the cache down to close to 0, of course the cache
> grows right afterwards. It would be nice to cap the cache datastructures
> in the kernel but I've been posting about this since September to no
> avail so my expectations are pretty low.

Because it is complete nonsense. Keeping stuff around in RAM in case it
is needed again, as long as RAM is not needed for anything else, is a mayor
win. That is what cache is.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-04-29 14:24:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 10:02:19AM +0200, Wichert Akkerman wrote:
> Previously Marc Singer wrote:
> > I'm thinking that the problem is that the page cache is greedier that
> > most people expect. For example, if I could hold the page cache to be
> > under a specific size, then I could do some performance measurements.
>
> It is actually greedy enough that when my nightly cron starts I suddenly
> see apache and pdns_recursor being killed consistently every day.

Which kernel is that?

They are getting killed because there is no more swap available.
Otherwise its a bug.

2004-04-29 14:27:46

by Wichert Akkerman

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Previously Marcelo Tosatti wrote:
> Which kernel is that?

That machine is running 2.6.4 at the moment.

> They are getting killed because there is no more swap available.
> Otherwise its a bug.

It actually killed a bunch of processes a minute ago and right now has
120mb of swap free and 104mb used for cache.

Wichert.

--
Wichert Akkerman <[email protected]> It is simple to make things.
http://www.wiggy.net/ It is hard to make things simple.

2004-04-29 14:32:15

by Rik van Riel

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, 29 Apr 2004, Nick Piggin wrote:

> I'm not very impressed with the pagecache use-once logic, and I
> have a patch to remove it completely and treat non-mapped touches
> (IMO) more sanely.

The basic idea of use-once isn't bad (search for LIRS and
ARC page replacement), however the Linux implementation
doesn't have any of the checks and balances that the
researched replacement algorithms have...

However, adding the checks and balancing required for LIRS,
ARC and CAR(S) isn't easy since it requires keeping track of
a number of recently evicted pages. That could be quite a
bit of infrastructure, though it might be well worth it.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-04-29 14:45:48

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 09:33:59PM -0700, Andrew Morton wrote:
> Marc Singer <[email protected]> wrote:
> >
> > It could work differently from that. For example, if we had 500M
> > total, we map 200M, then we do 400M of IO. Perhaps we'd like to be
> > able to say that a 400M page cache is too big.
>
> Try it - you'll find that the system will leave all of your 200M of mapped
> memory in place. You'll be left with 300M of pagecache from that I/O
> activity. There may be a small amount of unmapping activity if the I/O is
> a write, or if the system has a small highmem zone. Maybe.

Are you sure? Isn't that what the other posters are winging about?
They do lots of IO and then they have to wait for the system to page
Mozilla back in.

> Beware that both ARM and NFS seem to be doing odd things, so try it on a
> PC+disk first ;)

Yeah, I know that there is still something odd in ARM-land. I assume
that the other posters are using IA32.

> No, the system will only start to unmap pages if reclaim of unmapped
> pagecache is getting into difficulty. The threshold of "getting into
> difficulty" is controlled by /proc/sys/vm/swappiness.

What constitutes 'difficulty'? Perhaps this is rhetorical.

> > I've read the source for where swappiness comes into play. Yet I
> > cannot make a statement about what it means. Can you?
>
> It controls the level of page reclaim distress at which we decide to start
> reclaiming mapped pages.
>
> We prefer to reclaim pagecache, but we have to start swapping at *some*
> level of reclaim failure. swappiness sets that level, in rather vague
> units.

I'm not sure I see why we have to swap. If have of memory is mapped,
and the user is using those pages with some frequency, perhaps we
should never reclaim mapped pages.

2004-04-29 14:48:12

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 11:43:25AM +0200, Helge Hafting wrote:
> Brett E. wrote:
> [...]
> >Or how about "Use ALL the cache you want Mr. Kernel. But when I want
> >more physical memory pages, just reap cache pages and only swap out when
> >the cache is down to a certain size(configurable, say 100megs or
> >something)."
>
> Problem: reaping cache is equivalent to swapping in some cases.
> The cache isn't merely "files read & written".
> It is also all your executable code. Code is not different from
> files being read at all. Dumping too much cache will dump the
> code you're executing, and then it have to be reloaded from disk.

Hmm. I was under the impression that mapped pages were code and
unmapped pages were IO page cache. Are you suggesting that code is
duplicated?

2004-04-29 14:49:47

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 02:26:17PM +1000, Nick Piggin wrote:
> Yes it includes something which should help that. Along with
> the "split active lists" that I mentioned might help your
> problem when WLI first came up with the change to the
> swappiness calculation for your problem.
>
> It would be great if you had time to give my patch a run.
> It hasn't been widely stress tested yet though, so no
> production systems, of course!

As I said, I'm game to have a go. The trouble was that it doesn't
apply. My development kernel has an RMK patch applied that seems to
conflict with the MM patch on which you depend.

2004-04-29 14:52:49

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 12:04:19PM +0100, Russell King wrote:
> On Thu, Apr 29, 2004 at 08:44:25PM +1000, Nick Piggin wrote:
> > Anyway, Marc said he tried flushing the tlb and that didn't
> > solve his problem.
>
> Nevertheless, when you have a TLB with ASIDs, there will be even less
> pressure to flush these entries from the TLB, so in effect we might
> as well save the expense of implementing the page aging in the first
> place.

Uh, oh. My FLA translator just broke. Whatsa ASID?

2004-04-29 16:25:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

>> These are at the heart of the thread (or my point, at
>> least) -- BloatyApp may be Oracle with a huge cache of its own, for
>> which swapping out may be a huge mistake. Or Mozilla. After some
>> amount of disk IO on my 512MB machine, Mozilla would be swapped out...
>> when I had only been typing an email minutes before.
>
> OK, so it takes four seconds to swap mozilla back in, and you noticed it.
>
> Did you notice that those three kernel builds you just did ran in twenty
> seconds less time because they had more cache available? Nope.

The latency for interactive stuff is definitely more noticeable though, and
thus arguably more important. Perhaps we should be tying the scheduler in
more tightly with the VM - we've already decided there which apps are
"interactive" and thus need low latency ... shouldn't we be giving a boost
to their RAM pages as well, and favour keeping those paged in over other
pages (whether other apps, or cache) logically? It's all latency still ...

M.

2004-04-29 16:37:07

by Chris Friesen

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Martin J. Bligh wrote:

> The latency for interactive stuff is definitely more noticeable though, and
> thus arguably more important. Perhaps we should be tying the scheduler in
> more tightly with the VM - we've already decided there which apps are
> "interactive" and thus need low latency ... shouldn't we be giving a boost
> to their RAM pages as well, and favour keeping those paged in over other
> pages (whether other apps, or cache) logically? It's all latency still ...

I like this idea. Maybe make it more general though--tasks with high scheduler priority also get
more of a memory priority boost. This will factor in the static priority as well as the
interactivity bonus.

Chris

2004-04-29 16:51:19

by Andy Isaacson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 08:19:24PM -0700, Andrew Morton wrote:
> What you discuss above is just an implementation detail. Forget it. What
> are the requirements? Thus far I've seen
>
> a) updatedb causes cache reclaim
>
> b) updatedb causes swapout
>
> c) prefer that openoffice/mozilla not get paged out when there's heavy
> pagecache demand.
>
> For a) we don't really have a solution. Some have been proposed but they
> could have serious downsides.
>
> For b) and c) we can tune the pageout-vs-cache reclaim tendency with
> /proc/sys/vm/swappiness, only nobody seems to know that.
>
> What else is there?

What I want is for purely sequential workloads which far exceed cache
size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
thrashing my entire desktop out of memory. I DON'T CARE if the tar
completed in 45 minutes rather than 80. (It wouldn't, anyways, because
it only needs about 5 MB of cache to get every bit of the speedup it was
going to get.) But the additional latency when I un-xlock in the
morning is annoying, and there is no benefit.

For a more useful example, ideally I *should not be able to tell* that
"dd if=/hde1 of=/hdf1" is running. [1] There is *no* benefit to cacheing
more than about 2 pages, under this workload. But with current kernels,
IME, that workload results in a gargantuan buffer cache and lots of
swapout of apps I was using 3 minutes ago. I've taken to walking away
for some coffee, coming back when it's done, and "sudo swapoff
/dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
trying to use bloaty apps.

[1] obviously I'll see some slowdown due to interrupts and PCI
bandwidth; that's not what I'm railing against, here.

-andy

2004-04-29 16:53:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

>> I suspect rsync is taking two passes across the source files for its
>> checksumming thing. If so, this will defeat the pagecache use-once logic.
>> The kernel sees the second touch of the pages and assumes that there will
>> be a third touch.
>
> OK, a bit of fiddling does indicate that if a file is present on both
> client and server, and is modified on the client, the rsync client will
> indeed touch the pagecache pages twice. Does this describe the files which
> you're copying at all?
>
> One thing you could do is to run `watch -n1 cat /proc/meminfo'. Cause lots
> of memory to be freed up then do the copy. Monitor the size of the active
> and inactive lists. If the active list is growing then we know that rsync
> is touching pages twice.
>
> That would be an unfortunate special-case.

Personally, I think that the use-twice logic is a bit of a hack that mostly
works. If we moved to a method where we kept an eye on which pages are
associated with which address_space (for mapped pages) or which process
(for anonymous pages) we'd have a much better shot at stopping any one
process / file from monopolizing the whole of system memory.

We'd also be able to favour memory for files that are still open over ones
that have been closed, and recognize linear access scan patterns per file,
and reclaim more agressively from the overscanned areas, and favour higher
prio tasks over lower prio ones (including, but not limited to interactive).

Global LRU (even with the tweaks it has in Linux) doesn't seem optimal.

M.

2004-04-29 16:57:17

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

>> The latency for interactive stuff is definitely more noticeable though, and
>> thus arguably more important. Perhaps we should be tying the scheduler in
>> more tightly with the VM - we've already decided there which apps are
>> "interactive" and thus need low latency ... shouldn't we be giving a boost
>> to their RAM pages as well, and favour keeping those paged in over other
>> pages (whether other apps, or cache) logically? It's all latency still ...
>
> I like this idea. Maybe make it more general though--tasks with high scheduler priority also get more of a memory priority boost. This will factor in the static priority as well as the interactivity bonus.

Yeah, see also my other mail in that thread - if we moved to file-object (address_space) and task anon (mm) based tracking, it should be much easier.
Also fits in nicely with Hugh's anon_mm code.

M.

2004-04-29 17:50:40

by Adam Kropelin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Wed, Apr 28, 2004 at 09:47:45PM -0400, Rik van Riel wrote:
> On Wed, 28 Apr 2004, Andrew Morton wrote:
>
> > OK, so it takes four seconds to swap mozilla back in, and you noticed it.
> >
> > Did you notice that those three kernel builds you just did ran in twenty
> > seconds less time because they had more cache available? Nope.
>
> That's exactly why desktops should be optimised to give
> the best performance where the user notices it most...

Agreed. Looking at it from the standpoint of relative change, the time
to bring the mozilla window to the foreground is increased by orders of
magnitude while the kernel builds improve by a (relatively) small
percent. Humans easily notice change in orders of magnitude and such
changes can feel painful. Benchmarks notice $SMALLNUM percent long
before a human will, especially if s/he has left the room because the
job was going to take 10 minutes anyway. The 30 seconds saved off the
compile run just isn't worth it sometimes if its side-effect is to
disrupt the user's workflow.

The 'swappiness' tunable may well give enough control over the situation
to suit all sorts of users. If nothing else, this thread has raised
awareness that such a tunable exists and can be played with to influence
the kernel's decision-making. Distros, too, should give consideration to
appropriate default settings to serve their intended users.

--Adam

2004-04-29 18:06:00

by Brett E.

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Brett E. wrote:

> Andrew Morton wrote:
>
>> "Brett E." <[email protected]> wrote:
>>
>>>> I see no swapout from the info which you sent.
>>>
>>>
>>> pgpgout/s gives the total number of blocks paged out to disk per
>>> second, it peaks at 13,000 and hovers around 3,000 per the attachment.
>>
>>
>>
>> Nope. pgpgout is simply writes to disk, of all types.
>
> That is what is confusing me.. From the sar man page:
>
> pgpgin/s
> Total number of kilobytes the system paged in from disk per second.
>
> pgpgout/s
> Total number of kilobytes the system paged out to disk per second.
>
>
Anyone know what I should believe? Sar's pgpgin/s and pgpgout/s tell me
that it is paging in/out from/to disk. Yet pswpin/s and pswpout/s are
both 0. Swapping and paging are the same thing I believe. pgpgin/out
refer to paging, pswpin/out refer to swapping. So I for one am confused.

I guess I could dig through the source but I figured someone might have
encountered this disrepency in the past.


2004-04-29 18:32:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 11:05:42AM -0700, Brett E. wrote:
> Anyone know what I should believe? Sar's pgpgin/s and pgpgout/s tell me
> that it is paging in/out from/to disk. Yet pswpin/s and pswpout/s are
> both 0. Swapping and paging are the same thing I believe. pgpgin/out
> refer to paging, pswpin/out refer to swapping. So I for one am confused.
> I guess I could dig through the source but I figured someone might have
> encountered this disrepency in the past.

Both are to be believed. They merely describe different things.

Pagein/pageout are counts of VM-initiated IO, regardless of whether this
IO is done on filesystem-backed pages or swap-backed pages. Pagein and
pageout are used more generally to describe VM-initiated IO and don't
exclusively refer to swap IO, but also include IO to filesystems to/from
filesystem-backed memory.

Swapin/swapout are counts of swap IO only, and are considered to apply
only to IO done to swap files/devices to/from swap-backed anonymous memory.

Pagein/pageout are both proper and necessary to have. In fact, you were
requesting that filesystem IO be done preferentially to swap IO, and the
pagein/pageout indicators showing IO while swapin/swapout indicators show
none mean you are getting exactly what you asked for.


-- wli

2004-04-29 18:32:40

by Brett E.

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Horst von Brand wrote:

> "Brett E." <[email protected]> said:
>
> [...]
>
>
>>I created a hack which allocates memory causing cache to go down, then
>>exits, freeing up the malloc'ed memory. This brings free memory up by
>>400 megs and brings the cache down to close to 0, of course the cache
>>grows right afterwards. It would be nice to cap the cache datastructures
>>in the kernel but I've been posting about this since September to no
>>avail so my expectations are pretty low.
>
>
> Because it is complete nonsense. Keeping stuff around in RAM in case it
> is needed again, as long as RAM is not needed for anything else, is a mayor
> win. That is what cache is.
The key phrase in your post is "as long as RAM is not needed for
anything else." My assertion was that this is not the case and it seems
to favor cache over pages being used. Sar shows heavy paging to/from
disk even though 500 megs are reported in cache. I hope I don't need to
go into what paging in/out in succession means regarding paging out
pages which we will need shortly after they are paged out. Sar also
reports no swapping, hence the need to figure out why there is a
disprepency before continuing.


2004-04-29 20:03:13

by Horst H. von Brand

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Nick Piggin <[email protected]> said:

[...]

> I don't know. What if you have some huge application that only
> runs once per day for 10 minutes? Do you want it to be consuming
> 100MB of your memory for the other 23 hours and 50 minutes for
> no good reason?

How on earth is the kernel supposed to know that for this one particular
job you don't care if it takes 3 hours instead of 10 minutes, just because
you don't want to spare enough preciousss RAM?
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2004-04-29 20:20:37

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

--On Thursday, April 29, 2004 16:01:11 -0400 Horst von Brand <[email protected]> wrote:

> Nick Piggin <[email protected]> said:
>
> [...]
>
>> I don't know. What if you have some huge application that only
>> runs once per day for 10 minutes? Do you want it to be consuming
>> 100MB of your memory for the other 23 hours and 50 minutes for
>> no good reason?
>
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?

Nice value is the obvious interface for such information.

M.

2004-04-29 20:46:12

by David B. Stevens

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Horst von Brand wrote:
> Nick Piggin <[email protected]> said:
>
> [...]
>
>
>>I don't know. What if you have some huge application that only
>>runs once per day for 10 minutes? Do you want it to be consuming
>>100MB of your memory for the other 23 hours and 50 minutes for
>>no good reason?
>
>
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?

Maybe the kernel should be told by the apps exactly what they require in
the way of memory and maybe how to slice up what the app gets for memory
from the kernel.

This would not be the first time that applications had to specify such
information.

That was what REGION= and other such parameters were all about in other
operating systems.

Then the kernel would have free use of what was left until the next app
started etc ....

Cheers,
Dave

2004-04-29 20:53:53

by Brett E.

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

William Lee Irwin III wrote:

> On Thu, Apr 29, 2004 at 11:05:42AM -0700, Brett E. wrote:
>
>>Anyone know what I should believe? Sar's pgpgin/s and pgpgout/s tell me
>> that it is paging in/out from/to disk. Yet pswpin/s and pswpout/s are
>>both 0. Swapping and paging are the same thing I believe. pgpgin/out
>>refer to paging, pswpin/out refer to swapping. So I for one am confused.
>>I guess I could dig through the source but I figured someone might have
>>encountered this disrepency in the past.
>
>
> Both are to be believed. They merely describe different things.
>
> Pagein/pageout are counts of VM-initiated IO, regardless of whether this
> IO is done on filesystem-backed pages or swap-backed pages. Pagein and
> pageout are used more generally to describe VM-initiated IO and don't
> exclusively refer to swap IO, but also include IO to filesystems to/from
> filesystem-backed memory.
>
> Swapin/swapout are counts of swap IO only, and are considered to apply
> only to IO done to swap files/devices to/from swap-backed anonymous memory.
>
> Pagein/pageout are both proper and necessary to have. In fact, you were
> requesting that filesystem IO be done preferentially to swap IO, and the
> pagein/pageout indicators showing IO while swapin/swapout indicators show
> none mean you are getting exactly what you asked for.
>
>
Thanks I think it's clear now. In layman's terms, pgpgin/out relate to
disk cache activity vs pswpin/out which relate to swap activity.

2004-04-29 21:07:54

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andy Isaacson <[email protected]> wrote:
>
> What I want is for purely sequential workloads which far exceed cache
> size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
> thrashing my entire desktop out of memory. I DON'T CARE if the tar
> completed in 45 minutes rather than 80. (It wouldn't, anyways, because
> it only needs about 5 MB of cache to get every bit of the speedup it was
> going to get.) But the additional latency when I un-xlock in the
> morning is annoying, and there is no benefit.

What kernel version are you using? If 2.6, what value of
/proc/sys/vm/swappiness?

> For a more useful example, ideally I *should not be able to tell* that
> "dd if=/hde1 of=/hdf1" is running.

I just did a 4GB `dd if=/dev/sda of=/x bs=1M' on a 1GB 2.6.6-rc2-mm2
swappiness=85 machine here and there was no swapout at all.

Probably your machine has less memory. But without real, hard details
nothing can be done.

> There is *no* benefit to cacheing
> more than about 2 pages, under this workload.

Sure, we could do better things with the large streaming files, although
the risk of accidentally screwing up particular workloads is high.

But the use-once logic which we have in there at present does handle these
cases quite well.

> But with current kernels,
> IME, that workload results in a gargantuan buffer cache and lots of
> swapout of apps I was using 3 minutes ago. I've taken to walking away
> for some coffee, coming back when it's done, and "sudo swapoff
> /dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
> trying to use bloaty apps.

What kernel, what system specs, what swappiness setting?

2004-04-29 21:08:00

by Paul Jackson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes,

I'd pay ten bucks (yeah, I'm a cheapskate) for an option that I could
twiddle that would mark my nightly updatedb and backup jobs as ones to
use reduced memory footprint (both for file caching and backing user
virtual address space), even if it took much longer.

So, rather than protest in mock outrage that it's impossible for the
kernel to know this, instead answer the question as stated in all
seriousness ... well ... how _could_ the kernel know, and what _could_
the kernel do if it knew. What mechanism(s) would be needed so that
the kernel could restrict a jobs memory usage?

Heh - indeed perhaps the answer is closer than I realize. For SGI's big
NUMA boxes, managing memory placement is sufficiently critical that we
are inventing or encouraging ways (such as Andi Kleen's numa stuff) to
control memory placement per node per job. Perhaps this needs to be
extended to portions of a node (this job can only use 1 Gb of the memory
on that 2 Gb node) and to other memory uses (file cache, not just user
space memory).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-29 21:24:02

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Paul Jackson <[email protected]> wrote:
>
> > How on earth is the kernel supposed to know that for this one particular
> > job you don't care if it takes 3 hours instead of 10 minutes,
>
> I'd pay ten bucks (yeah, I'm a cheapskate) for an option that I could
> twiddle that would mark my nightly updatedb and backup jobs as ones to
> use reduced memory footprint (both for file caching and backing user
> virtual address space), even if it took much longer.
>
> So, rather than protest in mock outrage that it's impossible for the
> kernel to know this, instead answer the question as stated in all
> seriousness ... well ... how _could_ the kernel know, and what _could_
> the kernel do if it knew. What mechanism(s) would be needed so that
> the kernel could restrict a jobs memory usage?

Two things:

a) a knob to say "only reclaim pagecache". We have that now.

b) a knob to say "reclaim vfs caches harder". That's simply a matter of boosting
the return value from shrink_dcache_memory() and perhaps shrink_icache_memory().

It's not quite what you're after, but it's close.

2004-04-29 21:36:57

by Timothy Miller

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell



Paul Jackson wrote:

> Heh - indeed perhaps the answer is closer than I realize. For SGI's big
> NUMA boxes, managing memory placement is sufficiently critical that we
> are inventing or encouraging ways (such as Andi Kleen's numa stuff) to
> control memory placement per node per job. Perhaps this needs to be
> extended to portions of a node (this job can only use 1 Gb of the memory
> on that 2 Gb node) and to other memory uses (file cache, not just user
> space memory).
>

Is updatedb run with a nice level greater than zero?

Perhaps nice level could influence how much a process is allowed to
affect page cache.

2004-04-29 21:39:22

by Paul Jackson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew wrote:
> Two things:
> a) a knob to say "only reclaim pagecache". We have that now.
> b) a knob to say "reclaim vfs caches harder" ...

Are these knobs system wide in affect, or per job?
I am presuming system wide.

When I'm working late, I want my updatedb/backup jobs
to scrunch themselves into a corner, even as my builds
and gui desktop continue to fly and suck up RAM.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-29 21:49:05

by Denis Vlasenko

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thursday 29 April 2004 03:50, Wakko Warner wrote:
> > I don't know. What if you have some huge application that only
> > runs once per day for 10 minutes? Do you want it to be consuming
> > 100MB of your memory for the other 23 hours and 50 minutes for
> > no good reason?
>
> I keep soffice open all the time. The box in question has 512mb of ram.
> This is one app, even though I use it infrequently, would prefer that it
> never be swapped out. Mainly when I want to use it, I *WANT* it now (ie
> not waiting for it to come back from swap)

I'm afraid a part of the problem is that there are apps which are
way too bloated. Fighting bloat is thankless and hard, so almost
everybody simply throws RAM at the problem. Well. Having thrown
lotsa RAM at the problem, it may feel 'better' until you realize you
need not only RAM but *also* disk bandwidth to move bloat from disk
to RAM and back.

Come on, lets admit it. Proper fix to 'I want OpenOffice to be
responsive' problem is to make it several times smaller.
Everything else is more or less a workaround.

It's a pity size optimizations are not too popular even
on lkml.

> This is just my oppinion. I personally feel that cache should use
> available memory, not already used memory (swapping apps out for more
> cache).
--
vda

2004-04-29 21:53:07

by Paul Jackson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Timothy wrote:
> Perhaps nice level could influence how much a process is allowed to
> affect page cache.

I'm from the school that says 'nice' applies to scheduling priority,
not memory usage.

I'd expect a different knob, a per-task inherited value as is 'nice',
to control memory usage.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-29 21:59:04

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Paul Jackson <[email protected]> wrote:
>
> Andrew wrote:
> > Two things:
> > a) a knob to say "only reclaim pagecache". We have that now.
> > b) a knob to say "reclaim vfs caches harder" ...
>
> Are these knobs system wide in affect, or per job?
> I am presuming system wide.

yup, system-wide.

> When I'm working late, I want my updatedb/backup jobs
> to scrunch themselves into a corner, even as my builds
> and gui desktop continue to fly and suck up RAM.

Sure. That's not purely a cacheing thing though. Even if the background
activity was clamped to just a few megs of cache you'll find that the
seek activity is a killer, and needs a limitation mechanism. Although the
anticipatory scheduler helps here a lot.

2004-04-29 22:13:43

by Timothy Miller

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell



Paul Jackson wrote:
> Timothy wrote:
>
>>Perhaps nice level could influence how much a process is allowed to
>>affect page cache.
>
>
> I'm from the school that says 'nice' applies to scheduling priority,
> not memory usage.
>
> I'd expect a different knob, a per-task inherited value as is 'nice',
> to control memory usage.
>


Linux kernel developers seem to be of the mind that you cannot trust
what applications tell you about themselves, so it's better to use
heuristics to GUESS how to schedule something, rather than to add YET
ANOTHER property to it.

Nick, Con, Ingo, and others have done an impressive job of taking the
guess/heuristic approach to scheduling. I don't see why that can't be
taken further.

Also, there seems to be strong resistance to adding a property to
something which is not easily accessible through existing UNIX tools.
"nice" and "renice" commands have been around forever. Adding another
control requires new commands, new libc functions, changes to "top", etc.

Besides, when would you want to have a sched-nice of -20 and an io-nice
of 20, or a sched-nice of 20 and an io-nice of -20? Things like that
would make no sense.

2004-04-29 22:22:00

by Paul Jackson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew wrote:
> Even if the background activity was clamped to just a few megs
> of cache you'll find that the seek activity is a killer, and
> needs a limitation mechanism.

True - the seek activity is another critical resource that would need to
be throttled to keep updatedb/backup from interferring with my late
night labours.

Let's see, that's:
1) cpu scheduling ticks
2) memory for virtual address backing store
3) memory for file related caching
4) disk arm motion

Hmmm ... actually not so much a numa-placement extension, but rather a
CKRM opportunity.

CKRM focuses on measuring and restraining how much of specified critical
resources a task is using; numa placement on which cpus or memory nodes
are allowed to be used at all.

See further the CKRM thread of Shailabh Nagar, also running on lkml
today.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-29 22:27:15

by Andy Isaacson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 01:42:22PM -0700, Andrew Morton wrote:
> Andy Isaacson <[email protected]> wrote:
> > What I want is for purely sequential workloads which far exceed cache
> > size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
> > thrashing my entire desktop out of memory. I DON'T CARE if the tar
> > completed in 45 minutes rather than 80. (It wouldn't, anyways, because
> > it only needs about 5 MB of cache to get every bit of the speedup it was
> > going to get.) But the additional latency when I un-xlock in the
> > morning is annoying, and there is no benefit.
>
> What kernel version are you using? If 2.6, what value of
> /proc/sys/vm/swappiness?

2.4.various, including 2.4.25 and 2.4.26. I haven't taken the 2.6
plunge yet. Running on various x86 including
- dual PIII 666 MHz 512 MB
- SpeedStep PIII 700 MHz 128 MB
- Athlon XP 2GHz 512 MB

> > For a more useful example, ideally I *should not be able to tell* that
> > "dd if=/hde1 of=/hdf1" is running.
>
> I just did a 4GB `dd if=/dev/sda of=/x bs=1M' on a 1GB 2.6.6-rc2-mm2
> swappiness=85 machine here and there was no swapout at all.
>
> Probably your machine has less memory. But without real, hard details
> nothing can be done.

I'm pleased to hear that 2.6 is apparently better behaved. In your
test, what was the impact on the file cache? It's a big improvement to
not be paging out to swap, but it's also important that sequential IO
not evict my cached build tree.

An interesting test would be to time a compilation of a source file with
a large number of includes. For example, building
linux-2.4.25/kernel/sysctl.c on my Athlon XP 2GHz, 512MB, 2.4.25 takes
2.8 seconds with (fairly) cold cache. (I didn't reboot, but I did take
fairly extreme measures to force stuff out.) It takes 0.54 seconds with
warm caches. After doing 1GB of sequential IO (wc -w /tmp/bigfile) I'm
back up to 2.08 seconds.

> > There is *no* benefit to cacheing
> > more than about 2 pages, under this workload.
>
> Sure, we could do better things with the large streaming files, although
> the risk of accidentally screwing up particular workloads is high.

Yeah, I agree. For example, I've occasionally used cat(1) or wc(1) to
prefetch files that I knew I was going to be accessing randomly; with my
hypothetical "sequential IO doesn't cause cacheing" it would be much
harder to do effective manual prefetching.

> But the use-once logic which we have in there at present does handle these
> cases quite well.

Where is the use-once logic available? Is it in mainstream 2.6 or only
in some development branches? I've not upgraded from 2.4 mostly because
I didn't see much benefits evident in the discussions, but improved
paging logic would be nice.

> > But with current kernels,
> > IME, that workload results in a gargantuan buffer cache and lots of
> > swapout of apps I was using 3 minutes ago. I've taken to walking away
> > for some coffee, coming back when it's done, and "sudo swapoff
> > /dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
> > trying to use bloaty apps.
>
> What kernel, what system specs, what swappiness setting?

2.4.25, Athlon XP 2 GHz, 512MB. I suppose you're not terribly
interested in 2.4. I'll see if I can reasonably upgrade, if you can
tell me what I should upgrade to for the good stuff.

-andy

2004-04-29 22:48:46

by Paul Jackson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Timothy wrote:
> Linux kernel developers seem to be of the mind that you cannot trust
> what applications tell you about themselves, so it's better to use
> heuristics to GUESS how to schedule something, rather than to add YET
> ANOTHER property to it.

Both are needed. The thing has to work pretty well, for most people,
most of the time, without human intervention.

And there needs to be knobs to optimize performance. Even with no
conscious end-user administration, a knob on the cron job that runs
updatedb, setup by the distribution packager, could have wide spread
impact on the responsiveness of a system, when the user sits down with
the first cup of coffee to scan the morning headlines and incoming
email er eh spam.

As to whether it's two nice calls, or one with dual affect, let's not
confuse the kernel API with that seen by the user. The kernel should
provide a minimum spanning set of orthogonal mechanisms, and not be
second guessing whether the user is out of their ever loving mind to be
asking for a hot cpu, cold io, job.

In other words, I wouldn't agree with your take that it's a matter of
not trusting the application, better to GUESS. Rather I would say that
there is a preference, and a good one at that, to not use an excessive
number of knobs as a cop-out to avoid working hard to get the widest
practical range of cases to behave reasonably, without intervention, and
a preference to keep what knobs that are there short, sweet and
minimally interacting.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-29 22:54:54

by Steve Youngs

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

* David B Stevens <[email protected]> writes:

> Maybe the kernel should be told by the apps exactly what they
> require in the way of memory

So what happens when Mr BloatyApp says: "Yo, Mr Kernel, gimme all ya
got baby!"

--
|---<Steve Youngs>---------------<GnuPG KeyID: A94B3003>---|
| Ashes to ashes, dust to dust. |
| The proof of the pudding, is under the crust. |
|----------------------------------<[email protected]>---|

2004-04-29 23:04:43

by Timothy Miller

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell



Paul Jackson wrote:

>
> In other words, I wouldn't agree with your take that it's a matter of
> not trusting the application, better to GUESS.

Okay.

> Rather I would say that
> there is a preference, and a good one at that, to not use an excessive
> number of knobs as a cop-out to avoid working hard to get the widest
> practical range of cases to behave reasonably, without intervention, and
> a preference to keep what knobs that are there short, sweet and
> minimally interacting.
>

Agreed. And this is why I suggested not adding another knob but rather
going with the existing nice value.

Mind you, this shouldn't necessarily be done without some kind of
experimentation. Put two knobs in the kernel and try varying them to
each other to see what sorts of jobs, if any, would benefit in a
disparity between cpu-nice and io-nice. If there IS a significant
difference, then add the extra knob. If there isn't, then don't.

Another possibility would be to have one knob that controls cpu-nice,
and another knob that controls io-nice minus cpu-nice, so if you REALLY
want to make them different, you can, but typically, they are set to be
the same.

2004-04-29 23:17:49

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andy Isaacson <[email protected]> wrote:
>
> > What kernel version are you using? If 2.6, what value of
> > /proc/sys/vm/swappiness?
>
> 2.4.various, including 2.4.25 and 2.4.26. I haven't taken the 2.6
> plunge yet.

OK. Please try 2.6 and let us know how it changes things.

> > I just did a 4GB `dd if=/dev/sda of=/x bs=1M' on a 1GB 2.6.6-rc2-mm2
> > swappiness=85 machine here and there was no swapout at all.
> >
> > Probably your machine has less memory. But without real, hard details
> > nothing can be done.
>
> I'm pleased to hear that 2.6 is apparently better behaved. In your
> test, what was the impact on the file cache?

It will have munched everything else.

> It's a big improvement to
> not be paging out to swap, but it's also important that sequential IO
> not evict my cached build tree.

Yup. We don't have any large-streaming-file heuristics in there.

It is the case that if you have recently accessed a file *twice* then its
pages will be preferred over the large streaming file. But if you've
accessed the valuable file only once, the streaming I/O will evict it.

> > > There is *no* benefit to cacheing
> > > more than about 2 pages, under this workload.
> >
> > Sure, we could do better things with the large streaming files, although
> > the risk of accidentally screwing up particular workloads is high.
>
> Yeah, I agree. For example, I've occasionally used cat(1) or wc(1) to
> prefetch files that I knew I was going to be accessing randomly; with my
> hypothetical "sequential IO doesn't cause cacheing" it would be much
> harder to do effective manual prefetching.

We have a new syscall in 2.6 (fadvise) with which an app can provide hints
about its access patterns and its desired cache usage. So, for example,
tar or rsync or whatever could (if told to do so by the user) deliberately
throw away the pagecache after having accessed the file. But that does
require application modifications. They're pretty simple though:

+ if (user said to throw away the cache)
+ posix_fadvise64(fd, 0, -1, POSIX_FADV_DONTNEED);
close(fd);

> > But the use-once logic which we have in there at present does handle these
> > cases quite well.
>
> Where is the use-once logic available? Is it in mainstream 2.6 or only
> in some development branches? I've not upgraded from 2.4 mostly because
> I didn't see much benefits evident in the discussions, but improved
> paging logic would be nice.

It's in 2.4 also. I think 2.4 tends to do the wrong thing because dirty
pages easily make it to the tail of the VM LRU's, which eventually causes
the VM to go off and hunt down mapped pages instead. 2.6 takes more care
to prevent dirty pages from hitting the tail of the LRU.

> > > But with current kernels,
> > > IME, that workload results in a gargantuan buffer cache and lots of
> > > swapout of apps I was using 3 minutes ago. I've taken to walking away
> > > for some coffee, coming back when it's done, and "sudo swapoff
> > > /dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
> > > trying to use bloaty apps.
> >
> > What kernel, what system specs, what swappiness setting?
>
> 2.4.25, Athlon XP 2 GHz, 512MB. I suppose you're not terribly
> interested in 2.4. I'll see if I can reasonably upgrade, if you can
> tell me what I should upgrade to for the good stuff.

2.6.6-rc3 would be suitable.

2004-04-30 00:04:11

by Andy Isaacson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 02:57:25PM -0700, Andrew Morton wrote:
> > When I'm working late, I want my updatedb/backup jobs
> > to scrunch themselves into a corner, even as my builds
> > and gui desktop continue to fly and suck up RAM.
>
> Sure. That's not purely a cacheing thing though. Even if the background
> activity was clamped to just a few megs of cache you'll find that the
> seek activity is a killer, and needs a limitation mechanism. Although the
> anticipatory scheduler helps here a lot.

I grant that in the updatedb case (or the backup case), the seeks are
going to suck and they're inherently on the same spindle as the user's
data, so there's no fixing it (short of a real "IO nice").

But in a related case, I have a background daemon that does a lot of IO
(mostly sequential, one page at a time read/modify/write of a multi-GB
file) to a filesystem on a separate spindle from my main filesystems.
I'd like to use a similar mechanism to say "don't let this program eat
my pagecache" that will let the daemon crunch away without severely
impacting my desktop work.

-andy

2004-04-30 00:14:15

by Lincoln Dale

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

At 02:51 AM 30/04/2004, Andy Isaacson wrote:
>What I want is for purely sequential workloads which far exceed cache
>size (dd, updatedb, tar czf /backup/home.nightly.tar.gz /home) to avoid
>thrashing my entire desktop out of memory. I DON'T CARE if the tar
>completed in 45 minutes rather than 80. (It wouldn't, anyways, because
>it only needs about 5 MB of cache to get every bit of the speedup it was
>going to get.) But the additional latency when I un-xlock in the
>morning is annoying, and there is no benefit.
>
>For a more useful example, ideally I *should not be able to tell* that
>"dd if=/hde1 of=/hdf1" is running. [1] There is *no* benefit to cacheing
>more than about 2 pages, under this workload. But with current kernels,
>IME, that workload results in a gargantuan buffer cache and lots of
>swapout of apps I was using 3 minutes ago. I've taken to walking away
>for some coffee, coming back when it's done, and "sudo swapoff
>/dev/hda3; sudo swapon -a" to avoid the latency that is so annoying when
>trying to use bloaty apps.

the mechanism already exists; teach tar/dd and any other app that you don't
want to pollute the page-cache with data with to use O_DIRECT.

i suspect updatedb is a different case as its probably filling the system
with dcache/inode entries.


cheers,

lincoln.

2004-04-30 00:34:47

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andy Isaacson <[email protected]> wrote:
>
> But in a related case, I have a background daemon that does a lot of IO
> (mostly sequential, one page at a time read/modify/write of a multi-GB
> file) to a filesystem on a separate spindle from my main filesystems.
> I'd like to use a similar mechanism to say "don't let this program eat
> my pagecache" that will let the daemon crunch away without severely
> impacting my desktop work.

fadvise(POSIX_FADV_DONTNEED) is ideal for this. Run it once per megabyte
or so.

2004-04-30 01:02:15

by Paul Jackson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew wrote:
> fadvise(POSIX_FADV_DONTNEED) is ideal for this.

Perhaps ... perhaps not.

Just as the knobs "only reclaim pagecache" and "reclaim vfs caches
harder" had too big a scope (system-wide), using fadvise might have too
small a scope (currently cached pages of current task only).

If his background daemon is some shell script, say, that uses 'cat' to
generate the i/o to the other spindle, then he probably wants to be
marking that daemon job "don't let this entire job eat my pagecache",
not rebuilding a hacked up cat command with added POSIX_FADV_DONTNEED
calls every megabyte.

CKRM to the rescue ... ??

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-30 03:01:03

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Rik van Riel wrote:
> On Thu, 29 Apr 2004, Nick Piggin wrote:
>
>
>>I'm not very impressed with the pagecache use-once logic, and I
>>have a patch to remove it completely and treat non-mapped touches
>>(IMO) more sanely.
>
>
> The basic idea of use-once isn't bad (search for LIRS and
> ARC page replacement), however the Linux implementation
> doesn't have any of the checks and balances that the
> researched replacement algorithms have...
>
> However, adding the checks and balancing required for LIRS,
> ARC and CAR(S) isn't easy since it requires keeping track of
> a number of recently evicted pages. That could be quite a
> bit of infrastructure, though it might be well worth it.
>

No, use once logic is good in theory I think. Unfortunately
our implementation is quite fragile IMO (although it seems
to have been "good enough").

This is what I'm currently doing (on top of a couple of other
patches, but you get the idea). I should be able to transform
it into a proper use-once logic if I pick up Nikita's inactive
list second chance bit.


Attachments:
vm-dropbehind.patch (3.54 kB)

2004-04-30 03:20:29

by Tim Connors

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Adam Kropelin <[email protected]> said on Thu, 29 Apr 2004 14:14:13 -0400:
> On Wed, Apr 28, 2004 at 09:47:45PM -0400, Rik van Riel wrote:
> > On Wed, 28 Apr 2004, Andrew Morton wrote:
> >
> > > OK, so it takes four seconds to swap mozilla back in, and you noticed it.
> > >
> > > Did you notice that those three kernel builds you just did ran in twenty
> > > seconds less time because they had more cache available? Nope.
> >
> > That's exactly why desktops should be optimised to give
> > the best performance where the user notices it most...
...
> The 'swappiness' tunable may well give enough control over the situation
> to suit all sorts of users. If nothing else, this thread has raised
> awareness that such a tunable exists and can be played with to influence
> the kernel's decision-making. Distros, too, should give consideration to
> appropriate default settings to serve their intended users.

Actually, I decided to investigate how 2.4 compares (we're still stuck
on 2.4)

According to this:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0210.1/0011.html

2.6 with swapiness of 0% is same as 2.4.19 - I assume 2.4.19's VM is
the same as 2.4.26 (given feature freeze).

I have always been completely unimpressed with the 2.4 VM - before and
after the big change in ~2.4.10. It has *always* preferred to use
cache in preference to a recently used application.

So will this still apply to 2.6 with swapiness of 0%?

I might try to get my sysadmin to put on 2.6, becuase 2.4 is quite
unusable for some of the work I do (if I need mozilla at the same time
as my visualisation software, which allocates a good 3/4 of RAM, after
reading a file that is about that size, leaving still enough for
mozilla and X combined, mozilla and parts of X still get swapped out -
and the cahce is wasted, since I only ever read the file once, and it
is written on another host)

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
"32-bit patch for a 16-bit GUI shell running on top of an
8-bit operating system written for a 4-bit processor by a
2-bit company who cannot stand 1 bit of competition."

2004-04-30 03:41:03

by Tim Connors

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Timothy Miller <[email protected]> said on Thu, 29 Apr 2004 18:18:06 -0400:
> Paul Jackson wrote:
> > Timothy wrote:
> >
> >>Perhaps nice level could influence how much a process is allowed to
> >>affect page cache.
> >
> >
> > I'm from the school that says 'nice' applies to scheduling priority,
> > not memory usage.
> >
> > I'd expect a different knob, a per-task inherited value as is 'nice',
> > to control memory usage.
>
> Linux kernel developers seem to be of the mind that you cannot trust
> what applications tell you about themselves, so it's better to use
> heuristics to GUESS how to schedule something, rather than to add YET
> ANOTHER property to it.

Why is that?

On the desktop system/workstation, which is what we are talking about
here -- we want the desktop system in particular to be responsive --
the user wouldn't try to do anythign malicious, so why not trust the
applications? openoffice and mozilla and my visualisation software are
going to know what they want out of the kernel (possibly with
safegaurds such that they only tell the kernel what they want if the
kernel happens to be in some tested range, perhaps), the kernel sure
as hell won't know what my custom built application wants via
heuristics, because I am doing something that no-one else is, and so
my exact workloads haven't been experienced or designed for.

On a server, you can have a /proc file to tell the kernel to ignore
everything an application tells you, or ignore/believe application
with uid in ranges xx--yy.

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Beware of Programmers who carry screwdrivers.

2004-04-30 04:08:11

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Marc Singer wrote:
> On Thu, Apr 29, 2004 at 02:26:17PM +1000, Nick Piggin wrote:
>
>>Yes it includes something which should help that. Along with
>>the "split active lists" that I mentioned might help your
>>problem when WLI first came up with the change to the
>>swappiness calculation for your problem.
>>
>>It would be great if you had time to give my patch a run.
>>It hasn't been widely stress tested yet though, so no
>>production systems, of course!
>
>
> As I said, I'm game to have a go. The trouble was that it doesn't
> apply. My development kernel has an RMK patch applied that seems to
> conflict with the MM patch on which you depend.
>

You would probably be better off trying a simpler change
first actually:

in mm/vmscan.c, shrink_list(), change:

if (res == WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
goto activate_locked;
}

to

if (res == WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
goto keep_locked;
}

I think it is not the correct solution, but should narrow
down your problem. Let us know how it goes.

2004-04-30 05:15:55

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Horst von Brand wrote:
> Nick Piggin <[email protected]> said:
>
> [...]
>
>
>>I don't know. What if you have some huge application that only
>>runs once per day for 10 minutes? Do you want it to be consuming
>>100MB of your memory for the other 23 hours and 50 minutes for
>>no good reason?
>
>
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?


It doesn't know that.

But if you restrict this guy's working set to a tiny amount
and just allow it to thrash away, then if nothing else, all
that wasted disk IO will slow all your other stuff down too.

However that is something we can allow you to tune, via RSS
limits. I am maintaining Rik's patch for that and will send
it on when rmap optimisation work is more finalised.

2004-04-30 05:38:39

by Andy Isaacson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Thu, Apr 29, 2004 at 05:54:42PM -0700, Paul Jackson wrote:
> Andrew wrote:
> > fadvise(POSIX_FADV_DONTNEED) is ideal for this.
>
> Perhaps ... perhaps not.
>
> Just as the knobs "only reclaim pagecache" and "reclaim vfs caches
> harder" had too big a scope (system-wide), using fadvise might have too
> small a scope (currently cached pages of current task only).
>
> If his background daemon is some shell script, say, that uses 'cat' to
> generate the i/o to the other spindle, then he probably wants to be
> marking that daemon job "don't let this entire job eat my pagecache",
> not rebuilding a hacked up cat command with added POSIX_FADV_DONTNEED
> calls every megabyte.

Well, in this case it's bespoke C code so adding the fadvise isn't
terribly difficult. (The structure of the code doesn't lend itself to
"do this every 10 MB" but I'm sure I can hack something up.)

It would be nicer if the kernel would do the right thing without needing
to have its hand held, but the fadvise will solve my immediate need.
(Assuming it works on 2.4.)

-andy

2004-04-30 06:00:56

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andy Isaacson wrote:
> On Thu, Apr 29, 2004 at 05:54:42PM -0700, Paul Jackson wrote:
>
>>Andrew wrote:
>>
>>>fadvise(POSIX_FADV_DONTNEED) is ideal for this.
>>
>>Perhaps ... perhaps not.
>>
>>Just as the knobs "only reclaim pagecache" and "reclaim vfs caches
>>harder" had too big a scope (system-wide), using fadvise might have too
>>small a scope (currently cached pages of current task only).
>>
>>If his background daemon is some shell script, say, that uses 'cat' to
>>generate the i/o to the other spindle, then he probably wants to be
>>marking that daemon job "don't let this entire job eat my pagecache",
>>not rebuilding a hacked up cat command with added POSIX_FADV_DONTNEED
>>calls every megabyte.
>
>
> Well, in this case it's bespoke C code so adding the fadvise isn't
> terribly difficult. (The structure of the code doesn't lend itself to
> "do this every 10 MB" but I'm sure I can hack something up.)
>
> It would be nicer if the kernel would do the right thing without needing
> to have its hand held, but the fadvise will solve my immediate need.
> (Assuming it works on 2.4.)

Right for one thing will always be wrong for another.
If you want some specific behaviour then you might
have to hold hands. That is just the way it goes.

2004-04-30 06:21:39

by Tim Connors

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Horst von Brand <[email protected]> said on Thu, 29 Apr 2004 16:01:11 -0400:
> Nick Piggin <[email protected]> said:
>
> [...]
>
> > I don't know. What if you have some huge application that only
> > runs once per day for 10 minutes? Do you want it to be consuming
> > 100MB of your memory for the other 23 hours and 50 minutes for
> > no good reason?
>
> How on earth is the kernel supposed to know that for this one particular
> job you don't care if it takes 3 hours instead of 10 minutes, just because
> you don't want to spare enough preciousss RAM?

Note that we are not talking about having insufficient memory. In my
case (2.4 kernel - ie, 2.6 with swapiness 0%) there is more than
enough memory to contain all my working set - it's only because cache
is too eager to claim memory that is otherwise in use that
non-optimalities occur.

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
If I'd known computer science was going to be like this, I'd never have
given up being a rock 'n' roll star. -- G. Hirst

2004-04-30 06:34:59

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Tim Connors wrote:
> Horst von Brand <[email protected]> said on Thu, 29 Apr 2004 16:01:11 -0400:
>
>>Nick Piggin <[email protected]> said:
>>
>>[...]
>>
>>
>>>I don't know. What if you have some huge application that only
>>>runs once per day for 10 minutes? Do you want it to be consuming
>>>100MB of your memory for the other 23 hours and 50 minutes for
>>>no good reason?
>>
>>How on earth is the kernel supposed to know that for this one particular
>>job you don't care if it takes 3 hours instead of 10 minutes, just because
>>you don't want to spare enough preciousss RAM?
>
>
> Note that we are not talking about having insufficient memory. In my
> case (2.4 kernel - ie, 2.6 with swapiness 0%) there is more than
> enough memory to contain all my working set - it's only because cache
> is too eager to claim memory that is otherwise in use that
> non-optimalities occur.
>

Well depends on what you mean by working set.

In our memory manager, there is a point where often used
"file cache" (ie. unmapped cache) is considered preferable
to unused or little used "application memory" (mapped
memory).

There will be a point where even the most swap phobic desktop
users will want to start swapping.

I missed the description of your exact problem... was it in
this thread somewhere? Testing 2.6 would be appreciated if
possible too.

2004-04-30 07:06:57

by Tim Connors

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Fri, 30 Apr 2004, Nick Piggin wrote:

> Tim Connors wrote:
> > Horst von Brand <[email protected]> said on Thu, 29 Apr 2004 16:01:11 -0400:
> >
> >>Nick Piggin <[email protected]> said:
> >>
> >>[...]
> >>
> >>
> >>>I don't know. What if you have some huge application that only
> >>>runs once per day for 10 minutes? Do you want it to be consuming
> >>>100MB of your memory for the other 23 hours and 50 minutes for
> >>>no good reason?
> >>
> >>How on earth is the kernel supposed to know that for this one particular
> >>job you don't care if it takes 3 hours instead of 10 minutes, just because
> >>you don't want to spare enough preciousss RAM?
> >
> >
> > Note that we are not talking about having insufficient memory. In my
> > case (2.4 kernel - ie, 2.6 with swapiness 0%) there is more than
> > enough memory to contain all my working set - it's only because cache
> > is too eager to claim memory that is otherwise in use that
> > non-optimalities occur.
> >
>
> Well depends on what you mean by working set.
>
> In our memory manager, there is a point where often used
> "file cache" (ie. unmapped cache) is considered preferable
> to unused or little used "application memory" (mapped
> memory).

Sure - and indeed I have current swap usage (now that I am not doing
anything) of 300MB - that's good because I am not using whatever's in
there.

> I missed the description of your exact problem... was it in
> this thread somewhere? Testing 2.6 would be appreciated if
> possible too.

http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1033.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1394.html

In short: I have 512MB RAM. The files I am reading are read over NFS,
created remotely. I read them once, and then discard them (either delete
them, or keep them around, but don't read them again -- obviously, in the
latter case, they have an opportunity to pollute the cache for a long
time). If I do read them twice, it's only been a few times that I have
noticed any speedup the second time around, even for the smaller files (if
they're small enough not to cause a problem with swapping vital bits of
software out, then they are small enough that extra reads are hardly
noticable anyway - given the damn fast raid disks behind NFS we have)

For one of the file types, these can be several hundred megs, and are read
by an astronomical package - I have no idea how they are read, but because
it is fits, maybe it has to go back and read the header several times, but
I doubt the image data is read more than once. Come display time, memory
usage in this example is roughly the size of the FITS file - several
hundred megs. I don't recall how big X is, in such a situation, but the
sum of them both, plus recently used apps like mozilla, is below RAM size.
Watching top, I can see, during the read, mozilla rsize memory usage go
down rapidly - about as rapidly as cache usage going up.

Parts of X and the window manager also get swapped out, so when I move to
another virtual page, I get to watch fvwm redraw the screen - this is not
too painful though, because only a few megs need be swapped back in
(although the HD is seeking all over the place as things thrash about, so
it does still take non-negligible amount of time). Mozilla takes about 30
seconds to swap back in (~50-100MB - again, lots of thrashing from the
HD). I don't recall once mozilla is swapped back in, whether cache usage
has dropped again, or whether the visualisation software loses its pages
instead.

I'll try to test 2.6 here (half the battle is convincing the sysadmins
this is a worthwhile persuit) - I use that at home quite succesfully, but
I don't do big files or visualisation there.

--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
'It's amazing I won. I was running against peace, prosperity and incumbency.'
-- George W. Bush. June 14, 2001, to Swedish PM Goran Perrson,
unaware that a live television camera was still rolling.

2004-04-30 07:52:50

by Jeff Garzik

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton wrote:
> Andy Isaacson <[email protected]> wrote:
>
>> But in a related case, I have a background daemon that does a lot of IO
>> (mostly sequential, one page at a time read/modify/write of a multi-GB
>> file) to a filesystem on a separate spindle from my main filesystems.
>> I'd like to use a similar mechanism to say "don't let this program eat
>> my pagecache" that will let the daemon crunch away without severely
>> impacting my desktop work.
>
>
> fadvise(POSIX_FADV_DONTNEED) is ideal for this. Run it once per megabyte
> or so.


Sweet. I'm so happy you added posix_fadvise (way back when), and even
happier to hear this.

Does our fadvise support len==0 ("I mean the whole file")? That's
defined in POSIX, and would allow a compliant app to simply
POSIX_FADV_DONTNEED once at the beginning.

Jeff



2004-04-30 08:03:07

by Andrew Morton

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Jeff Garzik <[email protected]> wrote:
>
> > fadvise(POSIX_FADV_DONTNEED) is ideal for this. Run it once per megabyte
> > or so.
>
>
> Sweet. I'm so happy you added posix_fadvise (way back when), and even
> happier to hear this.

There are a number of other goodies we could add to it, as linux
extensions.

> Does our fadvise support len==0 ("I mean the whole file")? That's
> defined in POSIX, and would allow a compliant app to simply
> POSIX_FADV_DONTNEED once at the beginning.

Well I'll be darned.

--- 25/mm/fadvise.c~fadvise-len-fix 2004-04-30 00:58:00.437598504 -0700
+++ 25-akpm/mm/fadvise.c 2004-04-30 00:59:03.237051536 -0700
@@ -38,6 +38,9 @@ asmlinkage long sys_fadvise64_64(int fd,
goto out;
}

+ if (len == 0) /* 0 == "all data following offset" */
+ len = -1;
+
bdi = mapping->backing_dev_info;

switch (advice) {

_

2004-04-30 08:10:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Andrew Morton wrote:
> Jeff Garzik <[email protected]> wrote:
>> Does our fadvise support len==0 ("I mean the whole file")? That's
>> defined in POSIX, and would allow a compliant app to simply
>> POSIX_FADV_DONTNEED once at the beginning.
>
>
> Well I'll be darned.


FWIW the specific language is "If len is zero, all data following offset
is specified."

(for others, you probably already have this somewhere)
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

top level SuSv3:
http://www.opengroup.org/onlinepubs/007904975/toc.htm

Jeff



2004-04-30 08:35:47

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Tim Connors wrote:
> On Fri, 30 Apr 2004, Nick Piggin wrote:
>
>>In our memory manager, there is a point where often used
>>"file cache" (ie. unmapped cache) is considered preferable
>>to unused or little used "application memory" (mapped
>>memory).
>
>
> Sure - and indeed I have current swap usage (now that I am not doing
> anything) of 300MB - that's good because I am not using whatever's in
> there.
>
>
>>I missed the description of your exact problem... was it in
>>this thread somewhere? Testing 2.6 would be appreciated if
>>possible too.
>
>
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1033.html
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0404.3/1394.html
>
> In short: I have 512MB RAM. The files I am reading are read over NFS,

Ah, thanks for the description.

2.6 has a problem with NFS filesystems that would cause symptoms
like yours. I'm not sure whether 2.4 has something similar or not.
You can probably expect a fix for 2.6.6 but I'm not sure if there
is a patch that has been agreed upon yet.

In short, there probably isn't much point testing 2.6 right now.

2004-04-30 09:19:20

by Denis Vlasenko

[permalink] [raw]
Subject: Re[2]: ~500 megs cached yet 2.6.5 goes into swap hell

Hello Tim,

Friday, April 30, 2004, 10:05:19 AM, you wrote:
TC> Parts of X and the window manager also get swapped out, so when I move to
TC> another virtual page, I get to watch fvwm redraw the screen - this is not
TC> too painful though, because only a few megs need be swapped back in
TC> (although the HD is seeking all over the place as things thrash about, so
TC> it does still take non-negligible amount of time). Mozilla takes about 30
TC> seconds to swap back in (~50-100MB - again, lots of thrashing from the

I don't want to say that you're seeing optimal behavior,
just a different angle of view: wny in hell browser should
have such ridiculously large RSS? Why it tries to keep
so much stuff in the RAM?

Multimedia content (jpegs etc) is typically cached in
filesystem, so Mozilla polluted pagecache with it when
it saved JPEGs to the cache *and* then it keeps 'em in RAM
too, which doubles RAM usage. Most probably more, there
is severe internal fragmentation problems after you use
such a large application for several hours straight.
Why not reread JPEG whenever you need it? If you are
using Mozilla right now, it will be in pagecache.
When you are away, cache will be discarded, no need
to page out Mozilla pages with JPEG content - because
there aren't Mozilla pages with JPEG content!
RSS is smaller, less internal fragmentation, everyone's
happy.

(I don't specifically target Mozilla, it have shown some
improvement recently. Replace with your favorite
monstrosity)

Kernel folks probably can improve kernel behavior.
Next version of $BloatyApp will happily "use"
gained performance and improved RAM management
as an excuse for even less optimal code.

It's a vicious circle.
--
vda


2004-04-30 09:34:25

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Re[2]: ~500 megs cached yet 2.6.5 goes into swap hell


> Multimedia content (jpegs etc) is typically cached in
> filesystem, so Mozilla polluted pagecache with it when
> it saved JPEGs to the cache *and* then it keeps 'em in RAM
> too, which doubles RAM usage.

well if mozilla just mmap's the jpegs there is no double caching .....


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-04-30 11:34:13

by Denis Vlasenko

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Friday 30 April 2004 12:33, Arjan van de Ven wrote:
> > Multimedia content (jpegs etc) is typically cached in
> > filesystem, so Mozilla polluted pagecache with it when
> > it saved JPEGs to the cache *and* then it keeps 'em in RAM
> > too, which doubles RAM usage.
>
> well if mozilla just mmap's the jpegs there is no double caching .....

I may be wrong but Mozilla keeps unpacked bitmap in malloc() space.
The point is, $BloatyApp will keep bloating up while you
are working upon improving kernel. I guess it's very clear which
process is easier. You cannot win that race.

This is OpenOffice on idle 128Mb RAM, 1000MHz Duron machine with KDE,
Mozilla and KMail running:

# time swriter;time swriter

real 0m33.906s
user 0m10.163s
sys 0m0.705s

real 0m24.025s
user 0m10.069s
sys 0m0.546s

I closed windows as soon as it appeared.

Freshly started swriter in top:
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
2081 root 15 0 93980 41M 80300 S 1,3 34,0 0:09 0 soffice.bin

93 megs. 10 seconds of 1GHz CPU time taken...
--
vda

2004-04-30 12:32:39

by Bart Samwel

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Fri, 2004-04-30 at 01:08, Timothy Miller wrote:
> Agreed. And this is why I suggested not adding another knob but rather
> going with the existing nice value.
>
> Mind you, this shouldn't necessarily be done without some kind of
> experimentation. Put two knobs in the kernel and try varying them to
> each other to see what sorts of jobs, if any, would benefit in a
> disparity between cpu-nice and io-nice. If there IS a significant
> difference, then add the extra knob. If there isn't, then don't.

Thought experiment: what would happen when you set the hypothetical
cpu-nice and io-nice knobs very differently?

* cpu-nice 20, io-nice -20: Read I/O will finish immediately, but then
the process will have to wait for ages to get a CPU slice to process the
data, so why would you want to read it so quickly? The process can do as
much write I/O as it wants, but why is it not okay to take ages to write
the data if it's okay to take ages to produce it?

* cpu-nice -20, io-nice 20: Read I/O will take ages, but once the data
gets there, the processor is immediately taken to process the data as
fast as possible. If it was okay to take ages to read the data, why
would you want to process it as soon as you can? It makes some sense for
write I/O though: produce data as fast as the other processes will allow
you to write it. But if you're going to hog the CPU completely, why give
other processes the chance to do a lot of I/O while they don't get the
CPU time to submit any I/O? Going for a smaller difference makes more
sense.

As far as I can tell, giving the knobs very different values doesn't
make much sense. The same arguments go for medium-sized differences. And
if we're going to give the knobs only *slightly* different values, we
might as well make them the same. If we really need cpu-nice = 0 and
io-nice = 3 somewhere, then I think that's a sign of a kernel problem,
where the kernel's various nice-knobs aren't calibrated correctly to
result in the same amount of "niceness" when they're set to the same
value. And cpu-nice = io-nice = 3 would probably have about the same
effect.


BTW, if there *are* going to be more knobs, I suggest adding
"memory-nice" as well. :) If you set memory-nice to 20, then the process
will not kick out much memory from other processes (it will require more
I/O -- but that can be throttled using io-nice). If you set memory-nice
to -20, then the process will kick out the memory of all other processes
if it needs to.

--Bart

2004-04-30 12:53:04

by Rik van Riel

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Fri, 30 Apr 2004, Nick Piggin wrote:
> Rik van Riel wrote:

> > The basic idea of use-once isn't bad (search for LIRS and
> > ARC page replacement), however the Linux implementation
> > doesn't have any of the checks and balances that the
> > researched replacement algorithms have...

> No, use once logic is good in theory I think. Unfortunately
> our implementation is quite fragile IMO (although it seems
> to have been "good enough").

Hey, that's what I said ;))))

> This is what I'm currently doing (on top of a couple of other
> patches, but you get the idea). I should be able to transform
> it into a proper use-once logic if I pick up Nikita's inactive
> list second chance bit.

Ummm nope, there just isn't enough info to keep things
as balanced as ARC/LIRS/CAR(T) can do. No good way to
auto-tune the sizes of the active and inactive lists.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-04-30 13:18:45

by Nikita Danilov

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Rik van Riel writes:
> On Fri, 30 Apr 2004, Nick Piggin wrote:
> > Rik van Riel wrote:
>
> > > The basic idea of use-once isn't bad (search for LIRS and
> > > ARC page replacement), however the Linux implementation
> > > doesn't have any of the checks and balances that the
> > > researched replacement algorithms have...
>
> > No, use once logic is good in theory I think. Unfortunately
> > our implementation is quite fragile IMO (although it seems
> > to have been "good enough").
>
> Hey, that's what I said ;))))
>
> > This is what I'm currently doing (on top of a couple of other
> > patches, but you get the idea). I should be able to transform
> > it into a proper use-once logic if I pick up Nikita's inactive
> > list second chance bit.
>
> Ummm nope, there just isn't enough info to keep things
> as balanced as ARC/LIRS/CAR(T) can do. No good way to
> auto-tune the sizes of the active and inactive lists.

While keeping "history" for non-resident pages is very good from many
points of view (provides infrastructure for local replacement and
working set tuning, for example) and in the long term, current scanner
can still be improved somewhat.

Here are results that I obtained some time ago. Test is to concurrently
clone (bk) and build (make -jN) kernel source in M directories.

For N = M = 11, TIMEFORMAT='%3R %3S %3U'

REAL SYS USER
"stock" 3818.320 568.999 4358.460
transfer-dirty-on-refill 3368.690 569.066 4377.845
check-PageSwapCache-after-add-to-swap 3237.632 576.208 4381.248
dont-unmap-on-pageout 3207.522 566.539 4374.504
async-writepage 3115.338 562.702 4325.212

(check-PageSwapCache-after-add-to-swap was added to mainline since them.)

These patches weren't updated for some time. Last version is at
ftp://ftp.namesys.com/pub/misc-patches/unsupported/extra/2004.03.25-2.6.5-rc2

[from Nick Piggin's patch]
>
> Changes mark_page_accessed to only set the PageAccessed bit, and
> not move pages around the LRUs. This means we don't have to take
> the lru_lock, and it also makes page ageing and scanning consistient
> and all handled in mm/vmscan.c

By the way, batch-mark_page_accessed patch at the URL above also tries
to reduce lock contention in mark_page_accessed(), but through more
standard approach of batching target pages in per-cpu pvec.

Nikita.

2004-04-30 13:39:35

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Nikita Danilov wrote:

> Here are results that I obtained some time ago. Test is to concurrently
> clone (bk) and build (make -jN) kernel source in M directories.
>
> For N = M = 11, TIMEFORMAT='%3R %3S %3U'
>
> REAL SYS USER
> "stock" 3818.320 568.999 4358.460
> transfer-dirty-on-refill 3368.690 569.066 4377.845
> check-PageSwapCache-after-add-to-swap 3237.632 576.208 4381.248
> dont-unmap-on-pageout 3207.522 566.539 4374.504
> async-writepage 3115.338 562.702 4325.212
>

I like your transfer-dirty-on-refill change. It is definitely
worthwhile to mark a page as dirty when it drops off the active
list in order to hopefully get it written before it reaches the
tail of the inactive list.

> (check-PageSwapCache-after-add-to-swap was added to mainline since them.)
>
> These patches weren't updated for some time. Last version is at
> ftp://ftp.namesys.com/pub/misc-patches/unsupported/extra/2004.03.25-2.6.5-rc2
>
> [from Nick Piggin's patch]
> >
> > Changes mark_page_accessed to only set the PageAccessed bit, and
> > not move pages around the LRUs. This means we don't have to take
> > the lru_lock, and it also makes page ageing and scanning consistient
> > and all handled in mm/vmscan.c
>
> By the way, batch-mark_page_accessed patch at the URL above also tries
> to reduce lock contention in mark_page_accessed(), but through more
> standard approach of batching target pages in per-cpu pvec.
>

This is a good patch too if mark_page_accessed is required to
take the lock (which it currently is, of course).

2004-04-30 13:59:34

by Nick Piggin

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

Rik van Riel wrote:
> On Fri, 30 Apr 2004, Nick Piggin wrote:
>
>> Rik van Riel wrote:
>
>
>> > The basic idea of use-once isn't bad (search for LIRS and
>> > ARC page replacement), however the Linux implementation
>> > doesn't have any of the checks and balances that the
>> > researched replacement algorithms have...
>
>
>> No, use once logic is good in theory I think. Unfortunately
>> our implementation is quite fragile IMO (although it seems
>> to have been "good enough").
>
>
> Hey, that's what I said ;))))
>

Yes. I just thought you might have misunderstood me to
think use once is no good at all.

>> This is what I'm currently doing (on top of a couple of other
>> patches, but you get the idea). I should be able to transform
>> it into a proper use-once logic if I pick up Nikita's inactive
>> list second chance bit.
>
>
> Ummm nope, there just isn't enough info to keep things
> as balanced as ARC/LIRS/CAR(T) can do. No good way to
> auto-tune the sizes of the active and inactive lists.
>

I think perhaps it might be possible. I don't want to
discourage you from looking into more interesting replacement
schemes though. I don't doubt that our basic replacement
can often be suboptimal ;)

2004-04-30 15:35:44

by Clay Haapala

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Fri, 30 Apr 2004, Bart Samwel uttered the following:
>
> Thought experiment: what would happen when you set the hypothetical
> cpu-nice and io-nice knobs very differently?
>
Dunno why, but this talk of knobs makes me think of the "effects-mix"
knob on my bass amp that controls how much effects-loop signal is
mixed with the "dry" guitar signal.

Getting back to kernel talk, we have a "swappiness" knob, right?
Should there be, or is there already, a way to dynamically vary the
effect of swappiness [within a range], based on some monitored system
characteristics such as keyboard/mouse (HID) input or some other
identifiable profile? Perhaps this is similar to nice/fairness logic
in the process schedulers.

Using HID as a profile, if I'm up late working on a paper in OO and
using a browser like Mozilla when updatedb fires up, the fact that
there is recent keyboard/mouse input has been seen would modify
swappiness down.

However, if I've fallen asleep in my chair for an hour when updatedb
fires up, no recent input events will have been detected, and updatedb
gets the high range of swappiness effect. If I happen to wake up in
the middle of it, I just have to accept it'll take time to wake my
apps up, but at least they will get progressively more responsive as I
use 'em.

I use the term "profile" because I wouldn't want to have just HID
events be the trigger. If a machine's main use is database or
web-serving, perhaps the appropriate events to monitor would be, say,
traffic on specified TCP ports or network interfaces.

The amount of extra work should be no more than what goes on with
entropy generation, I would think.
--
Clay Haapala ([email protected]) Cisco Systems SRBU +1 763-398-1056
6450 Wedgwood Rd, Suite 130 Maple Grove MN 55311 PGP: C89240AD
"Oh, *that* Physics Prize. Well, I just substituted 'stupidity' for
'dark matter' in the equations, and it all came together."

2004-04-30 15:46:26

by Bart Samwel

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Fri, 2004-04-30 at 17:35, Clay Haapala wrote:
> On Fri, 30 Apr 2004, Bart Samwel uttered the following:
> >
> > Thought experiment: what would happen when you set the hypothetical
> > cpu-nice and io-nice knobs very differently?
> >
> Dunno why, but this talk of knobs makes me think of the "effects-mix"
> knob on my bass amp that controls how much effects-loop signal is
> mixed with the "dry" guitar signal.
>
> Getting back to kernel talk, we have a "swappiness" knob, right?
> Should there be, or is there already, a way to dynamically vary the
> effect of swappiness [within a range], based on some monitored system
> characteristics such as keyboard/mouse (HID) input or some other
> identifiable profile? Perhaps this is similar to nice/fairness logic
> in the process schedulers.

This kind of thing is exactly what has been avoided by using
interactivity boosts, and taking that into account in an "io-nice" value
as well should solve that. Other profiles might be interesting though.

Interactive tasks have a tendency to be interactive for a short while,
and then be noninteractive for a long time. I'm thinking that it might
be worthwhile to do something with that, i.e. to keep a bonus for "past
interactivity" on some pages based on the fact that they were originally
loaded by a still-existing process that was once/is marked as
interactive.

--Bart

2004-04-30 16:14:31

by Timothy Miller

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell



Arjan van de Ven wrote:
>>Multimedia content (jpegs etc) is typically cached in
>>filesystem, so Mozilla polluted pagecache with it when
>>it saved JPEGs to the cache *and* then it keeps 'em in RAM
>>too, which doubles RAM usage.
>
>
> well if mozilla just mmap's the jpegs there is no double caching .....
>


What is cached in memory? The original JPEG or the decoded raw image?

2004-04-30 22:19:41

by Paul Jackson

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

> Thought experiment: what would happen when you set the hypothetical
> cpu-nice and io-nice knobs very differently?

If there was one, single implementation hook in the kernel where making
some decision depend on a user setting cleanly adapted both i/o and cpu
priority, then yes, your thought experiment would recommend that this
one hook was sufficient, and lead to a single user knob to control it.

But in this case, there are obviously two implementation hooks - the
classic one in the scheduler that affects cpu usage, and another off in
some i/o code that affects i/o usage.

So then the question comes - do we have one knob over this that is
ganged to both hooks, or do we have two knobs, one per hook.

Ganging these two hooks together, to control them in synchrony to a
single user setting, is a policy choice. It's saying that we don't
think you will ever want to run them out of sync, so as a matter of
policy, we are ganging them together.

I prefer to avoid nonessential policy in the kernel. Best to simply
expose each independent kernel facility, 1-to-1, to the user. Let
them decide when and if if these two settings should be ganged.

I find gratuitous (not needed for system reliability) policy in the
kernel to be a greater negative than another system call.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-30 22:31:14

by Marc Singer

[permalink] [raw]
Subject: Re: ~500 megs cached yet 2.6.5 goes into swap hell

On Fri, Apr 30, 2004 at 02:08:03PM +1000, Nick Piggin wrote:
> You would probably be better off trying a simpler change
> first actually:
>
> in mm/vmscan.c, shrink_list(), change:
>
> if (res == WRITEPAGE_ACTIVATE) {
> ClearPageReclaim(page);
> goto activate_locked;
> }
>
> to
>
> if (res == WRITEPAGE_ACTIVATE) {
> ClearPageReclaim(page);
> goto keep_locked;
> }
>
> I think it is not the correct solution, but should narrow
> down your problem. Let us know how it goes.

OK, thanks. It might be a few days before I can get to this.