2001-11-14 12:43:58

by Sebastian Droege

[permalink] [raw]
Subject: [VM] 2.4.14/15-pre4 too "swap-happy"?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
Are there any paramters (for example in /proc/sys/vm), which make the VM less
swap-happy?
My problems are following:

I burn a CD-R on system 1:
...
- ---Swap: 0 KB
mkisofs blablabla
- ---swap begins to rise ;)
mkisofs finished
- ---swap: 3402 KB
cdrecord speed=12 blablabla (FIFO is 4 MB)
- ---heavy swapping
cdrecord finished
- ---swap: 27421 KB

The system has 256 MB RAM, nothing RAM-eating in the background I got many
buffer-underuns just because of swapping. When I turn swap off everything
works fine. I think it's something with the cache.

Leaving system 2 alone, just play mp3s over nfs:

After two or three days the used swap-space is around 3 MB. I just played
MP3s and no X and no other "big" applications were running. This isn't really
a problem but it doesn't look good. Just because of cache swap gets full :(

I think this must be fixed before opening 2.5.
It isn't good when something gets swapped out just because of the cache.
It'll be better if the cache gets lesser priority.

system 1:
Kernel 2.4.15-pre4
Intel Pentium II @ 350 MHZ
256 MB RAM
512 MB Swap

system 2:
Kernel 2.4.14
AMD K6-2 @ 350 MHZ
128 MB RAM
256 MB Swap

If you need some more system information contact me and I'll post them
I'll be happy to test all your patches/suggestions :)

Thanks in advance
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE78mc+vIHrJes3kVIRAhooAJ4mp52iyrIkRPe/wicwrSxmIwmvYQCgg/NQ
MW522KOtGdhPdjRVbXwLrko=
=TlFS
-----END PGP SIGNATURE-----


2001-11-14 15:01:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

On Wed, 14 Nov 2001, Sebastian Dr?ge wrote:

> After two or three days the used swap-space is around 3 MB. I just
> played MP3s and no X and no other "big" applications were running.
> This isn't really a problem but it doesn't look good. Just because of
> cache swap gets full :(

"This isn't really a problem" is a good analysis of the
situation, since 2.4.14/15 often have data duplicated
in both swap and RAM, so being 3MB into swap doesn't mean
3MB of your programs isn't in RAM.

If you take a look at /proc/meminfo, you'll find a field
called "SwapCached:", which tells you exactly how much
of your memory is duplicated in both swap and RAM.

regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-11-14 16:38:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?


On Wed, 14 Nov 2001, Sebastian Dr?ge wrote:
>
> The system has 256 MB RAM, nothing RAM-eating in the background I got many
> buffer-underuns just because of swapping. When I turn swap off everything
> works fine. I think it's something with the cache.

Can you do some statistics for me:

cat /proc/meminfo
cat /proc/slabinfo
vmstat 1

while the worst swapping is going on..

> Leaving system 2 alone, just play mp3s over nfs:
>
> After two or three days the used swap-space is around 3 MB. I just played
> MP3s and no X and no other "big" applications were running. This isn't really
> a problem but it doesn't look good. Just because of cache swap gets full :(

That's normal and usually good. It's supposed to swap stuff out if it
really isn't needed, and that improves performance. Cache _is_ more
important than swap if the cache is active.

HOWEVER, there's probably something in your system that triggers this too
easily. Heavy NFS usage will do that, for example - as mentioned in
another thread on linux-kernel, the VM doesn't really understand
writebacks and asynchronous reads from filesystems that don't use buffers,
and so sometimes the heuristics get confused simply because NFS activity
can _look_ like page mapping to the VM.

Your MP3-over-NFS doesn't sound bad, though. 3MB of swap is perfectly
normal: that tends to be just the idle deamons etc which really _should_
go to swapspace.

The cdrecord thing is something else, though. I'd love to see the
statistics from that. Although it can, of course, just be another SCSI
allocation strangeness.

Linus

2001-11-15 08:38:52

by janne

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

i've noticed some weirdness too.
i'm using 2.4.15-pre1 on a 1.4ghz athlon kt266a system with 1GB ram.

i first noticed this as i was copying large amount of data (10+ gigs)
across two reiserfs partitions and during that copy uniconified one
mozilla window which took about 30 seconds before it finished redrawing
the window and became responsive again.

to me it seems that latest 2.4 kernels are too eager to swapout things
to make room for cache on machines with _lots_ of memory.

this time i tried the same thing on a freshly rebooted machine.
uniconified mozilla when i was 35megs on swap, this time it took
under 10 secs to become responsive, but on the other hand it had
just been running a few minutes so it's memory footprint was still
quite small.

imho we should not be swapping so eagerly when there is lots of memory
and already 800+ megs used for cache...

attached vmstat log from start of copying to uniconifying mozilla.

slabinfo and meminfo before copying:

slabinfo - version: 1.1
kmem_cache 58 68 112 2 2 1
tcp_tw_bucket 3 30 128 1 1 1
tcp_bind_bucket 16 112 32 1 1 1
tcp_open_request 0 59 64 0 1 1
inet_peer_cache 0 0 64 0 0 1
ip_fib_hash 9 112 32 1 1 1
ip_dst_cache 71 80 192 4 4 1
arp_cache 2 30 128 1 1 1
urb_priv 1 59 64 1 1 1
blkdev_requests 512 540 128 18 18 1
nfs_read_data 0 0 384 0 0 1
nfs_write_data 0 0 384 0 0 1
nfs_page 0 0 128 0 0 1
dnotify cache 0 0 20 0 0 1
file lock cache 2 42 92 1 1 1
fasync cache 1 202 16 1 1 1
uid_cache 2 112 32 1 1 1
skbuff_head_cache 129 140 192 7 7 1
sock 131 144 832 16 16 2
sigqueue 0 29 132 0 1 1
cdev_cache 286 295 64 5 5 1
bdev_cache 4 59 64 1 1 1
mnt_cache 12 59 64 1 1 1
inode_cache 2853 2870 512 409 410 1
dentry_cache 3957 3990 128 132 133 1
filp 1168 1170 128 39 39 1
names_cache 0 7 4096 0 7 1
buffer_head 19989 20010 128 667 667 1
mm_struct 51 60 192 3 3 1
vm_area_struct 2154 2340 128 73 78 1
fs_cache 50 59 64 1 1 1
files_cache 50 54 448 6 6 1
signal_act 54 57 1344 18 19 1
size-131072(DMA) 0 0 131072 0 0 32
size-131072 0 0 131072 0 0 32
size-65536(DMA) 0 0 65536 0 0 16
size-65536 0 0 65536 0 0 16
size-32768(DMA) 0 0 32768 0 0 8
size-32768 2 2 32768 2 2 8
size-16384(DMA) 0 0 16384 0 0 4
size-16384 2 6 16384 2 6 4
size-8192(DMA) 0 0 8192 0 0 2
size-8192 3 4 8192 3 4 2
size-4096(DMA) 0 0 4096 0 0 1
size-4096 55 60 4096 55 60 1
size-2048(DMA) 0 0 2048 0 0 1
size-2048 8 50 2048 4 25 1
size-1024(DMA) 0 0 1024 0 0 1
size-1024 36 44 1024 9 11 1
size-512(DMA) 0 0 512 0 0 1
size-512 37 48 512 5 6 1
size-256(DMA) 0 0 256 0 0 1
size-256 41 60 256 3 4 1
size-128(DMA) 3 30 128 1 1 1
size-128 568 720 128 19 24 1
size-64(DMA) 0 0 64 0 0 1
size-64 331 354 64 6 6 1
size-32(DMA) 52 59 64 1 1 1
size-32 1044 1062 64 18 18 1

total: used: free: shared: buffers: cached:
Mem: 1054724096 191864832 862859264 0 5967872 77918208
Swap: 1077501952 0 1077501952
MemTotal: 1030004 kB
MemFree: 842636 kB
MemShared: 0 kB
Buffers: 5828 kB
Cached: 76092 kB
SwapCached: 0 kB
Active: 22164 kB
Inactive: 147532 kB
HighTotal: 131008 kB
HighFree: 2044 kB
LowTotal: 898996 kB
LowFree: 840592 kB
SwapTotal: 1052248 kB
SwapFree: 1052248 kB


slabinfo and meminfo after copying:

slabinfo - version: 1.1
kmem_cache 58 68 112 2 2 1
tcp_tw_bucket 3 30 128 1 1 1
tcp_bind_bucket 16 112 32 1 1 1
tcp_open_request 0 59 64 0 1 1
inet_peer_cache 0 0 64 0 0 1
ip_fib_hash 9 112 32 1 1 1
ip_dst_cache 8 60 192 3 3 1
arp_cache 2 30 128 1 1 1
urb_priv 1 59 64 1 1 1
blkdev_requests 512 540 128 18 18 1
nfs_read_data 0 0 384 0 0 1
nfs_write_data 0 0 384 0 0 1
nfs_page 0 0 128 0 0 1
dnotify cache 0 0 20 0 0 1
file lock cache 3 42 92 1 1 1
fasync cache 1 202 16 1 1 1
uid_cache 2 112 32 1 1 1
skbuff_head_cache 129 140 192 7 7 1
sock 133 144 832 16 16 2
sigqueue 0 29 132 0 1 1
cdev_cache 21 118 64 2 2 1
bdev_cache 5 59 64 1 1 1
mnt_cache 13 59 64 1 1 1
inode_cache 1501 2408 512 344 344 1
dentry_cache 767 3210 128 107 107 1
filp 1262 1290 128 43 43 1
names_cache 0 2 4096 0 2 1
buffer_head 226567 227160 128 7569 7572 1
mm_struct 54 60 192 3 3 1
vm_area_struct 2292 2430 128 77 81 1
fs_cache 53 59 64 1 1 1
files_cache 53 63 448 6 7 1
signal_act 57 63 1344 20 21 1
size-131072(DMA) 0 0 131072 0 0 32
size-131072 0 0 131072 0 0 32
size-65536(DMA) 0 0 65536 0 0 16
size-65536 0 0 65536 0 0 16
size-32768(DMA) 0 0 32768 0 0 8
size-32768 2 2 32768 2 2 8
size-16384(DMA) 0 0 16384 0 0 4
size-16384 2 3 16384 2 3 4
size-8192(DMA) 0 0 8192 0 0 2
size-8192 3 4 8192 3 4 2
size-4096(DMA) 0 0 4096 0 0 1
size-4096 66 67 4096 66 67 1
size-2048(DMA) 0 0 2048 0 0 1
size-2048 8 10 2048 4 5 1
size-1024(DMA) 0 0 1024 0 0 1
size-1024 39 40 1024 10 10 1
size-512(DMA) 0 0 512 0 0 1
size-512 38 48 512 5 6 1
size-256(DMA) 0 0 256 0 0 1
size-256 41 45 256 3 3 1
size-128(DMA) 3 30 128 1 1 1
size-128 568 600 128 19 20 1
size-64(DMA) 0 0 64 0 0 1
size-64 305 354 64 6 6 1
size-32(DMA) 52 59 64 1 1 1
size-32 400 1121 64 19 19 1

total: used: free: shared: buffers: cached:
Mem: 1054724096 1048174592 6549504 0 5664768 929890304
Swap: 1077501952 36356096 1041145856
MemTotal: 1030004 kB
MemFree: 6396 kB
MemShared: 0 kB
Buffers: 5532 kB
Cached: 892792 kB
SwapCached: 15304 kB
Active: 57556 kB
Inactive: 920424 kB
HighTotal: 131008 kB
HighFree: 1064 kB
LowTotal: 898996 kB
LowFree: 5332 kB
SwapTotal: 1052248 kB
SwapFree: 1016744 kB


Attachments:
vmstat (33.92 kB)

2001-11-15 09:06:16

by janne

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

to followup to my previous post: i don't know how the current code
works, but perhaps there should be some logic added to check the
percentage of total mem used for cache before swapping out.

like, if memory is full and there's less than 10% of total mem used for
cache, then start swapping out. not if ~90% is already used for cache.. :)

2001-11-15 17:45:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

On Thu, 15 Nov 2001, janne wrote:

> to followup to my previous post: i don't know how the current code
> works, but perhaps there should be some logic added to check the
> percentage of total mem used for cache before swapping out.

No.

> like, if memory is full and there's less than 10% of total mem used for
> cache, then start swapping out. not if ~90% is already used for cache.. :)

Numbers like this don't work. You may have a very large and very hot
cache.. you may also have a very large and hot gob of anonymous pages.

-Mike

2001-11-16 00:14:54

by janne

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

> > like, if memory is full and there's less than 10% of total mem used for
> > cache, then start swapping out. not if ~90% is already used for cache.. :)
>
> Numbers like this don't work. You may have a very large and very hot
> cache.. you may also have a very large and hot gob of anonymous pages.

yes of course, sorry if i was not clear. it wasn't meant to be an
implementation suggestion since i know there's a lot more to consider,
and even then those figures might not be feasible. i was just trying to
highlight the particular problem i was having..

2001-11-19 18:02:05

by Sebastian Droege

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
I couldn't answer ealier because I had some problems with my ISP
the heavy swapping problem while burning a cd is solved in pre6aa1
but if you want i can do some statistics tommorow
Thanks and bye
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE7+UjjvIHrJes3kVIRApxEAKCwoOhYcptcm/1Q2teIY2YkVwNZGwCeNsDR
pSi5RbK5o5qeYUWzHHYgAj0=
=1Nvc
-----END PGP SIGNATURE-----

2001-11-19 18:13:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?


On Mon, 19 Nov 2001, Sebastian Dr?ge wrote:
> Hi,
> I couldn't answer ealier because I had some problems with my ISP
> the heavy swapping problem while burning a cd is solved in pre6aa1
> but if you want i can do some statistics tommorow

Well, pre6aa1 performs really badly exactly because it by default doesn't
swap enough even on _normal_ loads because Andrea is playing with some
tuning (and see the bad results of that tuning in the VM testing by
[email protected]).

So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
due to fixing the problem, but due to bad tuning.

Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
memory area would previously cause unnecessary swapping, and maybe the CD
burning buffer is using shmlock..

Linus

2001-11-19 18:18:26

by Simon Kirby

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

On Wed, Nov 14, 2001 at 08:34:12AM -0800, Linus Torvalds wrote:

> That's normal and usually good. It's supposed to swap stuff out if it
> really isn't needed, and that improves performance. Cache _is_ more
> important than swap if the cache is active.

We have to remember that swap can be much slower to read back in than
rereading data from files, though. I guess this is because files tend to
be more often read sequentially. A freshly-booted box loads up things it
hasn't seen before much faster than a heavily-swapped-out box swaps the
things it needs back in...window managers and X desktop backgrounds, for
example, are awfully slow. I would prefer if it never swapped them out.

This is an annoying situation, though, because I would like some of my
unused daemons to be swapped out. mlocking random stuff would be worse,
though.

> HOWEVER, there's probably something in your system that triggers this too
> easily. Heavy NFS usage will do that, for example - as mentioned in
> another thread on linux-kernel, the VM doesn't really understand
> writebacks and asynchronous reads from filesystems that don't use buffers,
> and so sometimes the heuristics get confused simply because NFS activity
> can _look_ like page mapping to the VM.

I've been copying about 40 GB of stuff back and forth over NFS over
switched 100Mbit Ethernet lately, so I can say I'm definitely seeing
this. :) It also seems to happen when I "pull" over NFS rather than
"push" (eg: I ssh to a remote machine and "cp" with the source being an
NFS mount of the local machine)...the 2.4.15pre1 local machine tends to
swap out while this happens as well.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-11-19 18:31:47

by Ken Brownfield

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
issue with Oracle, but I do need to perform more deterministic tests
before I can fully sign off on that.

BTW, didn't your patch go into -pre5? Or is there an additional mod in
-pre6 that we should try?
--
Ken.
[email protected]

On Mon, Nov 19, 2001 at 10:07:58AM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Sebastian Dr?ge wrote:
| > Hi,
| > I couldn't answer ealier because I had some problems with my ISP
| > the heavy swapping problem while burning a cd is solved in pre6aa1
| > but if you want i can do some statistics tommorow
|
| Well, pre6aa1 performs really badly exactly because it by default doesn't
| swap enough even on _normal_ loads because Andrea is playing with some
| tuning (and see the bad results of that tuning in the VM testing by
| [email protected]).
|
| So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
| due to fixing the problem, but due to bad tuning.
|
| Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
| memory area would previously cause unnecessary swapping, and maybe the CD
| burning buffer is using shmlock..
|
| Linus
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2001-11-19 19:28:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

In article <[email protected]>,
Ken Brownfield <[email protected]> wrote:
>Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
>issue with Oracle, but I do need to perform more deterministic tests
>before I can fully sign off on that.
>
>BTW, didn't your patch go into -pre5? Or is there an additional mod in
>-pre6 that we should try?

You're right, it's probably in pre5 already..

Anyway, it would be interesting to see if the patch by Andrea (I think
he called it "zone-watermarks") that changes the zone allocators to take
other zones into account makes a difference. See separate thread with
the subject line "15pre6aa1 (fixes google VM problem)".

(I think the patch is overly complex as-is, but I htink the _ideas_ in
it are fine).

Linus

2001-11-19 19:30:57

by Ken Brownfield

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

Actually, I spoke too soon. We developed a quick stress test that
causes the problem immediately:

11:18am up 3 days, 1:36, 3 users, load average: 8.72, 7.18, 3.96
91 processes: 85 sleeping, 6 running, 0 zombie, 0 stopped
CPU states: 0.1% user, 93.4% system, 0.0% nice, 6.4% idle
Mem: 3343688K av, 3340784K used, 2904K free, 0K shrd, 308K buff
Swap: 1004052K av, 567404K used, 436648K free 2994288K cached

PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
12102 oracle 13 0 16320 15M 14868 R 5584 67.2 0.4 18:58 oracle
12365 oracle 18 5 39352 38M 37796 R N 30M 66.7 1.1 4:14 oracle
12353 oracle 18 5 39956 38M 38408 R N 31M 66.5 1.1 9:14 oracle
12191 root 13 0 892 852 672 R 0 66.4 0.0 6:09 top
12366 oracle 9 0 892 892 672 S 0 60.0 0.0 3:20 top
9 root 9 0 0 0 0 SW 0 49.0 0.0 9:27 kswapd
11 root 9 0 0 0 0 SW 0 38.3 0.0 3:58 kupdated
105 root 9 0 0 0 0 SW 0 28.8 0.0 4:56 kjournald
470 root 9 0 844 828 472 S 0 28.1 0.0 1:46 gamdrvd
12351 oracle 13 5 39956 38M 38408 S N 31M 25.6 1.1 3:08 oracle
669 oracle 9 0 4780 4780 4384 S 492 24.4 0.1 1:42 oracle
1 root 14 0 476 424 408 R 0 21.6 0.0 1:19 init
2 root 14 0 0 0 0 RW 0 20.8 0.0 1:29 keventd
615 oracle 9 0 8984 8984 8460 S 4380 16.3 0.2 2:41 oracle
388 root 9 0 732 728 592 S 0 11.5 0.0 0:17 syslogd

kswapd bounces up and down from 99%.

Keys for me are the full system time, the fact that the %CPUs seem to
add up to more than 6xCPUs (6-way Xeon), and that processes that aren't
really active show up as "active".

ASAP, I'll try -pre6 and then -aa1 to compare behavior.

The Oracle stress query looks like:

select /*+ parallel(mt,5) cache(mt) */ count(*) from mtable_units ;

Thanks much,
--
Ken.


On Mon, Nov 19, 2001 at 12:31:25PM -0600, Ken Brownfield wrote:
| Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
| issue with Oracle, but I do need to perform more deterministic tests
| before I can fully sign off on that.
|
| BTW, didn't your patch go into -pre5? Or is there an additional mod in
| -pre6 that we should try?
| --
| Ken.
| [email protected]
|
| On Mon, Nov 19, 2001 at 10:07:58AM -0800, Linus Torvalds wrote:
| |
| | On Mon, 19 Nov 2001, Sebastian Dr?ge wrote:
| | > Hi,
| | > I couldn't answer ealier because I had some problems with my ISP
| | > the heavy swapping problem while burning a cd is solved in pre6aa1
| | > but if you want i can do some statistics tommorow
| |
| | Well, pre6aa1 performs really badly exactly because it by default doesn't
| | swap enough even on _normal_ loads because Andrea is playing with some
| | tuning (and see the bad results of that tuning in the VM testing by
| | [email protected]).
| |
| | So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
| | due to fixing the problem, but due to bad tuning.
| |
| | Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
| | memory area would previously cause unnecessary swapping, and maybe the CD
| | burning buffer is using shmlock..
| |
| | Linus
| |
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to [email protected]
| | More majordomo info at http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2001-11-19 19:44:08

by Slo Mo Snail

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am Montag, 19. November 2001 19:07 schrieb Linus Torvalds:
> On Mon, 19 Nov 2001, Sebastian Dr?ge wrote:
> > Hi,
> > I couldn't answer ealier because I had some problems with my ISP
> > the heavy swapping problem while burning a cd is solved in pre6aa1
> > but if you want i can do some statistics tommorow
>
> Well, pre6aa1 performs really badly exactly because it by default doesn't
> swap enough even on _normal_ loads because Andrea is playing with some
> tuning (and see the bad results of that tuning in the VM testing by
> [email protected]).
>
> So the pre6aa1 numbers are kind of suspect - lack of swapping may not be
> due to fixing the problem, but due to bad tuning.
>
> Does plain pre6 solve it? Plain pre6 has a fix where a locked shared
> memory area would previously cause unnecessary swapping, and maybe the CD
> burning buffer is using shmlock..

Hi,
yes plain pre6 seems to solve it, too. I can't be sure right now because I
have recorded only 3 CDs while running pre6
pre6 swaps more than aa1 but I had so far I had no buffer-underuns and much
of the swap appears in SwapCached
the interactive performance seems to be much better in pre6 than in aa1 so
I'll stay with pre6 ;)
Bye
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE7+WEovIHrJes3kVIRAg+nAJ4issDSimDEal2I08CQHEoXBpGFLQCeNQ1x
AathQZ75U5nhnEZwTkR4WnI=
=lb0O
-----END PGP SIGNATURE-----

2001-11-19 19:45:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?



On Mon, 19 Nov 2001, Ken Brownfield wrote:

> Actually, I spoke too soon. We developed a quick stress test that
> causes the problem immediately:
>
> 11:18am up 3 days, 1:36, 3 users, load average: 8.72, 7.18, 3.96
> 91 processes: 85 sleeping, 6 running, 0 zombie, 0 stopped
> CPU states: 0.1% user, 93.4% system, 0.0% nice, 6.4% idle
> Mem: 3343688K av, 3340784K used, 2904K free, 0K shrd, 308K buff
> Swap: 1004052K av, 567404K used, 436648K free 2994288K cached
>
> PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
> 12102 oracle 13 0 16320 15M 14868 R 5584 67.2 0.4 18:58 oracle
> 12365 oracle 18 5 39352 38M 37796 R N 30M 66.7 1.1 4:14 oracle
> 12353 oracle 18 5 39956 38M 38408 R N 31M 66.5 1.1 9:14 oracle
> 12191 root 13 0 892 852 672 R 0 66.4 0.0 6:09 top
> 12366 oracle 9 0 892 892 672 S 0 60.0 0.0 3:20 top
> 9 root 9 0 0 0 0 SW 0 49.0 0.0 9:27 kswapd
> 11 root 9 0 0 0 0 SW 0 38.3 0.0 3:58 kupdated
> 105 root 9 0 0 0 0 SW 0 28.8 0.0 4:56 kjournald
> 470 root 9 0 844 828 472 S 0 28.1 0.0 1:46 gamdrvd
> 12351 oracle 13 5 39956 38M 38408 S N 31M 25.6 1.1 3:08 oracle
> 669 oracle 9 0 4780 4780 4384 S 492 24.4 0.1 1:42 oracle
> 1 root 14 0 476 424 408 R 0 21.6 0.0 1:19 init
> 2 root 14 0 0 0 0 RW 0 20.8 0.0 1:29 keventd
> 615 oracle 9 0 8984 8984 8460 S 4380 16.3 0.2 2:41 oracle
> 388 root 9 0 732 728 592 S 0 11.5 0.0 0:17 syslogd
>
> kswapd bounces up and down from 99%.

Ken,

Could you please check _where_ kswapd is spending its time ?

(you can use kernel profiling and the "readprofile" tool to report us the
functions which are wasting more CPU cycles in the kernel)


2001-11-19 23:40:01

by Ken Brownfield

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

I went straight to the aa patch, and it looks like it either fixes the
problem or (because of the side-effects Linus mentioned) otherwise
prevents the issue:

2:30pm up 11 min, 4 users, load average: 2.23, 2.18, 1.17
106 processes: 104 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: 14.7% user, 10.3% system, 0.0% nice, 74.9% idle
Mem: 3342304K av, 3013888K used, 328416K free, 0K shrd, 1224K buff
Swap: 1004052K av, 276824K used, 727228K free 2862112K cached

PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
722 oracle 12 0 13364 12M 11856 S 9.9M 29.5 0.3 2:24 oracle
731 oracle 17 0 13488 12M 11980 D 10M 28.7 0.3 2:27 oracle
728 oracle 12 0 13048 12M 11540 R 9816 20.8 0.3 2:22 oracle
718 oracle 12 0 154M 153M 152M S 150M 17.9 4.7 2:22 oracle
725 oracle 14 0 13472 12M 11964 S 10M 17.9 0.3 2:20 oracle
734 oracle 12 0 13936 13M 12432 S 10M 15.3 0.4 2:27 oracle
9 root 9 0 0 0 0 SW 0 4.3 0.0 0:27 kswapd

The machine went into swap immediately when the page cache stopped
growing and hovered at 100-400MB. Also, in my experience the page cache
will grow until there's only 5ishMB of free RAM, but with the aa patch
it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
patch, or part of -pre6?

It would be nice if that number were modifyable via /proc (writable
freepages again? 10% seems a tad high for many boxes) but I think it's
better to have a bit more purely free RAM available than 5MB.

kswapd isn't going nuts, but it seems to still be eating quite a bit of
CPU given plenty of RAM. And it seems to go pretty hard into swap -- I
would imagine that it's disadvantageous to do significant swapping
(based on age only?) in the presence of a massive page cache. I would
imagine the performance hit of a 2GB vs. 3GB page cache would be less
egregious than the time and I/O kswapd is causing without memory
pressure.

The Oracle SGA is set to ~522MB, with nothing else running except a
couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
shared memory segment fit? Is it being swapped out in deference to page
cache?

Just my USD$0.02. I'll try vanilla -pre6 with profiling soon and post
results. Thanks for the tip Marcelo.

Thanks,
--
Ken.
[email protected]


On Mon, Nov 19, 2001 at 07:23:27PM +0000, Linus Torvalds wrote:
| In article <[email protected]>,
| Ken Brownfield <[email protected]> wrote:
| >Linus, so far 2.4.15-pre4 with your patch does not reproduce the kswapd
| >issue with Oracle, but I do need to perform more deterministic tests
| >before I can fully sign off on that.
| >
| >BTW, didn't your patch go into -pre5? Or is there an additional mod in
| >-pre6 that we should try?
|
| You're right, it's probably in pre5 already..
|
| Anyway, it would be interesting to see if the patch by Andrea (I think
| he called it "zone-watermarks") that changes the zone allocators to take
| other zones into account makes a difference. See separate thread with
| the subject line "15pre6aa1 (fixes google VM problem)".
|
| (I think the patch is overly complex as-is, but I htink the _ideas_ in
| it are fine).
|
| Linus
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2001-11-19 23:58:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?


On Mon, 19 Nov 2001, Ken Brownfield wrote:
>
> I went straight to the aa patch, and it looks like it either fixes the
> problem or (because of the side-effects Linus mentioned) otherwise
> prevents the issue:

So is this pre6aa1, or pre6 + just the watermark patch?

> The machine went into swap immediately when the page cache stopped
> growing and hovered at 100-400MB. Also, in my experience the page cache
> will grow until there's only 5ishMB of free RAM, but with the aa patch
> it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
> patch, or part of -pre6?

That was the watermarking. The way Andrea did it, the page cache will
basically refuse to touch as much of the "normal" page zone, because it
would prefer to allocate more from highmem..

I think it's excessive to have 320MB free memory, though, that's just
an insane waste. I suspect that the real number should be somewhere
between the old behaviour and the new one. You can tweak the behaviour of
andrea's kernel by changing the "reserved" page numbers, but I'd like to
hear whether my simpler approach works too..

> The Oracle SGA is set to ~522MB, with nothing else running except a
> couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
> plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
> shared memory segment fit? Is it being swapped out in deference to page
> cache?

Shared memory actually uses the page cache too, so it will be accounted
for in the 2.8GB number.

Anyway, can you try plain vanilla pre6, with the appended patch? This is
my suggested simplified version of what Andrea tried to do, and it should
try to keep only a few extra megs of memory free in the low memory
regions, not 300+ MB.

(and the profiling would be interesting regardless, but I think Andrea did
find the real problem, his fix just seems a bit of an overkill ;)

Linus


Attachments:
p6p7 (1.80 kB)

2001-11-20 00:18:59

by M. Edward Borasky

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

On a related note, the files "/usr/src/linux/Documentation/filesystems/proc.txt"
and "sysctl/vm.txt" refer to some variables I need to be able to set on a
system running 2.4.12. In particular, I need to be able to get to the values
in "/proc/sys/vm/freepages", "/proc/sys/vm/buffermem" and
"/proc/sys/vm/pagecache". However, despite their existence in the documentation
files, these files don't exist on a 2.4.12 system. How can I read and set these
values on a 2.4.12 system?
--
[email protected] (M. Edward Borasky) http://www.aracnet.com/~znmeb
Relax! Run Your Own Brain with Neuro-Semantics!
http://www.meta-trading-coach.com

"Outside of a dog, a book is a man's best friend. Inside a dog, it's
too dark to read." -- Marx

2001-11-20 00:25:39

by Ken Brownfield

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
|
| So is this pre6aa1, or pre6 + just the watermark patch?

I'm currently using -pre6 with his separately-posted zone-watermark-1
patch. Sorry, I should have been clearer.

| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB. Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
| > patch, or part of -pre6?
|
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
|
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..

Yeah, maybe a tiered default would be best, IMHO. 5MB on a 3GB box
does, on the other hand, seem anemic.

| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit? Is it being swapped out in deference to page
| > cache?
|
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.

My bad, should have realized.

| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
|
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
|
| Linus

I'll try this patch ASAP.

Thanks a LOT to all involved,
--
Ken.
[email protected]

| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
| return page;
| }
|
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| - long free = zone->free_pages - (1UL << order);
| - return free >= 0 ? free : 0;
| -}
| -
| /*
| * This is the 'heart' of the zoned buddy allocator:
| */
| struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
| {
| + unsigned long min;
| zone_t **zone, * classzone;
| struct page * page;
| int freed;
|
| zone = zonelist->zones;
| classzone = *zone;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_low) {
| + min += z->pages_low;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -334,16 +331,18 @@
| wake_up_interruptible(&kswapd_wait);
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| - unsigned long min;
| + unsigned long local_min;
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - min = z->pages_min;
| + local_min = z->pages_min;
| if (!(gfp_mask & __GFP_WAIT))
| - min >>= 2;
| - if (zone_free_pages(z, order) > min) {
| + local_min >>= 2;
| + min += local_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -376,12 +375,14 @@
| return page;
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_min) {
| + min += z->pages_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;

2001-11-20 00:37:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?


On Mon, 19 Nov 2001, Ken Brownfield wrote:
> |
> | So is this pre6aa1, or pre6 + just the watermark patch?
>
> I'm currently using -pre6 with his separately-posted zone-watermark-1
> patch. Sorry, I should have been clearer.

Good. That removes the other variables from the equation, ie it's not an
effect of some of the other tweaking in the -aa patches.

> Yeah, maybe a tiered default would be best, IMHO. 5MB on a 3GB box
> does, on the other hand, seem anemic.

Yeah, the 5MB _is_ anemic. It comes from the fact that we decide to never
bother having more than zone_balance_max[] pages free, even if we have
tons of memory. And zone_balance_max[] is fairly small, it limits us to
255 free pages per zone (for page_min - wth "page_low" being twice that).
So you get 3 zones, with 255*2 pages free max each, except the DMA zone
has much less just because it's smaller. Thus 5MB.

There's no real reason for having zone_balance_max[] at all - without it
we'd just always try to keep about 1/128th of memory free, which would be
about 24MB on a 3GB box. Which is probably not a bad idea.

With my "simplified-Andrea" patch, you should see slightly more than 5MB
free, but not a lot more. A HIGHMEM allocation now wants to leave an
"extra" 510 pages in NORMAL, and even more in the DMA zone, so you should
see something like maybe 12-15 MB free instead of 300MB.

(Wild hand-waving number, I'm too lazy to actually do the math, and I
haven't even tested that the simple patch works at all - I think I forgot
to mention that small detail ;)

Linus

2001-11-20 00:49:02

by Yan, Noah

[permalink] [raw]
Subject: RE: [VM] 2.4.14/15-pre4 too "swap-happy"?

Hi, all

Just want to know is there any research/development work now on Linux kernel for IA-64, such as Intel Itanium?

Best Regards,
Noah Yan

SC/Automation Group
Shanghai Site Manufacturing Computing/IT
Intel Technology (China) Ltd.

IDD: (86 21) 50481818 - 31579
Fax: (86 21) 50481212
Email: [email protected]

-----Original Message-----
From: Linus Torvalds [mailto:[email protected]]
Sent: 2001?11?20? 8:31
To: Ken Brownfield
Cc: [email protected]; Andrea Arcangeli
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?



On Mon, 19 Nov 2001, Ken Brownfield wrote:
> |
> | So is this pre6aa1, or pre6 + just the watermark patch?
>
> I'm currently using -pre6 with his separately-posted zone-watermark-1
> patch. Sorry, I should have been clearer.

Good. That removes the other variables from the equation, ie it's not an
effect of some of the other tweaking in the -aa patches.

> Yeah, maybe a tiered default would be best, IMHO. 5MB on a 3GB box
> does, on the other hand, seem anemic.

Yeah, the 5MB _is_ anemic. It comes from the fact that we decide to never
bother having more than zone_balance_max[] pages free, even if we have
tons of memory. And zone_balance_max[] is fairly small, it limits us to
255 free pages per zone (for page_min - wth "page_low" being twice that).
So you get 3 zones, with 255*2 pages free max each, except the DMA zone
has much less just because it's smaller. Thus 5MB.

There's no real reason for having zone_balance_max[] at all - without it
we'd just always try to keep about 1/128th of memory free, which would be
about 24MB on a 3GB box. Which is probably not a bad idea.

With my "simplified-Andrea" patch, you should see slightly more than 5MB
free, but not a lot more. A HIGHMEM allocation now wants to leave an
"extra" 510 pages in NORMAL, and even more in the DMA zone, so you should
see something like maybe 12-15 MB free instead of 300MB.

(Wild hand-waving number, I'm too lazy to actually do the math, and I
haven't even tested that the simple patch works at all - I think I forgot
to mention that small detail ;)

Linus

2001-11-20 03:10:16

by Ken Brownfield

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

Well, I think you'll be pleased to hear that your untested patch
compiled, booted, _and_ fixed the problem. :)

The minimum free RAM was about 9.8-11MB (matching your guestimate) and
kswapd seemed to behave the same as the watermark patch. The results of
top were basically the same, so I'm omitting it.

However, I do have some profiling numbers, thanks to Marcelo. Attached
are numbers from "readprofile | sort -nr +2 | head -20". I think the
pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
The other two might have significance for wizards, but statistically
don't stand out to me, except maybe statm_pgd_range.

I reset the counters just before starting Oracle and the stress test. I
think a -pre7 with a blessed patch would be good, since my testing was
very narrow.

I'll test new kernels as I hear new info.

Thanks much!
--
Ken.
[email protected]


2.4.15-pre4 with your original patch:
(shorter time period since the machine went to hell fast)
(matches vanilla behaviour)

164536 default_idle 3164.1538
101562 shrink_cache 113.8587
3683 prune_icache 13.5404
3034 file_read_actor 12.2339
914 DAC960_BA_InterruptHandler 5.5732
1128 statm_pgd_range 2.9072
40 page_cache_release 0.8333
31 add_page_to_hash_queue 0.5167
89 page_cache_read 0.4363
25 remove_inode_page 0.4167
26 unlock_page 0.3095
509 __make_request 0.3008
66 smp_call_function 0.2946
21 set_bh_page 0.2917
9 __brelse 0.2812
90 try_to_free_buffers 0.2778
13 mark_page_accessed 0.2708
8 __free_pages 0.2500
43 get_hash_table 0.2443
42 activate_page 0.2234

2.4.15-pre6 with watermark patch:

1617446 default_idle 31104.7308
27599 DAC960_BA_InterruptHandler 168.2866
38918 file_read_actor 156.9274
528 page_cache_release 11.0000
554 add_page_to_hash_queue 9.2333
15487 __make_request 9.1531
3453 statm_pgd_range 8.8995
514 remove_inode_page 8.5667
1453 blk_init_free_list 7.2650
377 set_bh_page 5.2361
898 page_cache_read 4.4020
590 add_to_page_cache_unique 4.3382
136 __brelse 4.2500
1120 kmem_cache_alloc 3.8356
628 kunmap_high 3.7381
1189 try_to_free_buffers 3.6698
625 get_hash_table 3.5511
439 lru_cache_add 3.4297
1715 rmqueue 3.0194
105 remove_wait_queue 2.9167

2.4.15-pre6 with Linus patch:

1249875 default_idle 24036.0577
65324 file_read_actor 263.4032
36979 DAC960_BA_InterruptHandler 225.4817
9809 statm_pgd_range 25.2809
1039 page_cache_release 21.6458
994 add_page_to_hash_queue 16.5667
922 remove_inode_page 15.3667
2409 blk_init_free_list 12.0450
20159 __make_request 11.9143
1198 lru_cache_add 9.3594
1628 page_cache_read 7.9804
987 add_to_page_cache_unique 7.2574
2202 try_to_free_buffers 6.7963
1038 get_unused_buffer_head 6.6538
484 unlock_page 5.7619
3182 rmqueue 5.6021
874 kunmap_high 5.2024
164 __brelse 5.1250
900 get_hash_table 5.1136
357 set_bh_page 4.9583


On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
|
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
|
| So is this pre6aa1, or pre6 + just the watermark patch?
|
| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB. Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM. Was that the aa
| > patch, or part of -pre6?
|
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
|
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..
|
| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc. Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit? Is it being swapped out in deference to page
| > cache?
|
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.
|
| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
|
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
|
| Linus

| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
| return page;
| }
|
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| - long free = zone->free_pages - (1UL << order);
| - return free >= 0 ? free : 0;
| -}
| -
| /*
| * This is the 'heart' of the zoned buddy allocator:
| */
| struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
| {
| + unsigned long min;
| zone_t **zone, * classzone;
| struct page * page;
| int freed;
|
| zone = zonelist->zones;
| classzone = *zone;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_low) {
| + min += z->pages_low;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -334,16 +331,18 @@
| wake_up_interruptible(&kswapd_wait);
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| - unsigned long min;
| + unsigned long local_min;
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - min = z->pages_min;
| + local_min = z->pages_min;
| if (!(gfp_mask & __GFP_WAIT))
| - min >>= 2;
| - if (zone_free_pages(z, order) > min) {
| + local_min >>= 2;
| + min += local_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;
| @@ -376,12 +375,14 @@
| return page;
|
| zone = zonelist->zones;
| + min = 1UL << order;
| for (;;) {
| zone_t *z = *(zone++);
| if (!z)
| break;
|
| - if (zone_free_pages(z, order) > z->pages_min) {
| + min += z->pages_min;
| + if (z->free_pages > min) {
| page = rmqueue(z, order);
| if (page)
| return page;

2001-11-20 03:32:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)

Can you try to run an updatedb constantly in background?

Andrea

2001-11-20 03:36:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?


On Mon, 19 Nov 2001, Ken Brownfield wrote:
>
> Well, I think you'll be pleased to hear that your untested patch
> compiled, booted, _and_ fixed the problem. :)

Good. The patch itself was fairly simple, and the problem was
straightforward, the real credit for the fix goes to Andrea for thinking
about what was wrong with the old code..

> The minimum free RAM was about 9.8-11MB (matching your guestimate) and
> kswapd seemed to behave the same as the watermark patch. The results of
> top were basically the same, so I'm omitting it.

All right. I think 10MB free for a 3GB machine is good - and we can easily
tweak the zone_balance_max[] numbers if somebody comes to the conclusion
that it's better to have more free. It's about .3% of RAM, so it's small
enough that it's certainly not too much, and yet at the same time it's
probably enough to give reasonable behaviour in a temporary memory crunch.

> However, I do have some profiling numbers, thanks to Marcelo. Attached
> are numbers from "readprofile | sort -nr +2 | head -20". I think the
> pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
> The other two might have significance for wizards, but statistically
> don't stand out to me, except maybe statm_pgd_range.

I'd say that this clearly shows that yes, 2.4.14 did the wrong thing, and
wasted time in shrink_cache() without making any real progress. The two
other profiles look reasonable to me - nothing stands out that shouldn't.

(yeah, we spend _much_ too much time doing VM statistics with "top", and
the only way to get rid of that would be to add a per-vma "rss" field.
Which might not be a bad idea, but it's not a high priority for me).

> I reset the counters just before starting Oracle and the stress test. I
> think a -pre7 with a blessed patch would be good, since my testing was
> very narrow.

Sude, I'll do a pre7. This closes my last behaviour issue with the VM,
although I'm sure we'll end up spending tons of time chasing bugs still
(both VM and not).

Linus

2001-11-20 05:54:56

by Ken Brownfield

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
apparent interactivity problems. I'm keeping it in while( 1 ), but it's
been predictable so far.

3-10 is a lot better than 99, but is kswapd really going to eat that
much CPU in an essentially allocation-less state?

But certainly you found the right thing.

Thx all!
--
Ken.
[email protected]

On Tue, Nov 20, 2001 at 04:32:23AM +0100, Andrea Arcangeli wrote:
| On Mon, Nov 19, 2001 at 09:09:41PM -0600, Ken Brownfield wrote:
| > Well, I think you'll be pleased to hear that your untested patch
| > compiled, booted, _and_ fixed the problem. :)
|
| Can you try to run an updatedb constantly in background?
|
| Andrea
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2001-11-20 06:56:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

In article <[email protected]>,
Ken Brownfield <[email protected]> wrote:
>kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
>apparent interactivity problems. I'm keeping it in while( 1 ), but it's
>been predictable so far.
>
>3-10 is a lot better than 99, but is kswapd really going to eat that
>much CPU in an essentially allocation-less state?

Well, it's obviously not allocation-less: updatedb will really hit on
the dcache and icache (which are both in the NORMAL zone only, which is
why Andrea asked for it), and obviously your Oracle load itself seems to
be happily paging stuff around, which causes a lot of allocations for
page-ins.

It only _looks_ static, because once you find the proper "balance", the
VM numbers themselves shouldn't change under a constant load.

We could make kswapd use less CPU time, of course, simply by making the
actual working processes do more of the work to free memory. The total
work ends up being the same, though, and the advantage of kswapd is that
it tends to make the freeing slightly more asynchronous, which helps
throughput.

The _disadvantage_ of kswapd is that if it goes crazy and uses up all
CPU time, you get bad results ;)

But it doesn't sound crazy in your load. I'd be happier if the VM took
less CPU, of course, but for now we seem to be doing ok.

Linus

2001-12-01 13:15:27

by Ken Brownfield

[permalink] [raw]
Subject: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)

When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
had an unfortunate flashback:

5:02am up 2 days, 1 min, 59 users, load average: 5.66, 4.86, 3.60
741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
CPU states: 0.2% user, 77.3% system, 0.0% nice, 22.3% idle
Mem: 3351664K av, 3346504K used, 5160K free, 0K shrd, 498048K buff
Swap: 1052248K av, 282608K used, 769640K free 2531892K cached

PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
2117 root 15 5 580 580 408 R N 0 99.9 0.0 17:19 updatedb
2635 kb 12 0 1696 1556 1216 R 0 99.9 0.0 4:16 smbd
2672 root 17 10 4212 4212 492 D N 0 94.7 0.1 1:39 rsync
2609 root 2 -20 1284 1284 672 R < 0 81.2 0.0 4:02 top
9 root 9 0 0 0 0 SW 0 80.7 0.0 42:50 kswapd
22879 kb 9 0 11548 6316 1684 S 0 11.8 0.1 7:33 smbd

Under varied load I'm not seeing the kswapd issue, but it looks like
updatedb combined with one or two samba transfers does still reproduce
the problem easily, and adding rsync or NFS transfers to the mix makes
kswapd peg at 99%.

I noticed because I was trying to do kernel patches and compiles using a
partition NFS-mounted from this machine. I guess it sometimes pays to
be up at 5am...

Unfortunately it's difficult for me to reboot this machine to update the
kernel (59 users) but I will try to reproduce the problem on a separate
machine this weekend or early next week. And I don't have profiling on,
so that will have to wait as well. :-(

Andrea, do you have a patch vs. 2.4.16 of your original solution to this
problem that I could test out? I'd rather just change one thing at a
time rather than switching completely to an -aa kernel.

Grrrr!

Thanks much,
--
Ken.
[email protected]


On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| In article <[email protected]>,
| Ken Brownfield <[email protected]> wrote:
| >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| >apparent interactivity problems. I'm keeping it in while( 1 ), but it's
| >been predictable so far.
| >
| >3-10 is a lot better than 99, but is kswapd really going to eat that
| >much CPU in an essentially allocation-less state?
|
| Well, it's obviously not allocation-less: updatedb will really hit on
| the dcache and icache (which are both in the NORMAL zone only, which is
| why Andrea asked for it), and obviously your Oracle load itself seems to
| be happily paging stuff around, which causes a lot of allocations for
| page-ins.
|
| It only _looks_ static, because once you find the proper "balance", the
| VM numbers themselves shouldn't change under a constant load.
|
| We could make kswapd use less CPU time, of course, simply by making the
| actual working processes do more of the work to free memory. The total
| work ends up being the same, though, and the advantage of kswapd is that
| it tends to make the freeing slightly more asynchronous, which helps
| throughput.
|
| The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| CPU time, you get bad results ;)
|
| But it doesn't sound crazy in your load. I'd be happier if the VM took
| less CPU, of course, but for now we seem to be doing ok.
|
| Linus
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2001-12-08 13:13:24

by Ken Brownfield

[permalink] [raw]
Subject: Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)

Just a quick followup to this, which is still a near show-stopper issue
for me.

This is easy to reproduce for me if I run updatedb locally, and then run
updatedb on a remote machine that's scanning an NFS-mounted filesystem
from the original local machine. Instant kswapd saturation, especially
on large filesystems.

Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
peg on the NFS-client side as well.

I recently realized that slocate (at least on RH6.2 w/ 2.4 kernels) does
not seem to properly detect NFS when provided "-f nfs"... Urgh.

Also something I noticed in slab_info (other info below):

inode_cache 369188 1027256 480 59716 128407 1 : 124 62
dentry_cache 256380 705510 128 14946 23517 1 : 252 126
buffer_head 46961 47800 96 1195 1195 1 : 252 126

That seems like a TON of {dentry,inode}_cache on a 1GB (HIMEM) machine.

I'd try 10_vm-19 but it doesn't apply cleanly for me.

Thanks for any input or ports of 10_vm-19 to 2.4.17-pre6. ;)
--
Ken.
[email protected]

total: used: free: shared: buffers: cached:
Mem: 1054011392 900526080 153485312 0 67829760 174866432
Swap: 2149548032 581632 2148966400
MemTotal: 1029308 kB
MemFree: 149888 kB
MemShared: 0 kB
Buffers: 66240 kB
Cached: 170376 kB
SwapCached: 392 kB
Active: 202008 kB
Inactive: 40380 kB
HighTotal: 131008 kB
HighFree: 30604 kB
LowTotal: 898300 kB
LowFree: 119284 kB
SwapTotal: 2099168 kB
SwapFree: 2098600 kB

Mem: 1029308K av, 886144K used, 143164K free, 0K shrd, 66240K buff
Swap: 2099168K av, 568K used, 2098600K free 170872K cached

On Sat, Dec 01, 2001 at 07:15:02AM -0600, Ken Brownfield wrote:
| When updatedb kicked off on my 2.4.16 6-way Xeon 4GB box this morning, I
| had an unfortunate flashback:
|
| 5:02am up 2 days, 1 min, 59 users, load average: 5.66, 4.86, 3.60
| 741 processes: 723 sleeping, 4 running, 0 zombie, 14 stopped
| CPU states: 0.2% user, 77.3% system, 0.0% nice, 22.3% idle
| Mem: 3351664K av, 3346504K used, 5160K free, 0K shrd, 498048K buff
| Swap: 1052248K av, 282608K used, 769640K free 2531892K cached
|
| PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
| 2117 root 15 5 580 580 408 R N 0 99.9 0.0 17:19 updatedb
| 2635 kb 12 0 1696 1556 1216 R 0 99.9 0.0 4:16 smbd
| 2672 root 17 10 4212 4212 492 D N 0 94.7 0.1 1:39 rsync
| 2609 root 2 -20 1284 1284 672 R < 0 81.2 0.0 4:02 top
| 9 root 9 0 0 0 0 SW 0 80.7 0.0 42:50 kswapd
| 22879 kb 9 0 11548 6316 1684 S 0 11.8 0.1 7:33 smbd
|
| Under varied load I'm not seeing the kswapd issue, but it looks like
| updatedb combined with one or two samba transfers does still reproduce
| the problem easily, and adding rsync or NFS transfers to the mix makes
| kswapd peg at 99%.
|
| I noticed because I was trying to do kernel patches and compiles using a
| partition NFS-mounted from this machine. I guess it sometimes pays to
| be up at 5am...
|
| Unfortunately it's difficult for me to reboot this machine to update the
| kernel (59 users) but I will try to reproduce the problem on a separate
| machine this weekend or early next week. And I don't have profiling on,
| so that will have to wait as well. :-(
|
| Andrea, do you have a patch vs. 2.4.16 of your original solution to this
| problem that I could test out? I'd rather just change one thing at a
| time rather than switching completely to an -aa kernel.
|
| Grrrr!
|
| Thanks much,
| --
| Ken.
| [email protected]
|
|
| On Tue, Nov 20, 2001 at 06:50:50AM +0000, Linus Torvalds wrote:
| | In article <[email protected]>,
| | Ken Brownfield <[email protected]> wrote:
| | >kswapd goes up to 5-10% CPU (vs 3-6) but it finishes without issue or
| | >apparent interactivity problems. I'm keeping it in while( 1 ), but it's
| | >been predictable so far.
| | >
| | >3-10 is a lot better than 99, but is kswapd really going to eat that
| | >much CPU in an essentially allocation-less state?
| |
| | Well, it's obviously not allocation-less: updatedb will really hit on
| | the dcache and icache (which are both in the NORMAL zone only, which is
| | why Andrea asked for it), and obviously your Oracle load itself seems to
| | be happily paging stuff around, which causes a lot of allocations for
| | page-ins.
| |
| | It only _looks_ static, because once you find the proper "balance", the
| | VM numbers themselves shouldn't change under a constant load.
| |
| | We could make kswapd use less CPU time, of course, simply by making the
| | actual working processes do more of the work to free memory. The total
| | work ends up being the same, though, and the advantage of kswapd is that
| | it tends to make the freeing slightly more asynchronous, which helps
| | throughput.
| |
| | The _disadvantage_ of kswapd is that if it goes crazy and uses up all
| | CPU time, you get bad results ;)
| |
| | But it doesn't sound crazy in your load. I'd be happier if the VM took
| | less CPU, of course, but for now we seem to be doing ok.
| |
| | Linus
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to [email protected]
| | More majordomo info at http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2001-12-09 20:08:02

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)



On Sat, 8 Dec 2001, Ken Brownfield wrote:

> Just a quick followup to this, which is still a near show-stopper issue
> for me.
>
> This is easy to reproduce for me if I run updatedb locally, and then run
> updatedb on a remote machine that's scanning an NFS-mounted filesystem
> from the original local machine. Instant kswapd saturation, especially
> on large filesystems.
>
> Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
> peg on the NFS-client side as well.

Can you reproduce the problem without the over NFS updatedb?

Thanks

2001-12-10 06:57:19

by Ken Brownfield

[permalink] [raw]
Subject: Re: Slight Return (was Re: [VM] 2.4.14/15-pre4 too "swap-happy"?)

Yes, any kind of fairly heavy, spread-out I/O combined with updatedb
will do the trick, like samba. NFS isn't required, it just seems to be
a particularly good trigger.

It seems like anything that hits the inode/dentry caches hard, actually,
and doesn't always happen when freepages (or its 2.4.x equivalent) has
been hit. I had a little applet that malloc'ed and memcpy'ed 1GB of RAM
and exited, which doesn't really help like it did before 2.4.15-pre[56].

It also happens for me a lot more with my 4GB machines, though I have
seen it on my 1GB HIGHMEM boxes as well. If the problem is related to
scanning the cache, perhaps more RAM simply makes it worse.

I'm planning on trying Andrew Morton's patches as soon as I'm able.

Thanks,
--
Ken.
[email protected]


On Sun, Dec 09, 2001 at 04:51:14PM -0200, Marcelo Tosatti wrote:
|
|
| On Sat, 8 Dec 2001, Ken Brownfield wrote:
|
| > Just a quick followup to this, which is still a near show-stopper issue
| > for me.
| >
| > This is easy to reproduce for me if I run updatedb locally, and then run
| > updatedb on a remote machine that's scanning an NFS-mounted filesystem
| > from the original local machine. Instant kswapd saturation, especially
| > on large filesystems.
| >
| > Doing updatedb on NFS-mounted filesystems also seems to cause kswapd to
| > peg on the NFS-client side as well.
|
| Can you reproduce the problem without the over NFS updatedb?
|
| Thanks
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/