2003-02-19 16:04:36

by Dejan Muhamedagic

[permalink] [raw]
Subject: vm issues on sap app server

Hello everybody,

We're running here a couple of 4-way intel boxes each with
6GB of memory and a SCSI RAID. The sole purpose is to run
SAP applications (SAP app servers). Basically, it's 30-40
processes accepting connections from ~150 SAP users and
making queries/updates to a DB server. These processes are
long lived and may swallow quite a bit of memory. The
standard estimate is ~30MB per user.

Currently, one box is running 2.4.20aa1 kernel and the other
2.4.20 with rmap15d and a bunch of NFS patches applied.
We're not entirely happy with either VM though the SAP
statistics show that both machines have acceptable response
times.

Both servers swap constantly, but the 2.4.20aa1 at a 10-fold
higher rate. OTOH, there should be enough memory for
everything. It seems like both VMs have preference for
cache. Is it possible to reduce the amount of memory used
for cache?

The worst situation is when there's a high io load. For
example, a file transfer over the Gb i/f (~40MBps) makes
almost all SAP processes stuck in the D state for some time
even making some SAP jobs fail due to high timeouts. It
looks like the VM wants to fill the cache and starts to swap
more at the same time. So we have to do big file transfers
when there's no SAP activity. The machine also suffers
badly during backup.

Finally, there's a third SAP app server, an RS6000 running
AIX with the same amount of memory, which seems to be more
stable under various loads.

Anybody with advice on how to get Linux behave better?

I will paste here excerpts from vmstat and sysstat
(sar), which seemed to be representative, as well as other
relevant data.

Cheers!

Dejan

P.S. Please CC to me because I'm not subscribed to the list.

CPU and cache (hyperthreading enabled)

<6>CPU: L1 I cache: 0K, L1 D cache: 8K
<6>CPU: L2 cache: 512K
<6>CPU: L3 cache: 1024K
<4>CPU0: Intel(R) XEON(TM) MP CPU 1.90GHz stepping 02

mem output (2.4.20aa1)

total: used: free: shared: buffers: cached:
Mem: 6341193728 6131351552 209842176 0 29294592 5348958208
Swap: 12582862848 5425111040 7157751808
MemTotal: 6192572 kB
MemFree: 204924 kB
MemShared: 0 kB
Buffers: 28608 kB
Cached: 1982164 kB
SwapCached: 3241428 kB
Active: 831176 kB
Inactive: 4923396 kB
HighTotal: 5358496 kB
HighFree: 6040 kB
LowTotal: 834076 kB
LowFree: 198884 kB
SwapTotal: 12287952 kB
SwapFree: 6989992 kB
BigFree: 0 kB

mem output (2.4.20rmap)

total: used: free: shared: buffers: cached:
Mem: 6321442816 6175191040 146251776 0 20500480 5473546240
Swap: 12582862848 2317144064 10265718784
MemTotal: 6173284 kB
MemFree: 142824 kB
MemShared: 0 kB
Buffers: 20020 kB
Cached: 4310704 kB
SwapCached: 1034556 kB
Active: 4752500 kB
ActiveAnon: 4057072 kB
ActiveCache: 695428 kB
Inact_dirty: 0 kB
Inact_laundry: 898608 kB
Inact_clean: 147680 kB
Inact_target: 1159756 kB
HighTotal: 5358496 kB
HighFree: 70400 kB
LowTotal: 814788 kB
LowFree: 72424 kB
SwapTotal: 12287952 kB
SwapFree: 10025116 kB

vmstat (2.4.20aa1)

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 5 0 5109180 202528 23508 2169176 1031 1035 1031 1035 4180 4728 11 1 88
0 0 0 5108716 202384 23536 2169644 746 424 746 433 1200 1267 5 1 94
0 0 0 5107060 201988 23524 2171300 791 362 791 378 1581 1831 4 1 96
2 0 0 5107328 203968 23556 2171032 1585 1091 1585 1109 3689 4407 14 14 72
0 0 0 5104392 204792 23600 2173968 0 0 0 14 1889 2214 4 1 95

vmstat (2.4.20rmap)

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 1957348 21888 11092 4599932 3 105 3 105 386 315 1 0 99
0 0 0 1957020 21244 11092 4600304 57 154 57 154 723 866 3 1 96
2 0 1 1956784 18336 11092 4600540 27 102 27 102 480 461 1 1 98
2 0 0 1957088 19572 11156 4600308 11 154 11 184 573 598 8 7 85
0 0 0 1957080 19108 11252 4600320 2 194 2 227 474 525 5 5 90

sar averages (2.4.20aa1)

16:40:00 pgpgin/s pgpgout/s activepg inadtypg inaclnpg inatarpg
Average: 973.83 326.41 206077 0 0 0

16:40:00 pswpin/s pswpout/s
Average: 171.72 76.47

16:40:00 kbmemfree kbmemused %memused kbmemshrd kbbuffers kbcached kbswpfree kbswpused %swpused
Average: 226761 5965811 96.34 0 70942 2551248 7429293 4858659 39.54

sar averages (2.4.20rmap)

15:50:00 pgpgin/s pgpgout/s activepg inadtypg inaclnpg inatarpg
Average: 75.94 120.10 1217300 109160 38560 296729

15:50:00 pswpin/s pswpout/s
Average: 16.68 26.29

15:50:00 kbmemfree kbmemused %memused kbmemshrd kbbuffers kbcached kbswpfree kbswpused %swpused
Average: 14025 6159259 99.77 0 16786 4682172 10386017 -10299677 4962552.26


2003-02-19 17:55:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: vm issues on sap app server

On Wed, Feb 19, 2003 at 05:14:32PM +0100, Dejan Muhamedagic wrote:
> Hello everybody,
>
> We're running here a couple of 4-way intel boxes each with
> 6GB of memory and a SCSI RAID. The sole purpose is to run
> SAP applications (SAP app servers). Basically, it's 30-40
> processes accepting connections from ~150 SAP users and
> making queries/updates to a DB server. These processes are
> long lived and may swallow quite a bit of memory. The
> standard estimate is ~30MB per user.
>
> Currently, one box is running 2.4.20aa1 kernel and the other
> 2.4.20 with rmap15d and a bunch of NFS patches applied.
> We're not entirely happy with either VM though the SAP
> statistics show that both machines have acceptable response
> times.
>
> Both servers swap constantly, but the 2.4.20aa1 at a 10-fold
> higher rate. OTOH, there should be enough memory for
> everything. It seems like both VMs have preference for
> cache. Is it possible to reduce the amount of memory used
> for cache?

yes:

echo 1000 >/proc/sys/vm/vm_mapped_ratio

this controls how hard the vm will try to shrink the cache before
starting swapping/unmapping activities.

>
> The worst situation is when there's a high io load. For
> example, a file transfer over the Gb i/f (~40MBps) makes
> almost all SAP processes stuck in the D state for some time
> even making some SAP jobs fail due to high timeouts. It
> looks like the VM wants to fill the cache and starts to swap
> more at the same time. So we have to do big file transfers

you can try if this mitigate the "stalling" effect of the file
transfer:

echo 5 >/proc/sys/vm/bdflush

also elvtune might be useful here if the above doesn't help at all.

You can also consider to use 2.4.21pre4aa3, it has some more advanced
elevator logic, the one in 2.4.20aa1 wasn't as accurate and it might
generate more stalls than what was necessary.

> when there's no SAP activity. The machine also suffers
> badly during backup.
>
> Finally, there's a third SAP app server, an RS6000 running
> AIX with the same amount of memory, which seems to be more
> stable under various loads.
>
> Anybody with advice on how to get Linux behave better?

2.4.21pre4aa3 has also extreme scalability optimizations that generates
three digits percent improvements on some hardware, however those won't
help latency directly. These optimizations will also change partly when
the vm starts swapping, and it will defer the "swap" point somehow, this
new behaviour (besides the greatly improved scalability) is also
beneficial to very shm-userspace-cache intensive apps. You can revert to
the non-scalable behaviour (but possibly more desiderable on small
desktop/laptops) by using echo 1 >/proc/sys/vm/vm_anon_lru. You should
also try 'echo 1 >/proc/sys/vm/vm_anon_lru' if you see the VM isn't
swapping well enough and that it shrinks too much cache after upgrading
to 2.4.21pre4aa3.

Unfortunately the tuning largely depends on the workload and the
hardware, so it is very hard to make it autotuning at best
automatically.

> I will paste here excerpts from vmstat and sysstat
> (sar), which seemed to be representative, as well as other
> relevant data.

Thanks for the interesting feedback!

Andrea

2003-02-19 22:58:24

by Rik van Riel

[permalink] [raw]
Subject: Re: vm issues on sap app server

On Wed, 19 Feb 2003, Dejan Muhamedagic wrote:

> Both servers swap constantly, but the 2.4.20aa1 at a 10-fold
> higher rate. OTOH, there should be enough memory for
> everything. It seems like both VMs have preference for
> cache. Is it possible to reduce the amount of memory used
> for cache?

Yes, you can reduce the size of the cache above which the
pageout code will only reclaim cache and not application
data. To set the percentage to 10% (from the default 5%):

echo 1 10 > /proc/sys/vm/pagecache

> Finally, there's a third SAP app server, an RS6000 running
> AIX with the same amount of memory, which seems to be more
> stable under various loads.

In that case you're probably familiar with the cache size
tuning, since AIX has the exact same tuning knob as rmap ;)

kind regards,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-20 12:32:56

by Dejan Muhamedagic

[permalink] [raw]
Subject: Re: vm issues on sap app server

Andrea,

On Wed, Feb 19, 2003 at 07:05:24PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 19, 2003 at 05:14:32PM +0100, Dejan Muhamedagic wrote:
> >
> > Both servers swap constantly, but the 2.4.20aa1 at a 10-fold
> > higher rate. OTOH, there should be enough memory for
> > everything. It seems like both VMs have preference for
> > cache. Is it possible to reduce the amount of memory used
> > for cache?
>
> yes:
>
> echo 1000 >/proc/sys/vm/vm_mapped_ratio
>
> this controls how hard the vm will try to shrink the cache before
> starting swapping/unmapping activities.

Today the swapping rate went up compared to yesterday. So much
that it made a serious impact on performance. The server has been
up for four days and the more the time passes the less it is
capable of handling the load. I tried changing the
vm_mapped_ratio as you suggested, but the cache use is still very
high:

Cached: 2292156 kB
SwapCached: 2770440 kB

I must ask for an explanation of the latter item. Is that swap
being cached? If so, why? AFAIK, if a page is swapped out and
later (soon) referenced again then the system is in a need of more
memory or the VM didn't predict well. The latter case should occur
infrequently. In the former no clever piece of software would help
anyway. So, why cache swap?

elvtune gives:

/dev/sda elevator ID 2
read_latency: 128
write_latency: 512
max_bomb_segments: 0

Which seems fine to me. Anyway, with this much swapping (100-800Kpps)
it won't help. I'll do some testing later with file transfer.

> 2.4.21pre4aa3 has also extreme scalability optimizations that generates
> three digits percent improvements on some hardware, however those won't
> help latency directly. These optimizations will also change partly when
> the vm starts swapping, and it will defer the "swap" point somehow, this
> new behaviour (besides the greatly improved scalability) is also
> beneficial to very shm-userspace-cache intensive apps.

It is exactly the case here:

# df /dev/shm
Filesystem 1k-blocks Used Available Use% Mounted on
shmfs 16384000 5798364 10585636 36% /dev/shm

> You can revert to
> the non-scalable behaviour (but possibly more desiderable on small
> desktop/laptops) by using echo 1 >/proc/sys/vm/vm_anon_lru. You should
> also try 'echo 1 >/proc/sys/vm/vm_anon_lru' if you see the VM isn't
> swapping well enough and that it shrinks too much cache after upgrading
> to 2.4.21pre4aa3.

I hope I will be able to give this one a try.

> Thanks for the interesting feedback!

Thank you for your input.

Cheers!

Dejan

2003-02-20 12:39:34

by Dejan Muhamedagic

[permalink] [raw]
Subject: Re: vm issues on sap app server

Rik,

On Wed, Feb 19, 2003 at 08:08:01PM -0300, Rik van Riel wrote:
> On Wed, 19 Feb 2003, Dejan Muhamedagic wrote:
>
> > cache. Is it possible to reduce the amount of memory used
> > for cache?
>
> Yes, you can reduce the size of the cache above which the
> pageout code will only reclaim cache and not application
> data. To set the percentage to 10% (from the default 5%):
>
> echo 1 10 > /proc/sys/vm/pagecache

Will that work with rmap15d? The code seems to support only min
and borrow parameters. Correct me if I'm wrong. This is what it
looks like currently:

# cat /proc/sys/vm/pagecache
1 3 20
# mem | grep Cache
Cached: 4569128 kB
SwapCached: 829668 kB
ActiveCache: 136728 kB

>
> > Finally, there's a third SAP app server, an RS6000 running
> > AIX with the same amount of memory, which seems to be more
> > stable under various loads.
>
> In that case you're probably familiar with the cache size
> tuning, since AIX has the exact same tuning knob as rmap ;)

AIX vmtune -P is equivalent to the Linux cache-max, but cache-max
is not implemented.

Cheers!

Dejan

2003-02-20 12:59:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: vm issues on sap app server

On Thu, Feb 20, 2003 at 01:40:27PM +0100, Dejan Muhamedagic wrote:
> Andrea,
>
> On Wed, Feb 19, 2003 at 07:05:24PM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 19, 2003 at 05:14:32PM +0100, Dejan Muhamedagic wrote:
> > >
> > > Both servers swap constantly, but the 2.4.20aa1 at a 10-fold
> > > higher rate. OTOH, there should be enough memory for
> > > everything. It seems like both VMs have preference for
> > > cache. Is it possible to reduce the amount of memory used
> > > for cache?
> >
> > yes:
> >
> > echo 1000 >/proc/sys/vm/vm_mapped_ratio
> >
> > this controls how hard the vm will try to shrink the cache before
> > starting swapping/unmapping activities.
>
> Today the swapping rate went up compared to yesterday. So much
> that it made a serious impact on performance. The server has been
> up for four days and the more the time passes the less it is
> capable of handling the load. I tried changing the
> vm_mapped_ratio as you suggested, but the cache use is still very
> high:
>
> Cached: 2292156 kB
> SwapCached: 2770440 kB
>
> I must ask for an explanation of the latter item. Is that swap
> being cached? If so, why? AFAIK, if a page is swapped out and

cache is the filesystem cache, all your program images, the whole SHM
area used by sap and the swap can be cached too, so the shm memory can
showup both as cache as swap and as swacache at the same time, probably
you've plenty of swap so there's no need to reclaim shm and anonymous
pages from the swap space, this allows you to do zero-IO cost swapouts
of clean swapcache pages for example which is relevant in a scenario
like this.

> later (soon) referenced again then the system is in a need of more
> memory or the VM didn't predict well. The latter case should occur
> infrequently. In the former no clever piece of software would help
> anyway. So, why cache swap?

primarly beause this way if you don't modify it, the next time you need
to swap it to disk it will be zero I/O cost. Secondly because for
various consistency reason (especially with directio) we must be able to
mark swap cache dirty (and let the vm to collect it away, like we do
with the non-swap cache), and being able to mark swapcache dirty (rather
than reclaiming it from the swapcache by the time you write to it) is
helpful also to try to avoid fragmenting the swap too (so we allocate
the swap space only once and we keep overwriting in the same place).

After half ot the swap is full, the -aa vm stops caching the swap
aggressively because then the priority becomes not running out of
virtual memory, not anymore to swapout as fast as we possibly can.

One of the reason the performance may slowdown over time is also swap
fragmentation, the dirty cache will try to avoid it but it still can
happen and we don't defragment it aggressively. If you had enough memory
for it, it would be interesting if performance returns fine after a
swapoff -a/swapon -a (but I think you don't have enough ram and the
swapoff would lead to either killing tasks or swapoff failure). However
you should be able to verify that the performance returns at its peak
after a restart cycle of the app server. This almost guarantees the
kernel is doing fine.

>
> elvtune gives:
>
> /dev/sda elevator ID 2
> read_latency: 128
> write_latency: 512
> max_bomb_segments: 0
>
> Which seems fine to me. Anyway, with this much swapping (100-800Kpps)
> it won't help. I'll do some testing later with file transfer.

the elvtune suggestion was intended only for the file transfer. Just to
give it a spin you can try with a -r 10 (just to see if you notice any
difference).

But really you need to upgrade to pre4aa3 where I improved some bits in
elevator-lowlatency before testing again the file transfer stalling
behaviour.

> > 2.4.21pre4aa3 has also extreme scalability optimizations that generates
> > three digits percent improvements on some hardware, however those won't
> > help latency directly. These optimizations will also change partly when
> > the vm starts swapping, and it will defer the "swap" point somehow, this
> > new behaviour (besides the greatly improved scalability) is also
> > beneficial to very shm-userspace-cache intensive apps.
>
> It is exactly the case here:
>
> # df /dev/shm
> Filesystem 1k-blocks Used Available Use% Mounted on
> shmfs 16384000 5798364 10585636 36% /dev/shm

Yep I expected a scenario like this ;)

>
> > You can revert to
> > the non-scalable behaviour (but possibly more desiderable on small
> > desktop/laptops) by using echo 1 >/proc/sys/vm/vm_anon_lru. You should
> > also try 'echo 1 >/proc/sys/vm/vm_anon_lru' if you see the VM isn't
> > swapping well enough and that it shrinks too much cache after upgrading
> > to 2.4.21pre4aa3.
>
> I hope I will be able to give this one a try.

btw, be careful with the vm_mapped_ratio, 1000 may be too much if you
really need to swap a lot to get good performance. It is possible the
100 value by default is optimal for your workload.

Also remeber that if pushed at the maximum the vm will be forced to run
at the speed of the disk no matter how much the VM is good, there is an
hardware disk limit in how fast it can swap and behave. however the good
VM will run as worse at the speed-of-the-disk-during-seeks and never
much slower of what the disk can deliver during some seeking.

>
> > Thanks for the interesting feedback!
>
> Thank you for your input.

You're very welcome!

Andrea

2003-02-20 13:11:58

by Rik van Riel

[permalink] [raw]
Subject: Re: vm issues on sap app server

On Thu, 20 Feb 2003, Dejan Muhamedagic wrote:

> > echo 1 10 > /proc/sys/vm/pagecache
>
> Will that work with rmap15d? The code seems to support only min
> and borrow parameters.

Indeed, only min and borrow are currently supported.

> Correct me if I'm wrong. This is what it looks like currently:
>
> # cat /proc/sys/vm/pagecache
> 1 3 20
> # mem | grep Cache
> Cached: 4569128 kB
> SwapCached: 829668 kB
> ActiveCache: 136728 kB

The "problem" here is that a lot of the memory in Cached: is
mapped into process address space, so in effect it is process
memory.

This is especially true for executables, libraries and shared
memory segments, which you REALLY want to have treated as process
memory and not as cache...

This makes the Cached statistic a bit confusing for administrators.

> > In that case you're probably familiar with the cache size
> > tuning, since AIX has the exact same tuning knob as rmap ;)
>
> AIX vmtune -P is equivalent to the Linux cache-max, but cache-max
> is not implemented.

Doesn't it also have something like the borrow percentage, above
which AIX will only reclaim from the cache, unless the repaging
rate of the cache is higher than that of process memory ?

regards,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-20 13:57:15

by Dejan Muhamedagic

[permalink] [raw]
Subject: Re: vm issues on sap app server

Rik,

On Thu, Feb 20, 2003 at 10:21:50AM -0300, Rik van Riel wrote:
> On Thu, 20 Feb 2003, Dejan Muhamedagic wrote:
>
> > # mem | grep Cache
> > Cached: 4569128 kB
> > SwapCached: 829668 kB
> > ActiveCache: 136728 kB
>
> The "problem" here is that a lot of the memory in Cached: is
> mapped into process address space, so in effect it is process
> memory.
>
> This is especially true for executables, libraries and shared
> memory segments, which you REALLY want to have treated as process
> memory and not as cache...
>
> This makes the Cached statistic a bit confusing for administrators.

Is there a way to split the statistics? It also sounds confusing :)

> > > In that case you're probably familiar with the cache size
> > > tuning, since AIX has the exact same tuning knob as rmap ;)
> >
> > AIX vmtune -P is equivalent to the Linux cache-max, but cache-max
> > is not implemented.
>
> Doesn't it also have something like the borrow percentage, above
> which AIX will only reclaim from the cache, unless the repaging
> rate of the cache is higher than that of process memory ?

No, not that I'm aware of. You can check the man page for vmtune
yourself:

http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/cmds/aixcmds6/vmtune.htm

BTW, there is also quite a bit of interesting documentation about
the AIX VMM.

Cheers!

Dejan

2003-02-20 15:50:29

by Rik van Riel

[permalink] [raw]
Subject: Re: vm issues on sap app server

On Thu, 20 Feb 2003, Dejan Muhamedagic wrote:
> On Thu, Feb 20, 2003 at 10:21:50AM -0300, Rik van Riel wrote:
> > On Thu, 20 Feb 2003, Dejan Muhamedagic wrote:
> >
> > > # mem | grep Cache
> > > Cached: 4569128 kB
> > > SwapCached: 829668 kB
> > > ActiveCache: 136728 kB
> >
> > The "problem" here is that a lot of the memory in Cached: is
> > mapped into process address space, so in effect it is process
> > memory.
>
> Is there a way to split the statistics? It also sounds confusing :)

ActiveAnon is the active memory mapped into user processes,
plus swap cache. ActiveCache is the active memory that's
only caching files and not mapped into user memory.

Note that this isn't always correct since a page can start
in one cache and become mapped by user processes, or be
unmapped. Linux moves the pages lazily.

> > > > In that case you're probably familiar with the cache size
> > > > tuning, since AIX has the exact same tuning knob as rmap ;)
> > >
> > > AIX vmtune -P is equivalent to the Linux cache-max, but cache-max
> > > is not implemented.

> http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/cmds/aixcmds6/vmtune.htm

OK, vmtune -p is equivalent to cache-min, vmtune -P is
equivalent to cache-borrow ...

It looks like AIX doesn't have a cache-max, either.

kind regards,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-20 23:53:23

by Dejan Muhamedagic

[permalink] [raw]
Subject: Re: vm issues on sap app server

Andrea,

On Thu, Feb 20, 2003 at 02:08:58PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 20, 2003 at 01:40:27PM +0100, Dejan Muhamedagic wrote:
> >
> > Today the swapping rate went up compared to yesterday. So much
> > that it made a serious impact on performance. The server has been
> > up for four days and the more the time passes the less it is
> > capable of handling the load. I tried changing the
> > vm_mapped_ratio as you suggested, but the cache use is still very
> > high:
> >
> > Cached: 2292156 kB
> > SwapCached: 2770440 kB
> >
> > I must ask for an explanation of the latter item. Is that swap
> > being cached? If so, why? AFAIK, if a page is swapped out and
>
> cache is the filesystem cache, all your program images, the whole SHM
> area used by sap and the swap can be cached too, so the shm memory can
> showup both as cache as swap and as swacache at the same time, probably
> you've plenty of swap so there's no need to reclaim shm and anonymous
> pages from the swap space, this allows you to do zero-IO cost swapouts
> of clean swapcache pages for example which is relevant in a scenario
> like this.
>
> > later (soon) referenced again then the system is in a need of more
> > memory or the VM didn't predict well. The latter case should occur
> > infrequently. In the former no clever piece of software would help
> > anyway. So, why cache swap?
>
> primarly beause this way if you don't modify it, the next time you need
> to swap it to disk it will be zero I/O cost. Secondly because for
> various consistency reason (especially with directio) we must be able to
> mark swap cache dirty (and let the vm to collect it away, like we do
> with the non-swap cache), and being able to mark swapcache dirty (rather
> than reclaiming it from the swapcache by the time you write to it) is
> helpful also to try to avoid fragmenting the swap too (so we allocate
> the swap space only once and we keep overwriting in the same place).

I guess that I don't understand the vm so please forgive my
ignorance and bear with me. I can't see why would a page be
swapped out and then cached. In other words, why would caching be
preferred. Of course, if a page stays for a "long" time
untouched, then it is desirable to use that space for caching
purposes. However, this VM seems to like swapping a bit too much.

> After half ot the swap is full, the -aa vm stops caching the swap
> aggressively because then the priority becomes not running out of
> virtual memory, not anymore to swapout as fast as we possibly can.

I'm not sure if it's wise to base decisions on the swap size.
Swap size is commonly thought of as a VM fuse. Often people don't
think a lot how much space to reserve for swap. In many places it
is recommended to allocate three times the memory size for
swapping. In that case a half full swap would seem like trouble.

> One of the reason the performance may slowdown over time is also swap
> fragmentation, the dirty cache will try to avoid it but it still can
> happen and we don't defragment it aggressively. If you had enough memory
> for it, it would be interesting if performance returns fine after a
> swapoff -a/swapon -a (but I think you don't have enough ram and the
> swapoff would lead to either killing tasks or swapoff failure). However
> you should be able to verify that the performance returns at its peak
> after a restart cycle of the app server. This almost guarantees the
> kernel is doing fine.

I'd disagree here. The system _should_ be able to keep stability
without resetting. If it can't, then there's something wrong.
Unfortunately, in this case, the performance deteriorates over
time. The server goes through a cycle of heavy use during working
hours and a relative peace during the evening. However, the
amount of swapping increases every day. BTW, swapin prevails by
two or three times. Today it ended up in swapping in around 3000
Kpps almost continuosly.

[snipped part about the elevator; more on that in another post
and with the new kernel]

> Also remeber that if pushed at the maximum the vm will be forced to run
> at the speed of the disk no matter how much the VM is good, there is an
> hardware disk limit in how fast it can swap and behave. however the good
> VM will run as worse at the speed-of-the-disk-during-seeks and never
> much slower of what the disk can deliver during some seeking.

Yes, but I don't have a feeling that the memory is overcommitted.
I can't prove that(1), but the other app server handles the load
much better. The two should each be given a fair share of work,
though load balancing is never perfect. The other server is
swapping too, but on the order of magnitude less, and, more
important, it exhibits the same behaviour every day. Finally, the
AIX server, equipped with the same amount of memory, swaps a
couple of pages a few times a day.

(1) Summing up the RSS column of "ps aux" yields incredible 21GB.
Could one calculate used_mem - bufs - cached + used_swap ?

Sorry for such a long post. I hope to come back with some data on
the 2.4.21 kernel next week.

Cheers!

Dejan

2003-02-21 00:14:57

by Dejan Muhamedagic

[permalink] [raw]
Subject: Re: vm issues on sap app server

Rik,

On Thu, Feb 20, 2003 at 01:00:12PM -0300, Rik van Riel wrote:
> On Thu, 20 Feb 2003, Dejan Muhamedagic wrote:
> > > >
> > > > AIX vmtune -P is equivalent to the Linux cache-max, but cache-max
> > > > is not implemented.
>
> OK, vmtune -p is equivalent to cache-min, vmtune -P is
> equivalent to cache-borrow ...

Yes, I was wrong ...

> It looks like AIX doesn't have a cache-max, either.

... and now I can't think of a good reason for implementing
cache-max at all.

Cheers!

Dejan

2003-02-21 09:45:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: vm issues on sap app server

On Fri, Feb 21, 2003 at 01:03:22AM +0100, Dejan Muhamedagic wrote:
> I'd disagree here. The system _should_ be able to keep stability
> without resetting. If it can't, then there's something wrong.

I really meant restarting the app server, not the kernel.

> (1) Summing up the RSS column of "ps aux" yields incredible 21GB.
> Could one calculate used_mem - bufs - cached + used_swap ?

you're counting the size of the shm for the N tasks that are mapping it,
perfectly normal.

For the increased swapping, it is also possible you pay some pagetable
overhead that increases over time after all the processes touches the
whole 2G. Not sure if each task is reading the whole shm after the mmap
or shmat during startup (and certainly it's not mlocked, so the ptes
will be allocated lazily).

Again, if after stopping and starting the app server it returns at peak
performance I don't see how this can be a kernel issue.

Andrea