hi,
we have a growing cluster of ia32 SMP machines, each of them with 4GB
physical memory. the problem we observe with 2.4.5 as well as 2.4.6
is that once we start running simulations on those machines they
become after a short while quite unusable. this is the picture
of a machine, freshly rebooted after the app ran for 30 minutes
or so:
machine018:~ # top -b | head -28
3:27pm up 4:02, 2 users, load average: 2.08, 3.88, 3.05
60 processes: 55 sleeping, 3 running, 2 zombie, 0 stopped
CPU0 states: 89.0% user, 9.0% system, 89.0% nice, 0.1% idle
CPU1 states: 97.0% user, 1.0% system, 97.0% nice, 0.1% idle
Mem: 4058128K av, 4050816K used, 7312K free, 0K shrd, 3152K buff
Swap: 14337736K av, 3380176K used, 10957560K free 2876028K cached
PID PPID USER PRI SIZE SWAP RSS SHARE D STAT %CPU %MEM TIME COMMA
3759 3684 userid 15 2105M 1.1G 928M 254M 0M R N 94.4 23.4 10:57 ceqsim
3498 3425 userid 16 2189M 1.5G 609M 205M 0M R N 91.7 15.3 22:12 ceqsim
4126 819 root 16 1044 0 1044 820 55 R 9.8 0.0 0:00 top
1 0 root 8 76 12 64 64 4 S 0.0 0.0 0:00 init
2 1 root 8 0 0 0 0 0 SW 0.0 0.0 0:00 kevent
3 1 root 9 0 0 0 0 0 SW 0.0 0.0 2:42 kswapd
4 1 root 9 0 0 0 0 0 SW 0.0 0.0 0:00 krecla
5 1 root 9 0 0 0 0 0 SW 0.0 0.0 0:00 bdflus
6 1 root 9 0 0 0 0 0 SW 0.0 0.0 0:03 kupdat
7 1 root 9 0 0 0 0 0 SW 0.0 0.0 0:00 scsi_e
8 1 root 9 0 0 0 0 0 SW 0.0 0.0 0:00 scsi_e
41 1 root 9 0 0 0 0 0 SW 0.0 0.0 0:01 kreise
machine018:~ # cat /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 4155523072 4148527104 6995968 0 3227648 2940301312
Swap: 1796939776 3461300224 2630606848
MemTotal: 4058128 kB
MemFree: 6832 kB
MemShared: 0 kB
Buffers: 3152 kB
Cached: 2871388 kB
Active: 1936040 kB
Inact_dirty: 499780 kB
Inact_clean: 438720 kB
Inact_target: 3080 kB
HighTotal: 3211200 kB
HighFree: 3988 kB
LowTotal: 846928 kB
LowFree: 2844 kB
SwapTotal: 14337736 kB
SwapFree: 10957560 kB
machine018:~ # cat /proc/swaps
Filename Type Size Used Priority
/dev/sda5 partition 2048248 2048248 -1
/dev/sdb1 partition 2048248 1331928 -2
/dev/sdc1 partition 2048248 0 -3
[..]
why does the kernel have 2.8GB of cached pages, and our applications
have to swap 1.5+1.1GB of pages out? also, i do not understand why
the amount of inactive pages is so high. i don't have good statistics
on that, but my impression is that the amount of Inact_dirty pages
increases as longer as the application runs.
not to mention, that the moment the machine is swapping pages out
in the order of gigabytes, the console even doesn't respond and
network wise it only responds to icmp packets.
i don't know how to collect more information, please let me know what
i can do in order send more info (.config is CONFIG_HIGHMEM4G=y).
thx,
~dirkw
PS: cc me please since i am not subscribed to the list.
Mark Hahn wrote:
> > 3759 3684 userid 15 2105M 1.1G 928M 254M 0M R N 94.4 23.4 10:57 ceqsim
> > 3498 3425 userid 16 2189M 1.5G 609M 205M 0M R N 91.7 15.3 22:12 ceqsim
>
> do you have any control over the size of these processes?
> with 4G ram, it makes more sense to have them sum to ~3.5G.
i can only allocate 3GB if i use doug lea's malloc, that's a thing in glibc
which hasn't been addressed yet. under normal use of malloc() me
apps only get 2GB per process.
for 3.5 GB i would have to use the patch from aa, posted a couple of months ago.
i am using for the farm a plain vanilla 2.4.[56] to keep things simple first.
no, the typical job has beween 1GB and 3GB in memory. they are independent,
just some number crunchung on farm CPUs.
> > MemTotal: 4058128 kB
> > MemFree: 6832 kB
> > MemShared: 0 kB
> > Buffers: 3152 kB
> > Cached: 2871388 kB
> > Active: 1936040 kB
> > Inact_dirty: 499780 kB
> > Inact_clean: 438720 kB
> > Inact_target: 3080 kB
> > HighTotal: 3211200 kB
> > HighFree: 3988 kB
> > LowTotal: 846928 kB
> > LowFree: 2844 kB
> > SwapTotal: 14337736 kB
> > SwapFree: 10957560 kB
> >
> >
> > machine018:~ # cat /proc/swaps
> > Filename Type Size Used Priority
> > /dev/sda5 partition 2048248 2048248 -1
> > /dev/sdb1 partition 2048248 1331928 -2
> > /dev/sdc1 partition 2048248 0 -3
>
> they should all have the same priority, so swapping is distributed.
> currently sda5 fills (and judging by the 5, it's not on the fast
> part of the disk) before sdb1 is used.
well, that's the symptom, but not the disease, medically speaking.
the idea was not swapping to the data disks (which are sdb and sdc)
in the first place.
> > why does the kernel have 2.8GB of cached pages, and our applications
>
> afaik, cached doesn't exclude your apps' pages.
i am in this typical example 2.6GB in swap. where did all my precious memory
go?
> > have to swap 1.5+1.1GB of pages out? also, i do not understand why
> > the amount of inactive pages is so high.
>
> the kernel thinks that there's memory pressure: perhaps you're doing
> file IO? memory pressure causes it to pick on processes, especially
> large ones, especially the ones that are doing the allocation.
also if i do file i/o, i don't expect the kernel to take away so much memory from
my apps. file caching (or whatever "cached" in /proc/cpuinfo means, is there
a doc btw?) should be a courtesy and not a torture for the performance of app.
> > not to mention, that the moment the machine is swapping pages out
> > in the order of gigabytes, the console even doesn't respond and
>
> from appearances, you've overloaded it. the kernel also tries to be
> "fair", which in this case means trying to steal pages from the hogs.
it looks to me that the hog is the kernel, and the kernel isn't fair to me, again:
PID PPID USER PRI SIZE SWAP RSS SHARE D STAT %CPU %MEM TIME COMMAND
3759 3684 userid 15 2105M 1.1G 928M 254M 0M R N 94.4 23.4 10:57 ceqsim
3498 3425 userid 16 2189M 1.5G 609M 205M 0M R N 91.7 15.3 22:12 ceqsim
^^^^^^^ ^^^^
normally if i monitor the farm, i also do a "ps alx" and do a total over the RSS and
VSZ,
which tells something about the memory consumption in user space. in my initial
posting i
forgot to mention this number. just add ~100MB and ~200MB respectively. so from
the user space or application perspective the kernel still eats my memory.
what's wrong and what can i do to help people to help me?
thx,
~dirkw
On Tue, 10 Jul 2001, Mark Hahn wrote:
> my point, perhaps to terse, was that you shouldn't run 4.3G of job on
> a 4G machine, and expect everything to necessarily work.
then i expect to have 300M+ swapped out and not 2.6GB. so what?
> > > they should all have the same priority, so swapping is distributed.
> > > currently sda5 fills (and judging by the 5, it's not on the fast
> > > part of the disk) before sdb1 is used.
> >
> > well, that's the symptom, but not the disease, medically speaking.
>
> no, it's actually orthogonal, mathematically speaking:
> your swap configuration is inefficient
i am complaining about the fact that the machines start paging
heavily without a reason and you are telling me that my swap
config is wrong?
> note also that swap listed as in use is really just allocated,
> not necessarily used. the current VM preemptively assigns idle pages
> to swapcache; whether they ever get written out is another matter.
it IS used. after submitting the jobs the nodes were dead for a while since
they were swapping like hell.
> you should clearly run "vmstat 1" to see whether there's significant
> si/so. it would be a symptom if there was actually a lot of *both*
> si and so (thrashing).
they were so dead, i couldn't even type on the console. load was up
to 30, culprit at that point was kswapd.
> > > the kernel thinks that there's memory pressure: perhaps you're doing
> > > file IO? memory pressure causes it to pick on processes, especially
> > > large ones, especially the ones that are doing the allocation.
> >
> > also if i do file i/o, i don't expect the kernel to take away so much memory from
> > my apps.
>
> well, then you disagree with the VM design. it's based on idleness,
> not some notion of categories.
so are you saying, if i want to run apps being 4GB in mem i should get
a machine having 4GB+60%= 6.4GB? you are not serious, aren't you?
> > > from appearances, you've overloaded it. the kernel also tries to be
> > > "fair", which in this case means trying to steal pages from the hogs.
> >
> > it looks to me that the hog is the kernel, and the kernel isn't fair to me, again:
> >
> > PID PPID USER PRI SIZE SWAP RSS SHARE D STAT %CPU %MEM TIME COMMAND
> > 3759 3684 userid 15 2105M 1.1G 928M 254M 0M R N 94.4 23.4 10:57 ceqsim
> > 3498 3425 userid 16 2189M 1.5G 609M 205M 0M R N 91.7 15.3 22:12 ceqsim
> > ^^^^^^^ ^^^^
>
> what's surprising, that your app has 900/600M working set?
the arrows got wrong in the reply i am reading now. i was complaining about
the fact that 1.1 of 2.1GB and 1.5 of 2.2GB are swapped out. and there's
no need for that. period.
> > the user space or application perspective the kernel still eats my memory.
>
> the current VM does not follow your thinking, which is that ram is for apps
> and the kernel gets to keep any leftovers.
Mark, you still didn't get my point. the kernel doesn't get the leftovers.
in my case it takes mem away from my apps instead. there's a leak to /dev/dead
or there are somehow misreferenced pointers in a page table which are only
cleaned up by a reboot.
anybody else can tell me what's wrong?
~dirkw
On Tue, 10 Jul 2001, Dirk Wetter wrote:
>
> On Tue, 10 Jul 2001, Mark Hahn wrote:
>
>
> > my point, perhaps to terse, was that you shouldn't run 4.3G of job on
> > a 4G machine, and expect everything to necessarily work.
>
> then i expect to have 300M+ swapped out and not 2.6GB. so what?
>
> > > > they should all have the same priority, so swapping is distributed.
> > > > currently sda5 fills (and judging by the 5, it's not on the fast
> > > > part of the disk) before sdb1 is used.
> > >
> > > well, that's the symptom, but not the disease, medically speaking.
> >
> > no, it's actually orthogonal, mathematically speaking:
> > your swap configuration is inefficient
>
> i am complaining about the fact that the machines start paging
> heavily without a reason and you are telling me that my swap
> config is wrong?
>
> > note also that swap listed as in use is really just allocated,
> > not necessarily used. the current VM preemptively assigns idle pages
> > to swapcache; whether they ever get written out is another matter.
>
> it IS used. after submitting the jobs the nodes were dead for a while since
> they were swapping like hell.
>
> > you should clearly run "vmstat 1" to see whether there's significant
> > si/so. it would be a symptom if there was actually a lot of *both*
> > si and so (thrashing).
>
> they were so dead, i couldn't even type on the console. load was up
> to 30, culprit at that point was kswapd.
Dirk,
Can you boot the kernel with "profile=2" and use the "readprofile" tool to
check where the kernel is wasting its time ? (take a look at the
readprofile man page)
Let me machine stay in the "unusable" state for quite some time before
reading the statistics.