2007-08-26 14:39:21

by Fred Tyler

[permalink] [raw]
Subject: Slow, persistent memory leak in 2.6.20

I think I've come across a memory leak in 2.6.20. I've upgraded to the
latest 2.6.20.17, but it didn't seem to help.

A little background: I saw something exactly like this many months ago
with a 2.6.12 kernel. However, by 2.6.16.x the leak had apparently
been fixed, so I didn't pursue it. I just assumed it had been fixed.
But either it remains in 2.6.20 or else a new leak has appeared.

FWIW, this is an x86_64 machine, but I also saw nearly the same
behavior on a i386 machine running 2.6.12. (Links to graphs showing
long-term memory usage are at the bottom of this email if you want to
skip all the text stats in the middle.)

Immediately after booting the system, I shut down all services to get
a baseline for comparison. Here is the output of top and vmstat with
virtually nothing running:

=========== top =============

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
754 root 15 0 16948 2368 1732 R 0.0 0.3 0:00.01 sshd
757 root 15 0 5620 1440 1124 S 0.0 0.2 0:00.00 bash
1195 root 15 0 6300 1116 880 R 0.3 0.1 0:00.02 top
1196 root 18 0 3880 628 516 S 0.0 0.1 0:00.00 agetty
1 root 18 0 3888 516 412 S 0.0 0.1 0:00.26 init
741 root 18 0 3880 508 412 S 0.0 0.1 0:00.00 agetty
742 root 18 0 3876 504 412 S 0.0 0.1 0:00.00 agetty
743 root 18 0 3876 504 412 S 0.0 0.1 0:00.00 agetty
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/0
5 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
57 root 10 -5 0 0 0 S 0.0 0.0 0:00.02 kblockd/0
58 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
59 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
62 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
64 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
121 root 25 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
122 root 15 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
123 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kswapd0
124 root 19 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
220 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
221 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
245 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
246 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 usb-storage
256 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 reiserfs/0

=========== free ============

total used free shared buffers cached
Mem: 899408 96824 802584 0 12604 70064
-/+ buffers/cache: 14156 885252
Swap: 65528 0 65528


=========== vmstat ==============

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 802152 13008 70096 0 0 402 184 282 87 2 1 88 9


=========== vmstat -s ===============

899408 total memory
97248 used memory
50352 active memory
34368 inactive memory
802160 free memory
13080 buffer memory
70104 swap cache
65528 total swap
0 used swap
65528 free swap
349 non-nice user cpu ticks
0 nice user cpu ticks
172 system cpu ticks
15743 idle cpu ticks
1522 IO-wait cpu ticks
0 IRQ cpu ticks
10 softirq cpu ticks
0 stolen cpu ticks
69682 pages paged in
32228 pages paged out
0 pages swapped in
0 pages swapped out
50228 interrupts
15207 CPU context switches
1188132534 boot time
1213 forks

==================================



Ok, now I start back up all services and let the system run for about
12 hours. At the end of this time, I shut down all services again so
that virtually nothing is running. Here are the stats:



=========== top ================

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17250 root 15 0 16952 2372 1732 R 0.0 0.3 0:00.09 sshd
17253 root 15 0 5624 1448 1124 S 0.0 0.2 0:00.01 bash
23409 root 15 0 6304 1124 884 R 0.0 0.1 0:00.00 top
23410 root 18 0 3880 628 516 S 0.0 0.1 0:00.00 agetty
1 root 18 0 3884 516 412 S 0.0 0.1 0:00.56 init
750 root 18 0 3880 508 412 S 0.0 0.1 0:00.00 agetty
751 root 18 0 3880 508 412 S 0.0 0.1 0:00.00 agetty
749 root 18 0 3876 504 412 S 0.0 0.1 0:00.00 agetty
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0
4 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/0
5 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
57 root 10 -5 0 0 0 S 0.0 0.0 0:00.31 kblockd/0
58 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
59 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
62 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
64 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
121 root 15 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
123 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kswapd0
124 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
220 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
221 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
245 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
246 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 usb-storage
256 root 10 -5 0 0 0 S 0.0 0.0 0:00.02 reiserfs/0
17277 root 15 0 0 0 0 S 0.0 0.0 0:00.16 pdflush

============= free ===========

total used free shared buffers cached
Mem: 899408 747128 152280 0 166228 444540
-/+ buffers/cache: 136360 763048
Swap: 65528 0 65528


============= vmstat ==============

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 152288 166228 444564 0 0 10 30 255 29 0 0 99 0

============ vmstat -s ==============

899408 total memory
747248 used memory
338700 active memory
273736 inactive memory
152160 free memory
166228 buffer memory
444660 swap cache
65528 total swap
0 used swap
65528 free swap
7522 non-nice user cpu ticks
300 nice user cpu ticks
4120 system cpu ticks
3699397 idle cpu ticks
13963 IO-wait cpu ticks
49 IRQ cpu ticks
146 softirq cpu ticks
0 stolen cpu ticks
355378 pages paged in
1108508 pages paged out
0 pages swapped in
0 pages swapped out
9505965 interrupts
1095062 CPU context switches
1188095217 boot time
23440 forks

======================================


After 12 hours, you can see that when I shut down all of the services
there is a lot of memory being used. But where is it going?

I have compared this to a machine running i386 2.6.16.2x and when I
stop all services down to nothing but ssh, there is only a tiny amount
of RAM in use, as expected.

I can verify that this memory loss never stops: The lost memory keeps
increasing until eventually the machine goes into swap and will
eventually crash if left to its own devices. However, on machines with
big RAM, this process can take a month or more.

Here are links to three cacti graphs where you can see the effect over
the long term:

This graph is from a machine running 2.6.16.27/i386, which does not
have any memory loss. You can see the long-term memory line is flat:

http://i239.photobucket.com/albums/ff117/fredty8/memory-a4.png

Now here is a graph from a machine running 2.6.12/i386, which clearly
shows a long-term memory loss. The points where the memory shoots back
up to its full level are when the machine had to be rebooted because
it was going into swap:

http://i239.photobucket.com/albums/ff117/fredty8/memory-a2.png

And finally, here is a graph from a machine running 2.6.20.15/x86_64,
which shows a very similar memory loss as the 2.6.12 machine. (This
machine has only been up for a few weeks, which is why the graph is so
short. But it is clear that the graph is doing the same thing as
2.6.12):

http://i239.photobucket.com/albums/ff117/fredty8/memory-b1.png


If you need any more information from me, I'll be happy to provide it.
Please CC me on replies.


2007-08-26 15:32:25

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On Sun, Aug 26, 2007 at 10:39:11AM -0400, Fred Tyler wrote:
> I think I've come across a memory leak in 2.6.20. I've upgraded to the
> latest 2.6.20.17, but it didn't seem to help.
>
> A little background: I saw something exactly like this many months ago
> with a 2.6.12 kernel. However, by 2.6.16.x the leak had apparently
> been fixed, so I didn't pursue it. I just assumed it had been fixed.
> But either it remains in 2.6.20 or else a new leak has appeared.
>
> FWIW, this is an x86_64 machine, but I also saw nearly the same
> behavior on a i386 machine running 2.6.12. (Links to graphs showing
> long-term memory usage are at the bottom of this email if you want to
> skip all the text stats in the middle.)
>
> Immediately after booting the system, I shut down all services to get
> a baseline for comparison. Here is the output of top and vmstat with
> virtually nothing running:

You can try "Kernel Hacking" => "Debug slab memory allocations" =>
"Memory leak debugging". After you think it leaked pretty much, post
output of

sort -n -k2 /proc/slab_allocators

2007-08-26 15:41:14

by Fred Tyler

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 8/26/07, Fred Tyler <[email protected]> wrote:
> I think I've come across a memory leak in 2.6.20. I've upgraded to the
> latest 2.6.20.17, but it didn't seem to help.

One more thing, I just found this message from July from someone
seeing a similar problem:

http://lkml.org/lkml/2007/7/27/305

I am also running reiserfs, so I wonder if that has something to do
with this. Unlike the other poster, though, I am running an unmodified
kernel and have not seen the error he saw in the system logs.

Here's my output from /proc/meminfo in case it helps:

$ cat /proc/meminfo
MemTotal: 4053564 kB
MemFree: 144344 kB
Buffers: 310824 kB
Cached: 2684244 kB
SwapCached: 64 kB
Active: 1858644 kB
Inactive: 1510808 kB
SwapTotal: 65528 kB
SwapFree: 65316 kB
Dirty: 1772 kB
Writeback: 0 kB
AnonPages: 363844 kB
Mapped: 46924 kB
Slab: 509276 kB
SReclaimable: 467220 kB
SUnreclaim: 42056 kB
PageTables: 9660 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 2092308 kB
Committed_AS: 854520 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 4936 kB
VmallocChunk: 34359733423 kB

2007-08-26 15:51:55

by Fred Tyler

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 8/26/07, Fred Tyler <[email protected]> wrote:
> I think I've come across a memory leak in 2.6.20. I've upgraded to the
> latest 2.6.20.17, but it didn't seem to help.

Sorry to keep replying to my own post, but further investigation
suggests that the memory losses may be occurring at times of heavy
filesystem access. The machines in question run rsyncs of hundreds of
thousands of files every few hours, and I'm starting to think that the
memory loss occurs during these times. I don't know how I'd go about
proving this though...

2007-08-26 15:52:53

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20


On Aug 26 2007 11:51, Fred Tyler wrote:
>On 8/26/07, Fred Tyler <[email protected]> wrote:
>> I think I've come across a memory leak in 2.6.20. I've upgraded to the
>> latest 2.6.20.17, but it didn't seem to help.
>
>Sorry to keep replying to my own post, but further investigation
>suggests that the memory losses may be occurring at times of heavy
>filesystem access. The machines in question run rsyncs of hundreds of
>thousands of files every few hours, and I'm starting to think that the
>memory loss occurs during these times. I don't know how I'd go about
>proving this though...

Please rule out filesystem caches by issuing
sync;
echo 3 >/proc/sys/vm/drop_caches;



Jan
--

2007-08-26 16:14:51

by Fred Tyler

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 8/26/07, Fred Tyler <[email protected]> wrote:
> On 8/26/07, Jan Engelhardt <[email protected]> wrote:
> >
> > On Aug 26 2007 11:51, Fred Tyler wrote:
> > >On 8/26/07, Fred Tyler <[email protected]> wrote:
> > >> I think I've come across a memory leak in 2.6.20. I've upgraded to the
> > >> latest 2.6.20.17, but it didn't seem to help.
> > >
> > >Sorry to keep replying to my own post, but further investigation
> > >suggests that the memory losses may be occurring at times of heavy
> > >filesystem access. The machines in question run rsyncs of hundreds of
> > >thousands of files every few hours, and I'm starting to think that the
> > >memory loss occurs during these times. I don't know how I'd go about
> > >proving this though...
> >
> > Please rule out filesystem caches by issuing
> > sync;
> > echo 3 >/proc/sys/vm/drop_caches;
>
>
> Ok, I did this on a non-production machine that has only been up for a
> few hours, and here's what happened:
>
> ======== Before =========
>
> $ free -m
> total used free shared buffers cached
> Mem: 878 824 54 0 111 422
> -/+ buffers/cache: 290 587
> Swap: 63 0 63
>
>
> ======== After ========
>
> root@b0$ free -m
> total used free shared buffers cached
> Mem: 878 47 830 0 6 4
> -/+ buffers/cache: 36 841
> Swap: 63 0 63
>
> ======================
>
> So, I guess it worked? (I don't know what was supposed to happen, but
> memory usage dropped significantly when I did this.)
>
> However, I'm not sure this staging machine has been up long enough or
> doing enough to exhibit the problem. I can try this on my production
> servers (the ones I provided graphs for) late tonight, but how safe is
> running this command? Does it permanently disable file caching? Do I
> need to reset it afterwards? If I stop all services (databases,
> logging, etc) first, am I protected against data loss?
>

2007-08-26 16:16:30

by Fred Tyler

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 8/26/07, Jan Engelhardt <[email protected]> wrote:
>
> On Aug 26 2007 11:51, Fred Tyler wrote:
> >On 8/26/07, Fred Tyler <[email protected]> wrote:
> >> I think I've come across a memory leak in 2.6.20. I've upgraded to the
> >> latest 2.6.20.17, but it didn't seem to help.
> >
> >Sorry to keep replying to my own post, but further investigation
> >suggests that the memory losses may be occurring at times of heavy
> >filesystem access. The machines in question run rsyncs of hundreds of
> >thousands of files every few hours, and I'm starting to think that the
> >memory loss occurs during these times. I don't know how I'd go about
> >proving this though...
>
> Please rule out filesystem caches by issuing
> sync;
> echo 3 >/proc/sys/vm/drop_caches;

(Sorry if this goes to the list twice... Mailer problems.)

Ok, I did this on a non-production machine that has only been up for a
few hours, and here's what happened:

======== Before =========

$ free -m
total used free shared buffers cached
Mem: 878 824 54 0 111 422
-/+ buffers/cache: 290 587
Swap: 63 0 63


======== After ========

root@b0$ free -m
total used free shared buffers cached
Mem: 878 47 830 0 6 4
-/+ buffers/cache: 36 841
Swap: 63 0 63

======================

So, I guess it worked? (I don't know what was supposed to happen, but
memory usage dropped significantly when I did this.)

However, I'm not sure this staging machine has been up long enough or
doing enough to exhibit the problem. I can try this on my production
servers (the ones I provided graphs for) late tonight, but how safe is
running this command? Does it permanently disable file caching? Do I
need to reset it afterwards? If I stop all services (databases,
logging, etc) first, am I protected against data loss?

2007-08-26 16:30:33

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20


On Aug 26 2007 12:16, Fred Tyler wrote:
>> Please rule out filesystem caches by issuing
>> sync;
>> echo 3 >/proc/sys/vm/drop_caches;
>
>(Sorry if this goes to the list twice... Mailer problems.)
alright..

>Ok, I did this on a non-production machine that has only been up for a
>few hours, and here's what happened:
>
>======== Before =========
>
>$ free -m
> total used free shared buffers cached
>Mem: 878 824 54 0 111 422
>-/+ buffers/cache: 290 587
>Swap: 63 0 63
>
>
>======== After ========
>
>root@b0$ free -m
> total used free shared buffers cached
>Mem: 878 47 830 0 6 4
>-/+ buffers/cache: 36 841
>Swap: 63 0 63
>
>======================
>
>So, I guess it worked? (I don't know what was supposed to happen, but
>memory usage dropped significantly when I did this.)

So I guess you are not seeing any memory leak at all, but just the regular
caching?

>However, I'm not sure this staging machine has been up long enough or
>doing enough to exhibit the problem. I can try this on my production
>servers (the ones I provided graphs for) late tonight, but how safe is
>running this command? Does it permanently disable file caching? Do I
>need to reset it afterwards? If I stop all services (databases,

drop_cache is a trigger, not a setting. Hence your RAM will be used again
after you have used drop_caches.

>logging, etc) first, am I protected against data loss?


Jan
--

2007-08-26 16:49:36

by Fred Tyler

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 8/26/07, Jan Engelhardt <[email protected]> wrote:
>
> On Aug 26 2007 12:16, Fred Tyler wrote:
> >> Please rule out filesystem caches by issuing
> >> sync;
> >> echo 3 >/proc/sys/vm/drop_caches;
> >
>
> >Ok, I did this on a non-production machine that has only been up for a
> >few hours, and here's what happened:
> > ...
> >So, I guess it worked? (I don't know what was supposed to happen, but
> >memory usage dropped significantly when I did this.)
>
> So I guess you are not seeing any memory leak at all, but just the regular
> caching?

I certainly hope that is the case, but until I try it on the
production machine tonight I won't know for sure. If this is indeed a
leak, it's pretty slow, and it takes a week or so before you can even
start noticing it on the graphs

I can say with absolute certainty that something very similar was
happening in 2.6.12 (compare the graphs in my original email), and in
2.6.12 it would inevitably lead to the server running entirely out of
memory, to the point where applications could no longer allocate
memory and the server would have to be rebooted.

The symptoms were almost identical in that case: I'd shut down
virtually every application on the server, but the memory would still
be almost entirely in use. I understand there's kernel caching, but if
the kernel caching occurs at the expense of any other applications
being able to access memory, then there's a real problem. (I actually
still have one 2.6.12 machine running, but drop_caches doesn't appear
to exist on it so I can't test it there. Is there an analogue?)

Anyway, I'll post the results from the 2.6.20 server as soon as I have
them. Should be late tonight.

Thanks.

2007-08-26 16:59:10

by Fred Tyler

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 8/26/07, Jan Engelhardt <[email protected]> wrote:
>
> On Aug 26 2007 12:16, Fred Tyler wrote:
> >> Please rule out filesystem caches by issuing
> >> sync;
> >> echo 3 >/proc/sys/vm/drop_caches;
> >
> >
> >So, I guess it worked? (I don't know what was supposed to happen, but
> >memory usage dropped significantly when I did this.)
>
> So I guess you are not seeing any memory leak at all, but just the regular
> caching?

Also, how can you explain the differences between the graphs of
long-term memory usage? This first graph is from a server running
2.6.16 that never has memory problems:

http://i239.photobucket.com/albums/ff117/fredty8/memory-a4.png

And here's a graph of a server running 2.6.12 that has to be rebooted
every month or two because it runs out of memory:

http://i239.photobucket.com/albums/ff117/fredty8/memory-a2.png

Now, admittedly, the 2.6.20 server has not been running long enough to
know whether or not it's going to start starving applications of
memory, but the graph here looks a whole lot more like 2.6.12 than
2.6.16, wouldn't you agree:

http://i239.photobucket.com/albums/ff117/fredty8/memory-b1.png


Those 2.6.12 servers caused me a ton of stress because I let the
problem go too long before I did anything. In the event that 2.6.20 is
doing the same thing, I'm trying to fix it before things get out of
control.

2007-08-26 16:59:27

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20


On Aug 26 2007 12:49, Fred Tyler wrote:
>>
>> So I guess you are not seeing any memory leak at all, but just the regular
>> caching?
>
>I certainly hope that is the case, but until I try it on the
>production machine tonight I won't know for sure.

Note that not all kernels have the 'drop_caches' control file.
So there is not much you can do there.

>If this is indeed a
>leak, it's pretty slow, and it takes a week or so before you can even
>start noticing it on the graphs

Well if it helps, you can accelerate the 'problem', by issuing, for example:

(1)
dd_rescue /dev/sda /dev/null -m $[4*1048576*1024]

for reading 4 GB from disk straight and populating 'buffers'.

(2)
cat /some/big/big/big/file >/dev/null

for reading X GB from disk and populating 'cache'.

and then you'll see. Also note that a kernel leak will eventually lead
to very low buffers/cached values (the ones to the far right) even when
large amounts of data are read (using either dd_rescue/cat as mentioned
above), because, of course, the leak clogs up memory.

total used free shared buffers cached
Mem: 775792 493724 282068 0 8 308416
-/+ buffers/cache: 185300 590492
Swap: 795136 60 795076


>I can say with absolute certainty that something very similar was
>happening in 2.6.12 (compare the graphs in my original email), and in
>2.6.12 it would inevitably lead to the server running entirely out of
>memory, to the point where applications could no longer allocate
>memory and the server would have to be rebooted.

>The symptoms were almost identical in that case: I'd shut down
>virtually every application on the server, but the memory would still
>be almost entirely in use. I understand there's kernel caching, but if
>the kernel caching occurs at the expense of any other applications
>being able to access memory, then there's a real problem. (I actually
>still have one 2.6.12 machine running, but drop_caches doesn't appear
>to exist on it so I can't test it there. Is there an analogue?)

If you see an Out Of Memory notice in dmesg, you'll know there is a
leak even if everything was shut down.

>
>Anyway, I'll post the results from the 2.6.20 server as soon as I have
>them. Should be late tonight.
>
>Thanks.
>

Jan
--

2007-08-26 17:03:29

by Denys Vlasenko

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On Sunday 26 August 2007 17:16, Fred Tyler wrote:
> So, I guess it worked? (I don't know what was supposed to happen, but
> memory usage dropped significantly when I did this.)

If you can reclaim "leaked" memory this way, it means that
you found a bug where cached data is incorrectly kept
in RAM in preference of other data.
(I'm assuming that you do have real problems after some time
of "leaking" memory - you mention that you get swap storms
and eventually machine is dead.)

> However, I'm not sure this staging machine has been up long enough or
> doing enough to exhibit the problem. I can try this on my production
> servers (the ones I provided graphs for) late tonight, but how safe is
> running this command? Does it permanently disable file caching? Do I

Yes, it's safe to do, anytime.

It's just a command to kernel to drop as much of currently
accumulated filesystem cache as it can. It is strictly
a debugging/benchmarking aid.

If you end up needing to do it once in a while to keep your machine
alive, something is definitely wrong.
--
vda

2007-08-26 17:41:52

by Fred Tyler

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 8/26/07, Denys Vlasenko <[email protected]> wrote:
> On Sunday 26 August 2007 17:16, Fred Tyler wrote:
> > So, I guess it worked? (I don't know what was supposed to happen, but
> > memory usage dropped significantly when I did this.)
>
> If you can reclaim "leaked" memory this way, it means that
> you found a bug where cached data is incorrectly kept
> in RAM in preference of other data.
> (I'm assuming that you do have real problems after some time
> of "leaking" memory - you mention that you get swap storms
> and eventually machine is dead.)

This was exactly what happened with 2.6.12 -- more and more memory
used until there was a swap storm and a dead machine.

The 2.6.20 machines haven't been up long enough to know if they're
going to be hit by the same problem, but it seems peculiar to me that
the 2.6.16 machine does not do anything remotely like this. As you can
see in the graphs, the 2.6.16 memory use levels off very quickly, but
2.6.12 keeps dropping until the machine bombs.

The 2.6.20 graph looks like it's heading the same direction as 2.6.12.

I'm going to run drop_caches on the 2.6.20 machines tonight and see
what happens...

2007-08-26 17:42:48

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20


On Aug 26 2007 12:58, Fred Tyler wrote:
>> So I guess you are not seeing any memory leak at all, but just the regular
>> caching?
>
>Also, how can you explain the differences between the graphs of
>long-term memory usage? This first graph is from a server running
>2.6.16 that never has memory problems:
>
> http://i239.photobucket.com/albums/ff117/fredty8/memory-a4.png
>
>And here's a graph of a server running 2.6.12 that has to be rebooted
>every month or two because it runs out of memory:
>
> http://i239.photobucket.com/albums/ff117/fredty8/memory-a2.png

Indeed that looks like a leak. But perhaps it would be helpful to not only
match MemFree Buffers and Cached (from /proc/meminfo) but also Slab.

>Now, admittedly, the 2.6.20 server has not been running long enough to
>know whether or not it's going to start starving applications of
>memory, but the graph here looks a whole lot more like 2.6.12 than
>2.6.16, wouldn't you agree:
>
> http://i239.photobucket.com/albums/ff117/fredty8/memory-b1.png

But it looks like.

>Those 2.6.12 servers caused me a ton of stress because I let the
>problem go too long before I did anything. In the event that 2.6.20 is
>doing the same thing, I'm trying to fix it before things get out of
>control.
>

Jan
--

2007-08-26 17:44:17

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20


On Aug 26 2007 13:41, Fred Tyler wrote:
>
>I'm going to run drop_caches on the 2.6.20 machines tonight and see
>what happens...

Better add "Slab" to your graphs, that looks like it's the amount of
non-cache kernel memory used.

Jan
--

2007-08-26 22:16:20

by Jesper Juhl

[permalink] [raw]
Subject: Re: Slow, persistent memory leak in 2.6.20

On 26/08/07, Fred Tyler <[email protected]> wrote:
> I think I've come across a memory leak in 2.6.20. I've upgraded to the
> latest 2.6.20.17, but it didn't seem to help.
>
Have you tried the latest 2.6.22.5 ?
A lot of memory leaks have been fixed between 2.6.20 and the latest
stable kernel - could be that yours is amongst the ones fixed :-)


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html