2005-05-09 04:00:37

by Bruce Guenter

[permalink] [raw]
Subject: How to diagnose a kernel memory leak

Greetings.

I am trying to diagnose a slow kernel memory leak, and am having no luck
in pining it down.

I am currently running unpatched 2.6.12-rc3 (x86 on Gentoo, I saw the
same symptoms with gentoo-sources 2.6.11-r6 and 2.6.11-r4. Over the
course of several days, the server in question has the amount of
available memory (free minus buffers+cache) gradually decrease. If I
leave it go, it does eventually thrash itself to death after about a
week (give or take). The rate is about 150MB per day (the system has
2GB of RAM total so it takes several days). The working set of
processes remains the same through the whole period at between 50-150MB
(depending on if you count VSZ or RSS). Nothing shows up in dmesg
except for a couple of one-time lockd and nfs messages (the system uses
two remote filesystems). The local filesystems are ReiserFS on a 3Ware
7500-4 controller, and the NIC is an Intel E100.

# free
total used free shared buffers cached
Mem: 2076180 2024068 52112 0 166760 93200
-/+ buffers/cache: 1764108 312072
Swap: 1028152 56 1028096

# cat /proc/meminfo
MemTotal: 2076180 kB
MemFree: 63080 kB
Buffers: 158776 kB
Cached: 91664 kB
SwapCached: 4 kB
Active: 1055244 kB
Inactive: 874660 kB
HighTotal: 1179072 kB
HighFree: 640 kB
LowTotal: 897108 kB
LowFree: 62440 kB
SwapTotal: 1028152 kB
SwapFree: 1028096 kB
Dirty: 768 kB
Writeback: 0 kB
Mapped: 12648 kB
Slab: 69872 kB
CommitLimit: 2066240 kB
Committed_AS: 26316 kB
PageTables: 1492 kB
VmallocTotal: 114680 kB
VmallocUsed: 4700 kB
VmallocChunk: 109784 kB

I would be happy to provide any additional information. As it stands, I
have to reboot about once a week to clear the RAM or else it thrashes
itself to death.
--
Bruce Guenter <[email protected]> http://em.ca/~bruceg/ http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA 2E2A E96F B2DC 6999 80E8


Attachments:
(No filename) (2.04 kB)
(No filename) (189.00 B)
Download all attachments

2005-05-09 06:02:18

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

On Sun, 8 May 2005, Bruce Guenter wrote:

> Greetings.
>
> I am trying to diagnose a slow kernel memory leak, and am having no luck
> in pining it down.
>
> I am currently running unpatched 2.6.12-rc3 (x86 on Gentoo, I saw the
> same symptoms with gentoo-sources 2.6.11-r6 and 2.6.11-r4. Over the
> course of several days, the server in question has the amount of
> available memory (free minus buffers+cache) gradually decrease. If I
> leave it go, it does eventually thrash itself to death after about a
> week (give or take). The rate is about 150MB per day (the system has
> 2GB of RAM total so it takes several days). The working set of
> processes remains the same through the whole period at between 50-150MB
> (depending on if you count VSZ or RSS). Nothing shows up in dmesg
> except for a couple of one-time lockd and nfs messages (the system uses
> two remote filesystems). The local filesystems are ReiserFS on a 3Ware
> 7500-4 controller, and the NIC is an Intel E100.

Try looking at slabtop(1) output after a few days.

2005-05-09 08:30:02

by Alexander Nyberg

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

> I am trying to diagnose a slow kernel memory leak, and am having no luck
> in pining it down.
>
> I am currently running unpatched 2.6.12-rc3 (x86 on Gentoo, I saw the
> same symptoms with gentoo-sources 2.6.11-r6 and 2.6.11-r4. Over the
> course of several days, the server in question has the amount of
> available memory (free minus buffers+cache) gradually decrease. If I
> leave it go, it does eventually thrash itself to death after about a
> week (give or take). The rate is about 150MB per day (the system has
> 2GB of RAM total so it takes several days). The working set of
> processes remains the same through the whole period at between 50-150MB
> (depending on if you count VSZ or RSS). Nothing shows up in dmesg
> except for a couple of one-time lockd and nfs messages (the system uses
> two remote filesystems). The local filesystems are ReiserFS on a 3Ware
> 7500-4 controller, and the NIC is an Intel E100.

You should keep an eye on /proc/meminfo but if there is memory that is
not accounted for then the patch below might help as it works on a lower
level. It accounts for bare pages in the system available
from /proc/page_owner. So a cat /proc/page_owner > tmpfile would be good
when the system starts to go low. There's a sorting program in
Documentation/page_owner.c used to sort the rather large output.
Also the meminfo you posted, how long had the box been alive when you
took it?

Select Track page owner under kernel hacking.


Index: linux-2.6/Documentation/page_owner.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/page_owner.c 2005-05-09 09:50:08.000000000 +0200
@@ -0,0 +1,140 @@
+/*
+ * User-space helper to sort the output of /proc/page_owner
+ *
+ * Example use:
+ * cat /proc/page_owner > page_owner.txt
+ * ./sort page_owner.txt sorted_page_owner.txt
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+
+struct block_list {
+ char *txt;
+ int len;
+ int num;
+};
+
+
+static struct block_list *list;
+static int list_size;
+static int max_size;
+
+struct block_list *block_head;
+
+int read_block(char *buf, FILE *fin)
+{
+ int ret = 0;
+ int hit = 0;
+ char *curr = buf;
+
+ for (;;) {
+ *curr = getc(fin);
+ if (*curr == EOF) return -1;
+
+ ret++;
+ if (*curr == '\n' && hit == 1)
+ return ret - 1;
+ else if (*curr == '\n')
+ hit = 1;
+ else
+ hit = 0;
+ curr++;
+ }
+}
+
+static int compare_txt(struct block_list *l1, struct block_list *l2)
+{
+ return strcmp(l1->txt, l2->txt);
+}
+
+static int compare_num(struct block_list *l1, struct block_list *l2)
+{
+ return l2->num - l1->num;
+}
+
+static void add_list(char *buf, int len)
+{
+ if (list_size != 0 &&
+ len == list[list_size-1].len &&
+ memcmp(buf, list[list_size-1].txt, len) == 0) {
+ list[list_size-1].num++;
+ return;
+ }
+ if (list_size == max_size) {
+ printf("max_size too small??\n");
+ exit(1);
+ }
+ list[list_size].txt = malloc(len+1);
+ list[list_size].len = len;
+ list[list_size].num = 1;
+ memcpy(list[list_size].txt, buf, len);
+ list[list_size].txt[len] = 0;
+ list_size++;
+ if (list_size % 1000 == 0) {
+ printf("loaded %d\r", list_size);
+ fflush(stdout);
+ }
+}
+
+int main(int argc, char **argv)
+{
+ FILE *fin, *fout;
+ char buf[1024];
+ int ret, i, count;
+ struct block_list *list2;
+ struct stat st;
+
+ fin = fopen(argv[1], "r");
+ fout = fopen(argv[2], "w");
+ if (!fin || !fout) {
+ printf("Usage: ./program <input> <output>\n");
+ perror("open: ");
+ exit(2);
+ }
+
+ fstat(fileno(fin), &st);
+ max_size = st.st_size / 100; /* hack ... */
+
+ list = malloc(max_size * sizeof(*list));
+
+ for(;;) {
+ ret = read_block(buf, fin);
+ if (ret < 0)
+ break;
+
+ buf[ret] = '\0';
+ add_list(buf, ret);
+ }
+
+ printf("loaded %d\n", list_size);
+
+ printf("sorting ....\n");
+
+ qsort(list, list_size, sizeof(list[0]), compare_txt);
+
+ list2 = malloc(sizeof(*list) * list_size);
+
+ printf("culling\n");
+
+ for (i=count=0;i<list_size;i++) {
+ if (count == 0 ||
+ strcmp(list2[count-1].txt, list[i].txt) != 0) {
+ list2[count++] = list[i];
+ } else {
+ list2[count-1].num += list[i].num;
+ }
+ }
+
+ qsort(list2, count, sizeof(list[0]), compare_num);
+
+ for (i=0;i<count;i++) {
+ fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
+ }
+ return 0;
+}
Index: linux-2.6/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.orig/fs/proc/proc_misc.c 2005-05-09 09:50:04.000000000 +0200
+++ linux-2.6/fs/proc/proc_misc.c 2005-05-09 09:50:10.000000000 +0200
@@ -539,6 +539,67 @@
};
#endif

+#ifdef CONFIG_PAGE_OWNER
+#include <linux/bootmem.h>
+#include <linux/kallsyms.h>
+static ssize_t
+read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ unsigned long start_pfn = min_low_pfn;
+ static unsigned long pfn;
+ struct page *page;
+ char *kbuf, *modname;
+ const char *symname;
+ int ret = 0, next_idx = 1;
+ char namebuf[128];
+ unsigned long offset = 0, symsize;
+ int i;
+
+ pfn = start_pfn + *ppos;
+ page = pfn_to_page(pfn);
+ for (; pfn < max_pfn; pfn++) {
+ if (!pfn_valid(pfn))
+ continue;
+ page = pfn_to_page(pfn);
+ if (page->order >= 0)
+ break;
+ next_idx++;
+ }
+
+ if (!pfn_valid(pfn))
+ return 0;
+
+ *ppos += next_idx;
+
+ kbuf = kmalloc(count, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ ret = snprintf(kbuf, 1024, "Page allocated via order %d, mask 0x%x\n",
+ page->order, page->gfp_mask);
+
+ for (i = 0; i < 8; i++) {
+ if (!page->trace[i])
+ break;
+ symname = kallsyms_lookup(page->trace[i], &symsize, &offset, &modname, namebuf);
+ ret += snprintf(kbuf + ret, count - ret, "[0x%lx] %s+%lu\n",
+ page->trace[i], namebuf, offset);
+ }
+
+ ret += snprintf(kbuf + ret, count -ret, "\n");
+
+ if (copy_to_user(buf, kbuf, ret))
+ ret = -EFAULT;
+
+ kfree(kbuf);
+ return ret;
+}
+
+static struct file_operations proc_page_owner_operations = {
+ .read = read_page_owner,
+};
+#endif
+
struct proc_dir_entry *proc_root_kcore;

void create_seq_entry(char *name, mode_t mode, struct file_operations *f)
@@ -617,4 +678,11 @@
entry->proc_fops = &ppc_htab_operations;
}
#endif
+#ifdef CONFIG_PAGE_OWNER
+ entry = create_proc_entry("page_owner", S_IWUSR | S_IRUGO, NULL);
+ if (entry) {
+ entry->proc_fops = &proc_page_owner_operations;
+ entry->size = 1024;
+ }
+#endif
}
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2005-05-09 09:50:04.000000000 +0200
+++ linux-2.6/include/linux/mm.h 2005-05-09 09:50:10.000000000 +0200
@@ -257,6 +257,11 @@
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
+#ifdef CONFIG_PAGE_OWNER
+ int order;
+ unsigned int gfp_mask;
+ unsigned long trace[8];
+#endif
};

/*
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug 2005-05-09 09:50:04.000000000 +0200
+++ linux-2.6/lib/Kconfig.debug 2005-05-09 09:50:08.000000000 +0200
@@ -139,6 +139,16 @@
automatically, but we'd like to make it more efficient by not
having to do that.

+config PAGE_OWNER
+ bool "Track page owner"
+ depends on DEBUG_KERNEL && X86
+ help
+ This keeps track of what call chain is the owner of a page, may
+ help to find bare alloc_page(s) leaks. Eats a fair amount of memory.
+ See Documentation/page_owner.c for user-space helper.
+
+ If unsure, say N.
+
config DEBUG_FS
bool "Debug Filesystem"
depends on DEBUG_KERNEL
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2005-05-09 09:50:04.000000000 +0200
+++ linux-2.6/mm/page_alloc.c 2005-05-09 09:50:10.000000000 +0200
@@ -724,6 +724,43 @@
return 1;
}

+#ifdef CONFIG_PAGE_OWNER
+static inline int valid_stack_ptr(struct thread_info *tinfo, void *p)
+{
+ return p > (void *)tinfo &&
+ p < (void *)tinfo + THREAD_SIZE - 3;
+}
+
+static inline void __stack_trace(struct page *page, unsigned long *stack, unsigned long bp)
+{
+ int i = 0;
+ unsigned long addr;
+ struct thread_info *tinfo = (struct thread_info *)
+ ((unsigned long)stack & (~(THREAD_SIZE - 1)));
+
+ memset(page->trace, 0, sizeof(long) * 8);
+
+#ifdef CONFIG_FRAME_POINTER
+ while (valid_stack_ptr(tinfo, (void *)bp)) {
+ addr = *(unsigned long *)(bp + sizeof(long));
+ page->trace[i] = addr;
+ if (++i >= 8)
+ break;
+ bp = *(unsigned long *)bp;
+ }
+#else
+ while (valid_stack_ptr(tinfo, stack)) {
+ addr = *stack++;
+ if (__kernel_text_address(addr)) {
+ page->trace[i] = addr;
+ if (++i >= 8)
+ break;
+ }
+ }
+#endif
+}
+#endif /* CONFIG_PAGE_OWNER */
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -908,6 +945,20 @@
}
return NULL;
got_pg:
+
+#ifdef CONFIG_PAGE_OWNER /* huga... */
+ {
+ unsigned long address, bp;
+#ifdef X86_64
+ asm ("movq %%rbp, %0" : "=r" (bp) : );
+#else
+ asm ("movl %%ebp, %0" : "=r" (bp) : );
+#endif
+ page->order = (int) order;
+ page->gfp_mask = gfp_mask;
+ __stack_trace(page, &address, bp);
+ }
+#endif /* CONFIG_PAGE_OWNER */
zone_statistics(zonelist, z);
return page;
}
@@ -961,6 +1012,9 @@
free_hot_page(page);
else
__free_pages_ok(page, order);
+#ifdef CONFIG_PAGE_OWNER
+ page->order = -1;
+#endif
}
}

@@ -1602,6 +1656,9 @@
set_page_address(page, __va(start_pfn << PAGE_SHIFT));
#endif
start_pfn++;
+#ifdef CONFIG_PAGE_OWNER
+ page->order = -1;
+#endif
}
}



2005-05-09 14:21:19

by Bruce Guenter

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

On Mon, May 09, 2005 at 12:05:03AM -0600, Zwane Mwaikambo wrote:
> Try looking at slabtop(1) output after a few days.

Well, this is interesting and all, and reiser_inode_cache is taking up a
lot of memory, but according to the /proc/meminfo I had posted, slab is
only using about 60MB. It doesn't appear to be the cause of any leaks.
--
Bruce Guenter <[email protected]> http://em.ca/~bruceg/ http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA 2E2A E96F B2DC 6999 80E8


Attachments:
(No filename) (489.00 B)
(No filename) (189.00 B)
Download all attachments

2005-05-09 14:24:34

by Bruce Guenter

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

On Mon, May 09, 2005 at 10:29:21AM +0200, Alexander Nyberg wrote:
> You should keep an eye on /proc/meminfo but if there is memory that is
> not accounted for then the patch below might help as it works on a
> lower level. It accounts for bare pages in the system available from
> /proc/page_owner.
> Select Track page owner under kernel hacking.

I will try that the next time I have to reboot. As this is a server, I
cannot arbitrarily take it down unfortunately.

> Also the meminfo you posted, how long had the box been alive when you
> took it?

Almost exactly 2 days.
--
Bruce Guenter <[email protected]> http://em.ca/~bruceg/ http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA 2E2A E96F B2DC 6999 80E8


Attachments:
(No filename) (729.00 B)
(No filename) (189.00 B)
Download all attachments

2005-05-11 19:37:42

by Bruce Guenter

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

On Mon, May 09, 2005 at 10:29:21AM +0200, Alexander Nyberg wrote:
> the patch below might help as it works on a lower
> level. It accounts for bare pages in the system available
> from /proc/page_owner. So a cat /proc/page_owner > tmpfile would be good
> when the system starts to go low. There's a sorting program in
> Documentation/page_owner.c used to sort the rather large output.

I've been running this for a day and a half now, and a few hundred megs
of memory is now missing:

# free
total used free shared buffers cached
Mem: 2055648 2001884 53764 0 259024 868484
-/+ buffers/cache: 874376 1181272
Swap: 1028152 56 1028096

I've put the output from the sorting program up at
http://untroubled.org/misc/page_owner_sorted

Is this useful information yet, or is there still too much in cached
pages to really identify the source?
--
Bruce Guenter <[email protected]> http://em.ca/~bruceg/ http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA 2E2A E96F B2DC 6999 80E8


Attachments:
(No filename) (1.05 kB)
(No filename) (189.00 B)
Download all attachments

2005-05-13 00:19:08

by Andrew Morton

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak


(Please always do reply-to-all)

Bruce Guenter <[email protected]> wrote:
>
> On Mon, May 09, 2005 at 10:29:21AM +0200, Alexander Nyberg wrote:
> > the patch below might help as it works on a lower
> > level. It accounts for bare pages in the system available
> > from /proc/page_owner. So a cat /proc/page_owner > tmpfile would be good
> > when the system starts to go low. There's a sorting program in
> > Documentation/page_owner.c used to sort the rather large output.
>
> I've been running this for a day and a half now, and a few hundred megs
> of memory is now missing:
>
> # free
> total used free shared buffers cached
> Mem: 2055648 2001884 53764 0 259024 868484
> -/+ buffers/cache: 874376 1181272
> Swap: 1028152 56 1028096
>
> I've put the output from the sorting program up at
> http://untroubled.org/misc/page_owner_sorted
>
> Is this useful information yet, or is there still too much in cached
> pages to really identify the source?

It all looks pretty innocent. Please send the contents of /proc/meminfo
rather than the `free' output. /proc/meminfo has much more info.
Sometimes /proc/vmstat is also useful.

If the /proc/meminfo output indicates that there are a lot of slab pages
then /proc/slabinfo should be looked at.


2005-05-13 21:30:19

by Bruce Guenter

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

On Thu, May 12, 2005 at 05:18:25PM -0700, Andrew Morton wrote:
> It all looks pretty innocent. Please send the contents of /proc/meminfo
> rather than the `free' output. /proc/meminfo has much more info.

Here are the current meminfo numbers:

MemTotal: 2055648 kB
MemFree: 56512 kB
Buffers: 236880 kB
Cached: 869616 kB
SwapCached: 0 kB
Active: 1004124 kB
Inactive: 729732 kB
HighTotal: 1179072 kB
HighFree: 3584 kB
LowTotal: 876576 kB
LowFree: 52928 kB
SwapTotal: 1028152 kB
SwapFree: 1028096 kB
Dirty: 1036 kB
Writeback: 0 kB
Mapped: 13100 kB
Slab: 252444 kB
CommitLimit: 2055976 kB
Committed_AS: 25704 kB
PageTables: 1060 kB
VmallocTotal: 114680 kB
VmallocUsed: 4700 kB
VmallocChunk: 109836 kB

If I am counting right, free+buffers+cached+slab comes to 1415452 kB.
Of course, at this point, it is far from out of memory like it has been
in the past. I am continuing to monitor, and will post numbers when it
gets closer to what I have previously observed.

> If the /proc/meminfo output indicates that there are a lot of slab pages
> then /proc/slabinfo should be looked at.

That was my first thought, yes. However, when it has run out of memory,
even the slab totals were low (my first post showed only about 60 MB in
slab).
--
Bruce Guenter <[email protected]> http://em.ca/~bruceg/ http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA 2E2A E96F B2DC 6999 80E8


Attachments:
(No filename) (1.50 kB)
(No filename) (189.00 B)
Download all attachments

2005-05-18 18:36:23

by Alexander Nyberg

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

If you don't do reply-to-all there's a chance people will miss out on
your mails...

> > It all looks pretty innocent. Please send the contents of /proc/meminfo
> > rather than the `free' output. /proc/meminfo has much more info.
>
> Here are the current meminfo numbers:
>

What's happening with this? It's been a week now so I'm curious.

What you can do is run the attached program, it's a simple memory eater
that will eat the amount of memory you specify, ie. "./a.out 2000" will
simply eat 2G of memory. This is because all caches get reaped to a
minimum leavel and distinguishing trouble makes is easier this way.

If you think the machine has lost memory at this time please do:
gcc memeat.c
./a.out 2000
wait until program is done
save /proc/meminfo
save /proc/page_owner
sort page_owner output


Attachments:
memeat.c (446.00 B)

2005-05-19 18:44:57

by Bruce Guenter

[permalink] [raw]
Subject: Re: How to diagnose a kernel memory leak

On Wed, May 18, 2005 at 08:32:53PM +0200, Alexander Nyberg wrote:
> > > It all looks pretty innocent. Please send the contents of /proc/meminfo
> > > rather than the `free' output. /proc/meminfo has much more info.
> >
> > Here are the current meminfo numbers:
>
> What's happening with this? It's been a week now so I'm curious.

It appears the memory consumption I thought I was seeing is now gone,
and only conclusively appeared with the Gentoo kernels. I will take
this back to their bug tracker. Sorry for the false alarm.
--
Bruce Guenter <[email protected]> http://em.ca/~bruceg/ http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA 2E2A E96F B2DC 6999 80E8


Attachments:
(No filename) (688.00 B)
(No filename) (189.00 B)
Download all attachments