Hello,
I'm having a problem with a memory leak in the kernel. I'm running
2.6.13.3 from kernel.org on FC4 on a Dell Poweredge 2850 Duel Xeon 3ghz
with 2GB RAM. Soon after booting up names_cache starts growing until it
consumes all available memory on the system until the oom killer goes
nuts and starts killing all the processes on the machine. After
googling the problem I thought it could be caused by a corrupt file
system but after running fsck the problem hasn't gone away. Here's the
entry from /proc/slabinfo:
names_cache 204686 204686 4096 1 1 : tunables 24 12
8 : slabdata 204686 204686 12
Anyone have an idea what could cause this problem or point me in the
correct direction?
Thanks,
Robert J Derr
Weatherflow, Inc.
On Thu, 6 Oct 2005, Robert Derr wrote:
>
> I'm having a problem with a memory leak in the kernel. I'm running 2.6.13.3
> from kernel.org on FC4 on a Dell Poweredge 2850 Duel Xeon 3ghz with 2GB RAM.
Just out of interest, do you have CONFIG_AUDIT_SYSCALL enabled? Does it go
away if you disable it?
Also, what filesystems do you use? And if you run
while : ; do cat /proc/slabinfo | grep names_cache ; sleep 2; done
in one terminal, can you see if you can find any correlation to some
particular action or behaviour that would seem to be part of leaking it?
It really shouldn't grow very big at all normally. Ie the counts are
normally something like a few tens of entries used or whatever - all the
allocations should basically be temporary, and your 200+ _thousand_
entries are way out of line.
If you can't find anything obvious, then we can try to figure out a way to
just print out the contents of your name entries, I bet that would give a
clue about who is allocating them. But there's also been various leak
debugging patches out there that may help. Manfred may have pointers.
Linus
Linus Torvalds wrote:
> On Thu, 6 Oct 2005, Robert Derr wrote:
>
>> I'm having a problem with a memory leak in the kernel. I'm running 2.6.13.3
>> from kernel.org on FC4 on a Dell Poweredge 2850 Duel Xeon 3ghz with 2GB RAM.
>>
>
> Just out of interest, do you have CONFIG_AUDIT_SYSCALL enabled? Does it go
> away if you disable it?
>
It looks like it is enabled. CONFIG_AUDITSYSCALL=y in .config, right?
> Also, what filesystems do you use? And if you run
>
> while : ; do cat /proc/slabinfo | grep names_cache ; sleep 2; done
>
> in one terminal, can you see if you can find any correlation to some
> particular action or behaviour that would seem to be part of leaking it?
>
I'm not sure if I can find the action or behavior causing the problem.
The server is the master node on a 14 computer cluster running a
mesoscale weather forecasting package so there's a million things going
on all the time. I guess I could write a program to compare all the
processes running against the names_cache and look for any correlation.
Here's the output of mount. The drives are all ext3
/dev/sda2 on / type ext3 (rw)
/dev/proc on /proc type proc (rw)
/dev/sys on /sys type sysfs (rw)
/dev/devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /boot type ext3 (rw)
/dev/shm on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
> It really shouldn't grow very big at all normally. Ie the counts are
> normally something like a few tens of entries used or whatever - all the
> allocations should basically be temporary, and your 200+ _thousand_
> entries are way out of line.
>
> If you can't find anything obvious, then we can try to figure out a way to
> just print out the contents of your name entries, I bet that would give a
> clue about who is allocating them. But there's also been various leak
> debugging patches out there that may help. Manfred may have pointers.
>
> Linus
>
I'll look for those leak debugging patches.
Thanks for your time.
Robert J Derr
Weatherflow, Inc.
On Thu, 6 Oct 2005, Robert Derr wrote:
> >
> > Just out of interest, do you have CONFIG_AUDIT_SYSCALL enabled? Does it go
> > away if you disable it?
>
> It looks like it is enabled. CONFIG_AUDITSYSCALL=y in .config, right?
Yes. My bad.
That would be the prime suspect, especially as you don't seem to have any
strange filesystems. Syscall auditing delays releasing filenames until the
system call exits, and I wouldn't be at all surprised if it might leak.
I doubt you depend on the syscall auditing, so the easiest thing to try is
to just disable it and see if the behaviour goes away. At that point, if
it does, we have people we can ping to look more closely into what causes
it.
> I'm not sure if I can find the action or behavior causing the problem. The
> server is the master node on a 14 computer cluster running a mesoscale weather
> forecasting package so there's a million things going on all the time. I
> guess I could write a program to compare all the processes running against the
> names_cache and look for any correlation.
Ahh, never mind. It sounds like the best thing to do is to first try the
simple audit test, and if the problem remains, just apply one of the slab
debugging patches.
There's one by Alexander Nyberg at least, which would probably show the
likely leak site immediately since it tracked allocators. Alexander, do
you have a recent version of that to send to Robert?
Linus
Are you running with CONFIG_AUDITSYSCALL?
We ran into what sounds like the same problem and we're testing
a patch right now for a names_cache growth which only occurs
with CONFIG_AUDITSYSCALL enabled, and then only every time you
traverse a symlink. In open_namei(), in the do_link section, we call
__do_follow_link() which will bypass the auditing whether it's enabled
or not. However at the end of this section, we will call putname(),
which will *not* actually do a __putname() if auditing is enabled because
it believes it will happen at syscall return. So we slowly lose memory
with each link traversal.
The code in open_namei() is a bit non-intuitive in error conditions,
but the general fix appears to be pretty straightforward. Let me know if
this patch seems to do the trick for you.
--- linux-2.6.13.3/fs/namei.c 2005-08-28 16:41:01.000000000 -0700
+++ linux-2.6.13.3-new/fs/namei.c 2005-10-06 12:45:41.996243768 -0700
@@ -1557,19 +1557,19 @@ do_link:
if (nd->last_type != LAST_NORM)
goto exit;
if (nd->last.name[nd->last.len]) {
- putname(nd->last.name);
+ __putname(nd->last.name);
goto exit;
}
error = -ELOOP;
if (count++==32) {
- putname(nd->last.name);
+ __putname(nd->last.name);
goto exit;
}
dir = nd->dentry;
down(&dir->d_inode->i_sem);
path.dentry = __lookup_hash(&nd->last, nd->dentry, nd);
path.mnt = nd->mnt;
- putname(nd->last.name);
+ __putname(nd->last.name);
goto do_last;
}
Rick Lindsley wrote:
> Are you running with CONFIG_AUDITSYSCALL?
>
> We ran into what sounds like the same problem and we're testing
> a patch right now for a names_cache growth which only occurs
> with CONFIG_AUDITSYSCALL enabled, and then only every time you
> traverse a symlink. In open_namei(), in the do_link section, we call
> __do_follow_link() which will bypass the auditing whether it's enabled
> or not. However at the end of this section, we will call putname(),
> which will *not* actually do a __putname() if auditing is enabled because
> it believes it will happen at syscall return. So we slowly lose memory
> with each link traversal.
>
> The code in open_namei() is a bit non-intuitive in error conditions,
> but the general fix appears to be pretty straightforward. Let me know if
> this patch seems to do the trick for you.
>
>
>
Thanks Rick and Linus,
Rick, I put in your patch and after running for 15 minutes the system is
holding steady at around 60-80 allocations. Before it would have
already have been up to a few thousand. I'll know for sure tomorrow
morning.
Thanks again for everyone's help,
Robert J Derr
Weatherflow, Inc.
On Thu, 6 Oct 2005, Rick Lindsley wrote:
>
> The code in open_namei() is a bit non-intuitive in error conditions,
> but the general fix appears to be pretty straightforward. Let me know if
> this patch seems to do the trick for you.
This patch seems to be correct.
As far as I can tell, the name in "last.name" has always been allocated
with "__getname()", and it should thus always be free'd with "__putname()"
in order to not cause trouble with the horrible AUDITSYSCALL code.
Now, very arguably the real bug is that bug-prone code in AUDITSYSCALL,
but I suspect that for 2.6.14 I should just apply this patch.
Al? Any comments? (Full patch quoted here in case you haven't followed the
mailing list)
Linus
> --- linux-2.6.13.3/fs/namei.c 2005-08-28 16:41:01.000000000 -0700
> +++ linux-2.6.13.3-new/fs/namei.c 2005-10-06 12:45:41.996243768 -0700
> @@ -1557,19 +1557,19 @@ do_link:
> if (nd->last_type != LAST_NORM)
> goto exit;
> if (nd->last.name[nd->last.len]) {
> - putname(nd->last.name);
> + __putname(nd->last.name);
> goto exit;
> }
> error = -ELOOP;
> if (count++==32) {
> - putname(nd->last.name);
> + __putname(nd->last.name);
> goto exit;
> }
> dir = nd->dentry;
> down(&dir->d_inode->i_sem);
> path.dentry = __lookup_hash(&nd->last, nd->dentry, nd);
> path.mnt = nd->mnt;
> - putname(nd->last.name);
> + __putname(nd->last.name);
> goto do_last;
> }
>
On Thu, Oct 06, 2005 at 02:31:38PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 6 Oct 2005, Rick Lindsley wrote:
> >
> > The code in open_namei() is a bit non-intuitive in error conditions,
> > but the general fix appears to be pretty straightforward. Let me know if
> > this patch seems to do the trick for you.
>
> This patch seems to be correct.
>
> As far as I can tell, the name in "last.name" has always been allocated
> with "__getname()", and it should thus always be free'd with "__putname()"
> in order to not cause trouble with the horrible AUDITSYSCALL code.
>
> Now, very arguably the real bug is that bug-prone code in AUDITSYSCALL,
> but I suspect that for 2.6.14 I should just apply this patch.
>
> Al? Any comments? (Full patch quoted here in case you haven't followed the
> mailing list)
ACK, and the only comment is that audit crap would be better off in /dev/null.
Too late for that now, unfortunately...