2009-10-02 12:27:05

by Daniel Lezcano

[permalink] [raw]
Subject: pidns memory leak

Hi,

I am facing a problem with the pid namespace when I launch the following
lxc commands:

lxc-execute -n foo sleep 3600 &
ls -al /proc/$(pidof lxc-init)/exe && lxc-stop -n foo

All the processes related to the container are killed, but there is
still a refcount on the pid_namespace which is never released.
That can be verified in /proc/slabinfo. Running a test suite with
thousand of tests quickly exhaust the memory and the oom killer is
triggered.

Reproduced with a 2.6.31 vanilla kernel on i686 and x86_64 architecture.

Thanks
-- Daniel


2009-10-06 04:05:48

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: pidns memory leak

Daniel Lezcano [[email protected]] wrote:
> Hi,
>
> I am facing a problem with the pid namespace when I launch the following
> lxc commands:
>
> lxc-execute -n foo sleep 3600 &
> ls -al /proc/$(pidof lxc-init)/exe && lxc-stop -n foo
>
> All the processes related to the container are killed, but there is
> still a refcount on the pid_namespace which is never released.

Thanks for the bug report.

Did you notice any leak in 'struct pids' also or just the pid_namespace ?
If the pids are not leaking, this may be slightly different from the problem
Catalin Marinas ran into:

http://lkml.org/lkml/2009/7/29/406

And the pid_namespace does not seem to reproduce for me, with out the
'ls -al /proc/...' above, or with the simpler 'ns_exec' approach to
creating pid namespace.

I am going through the code for lxc-execute, but does it remount /proc
in the container ?

Sukadev

2009-10-06 08:19:33

by Daniel Lezcano

[permalink] [raw]
Subject: Re: pidns memory leak

Sukadev Bhattiprolu wrote:
> Daniel Lezcano [[email protected]] wrote:
>> Hi,
>>
>> I am facing a problem with the pid namespace when I launch the following
>> lxc commands:
>>
>> lxc-execute -n foo sleep 3600 &
>> ls -al /proc/$(pidof lxc-init)/exe && lxc-stop -n foo
>>
>> All the processes related to the container are killed, but there is
>> still a refcount on the pid_namespace which is never released.
>
> Thanks for the bug report.
>
> Did you notice any leak in 'struct pids' also or just the pid_namespace ?
> If the pids are not leaking, this may be slightly different from the problem
> Catalin Marinas ran into:
>
> http://lkml.org/lkml/2009/7/29/406

I am not sure what you mean by 'struct pids' but what I observed is:

pid_2 and pid_namespace (as they are named in /proc/slabinfo) are never
decremented.

> And the pid_namespace does not seem to reproduce for me, with out the
> 'ls -al /proc/...' above, or with the simpler 'ns_exec' approach to
> creating pid namespace.

I tried to write a simpler program but I failed to reproduce it.

> I am going through the code for lxc-execute, but does it remount /proc
> in the container ?

Right, the parent does a clone(NEWMNT|NEWPID|NEWIPC|NEWUTS), wait for
the child while this one (pid 1) 'execs' the lxc-init process. This
program mounts /proc and fork-exec the command passed as parameter (here
'sleep 3600').

Without this intermediate process, the leak *seems* not happening.

If you don't access /proc/<pid>/<file>, the leak is not happening.

So to summarize:

Leak when:
----------

lxc-execute -n foo sleep 3600 &
ls -al /proc/$(pidof sleep)/exe && lxc-stop -n foo

The stop can be done, immediately after looking at the proc file or
later, the leak happens in all the cases.

No leak when:
-------------
lxc-execute -n foo sleep 3600 &
lxc-stop -n foo


I tried to create a simpler program doing the same but that did not
triggered the problem.

-- Daniel

2009-10-08 03:08:39

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: pidns memory leak

Still digging through some traces, but below I have some questions that
I am still trying to answer.

>
> I am not sure what you mean by 'struct pids' but what I observed is:

Ok, I see that too. If pids leak, then pid-namespace will leak too.
Do you see any leaks in proc_inode_cache ?

>
> pid_2 and pid_namespace (as they are named in /proc/slabinfo) are never
> decremented.
>
>> And the pid_namespace does not seem to reproduce for me, with out the
>> 'ls -al /proc/...' above, or with the simpler 'ns_exec' approach to
>> creating pid namespace.
>
> I tried to write a simpler program but I failed to reproduce it.
>
>> I am going through the code for lxc-execute, but does it remount /proc
>> in the container ?
>
> Right, the parent does a clone(NEWMNT|NEWPID|NEWIPC|NEWUTS), wait for
> the child while this one (pid 1) 'execs' the lxc-init process. This
> program mounts /proc and fork-exec the command passed as parameter (here
> 'sleep 3600').
>
> Without this intermediate process, the leak *seems* not happening.
>
> If you don't access /proc/<pid>/<file>, the leak is not happening.

I could not see the code for that. Does lxc-stop unmount /proc too ?
Or is the umount expected to happen automatically after all processes
in the container are killed ?

Sukadev

2009-10-08 08:13:04

by Daniel Lezcano

[permalink] [raw]
Subject: Re: pidns memory leak

Sukadev Bhattiprolu wrote:
> Still digging through some traces, but below I have some questions that
> I am still trying to answer.
>
>> I am not sure what you mean by 'struct pids' but what I observed is:
>
> Ok, I see that too. If pids leak, then pid-namespace will leak too.
> Do you see any leaks in proc_inode_cache ?

Yes, right. It leaks too.

>> pid_2 and pid_namespace (as they are named in /proc/slabinfo) are never
>> decremented.
>>
>>> And the pid_namespace does not seem to reproduce for me, with out the
>>> 'ls -al /proc/...' above, or with the simpler 'ns_exec' approach to
>>> creating pid namespace.
>> I tried to write a simpler program but I failed to reproduce it.
>>
>>> I am going through the code for lxc-execute, but does it remount /proc
>>> in the container ?
>> Right, the parent does a clone(NEWMNT|NEWPID|NEWIPC|NEWUTS), wait for
>> the child while this one (pid 1) 'execs' the lxc-init process. This
>> program mounts /proc and fork-exec the command passed as parameter (here
>> 'sleep 3600').
>>
>> Without this intermediate process, the leak *seems* not happening.
>>
>> If you don't access /proc/<pid>/<file>, the leak is not happening.
>
> I could not see the code for that. Does lxc-stop unmount /proc too ?
> Or is the umount expected to happen automatically after all processes
> in the container are killed ?

The umount is expected to happen automatically. I can not access the
container from the outside to umount /proc.

2009-10-09 03:29:28

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: pidns memory leak

Daniel Lezcano [[email protected]] wrote:
> Sukadev Bhattiprolu wrote:
>> Still digging through some traces, but below I have some questions that
>> I am still trying to answer.
>>
>>> I am not sure what you mean by 'struct pids' but what I observed is:
>>
>> Ok, I see that too. If pids leak, then pid-namespace will leak too.
>> Do you see any leaks in proc_inode_cache ?
>
> Yes, right. It leaks too.

Ok, some progress...

Can you please verify these observations:

- If the container exits normally, the leak does not seem to happen.
(i.e reduce your sleep 3600 to say sleep 3 and remove the lxc-stop).

- Revert the following commit and check if the leak happens:

commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
Author: Andrea Arcangeli <[email protected]>
Date: Mon Feb 4 22:29:21 2008 -0800

(this commit added the check for PF_EXITING in proc_flush_task_mnt
loosely explained below).

Incomplete analysis :-)

If the container-init is terminated (by the lxc-stop), the container zaps
other processes in the container and waits for them. The leak happens in
this case.

Following sequence of events occur:

- container-init calls do_exit and sets PF_EXITING (in exit_signals())

- container init calls zaps_pid_ns_processes() (exit_notify /
forget_orignal_parent() / find_new_reaper())

- In zap_pid_ns_processes() container-init sends SIGKILL to
descendants and calls sys_wait().

- The sys_wait() is expected to call release_task() which calls
proc_flush_task_mnt().

- proc_flush_task_mnt() looks up the dentry for the pid (2 in
our example) and finds the dentry.

But since container-init is itself exiting (i.e PF_EXITING is
set) it does NOT call the shrink_dcache_parent(), but,
interestingly calls d_drop() and dput().

Now the d_drop() unhashes the dentry for the pid 2.

- proc_flush_task_mnt() then tries to find the dentry for the
tgid of the process. In our case, the tgid == pid == 2 and
we just unhashed the dentry for "2".

So, we don't find the dentry for the leader either (and hence
don't make the second shrink_dcache_parent() call in
proc_flush_task_mnt() either).

Without a call to shrink_dcache_parent(), the proc inode
for the process that was terminated by container init is
not deleted (i.e we don't call proc_delete_inode() or
the put_pid() inside it) causing us to leak proc_inodes,
struct pid and hence struct pid_namespace.

There should be a better fix, but first please confirm if reverting the
above commit fixes the leak for you also.

Sukadev

2009-10-09 13:19:29

by Daniel Lezcano

[permalink] [raw]
Subject: Re: pidns memory leak

Sukadev Bhattiprolu wrote:
> Daniel Lezcano [[email protected]] wrote:
>> Sukadev Bhattiprolu wrote:
>>> Still digging through some traces, but below I have some questions that
>>> I am still trying to answer.
>>>
>>>> I am not sure what you mean by 'struct pids' but what I observed is:
>>> Ok, I see that too. If pids leak, then pid-namespace will leak too.
>>> Do you see any leaks in proc_inode_cache ?
>> Yes, right. It leaks too.
>
> Ok, some progress...
>
> Can you please verify these observations:
>
> - If the container exits normally, the leak does not seem to happen.
> (i.e reduce your sleep 3600 to say sleep 3 and remove the lxc-stop).
>
> - Revert the following commit and check if the leak happens:
>
> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
> Author: Andrea Arcangeli <[email protected]>
> Date: Mon Feb 4 22:29:21 2008 -0800
>
> (this commit added the check for PF_EXITING in proc_flush_task_mnt
> loosely explained below).



> Incomplete analysis :-)
>
> If the container-init is terminated (by the lxc-stop), the container zaps
> other processes in the container and waits for them. The leak happens in
> this case.
>
> Following sequence of events occur:
>
> - container-init calls do_exit and sets PF_EXITING (in exit_signals())
>
> - container init calls zaps_pid_ns_processes() (exit_notify /
> forget_orignal_parent() / find_new_reaper())
>
> - In zap_pid_ns_processes() container-init sends SIGKILL to
> descendants and calls sys_wait().
>
> - The sys_wait() is expected to call release_task() which calls
> proc_flush_task_mnt().
>
> - proc_flush_task_mnt() looks up the dentry for the pid (2 in
> our example) and finds the dentry.
>
> But since container-init is itself exiting (i.e PF_EXITING is
> set) it does NOT call the shrink_dcache_parent(), but,
> interestingly calls d_drop() and dput().
>
> Now the d_drop() unhashes the dentry for the pid 2.
>
> - proc_flush_task_mnt() then tries to find the dentry for the
> tgid of the process. In our case, the tgid == pid == 2 and
> we just unhashed the dentry for "2".
>
> So, we don't find the dentry for the leader either (and hence
> don't make the second shrink_dcache_parent() call in
> proc_flush_task_mnt() either).
>
> Without a call to shrink_dcache_parent(), the proc inode
> for the process that was terminated by container init is
> not deleted (i.e we don't call proc_delete_inode() or
> the put_pid() inside it) causing us to leak proc_inodes,
> struct pid and hence struct pid_namespace.

Ouch !

Nice analysis :)

Following your explanation I was able to reproduce a simple program
added in attachment. But there is something I do not understand is why
the leak does not appear if I do the 'lstat' (cf. test program) in the
pid 2 context.


> There should be a better fix, but first please confirm if reverting the
> above commit fixes the leak for you also.

I confirm the leak does no longer appear when reverting this patch.

Thanks
-- Daniel


Attachments:
bugpidns_leak.c (1.23 kB)

2009-10-09 20:38:06

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: pidns memory leak

Andrea,

We have been running a leak in child pid namespaces and some early debugging
points to the following commit:

>> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
>> Author: Andrea Arcangeli <[email protected]>
>> Date: Mon Feb 4 22:29:21 2008 -0800
>>

Reverting the commit seems to fix the leak but we need to do some more
analysis (like the lstat() question Daniel has).

However I have a basic question regarding the commit - the log mentions:

> do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
> come back to run journal_stop)

But release_task() calls shrink_dcache_parent() for a _procfs_ dentry. Does
journal_stop() apply to procfs also ?

Thanks,

Sukadev


Daniel Lezcano [[email protected]] wrote:
> Sukadev Bhattiprolu wrote:
>> Daniel Lezcano [[email protected]] wrote:
>>> Sukadev Bhattiprolu wrote:
>>>> Still digging through some traces, but below I have some questions
>>>> that I am still trying to answer.
>>>>
>>>>> I am not sure what you mean by 'struct pids' but what I observed is:
>>>> Ok, I see that too. If pids leak, then pid-namespace will leak too.
>>>> Do you see any leaks in proc_inode_cache ?
>>> Yes, right. It leaks too.
>>
>> Ok, some progress...
>>
>> Can you please verify these observations:
>>
>> - If the container exits normally, the leak does not seem to happen.
>> (i.e reduce your sleep 3600 to say sleep 3 and remove the lxc-stop).
>>
>> - Revert the following commit and check if the leak happens:
>>
>> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
>> Author: Andrea Arcangeli <[email protected]>
>> Date: Mon Feb 4 22:29:21 2008 -0800
>>
>> (this commit added the check for PF_EXITING in proc_flush_task_mnt
>> loosely explained below).
>
>
>
>> Incomplete analysis :-)
>>
>> If the container-init is terminated (by the lxc-stop), the container zaps
>> other processes in the container and waits for them. The leak happens in
>> this case.
>>
>> Following sequence of events occur:
>>
>> - container-init calls do_exit and sets PF_EXITING (in exit_signals())
>>
>> - container init calls zaps_pid_ns_processes() (exit_notify /
>> forget_orignal_parent() / find_new_reaper())
>>
>> - In zap_pid_ns_processes() container-init sends SIGKILL to
>> descendants and calls sys_wait().
>>
>> - The sys_wait() is expected to call release_task() which calls
>> proc_flush_task_mnt().
>>
>> - proc_flush_task_mnt() looks up the dentry for the pid (2 in
>> our example) and finds the dentry.
>>
>> But since container-init is itself exiting (i.e PF_EXITING is
>> set) it does NOT call the shrink_dcache_parent(), but,
>> interestingly calls d_drop() and dput().
>>
>> Now the d_drop() unhashes the dentry for the pid 2.
>>
>> - proc_flush_task_mnt() then tries to find the dentry for the
>> tgid of the process. In our case, the tgid == pid == 2 and
>> we just unhashed the dentry for "2".
>>
>> So, we don't find the dentry for the leader either (and hence
>> don't make the second shrink_dcache_parent() call in
>> proc_flush_task_mnt() either).
>>
>> Without a call to shrink_dcache_parent(), the proc inode
>> for the process that was terminated by container init is
>> not deleted (i.e we don't call proc_delete_inode() or
>> the put_pid() inside it) causing us to leak proc_inodes,
>> struct pid and hence struct pid_namespace.
>
> Ouch !
>
> Nice analysis :)
>
> Following your explanation I was able to reproduce a simple program
> added in attachment. But there is something I do not understand is why
> the leak does not appear if I do the 'lstat' (cf. test program) in the
> pid 2 context.
>
>
>> There should be a better fix, but first please confirm if reverting the
>> above commit fixes the leak for you also.
>
> I confirm the leak does no longer appear when reverting this patch.
>
> Thanks
> -- Daniel

| #include <stdio.h>
| #include <unistd.h>
| #include <stdlib.h>
| #include <sys/prctl.h>
| #include <sys/param.h>
| #include <sys/stat.h>
| #include <sys/poll.h>
| #include <signal.h>
| #include <sched.h>
|
| #ifndef CLONE_NEWPID
| # define CLONE_NEWPID 0x20000000
| #endif
|
| int child(void *arg)
| {
| pid_t pid;
| struct stat s;
|
| if (mount("proc", "/proc", "proc", 0, NULL)) {
| perror("mount");
| return -1;
| }
|
| pid = fork();
| if (pid < 0) {
| perror("fork");
| return -1;
| }
|
| if (!pid) {
| poll(0, 0 , -1);
| exit(-1);
| }
|
| poll(0, 0, -1);
|
| return 0;
| }
|
| pid_t clonens(int (*fn)(void *), void *arg, int flags)
| {
| long stack_size = sysconf(_SC_PAGESIZE);
| void *stack = alloca(stack_size) + stack_size;
| return clone(fn, stack, flags | SIGCHLD, arg);
| }
|
| int main(int argc, char *argv[])
| {
| pid_t pid;
| struct stat s;
| char path[MAXPATHLEN];
|
| pid = clonens(child, NULL, CLONE_NEWNS|CLONE_NEWPID);
| if (pid < 0) {
| perror("clone");
| return -1;
| }
|
| /* yes ugly.*/
| sleep(1);
|
| /* !! assumption : child of my child is pid + 1
| * any reliable simple solution is welcome :) */
| snprintf(path, sizeof(path), "/proc/%d/exe", pid + 1);
|
| if (lstat(path, &s)) {
| perror("lstat");
| exit(-1);
| }
|
| if (kill(pid, SIGKILL)) {
| perror("kill");
| return -1;
| }
|
| return 0;
| }

2009-10-09 20:51:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: pidns memory leak

Sukadev Bhattiprolu <[email protected]> writes:

> Andrea,
>
> We have been running a leak in child pid namespaces and some early debugging
> points to the following commit:
>
>>> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
>>> Author: Andrea Arcangeli <[email protected]>
>>> Date: Mon Feb 4 22:29:21 2008 -0800
>>>
>
> Reverting the commit seems to fix the leak but we need to do some more
> analysis (like the lstat() question Daniel has).

Yes.

That entire path is an optimization. It should not be needed for correct
operation. Although it may be responsible for some false positives.

> However I have a basic question regarding the commit - the log mentions:
>
> > do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
> > come back to run journal_stop)
>
> But release_task() calls shrink_dcache_parent() for a _procfs_ dentry. Does
> journal_stop() apply to procfs also ?

The problem when the that PF_EXITING check was introduced is that
shrink_dcache_parent could shrink dcache entries for other
filesystems. Last I looked that is no longer the case and we can
remove that code. As I recall proc_flush_task_mnt has a few other minor
bugs as well that could cause problems.

Ultimately what problems are you seeing?

Eric

2009-10-09 21:55:33

by Matt Helsley

[permalink] [raw]
Subject: Re: pidns memory leak

On Fri, Oct 09, 2009 at 03:18:23PM +0200, Daniel Lezcano wrote:
> Sukadev Bhattiprolu wrote:
>> Daniel Lezcano [[email protected]] wrote:
>>> Sukadev Bhattiprolu wrote:
>>>> Still digging through some traces, but below I have some questions
>>>> that I am still trying to answer.
>>>>
>>>>> I am not sure what you mean by 'struct pids' but what I observed is:
>>>> Ok, I see that too. If pids leak, then pid-namespace will leak too.
>>>> Do you see any leaks in proc_inode_cache ?
>>> Yes, right. It leaks too.
>>
>> Ok, some progress...
>>
>> Can you please verify these observations:
>>
>> - If the container exits normally, the leak does not seem to happen.
>> (i.e reduce your sleep 3600 to say sleep 3 and remove the lxc-stop).
>>
>> - Revert the following commit and check if the leak happens:
>>
>> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
>> Author: Andrea Arcangeli <[email protected]>
>> Date: Mon Feb 4 22:29:21 2008 -0800
>>
>> (this commit added the check for PF_EXITING in proc_flush_task_mnt
>> loosely explained below).
>
>
>
>> Incomplete analysis :-)
>>
>> If the container-init is terminated (by the lxc-stop), the container zaps
>> other processes in the container and waits for them. The leak happens in
>> this case.
>>
>> Following sequence of events occur:
>>
>> - container-init calls do_exit and sets PF_EXITING (in exit_signals())
>>
>> - container init calls zaps_pid_ns_processes() (exit_notify /
>> forget_orignal_parent() / find_new_reaper())
>>
>> - In zap_pid_ns_processes() container-init sends SIGKILL to
>> descendants and calls sys_wait().
>>
>> - The sys_wait() is expected to call release_task() which calls
>> proc_flush_task_mnt().
>>
>> - proc_flush_task_mnt() looks up the dentry for the pid (2 in
>> our example) and finds the dentry.
>>
>> But since container-init is itself exiting (i.e PF_EXITING is
>> set) it does NOT call the shrink_dcache_parent(), but,
>> interestingly calls d_drop() and dput().
>>
>> Now the d_drop() unhashes the dentry for the pid 2.
>>
>> - proc_flush_task_mnt() then tries to find the dentry for the
>> tgid of the process. In our case, the tgid == pid == 2 and
>> we just unhashed the dentry for "2".
>>
>> So, we don't find the dentry for the leader either (and hence
>> don't make the second shrink_dcache_parent() call in
>> proc_flush_task_mnt() either).
>>
>> Without a call to shrink_dcache_parent(), the proc inode
>> for the process that was terminated by container init is
>> not deleted (i.e we don't call proc_delete_inode() or
>> the put_pid() inside it) causing us to leak proc_inodes,
>> struct pid and hence struct pid_namespace.
>
> Ouch !
>
> Nice analysis :)
>
> Following your explanation I was able to reproduce a simple program
> added in attachment. But there is something I do not understand is why
> the leak does not appear if I do the 'lstat' (cf. test program) in the
> pid 2 context.

I would suspect that lstat may cause the proc inode to be relinked such
that later cleanup paths can reach the struct pid and the pidns.

When it runs, lstate will lookup the dentry, fail, look up the process
tgid via tasklist, and relinks the proc inode back to a new dentry,
gets the stat info, and exits back to userspace. Since the proc inode is
properly relinked a different path can now reach the struct pid and pidns
again and properly cleans those up.

That's just my guess. And of course I hand waved the "different path" --
no idea what path that is...

Cheers,
-Matt

2009-10-10 01:32:25

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: pidns memory leak

Ccing Andrea's new email id:

Daniel Lezcano [[email protected]] wrote:
> Following your explanation I was able to reproduce a simple program
> added in attachment. But there is something I do not understand is why
> the leak does not appear if I do the 'lstat' (cf. test program) in the
> pid 2 context.

Hmm, are you sure there is no leak with this test program ? If I put back
the commit (7766755a2f249e7), I do see a leak in all three data structures
(pid_2, proc_inode, pid_namespace).

2009-10-10 01:58:48

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: pidns memory leak

Eric W. Biederman [[email protected]] wrote:
| Sukadev Bhattiprolu <[email protected]> writes:
|
| > Andrea,
| >
| > We have been running a leak in child pid namespaces and some early debugging
| > points to the following commit:
| >
| >>> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
| >>> Author: Andrea Arcangeli <[email protected]>
| >>> Date: Mon Feb 4 22:29:21 2008 -0800
| >>>
| >
| > Reverting the commit seems to fix the leak but we need to do some more
| > analysis (like the lstat() question Daniel has).
|
| Yes.
|
| That entire path is an optimization. It should not be needed for correct
| operation. Although it may be responsible for some false positives.
|
| > However I have a basic question regarding the commit - the log mentions:
| >
| > > do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
| > > come back to run journal_stop)
| >
| > But release_task() calls shrink_dcache_parent() for a _procfs_ dentry. Does
| > journal_stop() apply to procfs also ?
|
| The problem when the that PF_EXITING check was introduced is that
| shrink_dcache_parent could shrink dcache entries for other
| filesystems. Last I looked that is no longer the case and we can
| remove that code.

Ok.

| As I recall proc_flush_task_mnt has a few other minor bugs as well that
| could cause problems.

Can you give me some more details on those bugs ? Reverting the commit
seems to fix the problem.

|
| Ultimately what problems are you seeing?

We are leaking 'struct pid', proc_inode, and 'struct pid_namespace', when
container-init exits before its descendant processes. i.e when the
container-init zaps its descendants and waits for them, it calls the
proc_flush_task_mnt(), but then misses the shrink_dcache_parent() call due
to the above commit.

So the proc_inode is never deleted and the references to struct pid and
pid_namespace never go away. Details of the leak are buried in the
previous mail...

2009-10-10 02:09:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: pidns memory leak

Sukadev Bhattiprolu <[email protected]> writes:

> Eric W. Biederman [[email protected]] wrote:
> | Sukadev Bhattiprolu <[email protected]> writes:
> |
> | > Andrea,
> | >
> | > We have been running a leak in child pid namespaces and some early debugging
> | > points to the following commit:
> | >
> | >>> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
> | >>> Author: Andrea Arcangeli <[email protected]>
> | >>> Date: Mon Feb 4 22:29:21 2008 -0800
> | >>>
> | >
> | > Reverting the commit seems to fix the leak but we need to do some more
> | > analysis (like the lstat() question Daniel has).
> |
> | Yes.
> |
> | That entire path is an optimization. It should not be needed for correct
> | operation. Although it may be responsible for some false positives.
> |
> | > However I have a basic question regarding the commit - the log mentions:
> | >
> | > > do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
> | > > come back to run journal_stop)
> | >
> | > But release_task() calls shrink_dcache_parent() for a _procfs_ dentry. Does
> | > journal_stop() apply to procfs also ?
> |
> | The problem when the that PF_EXITING check was introduced is that
> | shrink_dcache_parent could shrink dcache entries for other
> | filesystems. Last I looked that is no longer the case and we can
> | remove that code.
>
> Ok.
>
> | As I recall proc_flush_task_mnt has a few other minor bugs as well that
> | could cause problems.
>
> Can you give me some more details on those bugs ? Reverting the commit
> seems to fix the problem.
>
> |
> | Ultimately what problems are you seeing?
>
> We are leaking 'struct pid', proc_inode, and 'struct pid_namespace', when
> container-init exits before its descendant processes. i.e when the
> container-init zaps its descendants and waits for them, it calls the
> proc_flush_task_mnt(), but then misses the shrink_dcache_parent() call due
> to the above commit.
>
> So the proc_inode is never deleted and the references to struct pid and
> pid_namespace never go away. Details of the leak are buried in the
> previous mail...

In should be the case that bloating up the dcache so that we get a general
shrink_dcache from the memory reclaim code will free the proc_inode and
the appropriate data structures. struct pid is supposed to be small and
safe to leak in rare circumstances.

It should be possible to trigger this condition by creating a pid namespace.
cd /proc/<pid>/ (where <pid> is some process in that pid namespace)

Terminating that pid namespace.

But you are still actively using the proc_inode and the struct pid for the
process that has been killed. Because a process has it as it's current
working directory.

Eric

2009-10-12 08:42:29

by Daniel Lezcano

[permalink] [raw]
Subject: Re: pidns memory leak

Sukadev Bhattiprolu wrote:
> Ccing Andrea's new email id:
>
> Daniel Lezcano [[email protected]] wrote:
>
>> Following your explanation I was able to reproduce a simple program
>> added in attachment. But there is something I do not understand is why
>> the leak does not appear if I do the 'lstat' (cf. test program) in the
>> pid 2 context.
>>
>
> Hmm, are you sure there is no leak with this test program ? If I put back
> the commit (7766755a2f249e7), I do see a leak in all three data structures
> (pid_2, proc_inode, pid_namespace).
>

Let me clarify :)

The program leaks with the commit 7766755a2f249e7 and does not leak
without this commit.
This is the expected behaviour and this simple program spots the problem.

I tried to modify the program and I moved the lstat to the process 2 in
the child namespace. Conforming your analysis, I was expecting to see a
leak too, but this one didn't occur. I was wondering why, maybe there is
something I didn't understood in the analysis.

Thanks
-- Daniel

2009-10-14 06:14:52

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: pidns memory leak

Daniel Lezcano [[email protected]] wrote:
> Sukadev Bhattiprolu wrote:
>> Ccing Andrea's new email id:
>>
>> Daniel Lezcano [[email protected]] wrote:
>>
>>> Following your explanation I was able to reproduce a simple program
>>> added in attachment. But there is something I do not understand is
>>> why the leak does not appear if I do the 'lstat' (cf. test program)
>>> in the pid 2 context.
>>>
>>
>> Hmm, are you sure there is no leak with this test program ? If I put back
>> the commit (7766755a2f249e7), I do see a leak in all three data structures
>> (pid_2, proc_inode, pid_namespace).
>>
>
> Let me clarify :)
>
> The program leaks with the commit 7766755a2f249e7 and does not leak
> without this commit.
> This is the expected behaviour and this simple program spots the problem.
>
> I tried to modify the program and I moved the lstat to the process 2 in
> the child namespace. Conforming your analysis, I was expecting to see a
> leak too, but this one didn't occur. I was wondering why, maybe there is
> something I didn't understood in the analysis.

Hmm, There are two separate dentries associated with the processes.
One in each mount of /proc. The proc dentries in the child container
are freed when the child container unmounts its /proc so you don't see
the leak when the lstat() is inside the container.

When the lstat() is in the root container, it is accessing proc-dentries
from the _root container_ - They are supposed to be flushed when the task
exits (but the above commit prevents that flush). They should be freed
when the /proc in root container is unmounted - and leak until then ?

Sukadev

2009-11-02 21:33:51

by Andrew Morton

[permalink] [raw]
Subject: Re: pidns memory leak

On Tue, 13 Oct 2009 23:15:33 -0700
Sukadev Bhattiprolu <[email protected]> wrote:

> Daniel Lezcano [[email protected]] wrote:
> > Sukadev Bhattiprolu wrote:
> >> Ccing Andrea's new email id:
> >>
> >> Daniel Lezcano [[email protected]] wrote:
> >>
> >>> Following your explanation I was able to reproduce a simple program
> >>> added in attachment. But there is something I do not understand is
> >>> why the leak does not appear if I do the 'lstat' (cf. test program)
> >>> in the pid 2 context.
> >>>
> >>
> >> Hmm, are you sure there is no leak with this test program ? If I put back
> >> the commit (7766755a2f249e7), I do see a leak in all three data structures
> >> (pid_2, proc_inode, pid_namespace).
> >>
> >
> > Let me clarify :)
> >
> > The program leaks with the commit 7766755a2f249e7 and does not leak
> > without this commit.
> > This is the expected behaviour and this simple program spots the problem.
> >
> > I tried to modify the program and I moved the lstat to the process 2 in
> > the child namespace. Conforming your analysis, I was expecting to see a
> > leak too, but this one didn't occur. I was wondering why, maybe there is
> > something I didn't understood in the analysis.
>
> Hmm, There are two separate dentries associated with the processes.
> One in each mount of /proc. The proc dentries in the child container
> are freed when the child container unmounts its /proc so you don't see
> the leak when the lstat() is inside the container.
>
> When the lstat() is in the root container, it is accessing proc-dentries
> from the _root container_ - They are supposed to be flushed when the task
> exits (but the above commit prevents that flush). They should be freed
> when the /proc in root container is unmounted - and leak until then ?
>

This bug hasn't been fixed yet, has it?

2009-11-02 22:38:21

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: pidns memory leak

Quoting Andrew Morton ([email protected]):
> On Tue, 13 Oct 2009 23:15:33 -0700
> Sukadev Bhattiprolu <[email protected]> wrote:
>
> > Daniel Lezcano [[email protected]] wrote:
> > > Sukadev Bhattiprolu wrote:
> > >> Ccing Andrea's new email id:
> > >>
> > >> Daniel Lezcano [[email protected]] wrote:
> > >>
> > >>> Following your explanation I was able to reproduce a simple program
> > >>> added in attachment. But there is something I do not understand is
> > >>> why the leak does not appear if I do the 'lstat' (cf. test program)
> > >>> in the pid 2 context.
> > >>>
> > >>
> > >> Hmm, are you sure there is no leak with this test program ? If I put back
> > >> the commit (7766755a2f249e7), I do see a leak in all three data structures
> > >> (pid_2, proc_inode, pid_namespace).
> > >>
> > >
> > > Let me clarify :)
> > >
> > > The program leaks with the commit 7766755a2f249e7 and does not leak
> > > without this commit.
> > > This is the expected behaviour and this simple program spots the problem.
> > >
> > > I tried to modify the program and I moved the lstat to the process 2 in
> > > the child namespace. Conforming your analysis, I was expecting to see a
> > > leak too, but this one didn't occur. I was wondering why, maybe there is
> > > something I didn't understood in the analysis.
> >
> > Hmm, There are two separate dentries associated with the processes.
> > One in each mount of /proc. The proc dentries in the child container
> > are freed when the child container unmounts its /proc so you don't see
> > the leak when the lstat() is inside the container.
> >
> > When the lstat() is in the root container, it is accessing proc-dentries
> > from the _root container_ - They are supposed to be flushed when the task
> > exits (but the above commit prevents that flush). They should be freed
> > when the /proc in root container is unmounted - and leak until then ?
> >
>
> This bug hasn't been fixed yet, has it?

Well Suka did trace the bug to commit 7766755a2f249e7, and posted a patch
to revert that, acked by Eric on Oct 20. Suka, were you going to repost
that patch?

-serge

2009-11-02 22:48:40

by Andrew Morton

[permalink] [raw]
Subject: Re: pidns memory leak

On Mon, 2 Nov 2009 16:38:18 -0600
"Serge E. Hallyn" <[email protected]> wrote:

> > This bug hasn't been fixed yet, has it?
>
> Well Suka did trace the bug to commit 7766755a2f249e7, and posted a patch
> to revert that, acked by Eric on Oct 20. Suka, were you going to repost
> that patch?

Ah. OK. Thanks. Found it in the backlog pile.

2009-11-03 07:24:56

by Cedric Le Goater

[permalink] [raw]
Subject: Re: pidns memory leak

On 11/02/2009 11:47 PM, Andrew Morton wrote:
> On Mon, 2 Nov 2009 16:38:18 -0600
> "Serge E. Hallyn" <[email protected]> wrote:
>
>>> This bug hasn't been fixed yet, has it?
>>
>> Well Suka did trace the bug to commit 7766755a2f249e7, and posted a patch
>> to revert that, acked by Eric on Oct 20. Suka, were you going to repost
>> that patch?
>
> Ah. OK. Thanks. Found it in the backlog pile.

We've added the patch to our patchset and we confirm that the pid_* leaks have
been reduced to 'nearly' nothing but we still have a lot of inodes and dentries
leaks. I hope to find some time to investigate and reproduce with a small
scenario, we are running a LTP like testsuite in a container environment.

Also, Alexey had questions about it.

C.

2009-11-03 08:41:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: pidns memory leak

Cedric Le Goater <[email protected]> writes:

> On 11/02/2009 11:47 PM, Andrew Morton wrote:
>> On Mon, 2 Nov 2009 16:38:18 -0600
>> "Serge E. Hallyn" <[email protected]> wrote:
>>
>>>> This bug hasn't been fixed yet, has it?
>>>
>>> Well Suka did trace the bug to commit 7766755a2f249e7, and posted a patch
>>> to revert that, acked by Eric on Oct 20. Suka, were you going to repost
>>> that patch?
>>
>> Ah. OK. Thanks. Found it in the backlog pile.
>
> We've added the patch to our patchset and we confirm that the pid_* leaks have
> been reduced to 'nearly' nothing but we still have a lot of inodes and dentries
> leaks. I hope to find some time to investigate and reproduce with a small
> scenario, we are running a LTP like testsuite in a container environment.

Does forcing a cache flush help with the other leaks?

Eric

2009-11-03 09:24:39

by Cedric Le Goater

[permalink] [raw]
Subject: Re: pidns memory leak

On 11/03/2009 09:41 AM, Eric W. Biederman wrote:
> Cedric Le Goater <[email protected]> writes:
>
>> On 11/02/2009 11:47 PM, Andrew Morton wrote:
>>> On Mon, 2 Nov 2009 16:38:18 -0600
>>> "Serge E. Hallyn" <[email protected]> wrote:
>>>
>>>>> This bug hasn't been fixed yet, has it?
>>>>
>>>> Well Suka did trace the bug to commit 7766755a2f249e7, and posted a patch
>>>> to revert that, acked by Eric on Oct 20. Suka, were you going to repost
>>>> that patch?
>>>
>>> Ah. OK. Thanks. Found it in the backlog pile.
>>
>> We've added the patch to our patchset and we confirm that the pid_* leaks have
>> been reduced to 'nearly' nothing but we still have a lot of inodes and dentries
>> leaks. I hope to find some time to investigate and reproduce with a small
>> scenario, we are running a LTP like testsuite in a container environment.
>
> Does forcing a cache flush help with the other leaks?

yes, it frees a few more dentries, but not enough.

I did:

$ echo 2 > /proc/sys/vm/drop_caches

before :

size-64 193243 198088 88 44 1
dentry 110584 111202 280 14 1
inode_cache 107543 107543 4096 1 1
size-128 56341 63450 152 25 1
size-4096 21107 21107 4096 1 1
vm_area_struct 11838 11960 192 20 1
size-256 11406 11424 280 14 1
size-32 9408 9916 56 67 1
size-512 7710 7710 4096 1 1
sysfs_dir_cache 5288 5328 104 37 1
pid_2 302 336 136 28 1
pid_namespace 1 1 4096 1 1
nsproxy 1 53 72 53 1

after:

size-64 193150 198044 88 44 1
dentry 110509 111202 280 14 1
inode_cache 107543 107543 4096 1 1
size-128 56326 63450 152 25 1
size-4096 21107 21107 4096 1 1
vm_area_struct 11857 11960 192 20 1
size-256 11405 11424 280 14 1
size-32 9408 9916 56 67 1
size-512 7710 7710 4096 1 1
sysfs_dir_cache 5288 5328 104 37 1
pid_2 302 336 136 28 1
pid_namespace 1 1 4096 1 1
nsproxy 1 53 72 53 1


I'll come back to you (daniel or me) when we've nailed this one with a simpler
program. it shows up when stressing the system with lxc containers.

Cheers,

C.