Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761446AbZJIUiG (ORCPT ); Fri, 9 Oct 2009 16:38:06 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755935AbZJIUiF (ORCPT ); Fri, 9 Oct 2009 16:38:05 -0400 Received: from e39.co.us.ibm.com ([32.97.110.160]:35876 "EHLO e39.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755293AbZJIUiE (ORCPT ); Fri, 9 Oct 2009 16:38:04 -0400 Date: Fri, 9 Oct 2009 13:38:09 -0700 From: Sukadev Bhattiprolu To: Daniel Lezcano , andrea@suse.de, "Eric W. Biederman" Cc: Pavel Emelianov , Sukadev Bhattiprolu , Linux Containers , Linux Kernel Mailing List Subject: Re: pidns memory leak Message-ID: <20091009203809.GA12230@us.ibm.com> References: <4AC5F198.2070407@fr.ibm.com> <20091006040526.GA22923@us.ibm.com> <4ACAFD6A.3060008@fr.ibm.com> <20091008030828.GA18973@us.ibm.com> <4ACD9ECC.90508@fr.ibm.com> <20091009032928.GA2031@us.ibm.com> <4ACF381F.9050808@fr.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4ACF381F.9050808@fr.ibm.com> X-Operating-System: Linux 2.0.32 on an i486 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5532 Lines: 200 Andrea, We have been running a leak in child pid namespaces and some early debugging points to the following commit: >> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821 >> Author: Andrea Arcangeli >> Date: Mon Feb 4 22:29:21 2008 -0800 >> Reverting the commit seems to fix the leak but we need to do some more analysis (like the lstat() question Daniel has). However I have a basic question regarding the commit - the log mentions: > do_exit->release_task->mark_inode_dirty_sync->schedule() (will never > come back to run journal_stop) But release_task() calls shrink_dcache_parent() for a _procfs_ dentry. Does journal_stop() apply to procfs also ? Thanks, Sukadev Daniel Lezcano [dlezcano@fr.ibm.com] wrote: > Sukadev Bhattiprolu wrote: >> Daniel Lezcano [dlezcano@fr.ibm.com] wrote: >>> Sukadev Bhattiprolu wrote: >>>> Still digging through some traces, but below I have some questions >>>> that I am still trying to answer. >>>> >>>>> I am not sure what you mean by 'struct pids' but what I observed is: >>>> Ok, I see that too. If pids leak, then pid-namespace will leak too. >>>> Do you see any leaks in proc_inode_cache ? >>> Yes, right. It leaks too. >> >> Ok, some progress... >> >> Can you please verify these observations: >> >> - If the container exits normally, the leak does not seem to happen. >> (i.e reduce your sleep 3600 to say sleep 3 and remove the lxc-stop). >> >> - Revert the following commit and check if the leak happens: >> >> commit 7766755a2f249e7e0dabc5255a0a3d151ff79821 >> Author: Andrea Arcangeli >> Date: Mon Feb 4 22:29:21 2008 -0800 >> >> (this commit added the check for PF_EXITING in proc_flush_task_mnt >> loosely explained below). > > > >> Incomplete analysis :-) >> >> If the container-init is terminated (by the lxc-stop), the container zaps >> other processes in the container and waits for them. The leak happens in >> this case. >> >> Following sequence of events occur: >> >> - container-init calls do_exit and sets PF_EXITING (in exit_signals()) >> >> - container init calls zaps_pid_ns_processes() (exit_notify / >> forget_orignal_parent() / find_new_reaper()) >> >> - In zap_pid_ns_processes() container-init sends SIGKILL to >> descendants and calls sys_wait(). >> >> - The sys_wait() is expected to call release_task() which calls >> proc_flush_task_mnt(). >> >> - proc_flush_task_mnt() looks up the dentry for the pid (2 in >> our example) and finds the dentry. >> >> But since container-init is itself exiting (i.e PF_EXITING is >> set) it does NOT call the shrink_dcache_parent(), but, >> interestingly calls d_drop() and dput(). >> >> Now the d_drop() unhashes the dentry for the pid 2. >> >> - proc_flush_task_mnt() then tries to find the dentry for the >> tgid of the process. In our case, the tgid == pid == 2 and >> we just unhashed the dentry for "2". >> >> So, we don't find the dentry for the leader either (and hence >> don't make the second shrink_dcache_parent() call in >> proc_flush_task_mnt() either). >> >> Without a call to shrink_dcache_parent(), the proc inode >> for the process that was terminated by container init is >> not deleted (i.e we don't call proc_delete_inode() or >> the put_pid() inside it) causing us to leak proc_inodes, >> struct pid and hence struct pid_namespace. > > Ouch ! > > Nice analysis :) > > Following your explanation I was able to reproduce a simple program > added in attachment. But there is something I do not understand is why > the leak does not appear if I do the 'lstat' (cf. test program) in the > pid 2 context. > > >> There should be a better fix, but first please confirm if reverting the >> above commit fixes the leak for you also. > > I confirm the leak does no longer appear when reverting this patch. > > Thanks > -- Daniel | #include | #include | #include | #include | #include | #include | #include | #include | #include | | #ifndef CLONE_NEWPID | # define CLONE_NEWPID 0x20000000 | #endif | | int child(void *arg) | { | pid_t pid; | struct stat s; | | if (mount("proc", "/proc", "proc", 0, NULL)) { | perror("mount"); | return -1; | } | | pid = fork(); | if (pid < 0) { | perror("fork"); | return -1; | } | | if (!pid) { | poll(0, 0 , -1); | exit(-1); | } | | poll(0, 0, -1); | | return 0; | } | | pid_t clonens(int (*fn)(void *), void *arg, int flags) | { | long stack_size = sysconf(_SC_PAGESIZE); | void *stack = alloca(stack_size) + stack_size; | return clone(fn, stack, flags | SIGCHLD, arg); | } | | int main(int argc, char *argv[]) | { | pid_t pid; | struct stat s; | char path[MAXPATHLEN]; | | pid = clonens(child, NULL, CLONE_NEWNS|CLONE_NEWPID); | if (pid < 0) { | perror("clone"); | return -1; | } | | /* yes ugly.*/ | sleep(1); | | /* !! assumption : child of my child is pid + 1 | * any reliable simple solution is welcome :) */ | snprintf(path, sizeof(path), "/proc/%d/exe", pid + 1); | | if (lstat(path, &s)) { | perror("lstat"); | exit(-1); | } | | if (kill(pid, SIGKILL)) { | perror("kill"); | return -1; | } | | return 0; | } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/