Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752004AbaL3F5E (ORCPT ); Tue, 30 Dec 2014 00:57:04 -0500 Received: from out01.mta.xmission.com ([166.70.13.231]:34999 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751929AbaL3F5B convert rfc822-to-8bit (ORCPT ); Tue, 30 Dec 2014 00:57:01 -0500 From: ebiederm@xmission.com (Eric W. Biederman) To: Chen Hanxiao Cc: Serge Hallyn , Andrew Morton , Pavel Emelyanov , , , David Howells , Vasiliy Kulikov , Mateusz Guzik , Oleg Nesterov , Richard Weinberger References: <1419330039-29207-1-git-send-email-chenhanxiao@cn.fujitsu.com> <1419330039-29207-2-git-send-email-chenhanxiao@cn.fujitsu.com> Date: Mon, 29 Dec 2014 23:54:25 -0600 In-Reply-To: <1419330039-29207-2-git-send-email-chenhanxiao@cn.fujitsu.com> (Chen Hanxiao's message of "Tue, 23 Dec 2014 18:20:37 +0800") Message-ID: <87sifxznqm.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-XM-AID: U2FsdGVkX18Tmu0wSBpxJIZSqdMuqoqjxQYn0Ozlufg= X-SA-Exim-Connect-IP: 97.121.85.189 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 1.5 TR_Symld_Words too many words that have symbols inside * 0.7 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa04 1397; Body=1 Fuz1=1 Fuz2=1] * 1.0 T_XMDrugObfuBody_08 obfuscated drug references X-Spam-DCC: XMission; sa04 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ***;Chen Hanxiao X-Spam-Relay-Country: X-Spam-Timing: total 722 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 4.7 (0.6%), b_tie_ro: 3.2 (0.4%), parse: 2.2 (0.3%), extract_message_metadata: 20 (2.8%), get_uri_detail_list: 6 (0.9%), tests_pri_-1000: 8 (1.2%), tests_pri_-950: 1.22 (0.2%), tests_pri_-900: 1.00 (0.1%), tests_pri_-400: 50 (6.9%), check_bayes: 48 (6.7%), b_tokenize: 18 (2.5%), b_tok_get_all: 16 (2.2%), b_comp_prob: 6 (0.9%), b_tok_touch_all: 4.6 (0.6%), b_finish: 0.83 (0.1%), tests_pri_0: 625 (86.6%), tests_pri_500: 3.7 (0.5%), rewrite_mail: 0.00 (0.0%) Subject: Re: [resend][PATCH v9 1/3] procfs: show hierarchy of pid namespace X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Wed, 24 Sep 2014 11:00:52 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Chen Hanxiao writes: > We lack of pid hierarchy information, and this will lead to: > a) we don't know pids' relationship, who is whose child: > /proc/PID/ns/pid only tell us whether two pids live in different ns > b) bring trouble to nested lxc container checkpoint/restore/migration > c) bring trouble to pid translation between containers; > > This patch will show the hierarchy of pid namespace > by pidns_hierarchy like: > > I am still trying to figure out if this is a good idea. The problem is real, though I am not certain how severe? Is there code interesting code this would allow you to write? It would be nice if we could use the same solution for both user namespace and pid namespace hierarchy description. This solution doesn't have a chance of doing that. The patch itself though is currently incorrect. What is read from a file should be determined at open time, and better still be constant whoever reads the file. Your pidns_hierarchy file morphs depending on who is reading it and that is at a minimum confusing, and will cause problems if someone decides to pass the file descriptor. There is also an issue that this hierarchy does not seem to be able to deal with pid namespaces that currently have no pids in them. If the goal is to use this for checkpoint/restart that may be a make certain pid namespace states uncheckpointable. So that seems like a significant oversight. Eric > Ex: > [root@localhost ~]#cat /proc/pidns_hierarchy > 18060 1 1 > 18102 18060 2 > 1534 18102 3 > 1600 18102 3 > 1550 1 1 > *Note: numbers represent the pid 1 in different ns > > It shows the pid hierarchy below: > > init_pid_ns 1 > │ > ┌────────────┐ > ns1 ns2 > │ │ > 1550 18060 > │ > │ > ns3 > │ > 18102 > │ > ┌──────────┐ > ns4 ns5 > │ │ > 1534 1600 > > Every pid printed in pidns_hierarchy > is the init pid of that pid ns level. > > Acked-by: Richard Weinberer > > Signed-off-by: Chen Hanxiao > --- > v9: fix codes be included if CONFIG_PID_NS=n > v8: use max() from kernel.h > fix some improper comments > v7: change stype to be consistent with current interface like > > remove EXPERT dependent in Kconfig > v6: fix a get_pid leak and do some cleanups; > v5: collect pid by find_ge_pid; > use local list inside nslist_proc_show; > use get_pid, remove mutex lock. > v4: simplify pid collection and some performance optimizamtion > fix another race issue. > v3: fix a race issue and memory leak issue > v2: use a procfs text file instead of dirs under /proc > > fs/proc/Kconfig | 6 + > fs/proc/Makefile | 1 + > fs/proc/internal.h | 9 ++ > fs/proc/pidns_hierarchy.c | 280 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/proc/root.c | 1 + > 5 files changed, 297 insertions(+) > create mode 100644 fs/proc/pidns_hierarchy.c > > diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig > index 2183fcf..82dda55 100644 > --- a/fs/proc/Kconfig > +++ b/fs/proc/Kconfig > @@ -71,3 +71,9 @@ config PROC_PAGE_MONITOR > /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap, > /proc/kpagecount, and /proc/kpageflags. Disabling these > interfaces will reduce the size of the kernel by approximately 4kb. > + > +config PROC_PID_HIERARCHY > + bool "Enable /proc/pidns_hierarchy support" > + depends on PROC_FS > + help > + Show pid namespace hierarchy information > diff --git a/fs/proc/Makefile b/fs/proc/Makefile > index 7151ea4..33e384b 100644 > --- a/fs/proc/Makefile > +++ b/fs/proc/Makefile > @@ -30,3 +30,4 @@ proc-$(CONFIG_PROC_KCORE) += kcore.o > proc-$(CONFIG_PROC_VMCORE) += vmcore.o > proc-$(CONFIG_PRINTK) += kmsg.o > proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o > +proc-$(CONFIG_PROC_PID_HIERARCHY) += pidns_hierarchy.o > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index 6fcdba5..18e0773 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -280,6 +280,15 @@ struct proc_maps_private { > #endif > }; > > +/* > + * pidns_hierarchy.c > + */ > +#ifdef CONFIG_PROC_PID_HIERARCHY > + extern void proc_pidns_hierarchy_init(void); > +#else > + static inline void proc_pidns_hierarchy_init(void) {} > +#endif > + > struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode); > > extern const struct file_operations proc_pid_maps_operations; > diff --git a/fs/proc/pidns_hierarchy.c b/fs/proc/pidns_hierarchy.c > new file mode 100644 > index 0000000..ab1c665 > --- /dev/null > +++ b/fs/proc/pidns_hierarchy.c > @@ -0,0 +1,280 @@ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* > + * /proc/pidns_hierarchy > + * > + * show the hierarchy of pid namespace as: > + * > + * > + * init_PID: child reaper in ns > + * parent_of_init_PID: init_PID's parent, child reaper too > + * relative PID level: pid level relative to caller's ns > + */ > + > +#define NS_HIERARCHY "pidns_hierarchy" > + > +/* list for host pid collection */ > +struct pidns_list { > + struct list_head list; > + struct pid *pid; > + unsigned int level; > +}; > + > +static void free_pidns_list(struct list_head *head) > +{ > + struct pidns_list *tmp, *pos; > + > + list_for_each_entry_safe(pos, tmp, head, list) { > + list_del(&pos->list); > + put_pid(pos->pid); > + kfree(pos); > + } > +} > + > +static int > +pidns_list_add(struct pid *pid, struct list_head *list_head, > + int level) > +{ > + struct pidns_list *ent; > + > + ent = kmalloc(sizeof(*ent), GFP_KERNEL); > + if (!ent) > + return -ENOMEM; > + > + ent->pid = pid; > + ent->level = level; > + list_add_tail(&ent->list, list_head); > + > + return 0; > +} > + > +static int > +pidns_list_filter(struct list_head *pidns_pid_list, > + struct list_head *pidns_pid_tree) > +{ > + struct pidns_list *pos, *pos_t; > + struct pid_namespace *ns0, *ns1; > + struct pid *pid0, *pid1; > + int rc, flag = 0; > + > + /* > + * screen pids with relationship > + * in pidns_pid_list, we may add pids like: > + * ns0 ns1 ns2 > + * pid1->pid2->pid3 > + * we should screen pid1, pid2 and keep pid3 > + */ > + list_for_each_entry(pos, pidns_pid_list, list) { > + list_for_each_entry(pos_t, pidns_pid_list, list) { > + flag = 0; > + pid0 = pos->pid; > + pid1 = pos_t->pid; > + ns0 = pid0->numbers[pid0->level].ns; > + ns1 = pid1->numbers[pid1->level].ns; > + if (pos->pid->level < pos_t->pid->level) > + for (; ns1 != NULL; ns1 = ns1->parent) > + if (ns0 == ns1) { > + flag = 1; > + break; > + } > + /* a redundant pid found */ > + if (flag == 1) > + break; > + } > + > + if (flag == 0) { > + get_pid(pos->pid); > + rc = pidns_list_add(pos->pid, pidns_pid_tree, 0); > + if (rc) { > + put_pid(pos->pid); > + goto cleanup; > + } > + } > + } > + > + /* > + * Now all useful stuffs are in pidns_pid_tree, > + * free pidns_pid_list > + */ > + free_pidns_list(pidns_pid_list); > + > + return 0; > + > +cleanup: > + free_pidns_list(pidns_pid_tree); > + return rc; > +} > + > +static void > +pidns_list_set_level(struct list_head *pidns_list_in, > + struct pid_namespace *curr_ns) > +{ > + struct pidns_list *pos, *pos_t; > + struct pid *pid0, *pid1; > + int i; > + > + /* > + * From the pid hierarchy point of view, > + * we already had a list of pids who are not > + * the subsets of each other. > + * But part of them may be same. > + * We need to set the level of each pids: > + * pid0: A->B->C pid1: A->B->D > + * level: 2 0 > + * We use level to identify > + * the public part of each pids. > + */ > + list_for_each_entry(pos, pidns_list_in, list) { > + list_for_each_entry(pos_t, pidns_list_in, list) { > + pid0 = pos->pid; > + pid1 = pos_t->pid; > + if (pid0 == pid1) > + continue; > + if (pos_t->level > 0) > + continue; > + for (i = curr_ns->level + 1; i <= pid0->level; i++) { > + /* skip the public parts */ > + if (pid0->numbers[i].ns == > + pid1->numbers[i].ns) > + continue; > + else > + break; > + } > + pos->level = i - 1; > + } > + } > +} > + > +/* > + * Finds all init pids, places them into > + * pidns_pid_list and then stores the hierarchy > + * into pidns_pid_tree. > + */ > +static int proc_pidns_list_refresh(struct pid_namespace *curr_ns, > + struct list_head *pidns_pid_list, > + struct list_head *pidns_pid_tree) > +{ > + struct pid *pid; > + int new_nr, nr = 0; > + int rc; > + > + /* collect pids in current namespace */ > + while (nr < PID_MAX_LIMIT) { > + rcu_read_lock(); > + pid = find_ge_pid(nr, curr_ns); > + if (!pid) { > + rcu_read_unlock(); > + break; > + } > + > + new_nr = pid_vnr(pid); > + if (!is_child_reaper(pid)) { > + nr = new_nr + 1; > + rcu_read_unlock(); > + continue; > + } > + get_pid(pid); > + rcu_read_unlock(); > + rc = pidns_list_add(pid, pidns_pid_list, 0); > + if (rc) { > + put_pid(pid); > + goto cleanup; > + } > + nr = new_nr + 1; > + } > + > + /* > + * Only one pid found as the child reaper, > + * so current pid namespace do not have sub-namespace, > + * return 0 directly. > + */ > + if (list_is_singular(pidns_pid_list)) { > + rc = 0; > + goto cleanup; > + } > + > + /* > + * screen duplicate pids from pidns_pid_list > + * and form a new list pidns_pid_tree. > + */ > + rc = pidns_list_filter(pidns_pid_list, pidns_pid_tree); > + if (rc) > + goto cleanup; > + > + return 0; > + > +cleanup: > + free_pidns_list(pidns_pid_list); > + return rc; > +} > + > +static int nslist_proc_show(struct seq_file *m, void *v) > +{ > + struct pidns_list *pos; > + struct pid_namespace *ns, *curr_ns; > + struct pid *pid; > + char pid_buf[16], ppid_buf[16]; > + int i, rc; > + > + LIST_HEAD(pidns_pid_list); > + LIST_HEAD(pidns_pid_tree); > + > + curr_ns = task_active_pid_ns(current); > + > + rc = proc_pidns_list_refresh(curr_ns, > + &pidns_pid_list, &pidns_pid_tree); > + if (rc) > + return rc; > + > + pidns_list_set_level(&pidns_pid_tree, curr_ns); > + > + /* print pid namespace's hierarchy */ > + list_for_each_entry(pos, &pidns_pid_tree, list) { > + pid = pos->pid; > + for (i = max(curr_ns->level, pos->level) + 1; > + i <= pid->level; i++) { > + ns = pid->numbers[i].ns; > + /* show PID '1' in specific pid ns */ > + snprintf(pid_buf, 16, "%u", > + pid_vnr(find_pid_ns(1, ns))); > + ns = pid->numbers[i - 1].ns; > + snprintf(ppid_buf, 16, "%u", > + pid_vnr(find_pid_ns(1, ns))); > + seq_printf(m, "%s\t%s\t%d\n", pid_buf, ppid_buf, > + i - curr_ns->level); > + } > + } > + > + free_pidns_list(&pidns_pid_tree); > + > + return 0; > +} > + > +static int nslist_proc_open(struct inode *inode, struct file *file) > +{ > + return single_open(file, nslist_proc_show, NULL); > +} > + > +static const struct file_operations proc_nspid_nslist_fops = { > + .open = nslist_proc_open, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = single_release, > +}; > + > +/* > + * Called by proc_root_init() to initialize the /proc/pidns_hierarchy > + */ > +void __init proc_pidns_hierarchy_init(void) > +{ > + proc_create(NS_HIERARCHY, S_IRUGO, > + NULL, &proc_nspid_nslist_fops); > +} > diff --git a/fs/proc/root.c b/fs/proc/root.c > index e74ac9f..bcb55c7 100644 > --- a/fs/proc/root.c > +++ b/fs/proc/root.c > @@ -190,6 +190,7 @@ void __init proc_root_init(void) > proc_tty_init(); > proc_mkdir("bus", NULL); > proc_sys_init(); > + proc_pidns_hierarchy_init(); > } > > static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/