2006-02-06 19:22:25

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 0/20] Multiple instances of the process id namespace


There have been several discussions in the past month about how
to do a good job of implementing a subset of user space that
looks like it has the system to itself. Essentially making
chroot everything it could be. This is my take on what
the implementation of a pid namespace should look like.


What follows is a real patch set that is sufficiently complete
to be used and useful in it's own right. There are a few areas
of the kernel where the patchset does not reach, mostly these
cause the compile to fail. In addition a good thorough review
still needs to be done. This patchset does paint a picture
of how I think things should look.

>From the kernel community at large I am asking:
Does the code look generally sane?

Does the use of clone to create a new namespace instance look
like the sane approach?


Hopefully this code is sufficiently comprehensible to allow a good
discussion to come out of this.

Thanks for your time,

Eric



p.s. My apologies at the size of the CC list. It is very hard to tell who
the interested parties are, and since there is no one list we all subscribe
to other than linux-kernel how to reach everyone in a timely manner. I am
copying everyone who has chimed in on a previous thread on the subject. If
you don't want to be copied in the future tell and I will take your name off
of my list.


2006-02-06 19:25:30

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)


The wait id is the pid returned by wait. For tasks that span 2
namespaces (i.e. the process leaders of the pid namespaces) their
parent knows the task by a different PID value than the task knows
itself. Having a child with PID == 1 would be confusing.

This patch introduces the wid and walks through kernel and modifies
the places that observe the pid from the parent processes perspective
to use the wid instead of the pid.

Signed-off-by: Eric W. Biederman <[email protected]>


---

include/linux/sched.h | 1 +
kernel/exit.c | 18 +++++++++---------
kernel/fork.c | 4 ++--
kernel/sched.c | 2 +-
kernel/signal.c | 6 +++---
5 files changed, 16 insertions(+), 15 deletions(-)

598714d79648463ab3f2cbf6f6acd3cd6c09c87a
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f368048..e8ea561 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,7 @@ struct task_struct {
/* ??? */
unsigned long personality;
unsigned did_exec:1;
+ pid_t wid;
pid_t pid;
pid_t tgid;
/*
diff --git a/kernel/exit.c b/kernel/exit.c
index fb4c8b1..749bc8b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -941,7 +941,7 @@ asmlinkage void sys_exit_group(int error
static int eligible_child(pid_t pid, int options, task_t *p)
{
if (pid > 0) {
- if (p->pid != pid)
+ if (p->wid != pid)
return 0;
} else if (!pid) {
if (process_group(p) != process_group(current))
@@ -1018,7 +1018,7 @@ static int wait_task_zombie(task_t *p, i
int status;

if (unlikely(noreap)) {
- pid_t pid = p->pid;
+ pid_t pid = p->wid;
uid_t uid = p->uid;
int exit_code = p->exit_code;
int why, status;
@@ -1130,7 +1130,7 @@ static int wait_task_zombie(task_t *p, i
retval = put_user(status, &infop->si_status);
}
if (!retval && infop)
- retval = put_user(p->pid, &infop->si_pid);
+ retval = put_user(p->wid, &infop->si_pid);
if (!retval && infop)
retval = put_user(p->uid, &infop->si_uid);
if (retval) {
@@ -1138,7 +1138,7 @@ static int wait_task_zombie(task_t *p, i
p->exit_state = EXIT_ZOMBIE;
return retval;
}
- retval = p->pid;
+ retval = p->wid;
if (p->real_parent != p->parent) {
write_lock_irq(&tasklist_lock);
/* Double-check with lock held. */
@@ -1198,7 +1198,7 @@ static int wait_task_stopped(task_t *p,
read_unlock(&tasklist_lock);

if (unlikely(noreap)) {
- pid_t pid = p->pid;
+ pid_t pid = p->wid;
uid_t uid = p->uid;
int why = (p->ptrace & PT_PTRACED) ? CLD_TRAPPED : CLD_STOPPED;

@@ -1269,11 +1269,11 @@ bail_ref:
if (!retval && infop)
retval = put_user(exit_code, &infop->si_status);
if (!retval && infop)
- retval = put_user(p->pid, &infop->si_pid);
+ retval = put_user(p->wid, &infop->si_pid);
if (!retval && infop)
retval = put_user(p->uid, &infop->si_uid);
if (!retval)
- retval = p->pid;
+ retval = p->wid;
put_task_struct(p);

BUG_ON(!retval);
@@ -1310,7 +1310,7 @@ static int wait_task_continued(task_t *p
p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
spin_unlock_irq(&p->sighand->siglock);

- pid = p->pid;
+ pid = p->wid;
uid = p->uid;
get_task_struct(p);
read_unlock(&tasklist_lock);
@@ -1321,7 +1321,7 @@ static int wait_task_continued(task_t *p
if (!retval && stat_addr)
retval = put_user(0xffff, stat_addr);
if (!retval)
- retval = p->pid;
+ retval = pid;
} else {
retval = wait_noreap_copyout(p, pid, uid,
CLD_CONTINUED, SIGCONT,
diff --git a/kernel/fork.c b/kernel/fork.c
index f4a7281..743d46c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -931,10 +931,10 @@ static task_t *copy_process(unsigned lon

p->did_exec = 0;
copy_flags(clone_flags, p);
- p->pid = pid;
+ p->wid = p->pid = pid;
retval = -EFAULT;
if (clone_flags & CLONE_PARENT_SETTID)
- if (put_user(p->pid, parent_tidptr))
+ if (put_user(p->wid, parent_tidptr))
goto bad_fork_cleanup;

p->proc_dentry = NULL;
diff --git a/kernel/sched.c b/kernel/sched.c
index f77f23f..6579d49 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1674,7 +1674,7 @@ asmlinkage void schedule_tail(task_t *pr
preempt_enable();
#endif
if (current->set_child_tid)
- put_user(current->pid, current->set_child_tid);
+ put_user(current->wid, current->set_child_tid);
}

/*
diff --git a/kernel/signal.c b/kernel/signal.c
index 1f54ed7..70c226c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1564,7 +1564,7 @@ void do_notify_parent(struct task_struct

info.si_signo = sig;
info.si_errno = 0;
- info.si_pid = tsk->pid;
+ info.si_pid = tsk->wid;
info.si_uid = tsk->uid;

/* FIXME: find out whether or not this is supposed to be c*time. */
@@ -1629,7 +1629,7 @@ static void do_notify_parent_cldstop(str

info.si_signo = SIGCHLD;
info.si_errno = 0;
- info.si_pid = tsk->pid;
+ info.si_pid = tsk->wid;
info.si_uid = tsk->uid;

/* FIXME: find out whether or not this is supposed to be c*time. */
@@ -1732,7 +1732,7 @@ void ptrace_notify(int exit_code)
memset(&info, 0, sizeof info);
info.si_signo = SIGTRAP;
info.si_code = exit_code;
- info.si_pid = current->pid;
+ info.si_pid = current->wid;
info.si_uid = current->uid;

/* Let the debugger run. */
--
1.1.5.g3480

2006-02-06 19:29:39

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 02/20] pspace: The parent process id of pid 1 is always 0.


Force the parent process id of pid == 1 to always be 0. Force
this for nested pspaces.

Signed-off-by: Eric W. Biederman <[email protected]>


---

arch/alpha/kernel/entry.S | 3 +++
kernel/timer.c | 2 ++
2 files changed, 5 insertions(+), 0 deletions(-)

bec203c8a3e7bec5e3bee8086361caffa71ad685
diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index 7af15bf..38996ab 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -891,6 +891,9 @@ sys_getxpid:
cmpeq $4, $5, $5
beq $5, 1b
#endif
+ cmpeq $0, 1, $5
+ cmoveq $5, 0, $1
+
stq $1, 80($sp)
ret
.end sys_getxpid
diff --git a/kernel/timer.c b/kernel/timer.c
index 4f1cb0a..bae17fb 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -987,6 +987,8 @@ asmlinkage long sys_getppid(void)
#endif
break;
}
+ if (current->tgid == 1)
+ pid = 0;
return pid;
}

--
1.1.5.g3480

2006-02-06 19:32:29

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 03/20] pid: Introduce a generic helper to test for init.


There are a lot of places in the kernel where we test for init because
we give it special properties. Most significantly init must not die.
This results in code all over the kernel test ->pid == 1.

Introduce is_init to capture this case.

With multiple pid spaces for all of the cases affected we are looking
for only the first process on the system, not some other process that
has pid == 1.

Signed-off-by: Eric W. Biederman <[email protected]>


---

arch/alpha/mm/fault.c | 2 +-
arch/i386/lib/usercopy.c | 2 +-
arch/i386/mm/fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m32r/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/powerpc/platforms/pseries/ras.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/sh/mm/fault.c | 2 +-
arch/sh64/mm/fault.c | 6 +++---
arch/um/kernel/trap_kern.c | 2 +-
arch/x86_64/mm/fault.c | 2 +-
arch/xtensa/mm/fault.c | 2 +-
drivers/char/snsc_event.c | 2 +-
include/linux/sched.h | 13 +++++++++++++
kernel/exit.c | 2 +-
kernel/kexec.c | 2 +-
kernel/sysctl.c | 2 +-
mm/oom_kill.c | 6 +++---
security/seclvl.c | 4 ++--
22 files changed, 39 insertions(+), 26 deletions(-)

26458ca5ad0bf86dde7bbe914e0f475f11945f44
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 64ace5a..36284e7 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -194,7 +194,7 @@ do_page_fault(unsigned long address, uns
/* We ran out of memory, or some other thing happened to us that
made us unable to handle the page fault gracefully. */
out_of_memory:
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/i386/lib/usercopy.c b/arch/i386/lib/usercopy.c
index 4cf981d..ae9b319 100644
--- a/arch/i386/lib/usercopy.c
+++ b/arch/i386/lib/usercopy.c
@@ -543,7 +543,7 @@ survive:
retval = get_user_pages(current, current->mm,
(unsigned long )to, 1, 1, 0, &pg, NULL);

- if (retval == -ENOMEM && current->pid == 1) {
+ if (retval == -ENOMEM && is_init(current)) {
up_read(&current->mm->mmap_sem);
blk_congestion_wait(WRITE, HZ/50);
goto survive;
diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
index cf572d9..adec5ef 100644
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -485,7 +485,7 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (tsk->pid == 1) {
+ if (is_init(tsk)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index af7eb08..3d27cf8 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -241,7 +241,7 @@ ia64_do_page_fault (unsigned long addres

out_of_memory:
up_read(&mm->mmap_sem);
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index bf7fb58..3f98d5a 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -300,7 +300,7 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (tsk->pid == 1) {
+ if (is_init(tsk)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index aec1527..0081729 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -181,7 +181,7 @@ good_area:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 2d9624f..c615567 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -172,7 +172,7 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (tsk->pid == 1) {
+ if (is_init(tsk)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a4815d3..a002917 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -351,7 +351,7 @@ bad_area_nosemaphore:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index b046bcf..5d6a4a6 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -314,7 +314,7 @@ static int recover_mce(struct pt_regs *r
err->disposition == RTAS_DISP_NOT_RECOVERED &&
err->target == RTAS_TARGET_MEMORY &&
err->type == RTAS_TYPE_ECC_UNCORR &&
- !(current->pid == 0 || current->pid == 1)) {
+ !(current->pid == 0 || is_init(current))) {
/* Kill off a user process with an ECC error */
printk(KERN_ERR "MCE: uncorrectable ecc error for pid %d\n",
current->pid);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 81ade40..9d9c009 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -316,7 +316,7 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (tsk->pid == 1) {
+ if (is_init(tsk)) {
yield();
goto survive;
}
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 775f86c..46e5e60 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -160,7 +160,7 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/sh64/mm/fault.c b/arch/sh64/mm/fault.c
index f08d0ea..8e2f6c2 100644
--- a/arch/sh64/mm/fault.c
+++ b/arch/sh64/mm/fault.c
@@ -277,7 +277,7 @@ bad_area:
show_regs(regs);
#endif
}
- if (tsk->pid == 1) {
+ if (is_init(tsk)) {
panic("INIT had user mode bad_area\n");
}
tsk->thread.address = address;
@@ -319,14 +319,14 @@ no_context:
* us unable to handle the page fault gracefully.
*/
out_of_memory:
- if (current->pid == 1) {
+ if (is_init(current)) {
panic("INIT out of memory\n");
yield();
goto survive;
}
printk("fault:Out of memory\n");
up_read(&mm->mmap_sem);
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/arch/um/kernel/trap_kern.c b/arch/um/kernel/trap_kern.c
index d56046c..bed3e03 100644
--- a/arch/um/kernel/trap_kern.c
+++ b/arch/um/kernel/trap_kern.c
@@ -120,7 +120,7 @@ out_nosemaphore:
* us unable to handle the page fault gracefully.
*/
out_of_memory:
- if (current->pid == 1) {
+ if (is_init(current)) {
up_read(&mm->mmap_sem);
yield();
down_read(&mm->mmap_sem);
diff --git a/arch/x86_64/mm/fault.c b/arch/x86_64/mm/fault.c
index 26eac19..7beb271 100644
--- a/arch/x86_64/mm/fault.c
+++ b/arch/x86_64/mm/fault.c
@@ -545,7 +545,7 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
goto again;
}
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index a945a33..dd0dbec 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -144,7 +144,7 @@ bad_area:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (current->pid == 1) {
+ if (is_init(current)) {
yield();
down_read(&mm->mmap_sem);
goto survive;
diff --git a/drivers/char/snsc_event.c b/drivers/char/snsc_event.c
index baaa365..60d1343 100644
--- a/drivers/char/snsc_event.c
+++ b/drivers/char/snsc_event.c
@@ -207,7 +207,7 @@ scdrv_dispatch_event(char *event, int le
/* first find init's task */
read_lock(&tasklist_lock);
for_each_process(p) {
- if (p->pid == 1)
+ if (is_init(p))
break;
}
if (p) { /* we found init's task */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e8ea561..86a92d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -894,6 +894,19 @@ static inline int pid_alive(struct task_
return p->pids[PIDTYPE_PID].nr != 0;
}

+/**
+ * is_init - check if a task structure is the first user space
+ * task the kernel created.
+ * @p: Task structure to be checked.
+ */
+static inline int is_init(struct task_struct *tsk)
+{
+ /* Note there is only one task whose parent knows
+ * it as pid 1.
+ */
+ return tsk->wid == 1;
+}
+
extern void free_task(struct task_struct *tsk);
extern void __put_task_struct(struct task_struct *tsk);
#define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
diff --git a/kernel/exit.c b/kernel/exit.c
index 749bc8b..f1af8bb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -786,7 +786,7 @@ fastcall NORET_TYPE void do_exit(long co
panic("Aiee, killing interrupt handler!");
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");
- if (unlikely(tsk == child_reaper))
+ if (unlikely(is_init(tsk)))
panic("Attempted to kill init!");
if (tsk->io_context)
exit_io_context();
diff --git a/kernel/kexec.c b/kernel/kexec.c
index bf39d28..64d05ab 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -40,7 +40,7 @@ struct resource crashk_res = {

int kexec_should_crash(struct task_struct *p)
{
- if (in_interrupt() || !p->pid || p->pid == 1 || panic_on_oops)
+ if (in_interrupt() || !p->pid || is_init(p) || panic_on_oops)
return 1;
return 0;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 71dd6f6..8e1bdc5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1806,7 +1806,7 @@ int proc_dointvec_bset(ctl_table *table,
return -EPERM;
}

- op = (current->pid == 1) ? OP_SET : OP_AND;
+ op = is_init(current) ? OP_SET : OP_AND;
return do_proc_dointvec(table,write,filp,buffer,lenp,ppos,
do_proc_dointvec_bset_conv,&op);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b05ab8f..b417dce 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -148,8 +148,8 @@ static struct task_struct * select_bad_p
unsigned long points;
int releasing;

- /* skip the init task with pid == 1 */
- if (p->pid == 1)
+ /* skip the init task */
+ if (is_init(p))
continue;
if (p->oomkilladj == OOM_DISABLE)
continue;
@@ -184,7 +184,7 @@ static struct task_struct * select_bad_p
*/
static void __oom_kill_task(task_t *p)
{
- if (p->pid == 1) {
+ if (is_init(p)) {
WARN_ON(1);
printk(KERN_WARNING "tried to kill init!\n");
return;
diff --git a/security/seclvl.c b/security/seclvl.c
index 8529ea6..a49f2dd 100644
--- a/security/seclvl.c
+++ b/security/seclvl.c
@@ -313,7 +313,7 @@ static int seclvl_ptrace(struct task_str
static int seclvl_capable(struct task_struct *tsk, int cap)
{
/* init can do anything it wants */
- if (tsk->pid == 1)
+ if (is_init(tsk))
return 0;

switch (seclvl) {
@@ -479,7 +479,7 @@ static void seclvl_file_free_security(st
*/
static int seclvl_umount(struct vfsmount *mnt, int flags)
{
- if (current->pid == 1)
+ if (is_init(current))
return 0;
if (seclvl == 2) {
seclvl_printk(1, KERN_WARNING, "Attempt to unmount in secure "
--
1.1.5.g3480

2006-02-06 19:37:38

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace


This patch modifies the fork/exit, signal handling, and pid and
process group manipulating syscalls to support multiple process
spaces, and implements the data for allow multiple instaces of the pid
namespace.

The implementation is the best compromise I have found between a
maintainable implementation that will break users that need to be
modified and a minimal impact implementation that doesn't require all
of the kernel's process id handling to be modified.

pid,tgid,pgrp,session all retain their current meanings. Functions
that operation in an arbitrary context take an additional pspace
parameter. The tuple pspace,pid is globally unique. Making the
tuple pspace, pid the kernel pid and any plain pid value the user
space pid.

In conjunction with this patch there will need to be an audit of the
kernel to find and fix the places that compare and print pids. The
only operations possible with out calling a helper function.

The number of comparisions performed not in conjunction with another
pid operation is small so there should not be many pieces of the
kernel affected.

Fixing up printing of the pids in debug messages is optional but
will likely be useful for understanding what is going on. The
pspace->name field is provided for this purpose. Simply
adding a "%s/" before the existing "%d is suffient to add
the pspace name.

The API for manipulating PID namespaces is defined as follows:

sys_clone(CLONE_NPSPACE) creates a new process id namespace,
giving the child process 2 pids. The normal pid which only
the parent sees and pid 1. The child the becomes the init
process for the namespace in all measurable wasy including
reaping all of the namespaces children who die unexpectedly.

>From any process in the parents namespace with the appropriate
permissions sending sys_kill( <child pid>) sends a signal to just the
leading process of the pid namespace.

>From any process in the parents namespace with the appropriate
permissions sending sys_kill( -<child pid>) sends a signal to all
of the processes in the pid namespace.

currenty CAP_KILL is defined as the required capability to
create a PID namespace. I don't know of anything harmful that
could result in making this disallowed but since someone
with CAP_KILL can arbitrarily kill any process it sounded
like a good safety measure.

When the leading process of the pid namespace exits all of
it's children die because init cannot exit :)

Waitpid on the extrenally visible pid of a pid namespace waits
for that namespace to exit.

pid namespaces are currently implemented so the can nest one
inside the other, as this adds no real complications.

Signed-off-by: Eric W. Biederman <[email protected]>


---

fs/exec.c | 7 +-
include/linux/init_task.h | 2
include/linux/kernel.h | 3 -
include/linux/pid.h | 17 +++-
include/linux/pspace.h | 95 ++++++++++++++++++++++
include/linux/sched.h | 23 +++--
include/linux/tty.h | 2
init/main.c | 5 -
kernel/exit.c | 79 ++++++++++++-------
kernel/fork.c | 56 ++++++++-----
kernel/pid.c | 192 ++++++++++++++++++++++++++++++++++-----------
kernel/signal.c | 118 +++++++++++++++++++---------
kernel/sys.c | 14 ++-
13 files changed, 450 insertions(+), 163 deletions(-)
create mode 100644 include/linux/pspace.h

8927729234906eb9fef21ca16bb30598a97b4485
diff --git a/fs/exec.c b/fs/exec.c
index 390aafe..1c2fa2c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -49,6 +49,7 @@
#include <linux/rmap.h>
#include <linux/acct.h>
#include <linux/cn_proc.h>
+#include <linux/pspace.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -621,8 +622,8 @@ static int de_thread(struct task_struct
* Reparenting needs write_lock on tasklist_lock,
* so it is safe to do it under read_lock.
*/
- if (unlikely(current->group_leader == child_reaper))
- child_reaper = current;
+ if (unlikely(current->group_leader == current->pspace->child_reaper.task))
+ current->pspace->child_reaper.task = current;

zap_other_threads(current);
read_unlock(&tasklist_lock);
@@ -720,7 +721,7 @@ static int de_thread(struct task_struct
list_add_tail(&current->tasks, &init_task.tasks);

current->parent = current->real_parent = leader->real_parent;
- leader->parent = leader->real_parent = child_reaper;
+ leader->parent = leader->real_parent = leader->pspace->child_reaper.task;
current->group_leader = current;
leader->group_leader = leader;

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e182ee6..db93266 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -3,6 +3,7 @@

#include <linux/file.h>
#include <linux/rcupdate.h>
+#include <linux/pspace.h>

#define INIT_FDTABLE \
{ \
@@ -112,6 +113,7 @@ extern struct group_info init_groups;
.thread = INIT_THREAD, \
.fs = &init_fs, \
.files = &init_files, \
+ .pspace = &init_pspace, \
.signal = &init_signals, \
.sighand = &init_sighand, \
.pending = { \
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index b49affa..bd2dd1f 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -49,6 +49,7 @@ extern int console_printk[];
struct completion;
struct pt_regs;
struct user;
+struct pspace;

/**
* might_sleep - annotation for functions that can sleep
@@ -123,7 +124,7 @@ extern unsigned long long memparse(char

extern int __kernel_text_address(unsigned long addr);
extern int kernel_text_address(unsigned long addr);
-extern int session_of_pgrp(int pgrp);
+extern int session_of_pgrp(struct pspace *pspace, int pgrp);

extern void dump_thread(struct pt_regs *regs, struct user *dump);

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 099e70e..ba38c13 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -10,13 +10,16 @@ enum pid_type
PIDTYPE_MAX
};

+struct pspace;
struct pid
{
/* Try to keep pid_chain in the same cacheline as nr for find_pid */
+ struct pspace *pspace;
int nr;
struct hlist_node pid_chain;
/* list of pids with the same nr, only one of them is in the hash */
struct list_head pid_list;
+ struct task_struct *task;
};

#define pid_task(elem, type) \
@@ -34,17 +37,19 @@ extern void FASTCALL(detach_pid(struct t
* look up a PID in the hash table. Must be called with the tasklist_lock
* held.
*/
-extern struct pid *FASTCALL(find_pid(enum pid_type, int));
+extern struct pid *FASTCALL(find_pid(enum pid_type, struct pspace *, int));
+#define find_task_by_pid(pspace, nr) find_task_by_pid_type(PIDTYPE_PID, pspace, nr)
+extern struct task_struct *find_task_by_pid_type(int type, struct pspace *pspace, int pid);

-extern int alloc_pidmap(void);
-extern void FASTCALL(free_pidmap(int));
+extern int alloc_pidmap(struct pspace *pspace);
+extern void FASTCALL(free_pidmap(struct pspace *psapce, int pid));

-#define do_each_task_pid(who, type, task) \
- if ((task = find_task_by_pid_type(type, who))) { \
+#define do_each_task_pid(pspace, who, type, task) \
+ if ((task = find_task_by_pid_type(type, pspace, who))) { \
prefetch((task)->pids[type].pid_list.next); \
do {

-#define while_each_task_pid(who, type, task) \
+#define while_each_task_pid(pspace, who, type, task) \
} while (task = pid_task((task)->pids[type].pid_list.next,\
type), \
prefetch((task)->pids[type].pid_list.next), \
diff --git a/include/linux/pspace.h b/include/linux/pspace.h
new file mode 100644
index 0000000..950393a
--- /dev/null
+++ b/include/linux/pspace.h
@@ -0,0 +1,95 @@
+#ifndef _LINUX_PSPACE_H
+#define _LINUX_PSPACE_H
+
+#include <linux/sched.h>
+#include <linux/threads.h>
+#include <linux/pid.h>
+
+struct pidmap
+{
+ atomic_t nr_free;
+ void *page;
+};
+
+#define PIDMAP_ENTRIES ((PID_MAX_LIMIT + 8*PAGE_SIZE - 1)/PAGE_SIZE/8)
+
+struct pspace
+{
+ atomic_t count;
+ unsigned int flags;
+#define PSPACE_EXIT 0x00000001 /* pspace exit in progress */
+ struct pid child_reaper;
+ int nr_threads;
+ int nr_processes;
+ int last_pid;
+ int min;
+ int max;
+ struct pidmap pidmap[PIDMAP_ENTRIES];
+ char name[1]; /* For use in debugging print statements */
+};
+
+extern struct pspace init_pspace;
+
+#define INVALID_PID 0x7fffffff
+
+static inline int pspace_task_visible(struct pspace *pspace, struct task_struct *tsk)
+{
+ return (tsk->pspace == pspace) ||
+ ((tsk->pspace->child_reaper.pspace == pspace) &&
+ (tsk->pspace->child_reaper.task == tsk));
+}
+
+static inline int task_visible(struct task_struct *tsk)
+{
+ return pspace_task_visible(current->pspace, tsk);
+}
+
+static inline void get_pspace(struct pspace *pspace)
+{
+ if (!pspace)
+ return;
+ atomic_inc(&pspace->count);
+}
+
+extern void __put_pspace(struct pspace *pspace);
+
+static inline void put_pspace(struct pspace *pspace)
+{
+ if (!pspace)
+ return;
+ if (atomic_dec_and_test(&pspace->count)) {
+ __put_pspace(pspace);
+ }
+}
+
+extern int copy_pspace(int flags, struct task_struct *p);
+
+static inline void exit_pspace(struct task_struct *tsk)
+{
+ struct pspace *pspace = tsk->pspace;
+ tsk->pspace = NULL;
+ if (pspace->child_reaper.task == tsk)
+ pspace->child_reaper.task = NULL;
+ put_pspace(pspace);
+}
+
+static inline int pspace_leader(struct task_struct *tsk)
+{
+ return tsk == tsk->pspace->child_reaper.task;
+}
+
+static inline int current_pspace_leader(struct task_struct *tsk)
+{
+ return tsk == current->pspace->child_reaper.task;
+}
+
+static inline int in_pspace(struct pspace *pspace, struct task_struct *tsk)
+{
+ struct pspace *test;
+ test = tsk->pspace;
+ while((test != &init_pspace) && (test != pspace))
+ test = test->child_reaper.pspace;
+ return test == pspace;
+}
+
+#endif /* _LINUX_PSPACE_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 86a92d6..1977cb9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -61,6 +61,7 @@ struct exec_domain;
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
#define CLONE_STOPPED 0x02000000 /* Start in stopped state */
+#define CLONE_NPSPACE 0x04000000 /* New process id space */

/*
* List of flags we want to share for kernel threads,
@@ -94,9 +95,6 @@ extern unsigned long avenrun[]; /* Load

extern unsigned long total_forks;
extern int nr_threads;
-extern int last_pid;
-DECLARE_PER_CPU(unsigned long, process_counts);
-extern int nr_processes(void);
extern unsigned long nr_running(void);
extern unsigned long nr_uninterruptible(void);
extern unsigned long nr_iowait(void);
@@ -234,6 +232,7 @@ extern signed long schedule_timeout_unin
asmlinkage void schedule(void);

struct namespace;
+struct pspace;

/* Maximum number of active map areas.. This is a random (large) number */
#define DEFAULT_MAX_MAP_COUNT 65536
@@ -741,6 +740,7 @@ struct task_struct {
unsigned long personality;
unsigned did_exec:1;
pid_t wid;
+ struct pspace *pspace; /* process id namespace */
pid_t pid;
pid_t tgid;
/*
@@ -1037,8 +1037,6 @@ extern struct task_struct init_task;

extern struct mm_struct init_mm;

-#define find_task_by_pid(nr) find_task_by_pid_type(PIDTYPE_PID, nr)
-extern struct task_struct *find_task_by_pid_type(int type, int pid);
extern void set_special_pids(pid_t session, pid_t pgrp);
extern void __set_special_pids(pid_t session, pid_t pgrp);

@@ -1096,17 +1094,19 @@ extern int send_sig_info(int, struct sig
extern int send_group_sig_info(int, struct siginfo *, struct task_struct *);
extern int force_sigsegv(int, struct task_struct *);
extern int force_sig_info(int, struct siginfo *, struct task_struct *);
-extern int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp);
-extern int kill_pg_info(int, struct siginfo *, pid_t);
-extern int kill_proc_info(int, struct siginfo *, pid_t);
-extern int kill_proc_info_as_uid(int, struct siginfo *, pid_t, uid_t, uid_t);
+extern int __kill_pspace_info(int , struct siginfo *, struct pspace *);
+extern int kill_pspace_info(int , struct siginfo *, struct pspace *);
+extern int __kill_pg_info(int sig, struct siginfo *info, struct pspace *, pid_t pgrp);
+extern int kill_pg_info(int, struct siginfo *, struct pspace *, pid_t);
+extern int kill_proc_info(int, struct siginfo *, struct pspace *, pid_t);
+extern int kill_proc_info_as_uid(int, struct siginfo *, struct pspace *, pid_t, uid_t, uid_t);
extern void do_notify_parent(struct task_struct *, int);
extern void force_sig(int, struct task_struct *);
extern void force_sig_specific(int, struct task_struct *);
extern int send_sig(int, struct task_struct *, int);
extern void zap_other_threads(struct task_struct *p);
-extern int kill_pg(pid_t, int, int);
-extern int kill_proc(pid_t, int, int);
+extern int kill_pg(struct pspace *, pid_t, int, int);
+extern int kill_proc(struct pspace *, pid_t, int, int);
extern struct sigqueue *sigqueue_alloc(void);
extern void sigqueue_free(struct sigqueue *);
extern int send_sigqueue(int, struct sigqueue *, struct task_struct *);
@@ -1173,7 +1173,6 @@ extern NORET_TYPE void do_group_exit(int
extern void daemonize(const char *, ...);
extern int allow_signal(int);
extern int disallow_signal(int);
-extern task_t *child_reaper;

extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 3787102..898d593 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -292,7 +292,7 @@ extern int tty_read_raw_data(struct tty_
int buflen);
extern void tty_write_message(struct tty_struct *tty, char *msg);

-extern int is_orphaned_pgrp(int pgrp);
+extern int is_orphaned_pgrp(struct pspace *pspace, int pgrp);
extern int is_ignored(int sig);
extern int tty_signal(int sig, struct tty_struct *tty);
extern void tty_hangup(struct tty_struct * tty);
diff --git a/init/main.c b/init/main.c
index 7c79da5..42dabde 100644
--- a/init/main.c
+++ b/init/main.c
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/pspace.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -555,8 +556,6 @@ static int __init initcall_debug_setup(c
}
__setup("initcall_debug", initcall_debug_setup);

-struct task_struct *child_reaper = &init_task;
-
extern initcall_t __initcall_start[], __initcall_end[];

static void __init do_initcalls(void)
@@ -666,7 +665,7 @@ static int init(void * unused)
* assumptions about where in the task array this
* can be found.
*/
- child_reaper = current;
+ init_pspace.child_reaper.task = current;

/* Sets up cpus_possible() */
smp_prepare_cpus(max_cpus);
diff --git a/kernel/exit.c b/kernel/exit.c
index f1af8bb..a32bbe6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -31,6 +31,7 @@
#include <linux/signal.h>
#include <linux/cn_proc.h>
#include <linux/mutex.h>
+#include <linux/pspace.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -38,7 +39,6 @@
#include <asm/mmu_context.h>

extern void sem_exit (void);
-extern struct task_struct *child_reaper;

int getrusage(struct task_struct *, int, struct rusage __user *);

@@ -46,13 +46,14 @@ static void exit_mm(struct task_struct *

static void __unhash_process(struct task_struct *p)
{
+ p->pspace->nr_threads--;
nr_threads--;
detach_pid(p, PIDTYPE_PID);
detach_pid(p, PIDTYPE_TGID);
if (thread_group_leader(p)) {
detach_pid(p, PIDTYPE_PGID);
detach_pid(p, PIDTYPE_SID);
- __get_cpu_var(process_counts)--;
+ p->pspace->nr_processes--;
}

REMOVE_LINKS(p);
@@ -105,6 +106,7 @@ repeat:
write_unlock_irq(&tasklist_lock);
spin_unlock(&p->proc_lock);
proc_pid_flush(proc_dentry);
+ exit_pspace(p);
release_thread(p);
put_task_struct(p);

@@ -118,19 +120,19 @@ repeat:
* satisfactory pgrp is found. I dunno - gdb doesn't work correctly
* without this...
*/
-int session_of_pgrp(int pgrp)
+int session_of_pgrp(struct pspace *pspace, int pgrp)
{
struct task_struct *p;
int sid = -1;

read_lock(&tasklist_lock);
- do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+ do_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p) {
if (p->signal->session > 0) {
sid = p->signal->session;
goto out;
}
- } while_each_task_pid(pgrp, PIDTYPE_PGID, p);
- p = find_task_by_pid(pgrp);
+ } while_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p);
+ p = find_task_by_pid(pspace, pgrp);
if (p)
sid = p->signal->session;
out:
@@ -147,12 +149,12 @@ out:
*
* "I ask you, have you ever known what it is to be an orphan?"
*/
-static int will_become_orphaned_pgrp(int pgrp, task_t *ignored_task)
+static int will_become_orphaned_pgrp(struct pspace *pspace, int pgrp, task_t *ignored_task)
{
struct task_struct *p;
int ret = 1;

- do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+ do_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p) {
if (p == ignored_task
|| p->exit_state
|| p->real_parent->pid == 1)
@@ -162,27 +164,27 @@ static int will_become_orphaned_pgrp(int
ret = 0;
break;
}
- } while_each_task_pid(pgrp, PIDTYPE_PGID, p);
+ } while_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p);
return ret; /* (sighing) "Often!" */
}

-int is_orphaned_pgrp(int pgrp)
+int is_orphaned_pgrp(struct pspace *pspace, int pgrp)
{
int retval;

read_lock(&tasklist_lock);
- retval = will_become_orphaned_pgrp(pgrp, NULL);
+ retval = will_become_orphaned_pgrp(pspace, pgrp, NULL);
read_unlock(&tasklist_lock);

return retval;
}

-static int has_stopped_jobs(int pgrp)
+static int has_stopped_jobs(struct pspace *pspace, int pgrp)
{
int retval = 0;
struct task_struct *p;

- do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+ do_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p) {
if (p->state != TASK_STOPPED)
continue;

@@ -198,7 +200,7 @@ static int has_stopped_jobs(int pgrp)

retval = 1;
break;
- } while_each_task_pid(pgrp, PIDTYPE_PGID, p);
+ } while_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p);
return retval;
}

@@ -221,8 +223,8 @@ static void reparent_to_init(void)
ptrace_unlink(current);
/* Reparent to init */
REMOVE_LINKS(current);
- current->parent = child_reaper;
- current->real_parent = child_reaper;
+ current->parent = current->pspace->child_reaper.task;
+ current->real_parent = current->pspace->child_reaper.task;
SET_LINKS(current);

/* Set the exit signal to SIGCHLD so we signal init on exit */
@@ -524,6 +526,12 @@ static inline void choose_new_parent(tas
* the parent is not a zombie.
*/
BUG_ON(p == reaper || reaper->exit_state >= EXIT_ZOMBIE);
+ /* If the pspaces of our parents differ don't become a zombie
+ * or allow ourselves to be waited on, effecitvely this means
+ * the process just disappears.
+ */
+ if (p->real_parent->pspace != reaper->pspace)
+ p->exit_signal = -1;
p->real_parent = reaper;
}

@@ -576,11 +584,13 @@ static void reparent_thread(task_t *p, t
*/
if ((process_group(p) != process_group(father)) &&
(p->signal->session == father->signal->session)) {
+ struct pspace *pspace = p->pspace;
int pgrp = process_group(p);

- if (will_become_orphaned_pgrp(pgrp, NULL) && has_stopped_jobs(pgrp)) {
- __kill_pg_info(SIGHUP, SEND_SIG_PRIV, pgrp);
- __kill_pg_info(SIGCONT, SEND_SIG_PRIV, pgrp);
+ if (will_become_orphaned_pgrp(pspace, pgrp, NULL) &&
+ has_stopped_jobs(pspace, pgrp)) {
+ __kill_pg_info(SIGHUP, SEND_SIG_PRIV, pspace, pgrp);
+ __kill_pg_info(SIGCONT, SEND_SIG_PRIV, pspace, pgrp);
}
}
}
@@ -600,7 +610,10 @@ static void forget_original_parent(struc
do {
reaper = next_thread(reaper);
if (reaper == father) {
- reaper = child_reaper;
+ if (!(father->pspace->flags & PSPACE_EXIT))
+ reaper = father->pspace->child_reaper.task;
+ else
+ reaper = init_pspace.child_reaper.task;
break;
}
} while (reaper->exit_state);
@@ -624,7 +637,7 @@ static void forget_original_parent(struc

if (father == p->real_parent) {
/* reparent with a reaper, real father it's us */
- choose_new_parent(p, reaper, child_reaper);
+ choose_new_parent(p, reaper, p->pspace->child_reaper.task);
reparent_thread(p, father, 0);
} else {
/* reparent ptraced task to its real parent */
@@ -645,7 +658,7 @@ static void forget_original_parent(struc
}
list_for_each_safe(_p, _n, &father->ptrace_children) {
p = list_entry(_p,struct task_struct,ptrace_list);
- choose_new_parent(p, reaper, child_reaper);
+ choose_new_parent(p, reaper, p->pspace->child_reaper.task);
reparent_thread(p, father, 1);
}
}
@@ -713,10 +726,10 @@ static void exit_notify(struct task_stru

if ((process_group(t) != process_group(tsk)) &&
(t->signal->session == tsk->signal->session) &&
- will_become_orphaned_pgrp(process_group(tsk), tsk) &&
- has_stopped_jobs(process_group(tsk))) {
- __kill_pg_info(SIGHUP, SEND_SIG_PRIV, process_group(tsk));
- __kill_pg_info(SIGCONT, SEND_SIG_PRIV, process_group(tsk));
+ will_become_orphaned_pgrp(tsk->pspace, process_group(tsk), tsk) &&
+ has_stopped_jobs(tsk->pspace, process_group(tsk))) {
+ __kill_pg_info(SIGHUP, SEND_SIG_PRIV, tsk->pspace, process_group(tsk));
+ __kill_pg_info(SIGCONT, SEND_SIG_PRIV, tsk->pspace, process_group(tsk));
}

/* Let father know we died
@@ -788,6 +801,16 @@ fastcall NORET_TYPE void do_exit(long co
panic("Attempted to kill the idle task!");
if (unlikely(is_init(tsk)))
panic("Attempted to kill init!");
+
+ /*
+ * If we are the pspace leader it is nonsense for the pspace
+ * to continue so kill everyone else in the pspace.
+ */
+ if (pspace_leader(tsk)) {
+ tsk->pspace->flags = PSPACE_EXIT;
+ kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
+ }
+
if (tsk->io_context)
exit_io_context();

@@ -943,10 +966,10 @@ static int eligible_child(pid_t pid, int
if (pid > 0) {
if (p->wid != pid)
return 0;
- } else if (!pid) {
+ } else if (!pid && (current->pspace == p->pspace)) {
if (process_group(p) != process_group(current))
return 0;
- } else if (pid != -1) {
+ } else if ((pid != -1) && (current->pspace == p->pspace)) {
if (process_group(p) != -pid)
return 0;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 743d46c..354d156 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -44,6 +44,7 @@
#include <linux/rmap.h>
#include <linux/acct.h>
#include <linux/cn_proc.h>
+#include <linux/pspace.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -60,23 +61,10 @@ int nr_threads; /* The idle threads do

int max_threads; /* tunable limit on nr_threads */

-DEFINE_PER_CPU(unsigned long, process_counts) = 0;
-
__cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */

EXPORT_SYMBOL(tasklist_lock);

-int nr_processes(void)
-{
- int cpu;
- int total = 0;
-
- for_each_online_cpu(cpu)
- total += per_cpu(process_counts, cpu);
-
- return total;
-}
-
#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
# define alloc_task_struct() kmem_cache_alloc(task_struct_cachep, GFP_KERNEL)
# define free_task_struct(tsk) kmem_cache_free(task_struct_cachep, (tsk))
@@ -879,6 +867,12 @@ static task_t *copy_process(unsigned lon
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);

+ /* If I'm not sharing process ids I can't share signals or pids.
+ */
+ if ((clone_flags & CLONE_NPSPACE) &&
+ (clone_flags & (CLONE_THREAD|CLONE_SIGHAND)))
+ return ERR_PTR(-EINVAL);
+
/*
* Thread groups must share signals as well, and detached threads
* can only be started up within the thread group.
@@ -894,6 +888,14 @@ static task_t *copy_process(unsigned lon
if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
return ERR_PTR(-EINVAL);

+ /*
+ * Important: If an exit-all has been started then
+ * do not create this new process - the whole pspace
+ * supposed to exit.
+ */
+ if (current->pspace->flags & PSPACE_EXIT)
+ return ERR_PTR(-EAGAIN);
+
retval = security_task_create(clone_flags);
if (retval)
goto fork_out;
@@ -932,10 +934,13 @@ static task_t *copy_process(unsigned lon
p->did_exec = 0;
copy_flags(clone_flags, p);
p->wid = p->pid = pid;
+ if ((retval = copy_pspace(clone_flags, p)))
+ goto bad_fork_cleanup;
+
retval = -EFAULT;
if (clone_flags & CLONE_PARENT_SETTID)
if (put_user(p->wid, parent_tidptr))
- goto bad_fork_cleanup;
+ goto bad_fork_cleanup_pspace;

p->proc_dentry = NULL;

@@ -1139,13 +1144,20 @@ static task_t *copy_process(unsigned lon
attach_pid(p, PIDTYPE_PID, p->pid);
attach_pid(p, PIDTYPE_TGID, p->tgid);
if (thread_group_leader(p)) {
- p->signal->tty = current->signal->tty;
- p->signal->pgrp = process_group(current);
- p->signal->session = current->signal->session;
+ if (unlikely(clone_flags & CLONE_NPSPACE)) {
+ p->signal->tty = NULL;
+ p->signal->pgrp = 1;
+ p->signal->session = 1;
+ } else {
+ p->signal->tty = current->signal->tty;
+ p->signal->pgrp = process_group(current);
+ p->signal->session = current->signal->session;
+ }
attach_pid(p, PIDTYPE_PGID, process_group(p));
attach_pid(p, PIDTYPE_SID, p->signal->session);
- __get_cpu_var(process_counts)++;
+ p->pspace->nr_processes++;
}
+ p->pspace->nr_threads++;
nr_threads++;
}

@@ -1181,6 +1193,10 @@ bad_fork_cleanup_policy:
bad_fork_cleanup_cpuset:
#endif
cpuset_exit(p);
+bad_fork_cleanup_pspace:
+ if (p->wid != p->pid)
+ free_pidmap(p->pspace, p->pid);
+ exit_pspace(p);
bad_fork_cleanup:
if (p->binfmt)
module_put(p->binfmt->module);
@@ -1248,7 +1264,7 @@ long do_fork(unsigned long clone_flags,
{
struct task_struct *p;
int trace = 0;
- long pid = alloc_pidmap();
+ long pid = alloc_pidmap(current->pspace);

if (pid < 0)
return -EAGAIN;
@@ -1295,7 +1311,7 @@ long do_fork(unsigned long clone_flags,
ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
}
} else {
- free_pidmap(pid);
+ free_pidmap(current->pspace, pid);
pid = PTR_ERR(p);
}
return pid;
diff --git a/kernel/pid.c b/kernel/pid.c
index fbf45bb..221585d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -26,23 +26,21 @@
#include <linux/init.h>
#include <linux/bootmem.h>
#include <linux/hash.h>
+#include <linux/pspace.h>

-#define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
+#define pid_hashfn(pspace, nr) \
+ hash_long(((unsigned long)pspace) ^ ((unsigned long)nr), pidhash_shift)
static struct hlist_head *pid_hash[PIDTYPE_MAX];
static int pidhash_shift;

-int pid_max = PID_MAX_DEFAULT;
-int last_pid;
-
#define RESERVED_PIDS 300

int pid_max_min = RESERVED_PIDS + 1;
int pid_max_max = PID_MAX_LIMIT;

-#define PIDMAP_ENTRIES ((PID_MAX_LIMIT + 8*PAGE_SIZE - 1)/PAGE_SIZE/8)
#define BITS_PER_PAGE (PAGE_SIZE*8)
#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1)
-#define mk_pid(map, off) (((map) - pidmap_array)*BITS_PER_PAGE + (off))
+#define mk_pid(map, off) (((map) - pspace->pidmap)*BITS_PER_PAGE + (off))
#define find_next_offset(map, off) \
find_next_zero_bit((map)->page, BITS_PER_PAGE, off)

@@ -52,36 +50,45 @@ int pid_max_max = PID_MAX_LIMIT;
* value does not cause lots of bitmaps to be allocated, but
* the scheme scales to up to 4 million PIDs, runtime.
*/
-typedef struct pidmap {
- atomic_t nr_free;
- void *page;
-} pidmap_t;
-
-static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
- { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };
+struct pspace init_pspace = {
+ .count = ATOMIC_INIT(1),
+ .child_reaper = {
+ .pspace = &init_pspace,
+ .task = &init_task,
+ },
+ .nr_threads = 0,
+ .nr_processes = 0,
+ .last_pid = 0,
+ .min = RESERVED_PIDS,
+ .max = PID_MAX_DEFAULT,
+ .pidmap = {
+ [ 0 ... PIDMAP_ENTRIES-1] = { ATOMIC_INIT(BITS_PER_PAGE), NULL }
+ },
+ .name = "",
+};

static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);

-fastcall void free_pidmap(int pid)
+fastcall void free_pidmap(struct pspace *pspace, int pid)
{
- pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+ struct pidmap *map = pspace->pidmap + pid / BITS_PER_PAGE;
int offset = pid & BITS_PER_PAGE_MASK;

clear_bit(offset, map->page);
atomic_inc(&map->nr_free);
}

-int alloc_pidmap(void)
+int alloc_pidmap(struct pspace *pspace)
{
- int i, offset, max_scan, pid, last = last_pid;
- pidmap_t *map;
+ int i, offset, max_scan, pid, last = pspace->last_pid;
+ struct pidmap *map;

pid = last + 1;
- if (pid >= pid_max)
- pid = RESERVED_PIDS;
+ if (pid >= pspace->max)
+ pid = pspace->min;
offset = pid & BITS_PER_PAGE_MASK;
- map = &pidmap_array[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ map = &pspace->pidmap[pid/BITS_PER_PAGE];
+ max_scan = (pspace->max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
if (unlikely(!map->page)) {
unsigned long page = get_zeroed_page(GFP_KERNEL);
@@ -102,7 +109,7 @@ int alloc_pidmap(void)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- last_pid = pid;
+ pspace->last_pid = pid;
return pid;
}
offset = find_next_offset(map, offset);
@@ -113,16 +120,16 @@ int alloc_pidmap(void)
* bitmap block and the final block was the same
* as the starting point, pid is before last_pid.
*/
- } while (offset < BITS_PER_PAGE && pid < pid_max &&
+ } while (offset < BITS_PER_PAGE && pid < pspace->max &&
(i != max_scan || pid < last ||
!((last+1) & BITS_PER_PAGE_MASK)));
}
- if (map < &pidmap_array[(pid_max-1)/BITS_PER_PAGE]) {
+ if (map < &pspace->pidmap[(pspace->max-1)/BITS_PER_PAGE]) {
++map;
offset = 0;
} else {
- map = &pidmap_array[0];
- offset = RESERVED_PIDS;
+ map = &pspace->pidmap[0];
+ offset = pspace->min;
if (unlikely(last == offset))
break;
}
@@ -131,30 +138,31 @@ int alloc_pidmap(void)
return -1;
}

-struct pid * fastcall find_pid(enum pid_type type, int nr)
+struct pid * fastcall find_pid(enum pid_type type, struct pspace *pspace, int nr)
{
struct hlist_node *elem;
struct pid *pid;

hlist_for_each_entry_rcu(pid, elem,
- &pid_hash[type][pid_hashfn(nr)], pid_chain) {
- if (pid->nr == nr)
+ &pid_hash[type][pid_hashfn(pspace, nr)], pid_chain) {
+ if ((pid->nr == nr) && (pid->pspace == pspace))
return pid;
}
return NULL;
}

-int fastcall attach_pid(task_t *task, enum pid_type type, int nr)
+static int fastcall attach_any_pid(struct pid *task_pid, task_t *task, enum pid_type type, struct pspace *pspace, int nr)
{
- struct pid *pid, *task_pid;
+ struct pid *pid;

- task_pid = &task->pids[type];
- pid = find_pid(type, nr);
+ pid = find_pid(type, pspace, nr);
task_pid->nr = nr;
+ task_pid->pspace = pspace;
+ task_pid->task = task;
if (pid == NULL) {
INIT_LIST_HEAD(&task_pid->pid_list);
hlist_add_head_rcu(&task_pid->pid_chain,
- &pid_hash[type][pid_hashfn(nr)]);
+ &pid_hash[type][pid_hashfn(pspace, nr)]);
} else {
INIT_HLIST_NODE(&task_pid->pid_chain);
list_add_tail_rcu(&task_pid->pid_list, &pid->pid_list);
@@ -163,12 +171,16 @@ int fastcall attach_pid(task_t *task, en
return 0;
}

-static fastcall int __detach_pid(task_t *task, enum pid_type type)
+int fastcall attach_pid(task_t *task, enum pid_type type, int nr)
+{
+ return attach_any_pid(&task->pids[type], task, type, task->pspace, nr);
+}
+
+static fastcall int __detach_pid(struct pid *pid, enum pid_type type)
{
- struct pid *pid, *pid_next;
+ struct pid *pid_next;
int nr = 0;

- pid = &task->pids[type];
if (!hlist_unhashed(&pid->pid_chain)) {

if (list_empty(&pid->pid_list)) {
@@ -189,34 +201,118 @@ static fastcall int __detach_pid(task_t
return nr;
}

-void fastcall detach_pid(task_t *task, enum pid_type type)
+static void fastcall detach_any_pid(struct pid *pid, enum pid_type type)
{
int tmp, nr;

- nr = __detach_pid(task, type);
+ nr = __detach_pid(pid, type);
if (!nr)
return;

for (tmp = PIDTYPE_MAX; --tmp >= 0; )
- if (tmp != type && find_pid(tmp, nr))
+ if (tmp != type && find_pid(tmp, pid->pspace, nr))
return;

- free_pidmap(nr);
+ free_pidmap(pid->pspace, nr);
}

-task_t *find_task_by_pid_type(int type, int nr)
+void fastcall detach_pid(task_t *task, enum pid_type type)
+{
+ return detach_any_pid(&task->pids[type], type);
+}
+
+task_t *find_task_by_pid_type(int type, struct pspace *pspace, int nr)
{
struct pid *pid;

- pid = find_pid(type, nr);
+ pid = find_pid(type, pspace, nr);
if (!pid)
return NULL;

- return pid_task(&pid->pid_list, type);
+ return pid->task;
}

EXPORT_SYMBOL(find_task_by_pid_type);

+static struct pspace *new_pspace(struct task_struct *leader)
+{
+ struct pspace *pspace, *parent;
+ int i;
+ size_t len;
+ parent = leader->pspace;
+ len = strlen(parent->name) + 10;
+ pspace = kzalloc(sizeof(struct pspace) + len, GFP_KERNEL);
+ if (!pspace)
+ return NULL;
+ atomic_set(&pspace->count, 1);
+ pspace->flags = 0;
+ pspace->nr_threads = 0;
+ pspace->nr_processes = 0;
+ pspace->last_pid = 0;
+ pspace->min = RESERVED_PIDS;
+ pspace->max = PID_MAX_DEFAULT;
+ for (i = 0; i < PIDMAP_ENTRIES; i++) {
+ atomic_set(&pspace->pidmap[i].nr_free, BITS_PER_PAGE);
+ pspace->pidmap[i].page = NULL;
+ }
+ attach_any_pid(&pspace->child_reaper, leader, PIDTYPE_PID,
+ parent, leader->wid);
+ leader->pspace->nr_processes++;
+ snprintf(pspace->name, len + 1, "%s/%d", parent->name, leader->wid);
+
+ return pspace;
+}
+
+int copy_pspace(int flags, struct task_struct *p)
+{
+ struct pspace *new;
+ int pid;
+ get_pspace(p->pspace);
+
+ if (!(flags & CLONE_NPSPACE))
+ return 0;
+
+ if (!capable(CAP_KILL))
+ return -EPERM;
+
+ /* Allocate the new pidspace structure */
+ new = new_pspace(p);
+ if (!new) {
+ put_pspace(p->pspace);
+ return -ENOMEM;
+ }
+
+ /* Allocate the new pid */
+ pid = alloc_pidmap(new);
+
+ /* Setup the new pspace and pid */
+ p->pspace = new;
+ p->pid = pid;
+
+ return 0;
+}
+
+void __put_pspace(struct pspace *pspace)
+{
+ struct pspace *parent;
+ struct pidmap *map;
+ int i;
+
+ BUG_ON(atomic_read(&pspace->count) != 0);
+ /* notifier? */
+
+ pspace->child_reaper.pspace->nr_processes--;
+ detach_any_pid(&pspace->child_reaper, PIDTYPE_PID);
+ parent = pspace->child_reaper.pspace;
+ map = pspace->pidmap;
+ for (i = 0; i < PIDMAP_ENTRIES; i++) {
+ BUG_ON(atomic_read(&map[i].nr_free) != BITS_PER_PAGE);
+ free_page((unsigned long)map[i].page);
+ }
+ kfree(pspace);
+ put_pspace(parent);
+}
+
/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
@@ -247,8 +343,8 @@ void __init pidhash_init(void)

void __init pidmap_init(void)
{
- pidmap_array->page = (void *)get_zeroed_page(GFP_KERNEL);
+ init_pspace.pidmap->page = (void *)get_zeroed_page(GFP_KERNEL);
/* Reserve PID 0 */
- set_bit(0, pidmap_array->page);
- atomic_dec(&pidmap_array->nr_free);
+ set_bit(0, init_pspace.pidmap->page);
+ atomic_dec(&init_pspace.pidmap->nr_free);
}
diff --git a/kernel/signal.c b/kernel/signal.c
index 70c226c..8a7fa8e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -26,6 +26,7 @@
#include <linux/signal.h>
#include <linux/audit.h>
#include <linux/capability.h>
+#include <linux/pspace.h>
#include <asm/param.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -680,6 +681,8 @@ static int check_kill_permission(int sig
if (!valid_signal(sig))
return error;
error = -EPERM;
+ if (current_pspace_leader(t) && sig_kernel_only(sig))
+ return error;
if ((info == SEND_SIG_NOINFO || (!is_si_special(info) && SI_FROMUSER(info)))
&& ((sig != SIGCONT) ||
(current->signal->session != t->signal->session))
@@ -1147,11 +1150,55 @@ retry:
}

/*
+ * kill_pspace_info() sends a signal to all processes in a process space.
+ * This is what kill(-1, sig) does.
+ */
+
+int __kill_pspace_info(int sig, struct siginfo *info, struct pspace *pspace)
+{
+ struct task_struct *p = NULL;
+ int retval = 0, count = 0;
+
+ for_each_process(p) {
+ int err;
+ /* Skip the current pspace leader */
+ if (current_pspace_leader(p))
+ continue;
+
+ /* Skip the sender of the signal */
+ if (p->signal == current->signal)
+ continue;
+
+ /* Skip processes outside the target process space */
+ if (!in_pspace(pspace, p))
+ continue;
+
+ /* Finally it is a good process send the signal. */
+ err = group_send_sig_info(sig, info, p);
+ ++count;
+ if (err != -EPERM)
+ retval = err;
+ }
+ return count ? retval : -ESRCH;
+}
+
+int kill_pspace_info(int sig, struct siginfo *info, struct pspace *pspace)
+{
+ int retval;
+
+ read_lock(&tasklist_lock);
+ retval = __kill_pspace_info(sig, info, pspace);
+ read_unlock(&tasklist_lock);
+
+ return retval;
+}
+
+/*
* kill_pg_info() sends a signal to a process group: this is what the tty
* control characters do (^C, ^Z etc)
*/

-int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
+int __kill_pg_info(int sig, struct siginfo *info, struct pspace *pspace, pid_t pgrp)
{
struct task_struct *p = NULL;
int retval, success;
@@ -1161,28 +1208,28 @@ int __kill_pg_info(int sig, struct sigin

success = 0;
retval = -ESRCH;
- do_each_task_pid(pgrp, PIDTYPE_PGID, p) {
+ do_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p) {
int err = group_send_sig_info(sig, info, p);
success |= !err;
retval = err;
- } while_each_task_pid(pgrp, PIDTYPE_PGID, p);
+ } while_each_task_pid(pspace, pgrp, PIDTYPE_PGID, p);
return success ? 0 : retval;
}

int
-kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
+kill_pg_info(int sig, struct siginfo *info, struct pspace *pspace, pid_t pgrp)
{
int retval;

read_lock(&tasklist_lock);
- retval = __kill_pg_info(sig, info, pgrp);
+ retval = __kill_pg_info(sig, info, pspace, pgrp);
read_unlock(&tasklist_lock);

return retval;
}

int
-kill_proc_info(int sig, struct siginfo *info, pid_t pid)
+kill_proc_info(int sig, struct siginfo *info, struct pspace *pspace, pid_t pid)
{
int error;
int acquired_tasklist_lock = 0;
@@ -1193,7 +1240,7 @@ kill_proc_info(int sig, struct siginfo *
read_lock(&tasklist_lock);
acquired_tasklist_lock = 1;
}
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(pspace, pid);
error = -ESRCH;
if (p)
error = group_send_sig_info(sig, info, p);
@@ -1204,8 +1251,9 @@ kill_proc_info(int sig, struct siginfo *
}

/* like kill_proc_info(), but doesn't use uid/euid of "current" */
-int kill_proc_info_as_uid(int sig, struct siginfo *info, pid_t pid,
- uid_t uid, uid_t euid)
+int kill_proc_info_as_uid(int sig, struct siginfo *info,
+ struct pspace *pspace, pid_t pid,
+ uid_t uid, uid_t euid)
{
int ret = -EINVAL;
struct task_struct *p;
@@ -1214,7 +1262,7 @@ int kill_proc_info_as_uid(int sig, struc
return ret;

read_lock(&tasklist_lock);
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(pspace, pid);
if (!p) {
ret = -ESRCH;
goto out_unlock;
@@ -1242,31 +1290,28 @@ EXPORT_SYMBOL_GPL(kill_proc_info_as_uid)
*
* POSIX specifies that kill(-1,sig) is unspecified, but what we have
* is probably wrong. Should make it like BSD or SYSV.
+ *
+ * This has been extended to treat all process spaces just like kill(-1, sig)
*/

static int kill_something_info(int sig, struct siginfo *info, int pid)
{
if (!pid) {
- return kill_pg_info(sig, info, process_group(current));
- } else if (pid == -1) {
- int retval = 0, count = 0;
- struct task_struct * p;
+ return kill_pg_info(sig, info, current->pspace, process_group(current));
+ } else if (pid < 0) {
+ struct task_struct *p;
+ int retval;

read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->pid > 1 && p->tgid != current->tgid) {
- int err = group_send_sig_info(sig, info, p);
- ++count;
- if (err != -EPERM)
- retval = err;
- }
- }
+ p = find_task_by_pid(current->pspace, -pid);
+ if (!p || !pspace_leader(p))
+ retval = __kill_pg_info(sig, info, current->pspace, -pid);
+ else
+ retval = __kill_pspace_info(sig, info, p->pspace);
read_unlock(&tasklist_lock);
- return count ? retval : -ESRCH;
- } else if (pid < 0) {
- return kill_pg_info(sig, info, -pid);
+ return retval;
} else {
- return kill_proc_info(sig, info, pid);
+ return kill_proc_info(sig, info, current->pspace, pid);
}
}

@@ -1354,15 +1399,15 @@ force_sigsegv(int sig, struct task_struc
}

int
-kill_pg(pid_t pgrp, int sig, int priv)
+kill_pg(struct pspace *pspace, pid_t pgrp, int sig, int priv)
{
- return kill_pg_info(sig, __si_special(priv), pgrp);
+ return kill_pg_info(sig, __si_special(priv), pspace, pgrp);
}

int
-kill_proc(pid_t pid, int sig, int priv)
+kill_proc(struct pspace *pspace, pid_t pid, int sig, int priv)
{
- return kill_proc_info(sig, __si_special(priv), pid);
+ return kill_proc_info(sig, __si_special(priv), pspace, pid);
}

/*
@@ -1987,8 +2032,11 @@ relock:
if (sig_kernel_ignore(signr)) /* Default is nothing. */
continue;

- /* Init gets no signals it doesn't want. */
- if (current == child_reaper)
+ /* Init gets no signals it doesn't want.
+ * Other pspace leaders can't ignore SIGKILL and SIGSTOP
+ */
+ if (pspace_leader(current) &&
+ (is_init(current) || !sig_kernel_only(signr)))
continue;

if (sig_kernel_stop(signr)) {
@@ -2007,7 +2055,7 @@ relock:

/* signals can be posted during this window */

- if (is_orphaned_pgrp(process_group(current)))
+ if (is_orphaned_pgrp(current->pspace, process_group(current)))
goto relock;

spin_lock_irq(&current->sighand->siglock);
@@ -2360,7 +2408,7 @@ static int do_tkill(int tgid, int pid, i
info.si_uid = current->uid;

read_lock(&tasklist_lock);
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);
if (p && (tgid <= 0 || p->tgid == tgid)) {
error = check_kill_permission(sig, &info, p);
/*
@@ -2426,7 +2474,7 @@ sys_rt_sigqueueinfo(int pid, int sig, si
info.si_signo = sig;

/* POSIX.1b doesn't mention process groups. */
- return kill_proc_info(sig, &info, pid);
+ return kill_proc_info(sig, &info, current->pspace, pid);
}

int
diff --git a/kernel/sys.c b/kernel/sys.c
index 0929c69..dc8cb58 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -30,6 +30,7 @@
#include <linux/tty.h>
#include <linux/signal.h>
#include <linux/cn_proc.h>
+#include <linux/pspace.h>

#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -1114,7 +1115,7 @@ asmlinkage long sys_setpgid(pid_t pid, p
write_lock_irq(&tasklist_lock);

err = -ESRCH;
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);
if (!p)
goto out;

@@ -1140,12 +1141,13 @@ asmlinkage long sys_setpgid(pid_t pid, p
goto out;

if (pgid != pid) {
+ struct pspace *pspace = current->pspace;
struct task_struct *p;

- do_each_task_pid(pgid, PIDTYPE_PGID, p) {
+ do_each_task_pid(pspace, pgid, PIDTYPE_PGID, p) {
if (p->signal->session == group_leader->signal->session)
goto ok_pgid;
- } while_each_task_pid(pgid, PIDTYPE_PGID, p);
+ } while_each_task_pid(pspace, pgid, PIDTYPE_PGID, p);
goto out;
}

@@ -1176,7 +1178,7 @@ asmlinkage long sys_getpgid(pid_t pid)
struct task_struct *p;

read_lock(&tasklist_lock);
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);

retval = -ESRCH;
if (p) {
@@ -1208,7 +1210,7 @@ asmlinkage long sys_getsid(pid_t pid)
struct task_struct *p;

read_lock(&tasklist_lock);
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);

retval = -ESRCH;
if(p) {
@@ -1230,7 +1232,7 @@ asmlinkage long sys_setsid(void)
down(&tty_sem);
write_lock_irq(&tasklist_lock);

- pid = find_pid(PIDTYPE_PGID, group_leader->pid);
+ pid = find_pid(PIDTYPE_PGID, current->pspace, group_leader->pid);
if (pid)
goto out;

--
1.1.5.g3480

2006-02-06 19:39:12

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 05/20] sched: Fixup the scheduler syscalls to deal with pspaces.


The scheduler is controlled by pid, so fix up the guts of it's system
calls to pass the appropriate pspace when looking up tasks by pid.

Signed-off-by: Eric W. Biederman <[email protected]>


---

kernel/sched.c | 2 +-
kernel/sys.c | 14 ++++++++------
2 files changed, 9 insertions(+), 7 deletions(-)

e37379b11b63503c1146cc17cec75c358154a1ec
diff --git a/kernel/sched.c b/kernel/sched.c
index 6579d49..449cc6e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3701,7 +3701,7 @@ task_t *idle_task(int cpu)
*/
static inline task_t *find_process_by_pid(pid_t pid)
{
- return pid ? find_task_by_pid(pid) : current;
+ return pid ? find_task_by_pid(current->pspace, pid) : current;
}

/* Actually do priority change: must hold rq lock. */
diff --git a/kernel/sys.c b/kernel/sys.c
index dc8cb58..bd594c3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -264,6 +264,7 @@ out:

asmlinkage long sys_setpriority(int which, int who, int niceval)
{
+ struct pspace *pspace = current->pspace;
struct task_struct *g, *p;
struct user_struct *user;
int error = -EINVAL;
@@ -283,16 +284,16 @@ asmlinkage long sys_setpriority(int whic
case PRIO_PROCESS:
if (!who)
who = current->pid;
- p = find_task_by_pid(who);
+ p = find_task_by_pid(pspace, who);
if (p)
error = set_one_prio(p, niceval, error);
break;
case PRIO_PGRP:
if (!who)
who = process_group(current);
- do_each_task_pid(who, PIDTYPE_PGID, p) {
+ do_each_task_pid(pspace, who, PIDTYPE_PGID, p) {
error = set_one_prio(p, niceval, error);
- } while_each_task_pid(who, PIDTYPE_PGID, p);
+ } while_each_task_pid(pspace, who, PIDTYPE_PGID, p);
break;
case PRIO_USER:
user = current->user;
@@ -324,6 +325,7 @@ out:
*/
asmlinkage long sys_getpriority(int which, int who)
{
+ struct pspace *pspace = current->pspace;
struct task_struct *g, *p;
struct user_struct *user;
long niceval, retval = -ESRCH;
@@ -336,7 +338,7 @@ asmlinkage long sys_getpriority(int whic
case PRIO_PROCESS:
if (!who)
who = current->pid;
- p = find_task_by_pid(who);
+ p = find_task_by_pid(pspace, who);
if (p) {
niceval = 20 - task_nice(p);
if (niceval > retval)
@@ -346,11 +348,11 @@ asmlinkage long sys_getpriority(int whic
case PRIO_PGRP:
if (!who)
who = process_group(current);
- do_each_task_pid(who, PIDTYPE_PGID, p) {
+ do_each_task_pid(pspace, who, PIDTYPE_PGID, p) {
niceval = 20 - task_nice(p);
if (niceval > retval)
retval = niceval;
- } while_each_task_pid(who, PIDTYPE_PGID, p);
+ } while_each_task_pid(pspace, who, PIDTYPE_PGID, p);
break;
case PRIO_USER:
user = current->user;
--
1.1.5.g3480

2006-02-06 19:41:36

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 06/20] cad_pid: Fixup the cad_pid users to assume it is in the initial process id namespace.


Having the kernel memorize pids of running processes is ugly. Especially if
those processes are allowed to exit. So we don't introduce any lifetime issues
or any other weird uglyness and simply insist that the cad_pid is always
interpreted with respect to the initial process id namespace.

Signed-off-by: Eric W. Biederman <[email protected]>


---

drivers/parisc/power.c | 3 ++-
kernel/sys.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)

f343da178ee420b5b35a7fa43cc7995f47024004
diff --git a/drivers/parisc/power.c b/drivers/parisc/power.c
index 54b2b7f..fbf14d9 100644
--- a/drivers/parisc/power.c
+++ b/drivers/parisc/power.c
@@ -45,6 +45,7 @@
#include <linux/sched.h>
#include <linux/interrupt.h>
#include <linux/workqueue.h>
+#include <linux/pspace.h>

#include <asm/pdc.h>
#include <asm/io.h>
@@ -86,7 +87,7 @@
static void deferred_poweroff(void *dummy)
{
extern int cad_pid; /* from kernel/sys.c */
- if (kill_proc(cad_pid, SIGINT, 1)) {
+ if (kill_proc(&init_pspace, cad_pid, SIGINT, 1)) {
/* just in case killing init process failed */
machine_power_off();
}
diff --git a/kernel/sys.c b/kernel/sys.c
index bd594c3..d9fcc0e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -581,7 +581,7 @@ void ctrl_alt_del(void)
if (C_A_D)
schedule_work(&cad_work);
else
- kill_proc(cad_pid, SIGINT, 1);
+ kill_proc(&init_pspace, cad_pid, SIGINT, 1);
}


--
1.1.5.g3480

2006-02-06 19:44:17

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 07/20] tty: Update the tty layer to work with pspaces.


Of kernel subsystems that work with pids the tty layer
is probably the largest consumer. But it has the nice
virtue that the association with a session only lasts until
the session leader exits. Which means that no reference
counting is required. Only caching of the pspace,
and a few extra comparisons.

Signed-off-by: Eric W. Biederman <[email protected]>


---

drivers/char/n_tty.c | 9 ++++---
drivers/char/tty_io.c | 62 ++++++++++++++++++++++++++++++-------------------
include/linux/tty.h | 1 +
3 files changed, 44 insertions(+), 28 deletions(-)

5de29104be4f33d4dca8246e13d2996c1ce745b3
diff --git a/drivers/char/n_tty.c b/drivers/char/n_tty.c
index ccad7ae..5edfd13 100644
--- a/drivers/char/n_tty.c
+++ b/drivers/char/n_tty.c
@@ -580,7 +580,7 @@ static void eraser(unsigned char c, stru
static inline void isig(int sig, struct tty_struct *tty, int flush)
{
if (tty->pgrp > 0)
- kill_pg(tty->pgrp, sig, 1);
+ kill_pg(tty->pspace, tty->pgrp, sig, 1);
if (flush || !L_NOFLSH(tty)) {
n_tty_flush_buffer(tty);
if (tty->driver->flush_buffer)
@@ -1187,11 +1187,12 @@ static int job_control(struct tty_struct
current->signal->tty == tty) {
if (tty->pgrp <= 0)
printk("read_chan: tty->pgrp <= 0!\n");
- else if (process_group(current) != tty->pgrp) {
+ else if ((process_group(current) != tty->pgrp) ||
+ (current->pspace != tty->pspace)) {
if (is_ignored(SIGTTIN) ||
- is_orphaned_pgrp(process_group(current)))
+ is_orphaned_pgrp(current->pspace, process_group(current)))
return -EIO;
- kill_pg(process_group(current), SIGTTIN, 1);
+ kill_pg(current->pspace, process_group(current), SIGTTIN, 1);
return -ERESTARTSYS;
}
}
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 1d83cd5..b6f3004 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -95,6 +95,7 @@
#include <linux/wait.h>
#include <linux/bitops.h>
#include <linux/delay.h>
+#include <linux/pspace.h>

#include <asm/uaccess.h>
#include <asm/system.h>
@@ -858,13 +859,14 @@ int tty_check_change(struct tty_struct *
printk(KERN_WARNING "tty_check_change: tty->pgrp <= 0!\n");
return 0;
}
- if (process_group(current) == tty->pgrp)
+ if ((process_group(current) == tty->pgrp) &&
+ (current->pspace == tty->pspace))
return 0;
if (is_ignored(SIGTTOU))
return 0;
- if (is_orphaned_pgrp(process_group(current)))
+ if (is_orphaned_pgrp(current->pspace, process_group(current)))
return -EIO;
- (void) kill_pg(process_group(current), SIGTTOU, 1);
+ (void) kill_pg(current->pspace, process_group(current), SIGTTOU, 1);
return -ERESTARTSYS;
}

@@ -1070,7 +1072,7 @@ static void do_tty_hangup(void *data)

read_lock(&tasklist_lock);
if (tty->session > 0) {
- do_each_task_pid(tty->session, PIDTYPE_SID, p) {
+ do_each_task_pid(tty->pspace, tty->session, PIDTYPE_SID, p) {
if (p->signal->tty == tty)
p->signal->tty = NULL;
if (!p->signal->leader)
@@ -1079,11 +1081,12 @@ static void do_tty_hangup(void *data)
group_send_sig_info(SIGCONT, SEND_SIG_PRIV, p);
if (tty->pgrp > 0)
p->signal->tty_old_pgrp = tty->pgrp;
- } while_each_task_pid(tty->session, PIDTYPE_SID, p);
+ } while_each_task_pid(tty->pspace, tty->session, PIDTYPE_SID, p);
}
read_unlock(&tasklist_lock);

tty->flags = 0;
+ tty->pspace = NULL;
tty->session = 0;
tty->pgrp = -1;
tty->ctrl_status = 0;
@@ -1162,6 +1165,7 @@ void disassociate_ctty(int on_exit)
{
struct tty_struct *tty;
struct task_struct *p;
+ struct pspace *tty_pspace = NULL;
int tty_pgrp = -1;

lock_kernel();
@@ -1170,35 +1174,37 @@ void disassociate_ctty(int on_exit)
tty = current->signal->tty;
if (tty) {
tty_pgrp = tty->pgrp;
+ tty_pspace = tty->pspace;
up(&tty_sem);
if (on_exit && tty->driver->type != TTY_DRIVER_TYPE_PTY)
tty_vhangup(tty);
} else {
if (current->signal->tty_old_pgrp) {
- kill_pg(current->signal->tty_old_pgrp, SIGHUP, on_exit);
- kill_pg(current->signal->tty_old_pgrp, SIGCONT, on_exit);
+ kill_pg(current->pspace, current->signal->tty_old_pgrp, SIGHUP, on_exit);
+ kill_pg(current->pspace, current->signal->tty_old_pgrp, SIGCONT, on_exit);
}
up(&tty_sem);
unlock_kernel();
return;
}
if (tty_pgrp > 0) {
- kill_pg(tty_pgrp, SIGHUP, on_exit);
+ kill_pg(tty_pspace, tty_pgrp, SIGHUP, on_exit);
if (!on_exit)
- kill_pg(tty_pgrp, SIGCONT, on_exit);
+ kill_pg(tty_pspace, tty_pgrp, SIGCONT, on_exit);
}

/* Must lock changes to tty_old_pgrp */
down(&tty_sem);
current->signal->tty_old_pgrp = 0;
+ tty->pspace = NULL;
tty->session = 0;
tty->pgrp = -1;

/* Now clear signal->tty under the lock */
read_lock(&tasklist_lock);
- do_each_task_pid(current->signal->session, PIDTYPE_SID, p) {
+ do_each_task_pid(current->pspace, current->signal->session, PIDTYPE_SID, p) {
p->signal->tty = NULL;
- } while_each_task_pid(current->signal->session, PIDTYPE_SID, p);
+ } while_each_task_pid(current->pspace, current->signal->session, PIDTYPE_SID, p);
read_unlock(&tasklist_lock);
up(&tty_sem);
unlock_kernel();
@@ -1905,13 +1911,13 @@ static void release_dev(struct file * fi
struct task_struct *p;

read_lock(&tasklist_lock);
- do_each_task_pid(tty->session, PIDTYPE_SID, p) {
+ do_each_task_pid(tty->pspace, tty->session, PIDTYPE_SID, p) {
p->signal->tty = NULL;
- } while_each_task_pid(tty->session, PIDTYPE_SID, p);
+ } while_each_task_pid(tty->pspace, tty->session, PIDTYPE_SID, p);
if (o_tty)
- do_each_task_pid(o_tty->session, PIDTYPE_SID, p) {
+ do_each_task_pid(o_tty->pspace, o_tty->session, PIDTYPE_SID, p) {
p->signal->tty = NULL;
- } while_each_task_pid(o_tty->session, PIDTYPE_SID, p);
+ } while_each_task_pid(o_tty->pspace, o_tty->session, PIDTYPE_SID, p);
read_unlock(&tasklist_lock);
}

@@ -2110,6 +2116,7 @@ got_driver:
current->signal->tty = tty;
task_unlock(current);
current->signal->tty_old_pgrp = 0;
+ tty->pspace = current->pspace;
tty->session = current->signal->session;
tty->pgrp = process_group(current);
}
@@ -2270,9 +2277,9 @@ static int tiocswinsz(struct tty_struct
}
#endif
if (tty->pgrp > 0)
- kill_pg(tty->pgrp, SIGWINCH, 1);
+ kill_pg(tty->pspace, tty->pgrp, SIGWINCH, 1);
if ((real_tty->pgrp != tty->pgrp) && (real_tty->pgrp > 0))
- kill_pg(real_tty->pgrp, SIGWINCH, 1);
+ kill_pg(real_tty->pspace, real_tty->pgrp, SIGWINCH, 1);
tty->winsize = tmp_ws;
real_tty->winsize = tmp_ws;
return 0;
@@ -2342,9 +2349,9 @@ static int tiocsctty(struct tty_struct *
*/

read_lock(&tasklist_lock);
- do_each_task_pid(tty->session, PIDTYPE_SID, p) {
+ do_each_task_pid(tty->pspace, tty->session, PIDTYPE_SID, p) {
p->signal->tty = NULL;
- } while_each_task_pid(tty->session, PIDTYPE_SID, p);
+ } while_each_task_pid(tty->pspace, tty->session, PIDTYPE_SID, p);
read_unlock(&tasklist_lock);
} else
return -EPERM;
@@ -2353,6 +2360,7 @@ static int tiocsctty(struct tty_struct *
current->signal->tty = tty;
task_unlock(current);
current->signal->tty_old_pgrp = 0;
+ tty->pspace = current->pspace;
tty->session = current->signal->session;
tty->pgrp = process_group(current);
return 0;
@@ -2360,17 +2368,22 @@ static int tiocsctty(struct tty_struct *

static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
+ pid_t pgrp;
/*
* (tty == real_tty) is a cheap way of
* testing if the tty is NOT a master pty.
*/
if (tty == real_tty && current->signal->tty != real_tty)
return -ENOTTY;
- return put_user(real_tty->pgrp, p);
+ pgrp = real_tty->pgrp;
+ if (real_tty->pspace != current->pspace)
+ pgrp = -1;
+ return put_user(pgrp, p);
}

static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
+ struct pspace *pspace = current->pspace;
pid_t pgrp;
int retval = tty_check_change(real_tty);

@@ -2380,13 +2393,14 @@ static int tiocspgrp(struct tty_struct *
return retval;
if (!current->signal->tty ||
(current->signal->tty != real_tty) ||
- (real_tty->session != current->signal->session))
+ (real_tty->session != current->signal->session) ||
+ (real_tty->pspace != pspace))
return -ENOTTY;
if (get_user(pgrp, p))
return -EFAULT;
if (pgrp < 0)
return -EINVAL;
- if (session_of_pgrp(pgrp) != current->signal->session)
+ if (session_of_pgrp(pspace, pgrp) != current->signal->session)
return -EPERM;
real_tty->pgrp = pgrp;
return 0;
@@ -2676,12 +2690,12 @@ static void __do_SAK(void *arg)

read_lock(&tasklist_lock);
/* Kill the entire session */
- do_each_task_pid(session, PIDTYPE_SID, p) {
+ do_each_task_pid(tty->pspace, session, PIDTYPE_SID, p) {
printk(KERN_NOTICE "SAK: killed process %d"
" (%s): p->signal->session==tty->session\n",
p->pid, p->comm);
send_sig(SIGKILL, p, 1);
- } while_each_task_pid(session, PIDTYPE_SID, p);
+ } while_each_task_pid(tty->pspace, session, PIDTYPE_SID, p);
/* Now kill any processes that happen to have the
* tty open.
*/
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 898d593..bc9e9c4 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -184,6 +184,7 @@ struct tty_struct {
struct semaphore termios_sem;
struct termios *termios, *termios_locked;
char name[64];
+ struct pspace *pspace;
int pgrp;
int session;
unsigned long flags;
--
1.1.5.g3480

2006-02-06 19:47:32

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 08/20] vt: Update the virtual console to handle pspaces.



Currently I am lazy and disable all advanced signal sending
features of virtual consoles if you are not a member of the initial
process id space. Not quite right but it certainly saves the
reference counting hassle, and it is unlikely that anyone cares at
this point.


Fix this when I get pspace disappearence notifiers.

Signed-off-by: Eric W. Biederman <[email protected]>


---

drivers/char/keyboard.c | 3 ++-
drivers/char/vt.c | 2 +-
drivers/char/vt_ioctl.c | 9 +++++++--
3 files changed, 10 insertions(+), 4 deletions(-)

74c362e5aab6c780cf6a27d861874e71cebc8b2e
diff --git a/drivers/char/keyboard.c b/drivers/char/keyboard.c
index 8b603b2..511dad7 100644
--- a/drivers/char/keyboard.c
+++ b/drivers/char/keyboard.c
@@ -33,6 +33,7 @@
#include <linux/string.h>
#include <linux/init.h>
#include <linux/slab.h>
+#include <linux/pspace.h>

#include <linux/kbd_kern.h>
#include <linux/kbd_diacr.h>
@@ -568,7 +569,7 @@ static void fn_compose(struct vc_data *v
static void fn_spawn_con(struct vc_data *vc, struct pt_regs *regs)
{
if (spawnpid)
- if (kill_proc(spawnpid, spawnsig, 1))
+ if (kill_proc(&init_pspace, spawnpid, spawnsig, 1))
spawnpid = 0;
}

diff --git a/drivers/char/vt.c b/drivers/char/vt.c
index 0900d1d..e354c56 100644
--- a/drivers/char/vt.c
+++ b/drivers/char/vt.c
@@ -856,7 +856,7 @@ int vc_resize(struct vc_data *vc, unsign
ws.ws_ypixel = vc->vc_scan_lines;
if ((ws.ws_row != cws->ws_row || ws.ws_col != cws->ws_col) &&
vc->vc_tty->pgrp > 0)
- kill_pg(vc->vc_tty->pgrp, SIGWINCH, 1);
+ kill_pg(vc->vc_tty->pspace, vc->vc_tty->pgrp, SIGWINCH, 1);
*cws = ws;
}

diff --git a/drivers/char/vt_ioctl.c b/drivers/char/vt_ioctl.c
index 24011e7..431c3ed 100644
--- a/drivers/char/vt_ioctl.c
+++ b/drivers/char/vt_ioctl.c
@@ -26,6 +26,7 @@
#include <linux/console.h>
#include <linux/signal.h>
#include <linux/timex.h>
+#include <linux/pspace.h>

#include <asm/io.h>
#include <asm/uaccess.h>
@@ -649,6 +650,8 @@ int vt_ioctl(struct tty_struct *tty, str
extern int spawnpid, spawnsig;
if (!perm || !capable(CAP_KILL))
return -EPERM;
+ if (current->pspace != &init_pspace)
+ return -EPERM;
if (!valid_signal(arg) || arg < 1 || arg == SIGKILL)
return -EINVAL;
spawnpid = current->pid;
@@ -666,6 +669,8 @@ int vt_ioctl(struct tty_struct *tty, str
return -EFAULT;
if (tmp.mode != VT_AUTO && tmp.mode != VT_PROCESS)
return -EINVAL;
+ if ((tmp.mode == VT_PROCESS) && (current->pspace != &init_pspace))
+ return -EINVAL;
acquire_console_sem();
vc->vt_mode = tmp;
/* the frsig is ignored, so we set it to 0 */
@@ -1113,7 +1118,7 @@ static void complete_change_console(stru
* tell us if the process has gone or something else
* is awry
*/
- if (kill_proc(vc->vt_pid, vc->vt_mode.acqsig, 1) != 0) {
+ if (kill_proc(&init_pspace, vc->vt_pid, vc->vt_mode.acqsig, 1) != 0) {
/*
* The controlling process has died, so we revert back to
* normal operation. In this case, we'll also change back
@@ -1173,7 +1178,7 @@ void change_console(struct vc_data *new_
* tell us if the process has gone or something else
* is awry
*/
- if (kill_proc(vc->vt_pid, vc->vt_mode.relsig, 1) == 0) {
+ if (kill_proc(&init_pspace, vc->vt_pid, vc->vt_mode.relsig, 1) == 0) {
/*
* It worked. Mark the vt to switch to and
* return. The process needs to send us a
--
1.1.5.g3480

2006-02-06 19:49:21

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 09/20] ptrace: Update ptrace handle pspaces.


Signed-off-by: Eric W. Biederman <[email protected]>


---

kernel/ptrace.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

38464ccf04964acc267370758e89a9b4813b6191
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 5f33cdb..73dc283 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -455,7 +455,7 @@ struct task_struct *ptrace_get_task_stru
return ERR_PTR(-EPERM);

read_lock(&tasklist_lock);
- child = find_task_by_pid(pid);
+ child = find_task_by_pid(current->pspace, pid);
if (child)
get_task_struct(child);
read_unlock(&tasklist_lock);
--
1.1.5.g3480

2006-02-06 19:51:52

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 10/20] capabilities: Update the capabilities code to handle pspaces.


Signed-off-by: Eric W. Biederman <[email protected]>


---

kernel/capability.c | 56 +++++++++++++++++++++++++++++++--------------------
1 files changed, 34 insertions(+), 22 deletions(-)

d84edcf08e16ef0af7170b494b371493d1829ee7
diff --git a/kernel/capability.c b/kernel/capability.c
index bfa3c92..80a618b 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -12,6 +12,7 @@
#include <linux/module.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/pspace.h>
#include <asm/uaccess.h>

unsigned securebits = SECUREBITS_DEFAULT; /* systemwide security settings */
@@ -68,7 +69,7 @@ asmlinkage long sys_capget(cap_user_head
read_lock(&tasklist_lock);

if (pid && pid != current->pid) {
- target = find_task_by_pid(pid);
+ target = find_task_by_pid(current->pspace, pid);
if (!target) {
ret = -ESRCH;
goto out;
@@ -96,11 +97,12 @@ static inline int cap_set_pg(int pgrp, k
kernel_cap_t *inheritable,
kernel_cap_t *permitted)
{
+ struct pspace *pspace = current->pspace;
task_t *g, *target;
int ret = -EPERM;
int found = 0;

- do_each_task_pid(pgrp, PIDTYPE_PGID, g) {
+ do_each_task_pid(pspace, pgrp, PIDTYPE_PGID, g) {
target = g;
while_each_thread(g, target) {
if (!security_capset_check(target, effective,
@@ -113,7 +115,7 @@ static inline int cap_set_pg(int pgrp, k
}
found = 1;
}
- } while_each_task_pid(pgrp, PIDTYPE_PGID, g);
+ } while_each_task_pid(pspace, pgrp, PIDTYPE_PGID, g);

if (!found)
ret = 0;
@@ -121,20 +123,26 @@ static inline int cap_set_pg(int pgrp, k
}

/*
- * cap_set_all - set capabilities for all processes other than init
- * and self. We call this holding task_capability_lock and tasklist_lock.
- */
-static inline int cap_set_all(kernel_cap_t *effective,
- kernel_cap_t *inheritable,
- kernel_cap_t *permitted)
+ * cap_set_pspace - set capabilities for all processes in pspace
+ * other than init and self. We call this holding
+ * task_capability_lock and tasklist_lock.
+ */
+static inline int cap_set_pspace(struct pspace *pspace,
+ kernel_cap_t *effective,
+ kernel_cap_t *inheritable,
+ kernel_cap_t *permitted)
{
task_t *g, *target;
int ret = -EPERM;
int found = 0;

do_each_thread(g, target) {
- if (target == current || target->pid == 1)
- continue;
+ if (target == current)
+ continue;
+ if (current_pspace_leader(target))
+ continue;
+ if (!in_pspace(pspace, target))
+ continue;
found = 1;
if (security_capset_check(target, effective, inheritable,
permitted))
@@ -200,7 +208,7 @@ asmlinkage long sys_capset(cap_user_head
read_lock(&tasklist_lock);

if (pid > 0 && pid != current->pid) {
- target = find_task_by_pid(pid);
+ target = find_task_by_pid(current->pspace, pid);
if (!target) {
ret = -ESRCH;
goto out;
@@ -212,20 +220,24 @@ asmlinkage long sys_capset(cap_user_head

/* having verified that the proposed changes are legal,
we now put them into effect. */
- if (pid < 0) {
- if (pid == -1) /* all procs other than current and init */
- ret = cap_set_all(&effective, &inheritable, &permitted);
+ if (pid < 0) {
+ struct task_struct *p;

- else /* all procs in process group */
- ret = cap_set_pg(-pid, &effective, &inheritable,
+ p = find_task_by_pid(current->pspace, -pid);
+ if (p && pspace_leader(p))
+ /* all procs other than current and init */
+ ret = cap_set_pspace(p->pspace, &effective,
+ &inheritable, &permitted);
+ else /* all procs in process group */
+ ret = cap_set_pg(-pid, &effective, &inheritable,
&permitted);
- } else {
- ret = security_capset_check(target, &effective, &inheritable,
+ } else {
+ ret = security_capset_check(target, &effective, &inheritable,
&permitted);
- if (!ret)
- security_capset_set(target, &effective, &inheritable,
+ if (!ret)
+ security_capset_set(target, &effective, &inheritable,
&permitted);
- }
+ }

out:
read_unlock(&tasklist_lock);
--
1.1.5.g3480

2006-02-06 19:54:33

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

Quoting Eric W. Biederman ([email protected]):
>
> The wait id is the pid returned by wait. For tasks that span 2
> namespaces (i.e. the process leaders of the pid namespaces) their
> parent knows the task by a different PID value than the task knows
> itself. Having a child with PID == 1 would be confusing.

Is it possible here to have wid conflicts?

Does that matter?

Looking at sysvinit, it seems that it does. If the wid happens
to conflict with the pid of one of the children init knows about,
it could confuse init.

-serge

2006-02-06 19:55:12

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 11/20] ioprio: Update ioprio to handle pspaces.


Signed-off-by: Eric W. Biederman <[email protected]>


---

fs/ioprio.c | 13 +++++++------
1 files changed, 7 insertions(+), 6 deletions(-)

42c8b99a7c1417f8fc587f86c218a0be10ec27d0
diff --git a/fs/ioprio.c b/fs/ioprio.c
index ca77008..008301b 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -24,6 +24,7 @@
#include <linux/blkdev.h>
#include <linux/capability.h>
#include <linux/syscalls.h>
+#include <linux/pspace.h>

static int set_task_ioprio(struct task_struct *task, int ioprio)
{
@@ -78,18 +79,18 @@ asmlinkage long sys_ioprio_set(int which
if (!who)
p = current;
else
- p = find_task_by_pid(who);
+ p = find_task_by_pid(current->pspace, who);
if (p)
ret = set_task_ioprio(p, ioprio);
break;
case IOPRIO_WHO_PGRP:
if (!who)
who = process_group(current);
- do_each_task_pid(who, PIDTYPE_PGID, p) {
+ do_each_task_pid(current->pspace, who, PIDTYPE_PGID, p) {
ret = set_task_ioprio(p, ioprio);
if (ret)
break;
- } while_each_task_pid(who, PIDTYPE_PGID, p);
+ } while_each_task_pid(current->pspace, who, PIDTYPE_PGID, p);
break;
case IOPRIO_WHO_USER:
if (!who)
@@ -131,19 +132,19 @@ asmlinkage long sys_ioprio_get(int which
if (!who)
p = current;
else
- p = find_task_by_pid(who);
+ p = find_task_by_pid(current->pspace, who);
if (p)
ret = p->ioprio;
break;
case IOPRIO_WHO_PGRP:
if (!who)
who = process_group(current);
- do_each_task_pid(who, PIDTYPE_PGID, p) {
+ do_each_task_pid(current->pspace, who, PIDTYPE_PGID, p) {
if (ret == -ESRCH)
ret = p->ioprio;
else
ret = ioprio_best(ret, p->ioprio);
- } while_each_task_pid(who, PIDTYPE_PGID, p);
+ } while_each_task_pid(current->pspace, who, PIDTYPE_PGID, p);
break;
case IOPRIO_WHO_USER:
if (!who)
--
1.1.5.g3480

2006-02-06 19:56:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 03/20] pid: Introduce a generic helper to test for init.

On Mon, 2006-02-06 at 12:29 -0700, Eric W. Biederman wrote:
> Introduce is_init to capture this case.
>
> With multiple pid spaces for all of the cases affected we are looking
> for only the first process on the system, not some other process that
> has pid == 1.

If we have cases where each container has its own init (like vserver,
right?), will this naming get confusing? Will we have pseudo-init tasks
as well?

-- Dave

2006-02-06 19:57:20

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 12/20] fcntl: Update fcntl to work with pspaces.



Signed-off-by: Eric W. Biederman <[email protected]>


---

fs/fcntl.c | 25 ++++++++++++++++---------
include/linux/fs.h | 1 +
2 files changed, 17 insertions(+), 9 deletions(-)

34f261442c7ac3c160830b86aefd5b15df68de9d
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5f96786..0a41df8 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -18,6 +18,7 @@
#include <linux/ptrace.h>
#include <linux/signal.h>
#include <linux/rcupdate.h>
+#include <linux/pspace.h>

#include <asm/poll.h>
#include <asm/siginfo.h>
@@ -248,11 +249,14 @@ static int setfl(int fd, struct file * f
return error;
}

-static void f_modown(struct file *filp, unsigned long pid,
+static void f_modown(struct file *filp, struct pspace *pspace, unsigned long pid,
uid_t uid, uid_t euid, int force)
{
write_lock_irq(&filp->f_owner.lock);
if (force || !filp->f_owner.pid) {
+ put_pspace(filp->f_owner.pspace);
+ get_pspace(pspace);
+ filp->f_owner.pspace = pspace;
filp->f_owner.pid = pid;
filp->f_owner.uid = uid;
filp->f_owner.euid = euid;
@@ -268,7 +272,7 @@ int f_setown(struct file *filp, unsigned
if (err)
return err;

- f_modown(filp, arg, current->uid, current->euid, force);
+ f_modown(filp, current->pspace, arg, current->uid, current->euid, force);
return 0;
}

@@ -276,7 +280,7 @@ EXPORT_SYMBOL(f_setown);

void f_delown(struct file *filp)
{
- f_modown(filp, 0, 0, 0, 1);
+ f_modown(filp, NULL, 0, 0, 0, 1);
}

static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
@@ -318,6 +322,9 @@ static long do_fcntl(int fd, unsigned in
* to fix this will be in libc.
*/
err = filp->f_owner.pid;
+ /* Don't return an owner in a different pspace */
+ if (filp->f_owner.pspace != current->pspace)
+ err = 0;
force_successful_syscall_return();
break;
case F_SETOWN:
@@ -478,14 +485,14 @@ void send_sigio(struct fown_struct *fown

read_lock(&tasklist_lock);
if (pid > 0) {
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(fown->pspace, pid);
if (p) {
send_sigio_to_task(p, fown, fd, band);
}
} else {
- do_each_task_pid(-pid, PIDTYPE_PGID, p) {
+ do_each_task_pid(fown->pspace, -pid, PIDTYPE_PGID, p) {
send_sigio_to_task(p, fown, fd, band);
- } while_each_task_pid(-pid, PIDTYPE_PGID, p);
+ } while_each_task_pid(fown->pspace, -pid, PIDTYPE_PGID, p);
}
read_unlock(&tasklist_lock);
out_unlock_fown:
@@ -513,14 +520,14 @@ int send_sigurg(struct fown_struct *fown

read_lock(&tasklist_lock);
if (pid > 0) {
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(fown->pspace, pid);
if (p) {
send_sigurg_to_task(p, fown);
}
} else {
- do_each_task_pid(-pid, PIDTYPE_PGID, p) {
+ do_each_task_pid(fown->pspace, -pid, PIDTYPE_PGID, p) {
send_sigurg_to_task(p, fown);
- } while_each_task_pid(-pid, PIDTYPE_PGID, p);
+ } while_each_task_pid(fown->pspace, -pid, PIDTYPE_PGID, p);
}
read_unlock(&tasklist_lock);
out_unlock_fown:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e059da9..cabb1b6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -589,6 +589,7 @@ extern struct block_device *I_BDEV(struc

struct fown_struct {
rwlock_t lock; /* protects pid, uid, euid fields */
+ struct pspace *pspace; /* process space of owner pid */
int pid; /* pid or -pgrp where SIGIO should be sent */
uid_t uid, euid; /* uid/euid of process setting the owner */
void *security;
--
1.1.5.g3480

2006-02-06 20:01:13

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 13/20] kthread: Update kthread to work with pspaces.



Signed-off-by: Eric W. Biederman <[email protected]>


---

kernel/kthread.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

4dbaead60e8ba1ef38aea6059d1abb6f6a10fd1b
diff --git a/kernel/kthread.c b/kernel/kthread.c
index e75950a..0b4fa7b 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -12,6 +12,7 @@
#include <linux/unistd.h>
#include <linux/file.h>
#include <linux/module.h>
+#include <linux/pspace.h>
#include <asm/semaphore.h>

/*
@@ -114,7 +115,7 @@ static void keventd_create_kthread(void
create->result = ERR_PTR(pid);
} else {
wait_for_completion(&create->started);
- create->result = find_task_by_pid(pid);
+ create->result = find_task_by_pid(current->pspace, pid);
}
complete(&create->done);
}
--
1.1.5.g3480

2006-02-06 20:01:52

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 14/20] mm: Update vmscan to work with pspaces.


Signed-off-by: Eric W. Biederman <[email protected]>


---

mm/vmscan.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

c38033f8c948f7fccb92c290f66d3a757c5cec04
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a61080..ca5fceb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1802,7 +1802,8 @@ static int __init kswapd_init(void)
swap_setup();
for_each_pgdat(pgdat)
pgdat->kswapd
- = find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL));
+ = find_task_by_pid(current->pspace,
+ kernel_thread(kswapd, pgdat, CLONE_KERNEL));
total_memory = nr_free_pagecache_pages();
hotcpu_notifier(cpu_callback, 0);
return 0;
--
1.1.5.g3480

2006-02-06 20:03:25

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 15/20] posix-timers: Update posix timers to work with pspaces.


Signed-off-by: Eric W. Biederman <[email protected]>


---

kernel/posix-cpu-timers.c | 9 +++++----
kernel/posix-timers.c | 4 +++-
2 files changed, 8 insertions(+), 5 deletions(-)

90b04f2eeb0fd42309ed68dad92e1b7bb4d2fba6
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 520f6c5..9eda08a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -3,6 +3,7 @@
*/

#include <linux/sched.h>
+#include <linux/pspace.h>
#include <linux/posix-timers.h>
#include <asm/uaccess.h>
#include <linux/errno.h>
@@ -20,7 +21,7 @@ static int check_clock(const clockid_t w
return 0;

read_lock(&tasklist_lock);
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);
if (!p || (CPUCLOCK_PERTHREAD(which_clock) ?
p->tgid != current->tgid : p->tgid != pid)) {
error = -EINVAL;
@@ -292,7 +293,7 @@ int posix_cpu_clock_get(const clockid_t
*/
struct task_struct *p;
read_lock(&tasklist_lock);
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);
if (p) {
if (CPUCLOCK_PERTHREAD(which_clock)) {
if (p->tgid == current->tgid) {
@@ -336,7 +337,7 @@ int posix_cpu_timer_create(struct k_itim
if (pid == 0) {
p = current;
} else {
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);
if (p && p->tgid != current->tgid)
p = NULL;
}
@@ -344,7 +345,7 @@ int posix_cpu_timer_create(struct k_itim
if (pid == 0) {
p = current->group_leader;
} else {
- p = find_task_by_pid(pid);
+ p = find_task_by_pid(current->pspace, pid);
if (p && p->tgid != pid)
p = NULL;
}
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 216f574..5f6775c 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -35,6 +35,7 @@
#include <linux/interrupt.h>
#include <linux/slab.h>
#include <linux/time.h>
+#include <linux/pspace.h>

#include <asm/uaccess.h>
#include <asm/semaphore.h>
@@ -365,7 +366,8 @@ static struct task_struct * good_sigeven
struct task_struct *rtn = current->group_leader;

if ((event->sigev_notify & SIGEV_THREAD_ID ) &&
- (!(rtn = find_task_by_pid(event->sigev_notify_thread_id)) ||
+ (!(rtn = find_task_by_pid(current->pspace,
+ event->sigev_notify_thread_id)) ||
rtn->tgid != current->tgid ||
(event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL))
return NULL;
--
1.1.5.g3480

2006-02-06 20:05:11

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 16/20] nfs: Don't use pids to track the lockd server process.



Signed-off-by: Eric W. Biederman <[email protected]>


---

fs/lockd/svc.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)

8879803b195822f0a4e6f1b05297f5ffe3f62ad1
diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 71a30b4..e5c1d5c 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL(nlmsvc_ops);

static DECLARE_MUTEX(nlmsvc_sema);
static unsigned int nlmsvc_users;
-static pid_t nlmsvc_pid;
+static struct task_struct * nlmsvc_task;
int nlmsvc_grace_period;
unsigned long nlmsvc_timeout;

@@ -111,7 +111,8 @@ lockd(struct svc_rqst *rqstp)
/*
* Let our maker know we're running.
*/
- nlmsvc_pid = current->pid;
+ get_task_struct(current);
+ nlmsvc_task = current;
up(&lockd_start);

daemonize("lockd");
@@ -135,7 +136,7 @@ lockd(struct svc_rqst *rqstp)
* NFS mount or NFS daemon has gone away, and we've been sent a
* signal, or else another process has taken over our job.
*/
- while ((nlmsvc_users || !signalled()) && nlmsvc_pid == current->pid) {
+ while ((nlmsvc_users || !signalled()) && nlmsvc_task == current) {
long timeout = MAX_SCHEDULE_TIMEOUT;

if (signalled()) {
@@ -184,11 +185,12 @@ lockd(struct svc_rqst *rqstp)
* Check whether there's a new lockd process before
* shutting down the hosts and clearing the slot.
*/
- if (!nlmsvc_pid || current->pid == nlmsvc_pid) {
+ if (!nlmsvc_task || current == nlmsvc_task) {
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
- nlmsvc_pid = 0;
+ put_task_struct(nlmsvc_task);
+ nlmsvc_task = NULL;
} else
printk(KERN_DEBUG
"lockd: new process, skipping host shutdown\n");
@@ -224,7 +226,7 @@ lockd_up(void)
/*
* Check whether we're already up and running.
*/
- if (nlmsvc_pid)
+ if (nlmsvc_task)
goto out;

/*
@@ -290,26 +292,27 @@ lockd_down(void)
if (--nlmsvc_users)
goto out;
} else
- printk(KERN_WARNING "lockd_down: no users! pid=%d\n", nlmsvc_pid);
+ printk(KERN_WARNING "lockd_down: no users! pid=%d\n", nlmsvc_task->pid);

- if (!nlmsvc_pid) {
+ if (!nlmsvc_task) {
if (warned++ == 0)
printk(KERN_WARNING "lockd_down: no lockd running.\n");
goto out;
}
warned = 0;

- kill_proc(nlmsvc_pid, SIGKILL, 1);
+ send_group_sig_info(SIGKILL, SEND_SIG_PRIV, nlmsvc_task);
/*
* Wait for the lockd process to exit, but since we're holding
* the lockd semaphore, we can't wait around forever ...
*/
clear_thread_flag(TIF_SIGPENDING);
interruptible_sleep_on_timeout(&lockd_exit, HZ);
- if (nlmsvc_pid) {
+ if (nlmsvc_task) {
printk(KERN_WARNING
"lockd_down: lockd failed to exit, clearing pid\n");
- nlmsvc_pid = 0;
+ put_task_struct(nlmsvc_task);
+ nlmsvc_task = NULL;
}
spin_lock_irq(&current->sighand->siglock);
recalc_sigpending();
--
1.1.5.g3480

2006-02-06 20:06:19

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 17/20] usb: Fixup usb so it works with pspaces.


Signed-off-by: Eric W. Biederman <[email protected]>


---

drivers/usb/core/devio.c | 12 ++++++++++--
drivers/usb/core/inode.c | 4 +++-
drivers/usb/core/usb.h | 1 +
3 files changed, 14 insertions(+), 3 deletions(-)

94873f6924179bc2ee782f6fb72d5e855c780a1b
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 2b68998..7bccb9a 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -47,6 +47,7 @@
#include <linux/usbdevice_fs.h>
#include <linux/cdev.h>
#include <linux/notifier.h>
+#include <linux/pspace.h>
#include <asm/uaccess.h>
#include <asm/byteorder.h>
#include <linux/moduleparam.h>
@@ -61,6 +62,7 @@ static struct class *usb_device_class;
struct async {
struct list_head asynclist;
struct dev_state *ps;
+ struct pspace *pspace;
pid_t pid;
uid_t uid, euid;
unsigned int signr;
@@ -224,6 +226,7 @@ static struct async *alloc_async(unsigne

static void free_async(struct async *as)
{
+ put_pspace(as->pspace);
kfree(as->urb->transfer_buffer);
kfree(as->urb->setup_packet);
usb_free_urb(as->urb);
@@ -316,8 +319,8 @@ static void async_completed(struct urb *
sinfo.si_errno = as->urb->status;
sinfo.si_code = SI_ASYNCIO;
sinfo.si_addr = as->userurb;
- kill_proc_info_as_uid(as->signr, &sinfo, as->pid, as->uid,
- as->euid);
+ kill_proc_info_as_uid(as->signr, &sinfo, as->pspace, as->pid,
+ as->uid, as->euid);
}
snoop(&urb->dev->dev, "urb complete\n");
snoop_urb(urb, as->userurb);
@@ -571,6 +574,8 @@ static int usbdev_open(struct inode *ino
INIT_LIST_HEAD(&ps->async_completed);
init_waitqueue_head(&ps->wait);
ps->discsignr = 0;
+ ps->disc_pspace = current->pspace;
+ get_pspace(ps->disc_pspace);
ps->disc_pid = current->pid;
ps->disc_uid = current->uid;
ps->disc_euid = current->euid;
@@ -601,6 +606,7 @@ static int usbdev_release(struct inode *
destroy_all_async(ps);
usb_unlock_device(dev);
usb_put_dev(dev);
+ put_pspace(ps->disc_pspace);
ps->dev = NULL;
kfree(ps);
return 0;
@@ -1054,6 +1060,8 @@ static int proc_do_submiturb(struct dev_
as->userbuffer = NULL;
as->signr = uurb->signr;
as->ifnum = ifnum;
+ as->pspace = current->pspace;
+ get_pspace(as->pspace);
as->pid = current->pid;
as->uid = current->uid;
as->euid = current->euid;
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index 3cf945c..94c91b3 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -700,7 +700,9 @@ static void usbfs_remove_device(struct u
sinfo.si_errno = EPIPE;
sinfo.si_code = SI_ASYNCIO;
sinfo.si_addr = ds->disccontext;
- kill_proc_info_as_uid(ds->discsignr, &sinfo, ds->disc_pid, ds->disc_uid, ds->disc_euid);
+ kill_proc_info_as_uid(ds->discsignr, &sinfo,
+ ds->disc_pspace, ds->disc_pid,
+ ds->disc_uid, ds->disc_euid);
}
}
}
diff --git a/drivers/usb/core/usb.h b/drivers/usb/core/usb.h
index 4647e1e..512c5c7 100644
--- a/drivers/usb/core/usb.h
+++ b/drivers/usb/core/usb.h
@@ -73,6 +73,7 @@ struct dev_state {
struct list_head async_completed;
wait_queue_head_t wait; /* wake up if a request completed */
unsigned int discsignr;
+ struct pspace *disc_pspace;
pid_t disc_pid;
uid_t disc_uid, disc_euid;
void __user *disccontext;
--
1.1.5.g3480

2006-02-06 20:08:23

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 18/20] posix-mqueue: Make mqueues work with pspspaces


Signed-off-by: Eric W. Biederman <[email protected]>


---

ipc/mqueue.c | 21 ++++++++++++++++++---
1 files changed, 18 insertions(+), 3 deletions(-)

04e0f90a315df40b07d316c884a28c58eed4de33
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 59302fc..5c71ae0 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -25,6 +25,7 @@
#include <linux/netlink.h>
#include <linux/syscalls.h>
#include <linux/signal.h>
+#include <linux/pspace.h>
#include <net/sock.h>
#include "util.h"

@@ -69,6 +70,7 @@ struct mqueue_inode_info {
struct mq_attr attr;

struct sigevent notify;
+ struct pspace *notify_pspace;
pid_t notify_owner;
struct user_struct *user; /* user who created, for accounting */
struct sock *notify_sock;
@@ -131,6 +133,7 @@ static struct inode *mqueue_get_inode(st
INIT_LIST_HEAD(&info->e_wait_q[0].list);
INIT_LIST_HEAD(&info->e_wait_q[1].list);
info->messages = NULL;
+ info->notify_pspace = NULL;
info->notify_owner = 0;
info->qsize = 0;
info->user = NULL; /* set when all is ok */
@@ -250,6 +253,9 @@ static void mqueue_delete_inode(struct i
kfree(info->messages);
spin_unlock(&info->lock);

+ put_pspace(info->notify_pspace);
+ info->notify_pspace = NULL;
+
clear_inode(inode);

mq_bytes = (info->attr.mq_maxmsg * sizeof(struct msg_msg *) +
@@ -360,7 +366,8 @@ static int mqueue_flush_file(struct file
struct mqueue_inode_info *info = MQUEUE_I(filp->f_dentry->d_inode);

spin_lock(&info->lock);
- if (current->tgid == info->notify_owner)
+ if ((current->tgid == info->notify_owner) &&
+ (current->pspace == info->notify_pspace))
remove_notification(info);

spin_unlock(&info->lock);
@@ -516,7 +523,8 @@ static void __do_notify(struct mqueue_in
sig_i.si_uid = current->uid;

kill_proc_info(info->notify.sigev_signo,
- &sig_i, info->notify_owner);
+ &sig_i,
+ info->notify_pspace, info->notify_owner);
break;
case SIGEV_THREAD:
set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
@@ -525,7 +533,9 @@ static void __do_notify(struct mqueue_in
break;
}
/* after notification unregisters process */
+ put_pspace(info->notify_pspace);
info->notify_owner = 0;
+ info->notify_pspace = NULL;
}
wake_up(&info->wait_q);
}
@@ -568,6 +578,8 @@ static void remove_notification(struct m
set_cookie(info->notify_cookie, NOTIFY_REMOVED);
netlink_sendskb(info->notify_sock, info->notify_cookie, 0);
}
+ put_pspace(info->notify_pspace);
+ info->notify_pspace = NULL;
info->notify_owner = 0;
}

@@ -1042,7 +1054,8 @@ retry:
ret = 0;
spin_lock(&info->lock);
if (u_notification == NULL) {
- if (info->notify_owner == current->tgid) {
+ if ((info->notify_owner == current->tgid) &&
+ (info->notify_pspace == current->pspace)) {
remove_notification(info);
inode->i_atime = inode->i_ctime = CURRENT_TIME;
}
@@ -1066,7 +1079,9 @@ retry:
info->notify.sigev_notify = SIGEV_SIGNAL;
break;
}
+ info->notify_pspace = current->pspace;
info->notify_owner = current->tgid;
+ get_pspace(info->notify_pspace);
inode->i_atime = inode->i_ctime = CURRENT_TIME;
}
spin_unlock(&info->lock);
--
1.1.5.g3480

2006-02-06 20:10:31

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 19/20] pspace: Upcate the pid_max sysctl to work in a per pspace fashion


It is a pain to export anything in sysctl that is not a global
variable but it is possible. So for backwards compatibility.

Signed-off-by: Eric W. Biederman <[email protected]>


---

kernel/sysctl.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 75 insertions(+), 4 deletions(-)

dc9bb041416aeaa92add46c7fe7689099768d8fa
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8e1bdc5..89476f6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -44,6 +44,7 @@
#include <linux/limits.h>
#include <linux/dcache.h>
#include <linux/syscalls.h>
+#include <linux/pspace.h>

#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -64,7 +65,6 @@ extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
extern int cad_pid;
-extern int pid_max;
extern int min_free_kbytes;
extern int printk_ratelimit_jiffies;
extern int printk_ratelimit_burst;
@@ -132,6 +132,11 @@ static int parse_table(int __user *, int
ctl_table *, void **);
static int proc_doutsstring(ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
+static int proc_pidmax(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+static int sysctl_pidmax(ctl_table *table, int __user *name, int nlen,
+ void __user *oldval, size_t __user *oldlenp,
+ void __user *newval, size_t newlen, void **context);

static ctl_table root_table[];
static struct ctl_table_header root_table_header =
@@ -579,11 +584,11 @@ static ctl_table kern_table[] = {
{
.ctl_name = KERN_PIDMAX,
.procname = "pid_max",
- .data = &pid_max,
+ .data = (void *)1,
.maxlen = sizeof (int),
.mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
- .strategy = sysctl_intvec,
+ .proc_handler = &proc_pidmax,
+ .strategy = sysctl_pidmax,
.extra1 = &pid_max_min,
.extra2 = &pid_max_max,
},
@@ -2157,6 +2162,29 @@ int proc_dointvec_ms_jiffies(ctl_table *
do_proc_dointvec_ms_jiffies_conv, NULL);
}

+static int do_pidmax_conv(int *negp, unsigned long *lvalp,
+ int *valp, int write, void *data)
+{
+ if (write) {
+ int val = *negp ? -*lvalp : *lvalp;
+ if ((pid_max_min > val) || (pid_max_max < val))
+ return -EINVAL;
+ current->pspace->max = val;
+ } else {
+ *negp = 0;
+ *lvalp = (unsigned long)(current->pspace->max);
+ }
+ return 0;
+}
+
+static int proc_pidmax(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return do_proc_dointvec(table, write, filp, buffer, lenp, ppos,
+ do_pidmax_conv, NULL);
+}
+
+
#else /* CONFIG_PROC_FS */

int proc_dostring(ctl_table *table, int write, struct file *filp,
@@ -2222,6 +2250,12 @@ int proc_doulongvec_ms_jiffies_minmax(ct
}


+static int proc_pidmax(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+
#endif /* CONFIG_PROC_FS */


@@ -2367,6 +2401,43 @@ int sysctl_ms_jiffies(ctl_table *table,
return 1;
}

+static int sysctl_pidmax(ctl_table *table, int __user *name, int nlen,
+ void __user *oldval, size_t __user *oldlenp,
+ void __user *newval, size_t newlen, void **context)
+{
+ int result = 0;
+ if (oldval && oldlenp) {
+ size_t len;
+ if (get_user(len, oldlenp))
+ return -EFAULT;
+ if (len < sizeof(int))
+ return -EINVAL;
+ if (put_user(current->pspace->max, oldval))
+ return -EFAULT;
+ if (put_user(sizeof(int), oldlenp))
+ return -EFAULT;
+ result = 1;
+ }
+ if (newval && newlen) {
+ int __user *vec = (int __user *)newval;
+ int value;
+
+ if (newlen != sizeof(int))
+ return -EINVAL;
+
+ if (get_user(value, vec))
+ return -EFAULT;
+ if (value < pid_max_min)
+ return -EINVAL;
+ if (value > pid_max_max)
+ return -EINVAL;
+
+ current->pspace->max = value;
+ result = 1;
+ }
+ return result;
+}
+
#else /* CONFIG_SYSCTL */


--
1.1.5.g3480

2006-02-06 20:16:32

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 20/20] proc: Update /proc to support multiple pid spaces.


This patch does a couple of things.
- It splits proc into proc and proc_sysinfo
- It adds pspace support to proc
- It adds getattr methods to ensure proc has the proper hard link count.
- It increases the size of a couple of buffers by one to avoid buffer overflow
- It moves /proc/mounts and /proc/loadavg into the proc filesystem from proc_sysinfo

Sorry for the big patch. When I start feeding this changes seriously I will
split this patch.

The split of /proc into mutliple filesystems works well however it comes
with one downsides. There are now some directories where cd -P <subdir>/..
is not a noop. Basically it is doing the equivalent of following symlinks
into an internal kernel mount. It is well defined and safe behaviour but
I'm not certain if it is desirable.

Signed-off-by: Eric W. Biederman <[email protected]>


---

fs/Kconfig | 6
fs/proc/Makefile | 22 +-
fs/proc/array.c | 203 ++++++++++++++++
fs/proc/base.c | 487 +++++++++++++++++++++++++++++++++------
fs/proc/generic.c | 28 ++
fs/proc/inode.c | 92 ++-----
fs/proc/internal.h | 38 +--
fs/proc/kcore.c | 2
fs/proc/mmu.c | 4
fs/proc/nommu.c | 4
fs/proc/proc.c | 168 +++++++++++++
fs/proc/proc_devtree.c | 2
fs/proc/proc_misc.c | 27 --
fs/proc/proc_tty.c | 2
fs/proc/root.c | 101 +-------
fs/proc/sysinfo.h | 35 +++
fs/proc/vmcore.c | 2
include/linux/proc_fs.h | 216 -----------------
include/linux/proc_sysinfo_fs.h | 228 ++++++++++++++++++
include/linux/pspace.h | 1
kernel/pid.c | 4
security/selinux/hooks.c | 10 -
22 files changed, 1156 insertions(+), 526 deletions(-)
create mode 100644 fs/proc/proc.c
create mode 100644 fs/proc/sysinfo.h
create mode 100644 include/linux/proc_sysinfo_fs.h

85533ea023d4002528fe0325406160e69c4ca7dd
diff --git a/fs/Kconfig b/fs/Kconfig
index 93b5dc4..f953b29 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -765,6 +765,7 @@ menu "Pseudo filesystems"

config PROC_FS
bool "/proc file system support"
+ select PROC_SYSINFO_FS
help
This is a virtual file system providing information about the status
of the system. "Virtual" means that it doesn't take up any space on
@@ -792,6 +793,11 @@ config PROC_FS
This option will enlarge your kernel by about 67 KB. Several
programs depend on this, so everyone should say Y here.

+config PROC_SYSINFO_FS
+ bool
+ depends on PROC_FS
+ default n
+
config PROC_KCORE
bool "/proc/kcore support" if !ARM
depends on PROC_FS && MMU
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 7431d7b..1ff44cf 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -2,14 +2,20 @@
# Makefile for the Linux proc filesystem routines.
#

-obj-$(CONFIG_PROC_FS) += proc.o
+obj-$(CONFIG_PROC_FS) += procfs.o
+obj-$(CONFIG_PROC_SYSINFO_FS) += proc-sysinfo.o

-proc-y := nommu.o task_nommu.o
-proc-$(CONFIG_MMU) := mmu.o task_mmu.o
+procfs-y := task_nommu.o
+procfs-$(CONFIG_MMU) := task_mmu.o
+procfs-y += proc.o base.o array.o

-proc-y += inode.o root.o base.o generic.o array.o \
- kmsg.o proc_tty.o proc_misc.o

-proc-$(CONFIG_PROC_KCORE) += kcore.o
-proc-$(CONFIG_PROC_VMCORE) += vmcore.o
-proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o
+proc-sysinfo-y := nommu.o
+proc-sysinfo-$(CONFIG_MMU) := mmu.o
+
+proc-sysinfo-y += inode.o root.o generic.o kmsg.o proc_tty.o proc_misc.o
+
+proc-sysinfo-$(CONFIG_PROC_KCORE) += kcore.o
+proc-sysinfo-$(CONFIG_PROC_VMCORE) += vmcore.o
+proc-sysinfo-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o
+
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 7eb1bd7..baa7d5e 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -75,6 +75,7 @@
#include <linux/times.h>
#include <linux/cpuset.h>
#include <linux/rcupdate.h>
+#include <linux/pspace.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -137,6 +138,8 @@ static const char *task_state_array[] =
"Z (zombie)", /* 16 */
"X (dead)" /* 32 */
};
+static const char task_state_pspace[] =
+ "P (pspace)";

static inline const char * get_task_state(struct task_struct *tsk)
{
@@ -175,8 +178,11 @@ static inline char * task_state(struct t
get_task_state(p),
(p->sleep_avg/1024)*100/(1020000000/1024),
p->tgid,
- p->pid, pid_alive(p) ? p->group_leader->real_parent->tgid : 0,
- pid_alive(p) && p->ptrace ? p->parent->pid : 0,
+ p->pid,
+ pid_alive(p) && p->pspace == p->group_leader->real_parent->pspace ?
+ p->group_leader->real_parent->tgid : 0,
+ pid_alive(p) && p->ptrace && p->pspace == p->parent->pspace ?
+ p->parent->pid : 0,
p->uid, p->euid, p->suid, p->fsuid,
p->gid, p->egid, p->sgid, p->fsgid);
read_unlock(&tasklist_lock);
@@ -202,6 +208,47 @@ static inline char * task_state(struct t
return buffer;
}

+
+static inline char * pspace_task_state(struct task_struct *p, char *buffer)
+{
+ struct group_info *group_info;
+ int g;
+
+ read_lock(&tasklist_lock);
+ buffer += sprintf(buffer,
+ "State:\t%s\n"
+ "Tgid:\t%d\n"
+ "Pid:\t%d\n"
+ "PPid:\t%d\n"
+ "TracerPid:\t%d\n"
+ "Uid:\t%d\t%d\t%d\t%d\n"
+ "Gid:\t%d\t%d\t%d\t%d\n",
+ task_state_pspace,
+ p->wid,
+ p->wid,
+ pid_alive(p)? p->group_leader->real_parent->tgid : 0,
+ pid_alive(p) && p->ptrace ? p->parent->pid : 0,
+ p->uid, p->euid, p->suid, p->fsuid,
+ p->gid, p->egid, p->sgid, p->fsgid);
+ read_unlock(&tasklist_lock);
+ task_lock(p);
+ rcu_read_lock();
+ buffer += sprintf(buffer,
+ "Groups:\t");
+ rcu_read_unlock();
+
+ group_info = p->group_info;
+ get_group_info(group_info);
+ task_unlock(p);
+
+ for (g = 0; g < min(group_info->ngroups,NGROUPS_SMALL); g++)
+ buffer += sprintf(buffer, "%d ", GROUP_AT(group_info,g));
+ put_group_info(group_info);
+
+ buffer += sprintf(buffer, "\n");
+ return buffer;
+}
+
static char * render_sigset_t(const char *header, sigset_t *set, char *buffer)
{
int i, len;
@@ -314,6 +361,15 @@ int proc_pid_status(struct task_struct *
return buffer - orig;
}

+int proc_pspace_status(struct task_struct *task, char * buffer)
+{
+ char *orig = buffer;
+ buffer = task_name(task, buffer);
+ buffer = pspace_task_state(task, buffer);
+ buffer += sprintf(buffer, "Threads:\t%d\n", task->pspace->nr_threads);
+ return buffer - orig;
+}
+
static int do_task_stat(struct task_struct *task, char * buffer, int whole)
{
unsigned long vsize, eip, esp, wchan = ~0UL;
@@ -388,7 +444,9 @@ static int do_task_stat(struct task_stru
}
it_real_value = task->signal->real_timer.expires;
}
- ppid = pid_alive(task) ? task->group_leader->real_parent->tgid : 0;
+ ppid = pid_alive(task) &&
+ task->pspace == task->group_leader->real_parent->pspace ?
+ task->group_leader->real_parent->tgid : 0;
read_unlock(&tasklist_lock);

if (!whole || num_threads<2)
@@ -475,6 +533,137 @@ int proc_tgid_stat(struct task_struct *t
return do_task_stat(task, buffer, 1);
}

+int proc_pspace_stat(struct task_struct *task, char * buffer)
+{
+ struct pspace *pspace = task->pspace;
+ long priority, nice;
+ int tty_pgrp = -1, tty_nr = 0;
+ char state;
+ int res;
+ pid_t ppid, pgid = -1, sid = -1;
+ int num_threads = 0;
+ unsigned long long start_time;
+ unsigned long cmin_flt = 0, cmaj_flt = 0;
+ unsigned long min_flt = 0, maj_flt = 0;
+ cputime_t cutime, cstime, utime, stime;
+ struct task_struct *p;
+ char tcomm[sizeof(task->comm)];
+
+ state = *task_state_pspace;
+ get_task_comm(tcomm, task);
+
+ cutime = cstime = utime = stime = cputime_zero;
+
+ /* scale priority and nice values from timeslices to -20..20 */
+ /* to make it look like a "normal" Unix priority/nice value */
+ priority = task_prio(task);
+ nice = task_nice(task);
+
+ /* add up live thread stats at the pspace level */
+ read_lock(&tasklist_lock);
+ for_each_process(p) {
+ long t_priority, t_nice;
+ /* Skip processes outside the target process space */
+ if (!in_pspace(pspace, p))
+ continue;
+
+ t_priority = task_prio(task);
+ t_nice = task_nice(task);
+
+ if (priority > t_priority)
+ priority = t_priority;
+ if (nice > t_nice)
+ nice = t_nice;
+
+ if (p->sighand) {
+ spin_lock_irq(&task->sighand->siglock);
+ min_flt += p->min_flt;
+ maj_flt += p->maj_flt;
+ utime = cputime_add(utime, p->utime);
+ stime = cputime_add(stime, p->stime);
+ spin_unlock_irq(&task->sighand->siglock);
+ }
+ if (thread_group_leader(p) && p->signal) {
+ min_flt += p->signal->min_flt;
+ maj_flt += p->signal->maj_flt;
+ utime = cputime_add(utime, p->signal->utime);
+ stime = cputime_add(stime, p->signal->stime);
+ cmin_flt += p->signal->cmin_flt;
+ cmaj_flt += p->signal->cmaj_flt;
+ cutime += p->signal->cutime;
+ cstime += p->signal->cstime;
+ }
+ if (pspace_leader(p)) {
+ num_threads += p->pspace->nr_threads;
+ }
+ }
+ if (task->signal) {
+ if (task->signal->tty) {
+ tty_pgrp = task->wid;
+ tty_nr = new_encode_dev(tty_devnum(task->signal->tty));
+ }
+ }
+ ppid = pid_alive(task)? task->group_leader->real_parent->tgid : 0;
+ read_unlock(&tasklist_lock);
+
+ /* Temporary variable needed for gcc-2.96 */
+ /* convert timespec -> nsec*/
+ start_time = (unsigned long long)task->start_time.tv_sec * NSEC_PER_SEC
+ + task->start_time.tv_nsec;
+ /* convert nsec -> ticks */
+ start_time = nsec_to_clock_t(start_time);
+
+ res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
+%lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+ task->wid,
+ tcomm,
+ state,
+ ppid,
+ pgid,
+ sid,
+ tty_nr,
+ tty_pgrp,
+ 0UL, /* task->flags */
+ min_flt,
+ cmin_flt,
+ maj_flt,
+ cmaj_flt,
+ cputime_to_clock_t(utime),
+ cputime_to_clock_t(stime),
+ cputime_to_clock_t(cutime),
+ cputime_to_clock_t(cstime),
+ priority,
+ nice,
+ num_threads,
+ 0UL /* it_real_value (Not implemented) */,
+ start_time,
+ 0UL /* vsize */,
+ 0L /* mm_counter */,
+ 0UL /* rsslim */,
+ 0UL /* start_code */,
+ 0UL /* end_code */,
+ 0UL /* start_stack */,
+ 0UL /* esp */,
+ 0UL /* eip */,
+ /* The signal information here is obsolete.
+ * It must be decimal for Linux 2.0 compatibility.
+ * Use /proc/#/status for real-time signals.
+ */
+ 0UL /* task->pending.signal.sig[0] & 0x7fffffffUL */,
+ 0UL /* task->blocked.sig[0] & 0x7fffffffUL */,
+ 0UL /* sigign .sig[0] & 0x7fffffffUL */,
+ 0UL /* sigcatch .sig[0] & 0x7fffffffUL */,
+ 0UL /* wchan */,
+ 0UL,
+ 0UL,
+ task->exit_signal,
+ 0 /* task_cpu(task) */,
+ 0UL /* task->rt_priority*/,
+ 0UL /* task->policy*/);
+ return res;
+}
+
int proc_pid_statm(struct task_struct *task, char *buffer)
{
int size = 0, resident = 0, shared = 0, text = 0, lib = 0, data = 0;
@@ -488,3 +677,11 @@ int proc_pid_statm(struct task_struct *t
return sprintf(buffer,"%d %d %d %d %d %d %d\n",
size, resident, shared, text, lib, data, 0);
}
+
+int proc_pspace_statm(struct task_struct *task, char *buffer)
+{
+ int size = 0, resident = 0, shared = 0, text = 0, lib = 0, data = 0;
+
+ return sprintf(buffer,"%d %d %d %d %d %d %d\n",
+ size, resident, shared, text, lib, data, 0);
+}
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 20feb75..66f2ccf 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -72,8 +72,12 @@
#include <linux/cpuset.h>
#include <linux/audit.h>
#include <linux/poll.h>
+#include <linux/pspace.h>
+
#include "internal.h"

+static int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir, unsigned int nr);
+
/*
* For hysterical raisins we keep the same inumbers as in the old procfs.
* Feel free to change the macro below - just keep the range distinct from
@@ -85,7 +89,11 @@
#define fake_ino(pid,ino) (((pid)<<16)|(ino))

enum pid_directory_inos {
- PROC_TGID_INO = 2,
+ PROC_SELF = 2,
+ PROC_MOUNTS,
+ PROC_LOADAVG, /* Must come after all of the symlinks */
+
+ PROC_TGID_INO,
PROC_TGID_TASK,
PROC_TGID_STATUS,
PROC_TGID_MEM,
@@ -126,6 +134,7 @@ enum pid_directory_inos {
#endif
PROC_TGID_OOM_SCORE,
PROC_TGID_OOM_ADJUST,
+
PROC_TID_INO,
PROC_TID_STATUS,
PROC_TID_MEM,
@@ -167,6 +176,11 @@ enum pid_directory_inos {
PROC_TID_OOM_SCORE,
PROC_TID_OOM_ADJUST,

+ PROC_PSPACE_STAT,
+ PROC_PSPACE_STATM,
+ PROC_PSPACE_STATUS,
+ PROC_PSPACE_CMDLINE,
+
/* Add new entries before this */
PROC_TID_FD_DIR = 0x8000, /* 0x8000-0xffff */
};
@@ -180,6 +194,13 @@ struct pid_entry {

#define E(type,name,mode) {(type),sizeof(name)-1,(name),(mode)}

+static struct pid_entry proc_base_stuff[] = {
+ E(PROC_SELF, "self", S_IFLNK|S_IRWXUGO),
+ E(PROC_MOUNTS, "mounts", S_IFLNK|S_IRWXUGO),
+ E(PROC_LOADAVG, "loadavg", S_IFREG|S_IRUGO),
+ {0,0,NULL,0}
+};
+
static struct pid_entry tgid_base_stuff[] = {
E(PROC_TGID_TASK, "task", S_IFDIR|S_IRUGO|S_IXUGO),
E(PROC_TGID_FD, "fd", S_IFDIR|S_IRUSR|S_IXUSR),
@@ -266,6 +287,15 @@ static struct pid_entry tid_base_stuff[]
{0,0,NULL,0}
};

+static struct pid_entry pspace_base_stuff[] = {
+ E(PROC_ROOT_INO, "pspace", S_IFDIR|S_IRUGO|S_IXUGO),
+ E(PROC_PSPACE_STAT, "stat", S_IFREG|S_IRUGO),
+ E(PROC_PSPACE_STATM, "statm", S_IFREG|S_IRUGO),
+ E(PROC_PSPACE_STATUS, "status", S_IFREG|S_IRUGO),
+ E(PROC_PSPACE_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
+ {0,0,NULL,0}
+};
+
#ifdef CONFIG_SECURITY
static struct pid_entry tgid_attr_stuff[] = {
E(PROC_TGID_ATTR_CURRENT, "current", S_IFREG|S_IRUGO|S_IWUGO),
@@ -1067,6 +1097,118 @@ static struct file_operations proc_secco
};
#endif /* CONFIG_SECCOMP */

+/*
+ * /proc/self:
+ */
+struct pspace *child_pspace(struct pspace *pspace, struct task_struct *tsk)
+{
+ struct pspace *child;
+ child = tsk->pspace;
+ while(child && (child->child_reaper.pspace != pspace)) {
+ child = child->child_reaper.pspace;
+ }
+ return child;
+}
+
+static int proc_self_readlink(struct dentry *dentry, char __user *buffer,
+ int buflen)
+{
+ struct pspace *pspace = proc_pspace(dentry->d_inode);
+ char tmp[30];
+ int result, len = 0;
+ while(buflen && pspace && (pspace != current->pspace)) {
+ pspace = child_pspace(pspace, current);
+ sprintf(tmp, "%d/pspace/", pspace->child_reaper.task->wid);
+ result = vfs_readlink(dentry, buffer, buflen, tmp);
+ if (result < 0)
+ goto out;
+ len += result;
+ buffer += result;
+ buflen -= result;
+ }
+ sprintf(tmp, "%d", current->tgid);
+ result = vfs_readlink(dentry, buffer, buflen, tmp);
+ if (result < 0)
+ goto out;
+ len += result;
+ result = len;
+ out:
+ return result;
+}
+
+static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ struct pspace *pspace = proc_pspace(dentry->d_inode);
+ char tmp[30];
+ int result;
+ while(pspace && (pspace != current->pspace)) {
+ pspace = child_pspace(pspace, current);
+ sprintf(tmp, "%d/pspace/", pspace->child_reaper.task->wid);
+ result = vfs_follow_link(nd, tmp);
+ if (result < 0)
+ goto out;
+ }
+ sprintf(tmp, "%d", current->tgid);
+ result = vfs_follow_link(nd,tmp);
+ out:
+ return ERR_PTR(result);
+}
+
+static struct inode_operations proc_self_inode_operations = {
+ .readlink = proc_self_readlink,
+ .follow_link = proc_self_follow_link,
+};
+
+static int self_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+ d_drop(dentry);
+ return 0;
+}
+
+static int self_delete_dentry(struct dentry * dentry)
+{
+ return 1;
+}
+
+static struct dentry_operations self_dentry_operations =
+{
+ .d_revalidate = self_revalidate,
+ .d_delete = self_delete_dentry,
+};
+
+static void *proc_mounts_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ static const char *mounts = "self/mounts";
+ nd_set_link(nd, (char *)mounts);
+ return NULL;
+}
+
+static struct inode_operations proc_mounts_inode_operations = {
+ .readlink = generic_readlink,
+ .follow_link = proc_mounts_follow_link,
+};
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+
+static int proc_loadavg(struct task_struct *task, char * buffer)
+{
+ struct pspace *pspace = task->pspace;
+ int a, b, c;
+ int len;
+
+ a = avenrun[0] + (FIXED_1/200);
+ b = avenrun[1] + (FIXED_1/200);
+ c = avenrun[2] + (FIXED_1/200);
+ len = sprintf(buffer,"%d.%02d %d.%02d %d.%02d %ld/%d %d\n",
+ LOAD_INT(a), LOAD_FRAC(a),
+ LOAD_INT(b), LOAD_FRAC(b),
+ LOAD_INT(c), LOAD_FRAC(c),
+ nr_running(), nr_threads, pspace? pspace->last_pid : 0);
+
+ return len;
+}
+
static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
{
struct inode *inode = dentry->d_inode;
@@ -1149,11 +1291,12 @@ static struct inode_operations proc_pid_

static int proc_readfd(struct file * filp, void * dirent, filldir_t filldir)
{
- struct inode *inode = filp->f_dentry->d_inode;
+ struct dentry *dentry = filp->f_dentry;
+ struct inode *inode = dentry->d_inode;
struct task_struct *p = proc_task(inode);
unsigned int fd, tid, ino;
int retval;
- char buf[NUMBUF];
+ char buf[NUMBUF + 1];
struct files_struct * files;
struct fdtable *fdt;

@@ -1170,7 +1313,7 @@ static int proc_readfd(struct file * fil
goto out;
filp->f_pos++;
case 1:
- ino = fake_ino(tid, PROC_TID_INO);
+ ino = parent_ino(dentry);
if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
goto out;
filp->f_pos++;
@@ -1189,8 +1332,9 @@ static int proc_readfd(struct file * fil
continue;
rcu_read_unlock();

- j = NUMBUF;
i = fd;
+ j = NUMBUF;
+ buf[j] = '\0';
do {
j--;
buf[j] = '0' + (i % 10);
@@ -1219,16 +1363,17 @@ static int proc_pident_readdir(struct fi
int pid;
struct dentry *dentry = filp->f_dentry;
struct inode *inode = dentry->d_inode;
+ struct task_struct *task = proc_task(inode);
struct pid_entry *p;
ino_t ino;
int ret;

ret = -ENOENT;
- if (!pid_alive(proc_task(inode)))
+ if (!pid_alive(task))
goto out;

ret = 0;
- pid = proc_task(inode)->pid;
+ pid = task->pid;
i = filp->f_pos;
switch (i) {
case 0:
@@ -1280,6 +1425,111 @@ static int proc_tid_base_readdir(struct
tid_base_stuff,ARRAY_SIZE(tid_base_stuff));
}

+static int proc_pspace_base_readdir(struct file * filp,
+ void * dirent, filldir_t filldir)
+{
+ return proc_pident_readdir(filp,dirent,filldir,
+ pspace_base_stuff,ARRAY_SIZE(pspace_base_stuff));
+}
+
+/* pspace root operations */
+static int proc_pspace_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+ struct inode *inode = dentry->d_inode;
+ struct task_struct *p = proc_task(inode);
+ generic_fillattr(inode, stat);
+ /*
+ * nr_processes is actually protected by the tasklist_lock;
+ * however, it's conventional to do reads, especially for
+ * reporting, without any locking whatsoever.
+ */
+ stat->nlink = proc_root.nlink;
+ if (pid_alive(p))
+ stat->nlink += p->pspace->nr_processes;
+ return 0;
+}
+
+static struct dentry *proc_pspace_lookup(struct inode * dir, struct dentry * dentry, struct nameidata *nd)
+{
+ struct nameidata sysinfo_nd;
+ int err;
+ memset(&sysinfo_nd, 0, sizeof(sysinfo_nd));
+ sysinfo_nd.mnt = mntget(proc_sysinfo_mnt);
+ sysinfo_nd.dentry = dget(proc_sysinfo_mnt->mnt_root);
+ err = link_path_walk(dentry->d_name.name, &sysinfo_nd);
+ if (err == 0) {
+ mntput(sysinfo_nd.mnt);
+ return sysinfo_nd.dentry;
+ }
+ return proc_pid_lookup(dir, dentry, nd);
+}
+
+static int proc_pspace_open_dir(struct inode *inode, struct file *f)
+{
+ struct vfsmount *mnt = proc_sysinfo_mnt;
+ struct file *filp;
+ int err;
+
+ err = 0;
+ mntget(mnt);
+ dget(mnt->mnt_root);
+ filp = dentry_open(mnt->mnt_root, mnt, f->f_flags);
+ if (IS_ERR(filp))
+ err = PTR_ERR(filp);
+ else
+ f->private_data = filp;
+ return err;
+}
+
+static int proc_pspace_release_dir(struct inode *inode, struct file *f)
+{
+ struct file *filp;
+ filp = f->private_data;
+ f->private_data = NULL;
+ return filp_close(filp, NULL);
+}
+
+static int proc_pspace_readdir(struct file * f,
+ void * dirent, filldir_t filldir)
+{
+ struct file *filp;
+
+ filp = f->private_data;
+ if (f->f_pos < FIRST_PROCESS_ENTRY) {
+ int res;
+ filp->f_pos = f->f_pos;
+ /* Don't pick up . or .. */
+ if (filp->f_pos < 2)
+ filp->f_pos = 2;
+ res = vfs_readdir(filp, filldir, dirent);
+ f->f_pos = filp->f_pos;
+ if (res <= 0)
+ return res;
+ f->f_pos = FIRST_PROCESS_ENTRY;
+ }
+ return proc_pid_readdir(f, dirent, filldir, f->f_pos - FIRST_PROCESS_ENTRY);
+}
+
+/*
+ * The root /proc directory is special, as it has the
+ * <pid> directories. Thus we don't use the generic
+ * directory handling functions for that..
+ */
+static struct file_operations proc_pspace_operations = {
+ .read = generic_read_dir,
+ .readdir = proc_pspace_readdir,
+ .open = proc_pspace_open_dir,
+ .release = proc_pspace_release_dir,
+};
+
+/*
+ * proc root can do almost nothing..
+ */
+static struct inode_operations proc_pspace_inode_operations = {
+ .lookup = proc_pspace_lookup,
+ .getattr = proc_pspace_getattr,
+};
+
/* building an inode */

static int task_dumpable(struct task_struct *task)
@@ -1336,11 +1586,23 @@ out:
return inode;

out_unlock:
- ei->pde = NULL;
iput(inode);
return NULL;
}

+struct inode *proc_pspace_make_inode(struct super_block *sb, struct pspace *pspace)
+{
+ struct inode *inode;
+ inode = proc_pid_make_inode(sb, pspace->child_reaper.task, PROC_ROOT_INO);
+ if (inode) {
+ inode->i_op = &proc_pspace_inode_operations;
+ inode->i_fop = &proc_pspace_operations;
+ inode->i_nlink = proc_root.nlink;
+ inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
+ }
+ return inode;
+}
+
/* dentry stuff */

/*
@@ -1527,6 +1789,18 @@ static struct file_operations proc_task_
/*
* proc directories can do almost nothing..
*/
+static int proc_task_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+ struct inode *inode = dentry->d_inode;
+ struct task_struct *p = proc_task(inode);
+ generic_fillattr(inode, stat);
+
+ stat->nlink = inode->i_nlink;
+ if (pid_alive(p) && p->signal)
+ stat->nlink += atomic_read(&p->signal->count);
+ return 0;
+}
+
static struct inode_operations proc_fd_inode_operations = {
.lookup = proc_lookupfd,
.permission = proc_permission,
@@ -1535,6 +1809,7 @@ static struct inode_operations proc_fd_i
static struct inode_operations proc_task_inode_operations = {
.lookup = proc_task_lookup,
.permission = proc_task_permission,
+ .getattr = proc_task_getattr,
};

#ifdef CONFIG_SECURITY
@@ -1600,7 +1875,7 @@ static struct file_operations proc_tgid_
static struct inode_operations proc_tgid_attr_inode_operations;
#endif

-static int get_tid_list(int index, unsigned int *tids, struct inode *dir);
+static int get_tid_list(struct pspace *pspace, int index, unsigned int *tids, struct inode *dir);

/* SMP-safe */
static struct dentry *proc_pident_lookup(struct inode *dir,
@@ -1609,6 +1884,7 @@ static struct dentry *proc_pident_lookup
{
struct inode *inode;
int error;
+ struct pspace *pspace = proc_pspace(dir);
struct task_struct *task = proc_task(dir);
struct pid_entry *p;
struct proc_inode *ei;
@@ -1628,6 +1904,10 @@ static struct dentry *proc_pident_lookup
if (!p->name)
goto out;

+ /* Don't setup /proc/self symlinks if this isn't the readers pspace. */
+ if ((p->type < PROC_LOADAVG) && !in_pspace(pspace, current))
+ goto out;
+
error = -EINVAL;
inode = proc_pid_make_inode(dir->i_sb, task, p->type);
if (!inode)
@@ -1635,13 +1915,26 @@ static struct dentry *proc_pident_lookup

ei = PROC_I(inode);
inode->i_mode = p->mode;
+ dentry->d_op = &pid_dentry_operations;
/*
* Yes, it does not scale. And it should not. Don't add
* new entries into /proc/<tgid>/ without very good reasons.
*/
switch(p->type) {
+ case PROC_SELF:
+ inode->i_op = &proc_self_inode_operations;
+ dentry->d_op = &self_dentry_operations;
+ break;
+ case PROC_MOUNTS:
+ inode->i_op = &proc_mounts_inode_operations;
+ dentry->d_op = &self_dentry_operations;
+ break;
+ case PROC_LOADAVG:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_loadavg;
+ break;
case PROC_TGID_TASK:
- inode->i_nlink = 2 + get_tid_list(2, NULL, dir);
+ inode->i_nlink = 2;
inode->i_op = &proc_task_inode_operations;
inode->i_fop = &proc_task_operations;
break;
@@ -1681,6 +1974,10 @@ static struct dentry *proc_pident_lookup
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_pid_status;
break;
+ case PROC_PSPACE_STATUS:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_pspace_status;
+ break;
case PROC_TID_STAT:
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_tid_stat;
@@ -1689,8 +1986,13 @@ static struct dentry *proc_pident_lookup
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_tgid_stat;
break;
+ case PROC_PSPACE_STAT:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_pspace_stat;
+ break;
case PROC_TID_CMDLINE:
case PROC_TGID_CMDLINE:
+ case PROC_PSPACE_CMDLINE:
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_pid_cmdline;
break;
@@ -1699,6 +2001,10 @@ static struct dentry *proc_pident_lookup
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_pid_statm;
break;
+ case PROC_PSPACE_STATM:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_pspace_statm;
+ break;
case PROC_TID_MAPS:
case PROC_TGID_MAPS:
inode->i_fop = &proc_maps_operations;
@@ -1792,7 +2098,6 @@ static struct dentry *proc_pident_lookup
iput(inode);
return ERR_PTR(-EINVAL);
}
- dentry->d_op = &pid_dentry_operations;
d_add(dentry, inode);
return NULL;

@@ -1808,6 +2113,19 @@ static struct dentry *proc_tid_base_look
return proc_pident_lookup(dir, dentry, tid_base_stuff);
}

+static struct dentry *proc_pspace_base_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
+{
+ int error;
+ error = -ENOENT;
+ if (memcmp(dentry->d_name.name, "pspace", 6) == 0) {
+ struct vfsmount *mnt;
+ mnt = proc_pspace_get_mnt(proc_pspace(dir));
+ if (!IS_ERR(mnt))
+ return dget(mnt->mnt_root);
+ }
+ return proc_pident_lookup(dir, dentry, pspace_base_stuff);
+}
+
static struct file_operations proc_tgid_base_operations = {
.read = generic_read_dir,
.readdir = proc_tgid_base_readdir,
@@ -1818,6 +2136,11 @@ static struct file_operations proc_tid_b
.readdir = proc_tid_base_readdir,
};

+static struct file_operations proc_pspace_base_operations = {
+ .read = generic_read_dir,
+ .readdir = proc_pspace_base_readdir,
+};
+
static struct inode_operations proc_tgid_base_inode_operations = {
.lookup = proc_tgid_base_lookup,
};
@@ -1826,6 +2149,10 @@ static struct inode_operations proc_tid_
.lookup = proc_tid_base_lookup,
};

+static struct inode_operations proc_pspace_base_inode_operations = {
+ .lookup = proc_pspace_base_lookup,
+};
+
#ifdef CONFIG_SECURITY
static int proc_tgid_attr_readdir(struct file * filp,
void * dirent, filldir_t filldir)
@@ -1872,29 +2199,6 @@ static struct inode_operations proc_tid_
};
#endif

-/*
- * /proc/self:
- */
-static int proc_self_readlink(struct dentry *dentry, char __user *buffer,
- int buflen)
-{
- char tmp[30];
- sprintf(tmp, "%d", current->tgid);
- return vfs_readlink(dentry,buffer,buflen,tmp);
-}
-
-static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
-{
- char tmp[30];
- sprintf(tmp, "%d", current->tgid);
- return ERR_PTR(vfs_follow_link(nd,tmp));
-}
-
-static struct inode_operations proc_self_inode_operations = {
- .readlink = proc_self_readlink,
- .follow_link = proc_self_follow_link,
-};
-
/**
* proc_pid_unhash - Unhash /proc/@pid entry from the dcache.
* @p: task that should be flushed.
@@ -1952,33 +2256,23 @@ void proc_pid_flush(struct dentry *proc_
/* SMP-safe */
struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
{
+ struct pspace *pspace = proc_pspace(nd->dentry->d_inode);
+ struct dentry *result;
struct task_struct *task;
struct inode *inode;
- struct proc_inode *ei;
unsigned tgid;
int died;

- if (dentry->d_name.len == 4 && !memcmp(dentry->d_name.name,"self",4)) {
- inode = new_inode(dir->i_sb);
- if (!inode)
- return ERR_PTR(-ENOMEM);
- ei = PROC_I(inode);
- inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
- inode->i_ino = fake_ino(0, PROC_TGID_INO);
- ei->pde = NULL;
- inode->i_mode = S_IFLNK|S_IRWXUGO;
- inode->i_uid = inode->i_gid = 0;
- inode->i_size = 64;
- inode->i_op = &proc_self_inode_operations;
- d_add(dentry, inode);
- return NULL;
- }
+ result = proc_pident_lookup(dir, dentry, proc_base_stuff);
+ if (!IS_ERR(result) || (PTR_ERR(result) != -ENOENT))
+ return result;
+
tgid = name_to_int(dentry);
if (tgid == ~0U)
goto out;

read_lock(&tasklist_lock);
- task = find_task_by_pid(tgid);
+ task = find_task_by_pid(pspace, tgid);
if (task)
get_task_struct(task);
read_unlock(&tasklist_lock);
@@ -1993,14 +2287,20 @@ struct dentry *proc_pid_lookup(struct in
goto out;
}
inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
- inode->i_op = &proc_tgid_base_inode_operations;
- inode->i_fop = &proc_tgid_base_operations;
inode->i_flags|=S_IMMUTABLE;
+ if (unlikely(pspace_leader(task) && (task->pspace != pspace))) {
+ inode->i_op = &proc_pspace_base_inode_operations;
+ inode->i_fop = &proc_pspace_base_operations;
+ inode->i_nlink = 3;
+ } else {
+ inode->i_op = &proc_tgid_base_inode_operations;
+ inode->i_fop = &proc_tgid_base_operations;
#ifdef CONFIG_SECURITY
- inode->i_nlink = 5;
+ inode->i_nlink = 5;
#else
- inode->i_nlink = 4;
+ inode->i_nlink = 4;
#endif
+ }

dentry->d_op = &pid_base_dentry_operations;

@@ -2027,6 +2327,7 @@ out:
/* SMP-safe */
static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
{
+ struct pspace *pspace = proc_pspace(dir);
struct task_struct *task;
struct task_struct *leader = proc_task(dir);
struct inode *inode;
@@ -2037,7 +2338,7 @@ static struct dentry *proc_task_lookup(s
goto out;

read_lock(&tasklist_lock);
- task = find_task_by_pid(tid);
+ task = find_task_by_pid(pspace, tid);
if (task)
get_task_struct(task);
read_unlock(&tasklist_lock);
@@ -2081,16 +2382,15 @@ out:
* tasklist lock while doing this, and we must release it before
* we actually do the filldir itself, so we use a temp buffer..
*/
-static int get_tgid_list(int index, unsigned long version, unsigned int *tgids)
+static int get_tgid_list(struct pspace *pspace, int index, unsigned long version, unsigned int *tgids)
{
struct task_struct *p;
int nr_tgids = 0;

- index--;
read_lock(&tasklist_lock);
p = NULL;
if (version) {
- p = find_task_by_pid(version);
+ p = find_task_by_pid(pspace, version);
if (p && !thread_group_leader(p))
p = NULL;
}
@@ -2101,12 +2401,13 @@ static int get_tgid_list(int index, unsi
p = next_task(&init_task);

for ( ; p != &init_task; p = next_task(p)) {
- int tgid = p->pid;
if (!pid_alive(p))
continue;
+ if (!pspace_task_visible(pspace, p))
+ continue;
if (--index >= 0)
continue;
- tgids[nr_tgids] = tgid;
+ tgids[nr_tgids] = (p->pspace == pspace) ? p->pid : p->wid;
nr_tgids++;
if (nr_tgids >= PROC_MAXPIDS)
break;
@@ -2120,7 +2421,7 @@ static int get_tgid_list(int index, unsi
* tasklist lock while doing this, and we must release it before
* we actually do the filldir itself, so we use a temp buffer..
*/
-static int get_tid_list(int index, unsigned int *tids, struct inode *dir)
+static int get_tid_list(struct pspace *pspace, int index, unsigned int *tids, struct inode *dir)
{
struct task_struct *leader_task = proc_task(dir);
struct task_struct *task = leader_task;
@@ -2133,7 +2434,7 @@ static int get_tid_list(int index, unsig
* unlinked task, which cannot be used to access the task-list
* via next_thread().
*/
- if (pid_alive(task)) do {
+ if (pid_alive(task) && pspace_task_visible(pspace, task)) do {
int tid = task->pid;

if (--index >= 0)
@@ -2149,20 +2450,47 @@ static int get_tid_list(int index, unsig
}

/* for the /proc/ directory itself, after non-process stuff has been done */
-int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
+static int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir, unsigned int nr)
{
+ struct dentry *dentry = filp->f_dentry;
+ struct inode *inode = dentry->d_inode;
+ struct task_struct *task = proc_task(inode);
+ struct pspace *pspace = task->pspace;
unsigned int tgid_array[PROC_MAXPIDS];
- char buf[PROC_NUMBUF];
- unsigned int nr = filp->f_pos - FIRST_PROCESS_ENTRY;
+ char buf[PROC_NUMBUF + 1];
unsigned int nr_tgids, i;
+ ino_t ino;
int next_tgid;

- if (!nr) {
- ino_t ino = fake_ino(0,PROC_TGID_INO);
- if (filldir(dirent, "self", 4, filp->f_pos, ino, DT_LNK) < 0)
- return 0;
+ if (!pid_alive(task))
+ goto out;
+
+ switch (nr) {
+ case 0:
+ ino = inode->i_ino;
+ if (filldir(dirent, ".", 1, filp->f_pos, ino, DT_DIR) < 0)
+ goto out;
+ filp->f_pos++;
+ nr++;
+ /* fall through */
+ case 1:
+ ino = parent_ino(dentry);
+ if (filldir(dirent, "..", 2, filp->f_pos, ino, DT_DIR) < 0)
+ goto out;
filp->f_pos++;
nr++;
+ /* fall through */
+ }
+ nr -= 2;
+ for(; nr < (ARRAY_SIZE(proc_base_stuff) - 1); filp->f_pos++, nr++) {
+ struct pid_entry *p = proc_base_stuff + nr;
+ /* Don't return /proc/self symlinks if this isn't the readers pspace. */
+ if ((p->type < PROC_LOADAVG) && !in_pspace(pspace, current))
+ continue;
+ ino = fake_ino(1, p->type);
+ if (filldir(dirent, p->name, p->len, filp->f_pos,
+ ino, p->mode >> 12) < 0)
+ goto out;
}

/* f_version caches the tgid value that the last readdir call couldn't
@@ -2170,8 +2498,9 @@ int proc_pid_readdir(struct file * filp,
*/
next_tgid = filp->f_version;
filp->f_version = 0;
+ nr -= (ARRAY_SIZE(proc_base_stuff) - 1);
for (;;) {
- nr_tgids = get_tgid_list(nr, next_tgid, tgid_array);
+ nr_tgids = get_tgid_list(pspace, nr, next_tgid, tgid_array);
if (!nr_tgids) {
/* no more entries ! */
break;
@@ -2186,9 +2515,11 @@ int proc_pid_readdir(struct file * filp,

for (i=0;i<nr_tgids;i++) {
int tgid = tgid_array[i];
- ino_t ino = fake_ino(tgid,PROC_TGID_INO);
- unsigned long j = PROC_NUMBUF;
+ unsigned long j;

+ ino = fake_ino(tgid, PROC_TGID_INO);
+ j = PROC_NUMBUF;
+ buf[j] = '\0';
do
buf[--j] = '0' + (tgid % 10);
while ((tgid /= 10) != 0);
@@ -2211,15 +2542,17 @@ out:
static int proc_task_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
unsigned int tid_array[PROC_MAXPIDS];
- char buf[PROC_NUMBUF];
+ char buf[PROC_NUMBUF + 1];
unsigned int nr_tids, i;
struct dentry *dentry = filp->f_dentry;
struct inode *inode = dentry->d_inode;
+ struct task_struct *task = proc_task(inode);
+ struct pspace *pspace = task->pspace;
int retval = -ENOENT;
ino_t ino;
unsigned long pos = filp->f_pos; /* avoiding "long long" filp->f_pos */

- if (!pid_alive(proc_task(inode)))
+ if (!pid_alive(task))
goto out;
retval = 0;

@@ -2238,8 +2571,7 @@ static int proc_task_readdir(struct file
/* fall through */
}

- nr_tids = get_tid_list(pos, tid_array, inode);
- inode->i_nlink = pos + nr_tids;
+ nr_tids = get_tid_list(pspace, pos, tid_array, inode);

for (i = 0; i < nr_tids; i++) {
unsigned long j = PROC_NUMBUF;
@@ -2247,6 +2579,7 @@ static int proc_task_readdir(struct file

ino = fake_ino(tid,PROC_TID_INO);

+ buf[j] = '\0';
do
buf[--j] = '0' + (tid % 10);
while ((tid /= 10) != 0);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 20e5c45..9befe4b 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -10,7 +10,7 @@

#include <linux/errno.h>
#include <linux/time.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/stat.h>
#include <linux/module.h>
#include <linux/mount.h>
@@ -21,7 +21,7 @@
#include <linux/bitops.h>
#include <asm/uaccess.h>

-#include "internal.h"
+#include "sysinfo.h"

static ssize_t proc_file_read(struct file *file, char __user *buf,
size_t nbytes, loff_t *ppos);
@@ -254,7 +254,7 @@ static int proc_getattr(struct vfsmount
struct kstat *stat)
{
struct inode *inode = dentry->d_inode;
- struct proc_dir_entry *de = PROC_I(inode)->pde;
+ struct proc_dir_entry *de = PDE(inode);
if (de && de->nlink)
inode->i_nlink = de->nlink;

@@ -373,7 +373,7 @@ static struct dentry_operations proc_den
* Don't create negative dentries here, return -ENOENT by hand
* instead.
*/
-struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nameidata *nd)
+static struct dentry *proc_lookup(struct inode * dir, struct dentry *dentry, struct nameidata *nd)
{
struct inode *inode = NULL;
struct proc_dir_entry * de;
@@ -389,7 +389,7 @@ struct dentry *proc_lookup(struct inode
unsigned int ino = de->low_ino;

error = -EINVAL;
- inode = proc_get_inode(dir->i_sb, ino, de);
+ inode = proc_sysinfo_get_inode(dir->i_sb, ino, de);
break;
}
}
@@ -413,7 +413,7 @@ struct dentry *proc_lookup(struct inode
* value of the readdir() call, as long as it's non-negative
* for success..
*/
-int proc_readdir(struct file * filp,
+static int proc_readdir(struct file * filp,
void * dirent, filldir_t filldir)
{
struct proc_dir_entry * de;
@@ -492,6 +492,20 @@ static struct inode_operations proc_dir_
.setattr = proc_notify_change,
};

+/*
+ * This is the root "inode" in the /proc tree..
+ */
+struct proc_dir_entry proc_root = {
+ .low_ino = PROC_SYSINFO_ROOT_INO,
+ .namelen = 5,
+ .name = "/proc",
+ .mode = S_IFDIR | S_IRUGO | S_IXUGO,
+ .nlink = 2,
+ .proc_iops = &proc_dir_inode_operations,
+ .proc_fops = &proc_dir_operations,
+ .parent = &proc_root,
+};
+
static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp)
{
unsigned int i;
@@ -527,7 +541,7 @@ static int proc_register(struct proc_dir
static void proc_kill_inodes(struct proc_dir_entry *de)
{
struct list_head *p;
- struct super_block *sb = proc_mnt->mnt_sb;
+ struct super_block *sb = proc_sysinfo_mnt->mnt_sb;

/*
* Actually it's a partial revoke().
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 6573f31..cf15c39 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -5,7 +5,7 @@
*/

#include <linux/time.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/string.h>
@@ -15,11 +15,12 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/smp_lock.h>
+#include <linux/pspace.h>

#include <asm/system.h>
#include <asm/uaccess.h>

-#include "internal.h"
+#include "sysinfo.h"

static inline struct proc_dir_entry * de_get(struct proc_dir_entry *de)
{
@@ -55,20 +56,14 @@ static void de_put(struct proc_dir_entry
/*
* Decrement the use count of the proc_dir_entry.
*/
-static void proc_delete_inode(struct inode *inode)
+static void proc_sysinfo_delete_inode(struct inode *inode)
{
struct proc_dir_entry *de;
- struct task_struct *tsk;

truncate_inode_pages(&inode->i_data, 0);

- /* Let go of any associated process */
- tsk = PROC_I(inode)->task;
- if (tsk)
- put_task_struct(tsk);
-
/* Let go of any associated proc directory entry */
- de = PROC_I(inode)->pde;
+ de = PDE(inode);
if (de) {
if (de->owner)
module_put(de->owner);
@@ -77,74 +72,29 @@ static void proc_delete_inode(struct ino
clear_inode(inode);
}

-struct vfsmount *proc_mnt;
-
-static void proc_read_inode(struct inode * inode)
-{
- inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
-}
-
-static kmem_cache_t * proc_inode_cachep;
+struct vfsmount *proc_sysinfo_mnt;

-static struct inode *proc_alloc_inode(struct super_block *sb)
+static void proc_sysinfo_read_inode(struct inode * inode)
{
- struct proc_inode *ei;
- struct inode *inode;
-
- ei = (struct proc_inode *)kmem_cache_alloc(proc_inode_cachep, SLAB_KERNEL);
- if (!ei)
- return NULL;
- ei->task = NULL;
- ei->type = 0;
- ei->op.proc_get_link = NULL;
- ei->pde = NULL;
- inode = &ei->vfs_inode;
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
- return inode;
}

-static void proc_destroy_inode(struct inode *inode)
-{
- kmem_cache_free(proc_inode_cachep, PROC_I(inode));
-}

-static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
-{
- struct proc_inode *ei = (struct proc_inode *) foo;
-
- if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
- SLAB_CTOR_CONSTRUCTOR)
- inode_init_once(&ei->vfs_inode);
-}
-
-int __init proc_init_inodecache(void)
-{
- proc_inode_cachep = kmem_cache_create("proc_inode_cache",
- sizeof(struct proc_inode),
- 0, SLAB_RECLAIM_ACCOUNT,
- init_once, NULL);
- if (proc_inode_cachep == NULL)
- return -ENOMEM;
- return 0;
-}
-
-static int proc_remount(struct super_block *sb, int *flags, char *data)
+static int proc_sysinfo_remount(struct super_block *sb, int *flags, char *data)
{
*flags |= MS_NODIRATIME;
return 0;
}

-static struct super_operations proc_sops = {
- .alloc_inode = proc_alloc_inode,
- .destroy_inode = proc_destroy_inode,
- .read_inode = proc_read_inode,
+static struct super_operations proc_sysinfo_sops = {
+ .read_inode = proc_sysinfo_read_inode,
.drop_inode = generic_delete_inode,
- .delete_inode = proc_delete_inode,
+ .delete_inode = proc_sysinfo_delete_inode,
.statfs = simple_statfs,
- .remount_fs = proc_remount,
+ .remount_fs = proc_sysinfo_remount,
};

-struct inode *proc_get_inode(struct super_block *sb, unsigned int ino,
+struct inode *proc_sysinfo_get_inode(struct super_block *sb, unsigned int ino,
struct proc_dir_entry *de)
{
struct inode * inode;
@@ -163,7 +113,7 @@ struct inode *proc_get_inode(struct supe
if (!inode)
goto out_ino;

- PROC_I(inode)->pde = de;
+ inode->u.generic_ip = de;
if (de) {
if (de->mode) {
inode->i_mode = de->mode;
@@ -190,24 +140,20 @@ out_mod:
return NULL;
}

-int proc_fill_super(struct super_block *s, void *data, int silent)
+int proc_sysinfo_fill_super(struct super_block *s, void *data, int silent)
{
struct inode * root_inode;

s->s_flags |= MS_NODIRATIME;
s->s_blocksize = 1024;
s->s_blocksize_bits = 10;
- s->s_magic = PROC_SUPER_MAGIC;
- s->s_op = &proc_sops;
+ s->s_magic = PROC_SYSINFO_SUPER_MAGIC;
+ s->s_op = &proc_sysinfo_sops;
s->s_time_gran = 1;

- root_inode = proc_get_inode(s, PROC_ROOT_INO, &proc_root);
+ root_inode = proc_sysinfo_get_inode(s, PROC_SYSINFO_ROOT_INO, &proc_root);
if (!root_inode)
goto out_no_root;
- /*
- * Fixup the root inode's nlink value
- */
- root_inode->i_nlink += nr_processes();
root_inode->i_uid = 0;
root_inode->i_gid = 0;
s->s_root = d_alloc_root(root_inode);
@@ -216,7 +162,7 @@ int proc_fill_super(struct super_block *
return 0;

out_no_root:
- printk("proc_read_super: get root inode failed\n");
+ printk("%s: get root inode failed\n", __func__);
iput(root_inode);
return -ENOMEM;
}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 95a1cf3..0a30e23 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -11,35 +11,22 @@

#include <linux/proc_fs.h>

-struct vmalloc_info {
- unsigned long used;
- unsigned long largest_chunk;
-};
-
-#ifdef CONFIG_MMU
-#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
-extern void get_vmalloc_info(struct vmalloc_info *vmi);
-#else
-
-#define VMALLOC_TOTAL 0UL
-#define get_vmalloc_info(vmi) \
-do { \
- (vmi)->used = 0; \
- (vmi)->largest_chunk = 0; \
-} while(0)
+/*
+ * Offset of the first process in the /proc root directory..
+ */
+#define FIRST_PROCESS_ENTRY 256

-#endif
-
-extern void create_seq_entry(char *name, mode_t mode, struct file_operations *f);
extern int proc_exe_link(struct inode *, struct dentry **, struct vfsmount **);
extern int proc_tid_stat(struct task_struct *, char *);
extern int proc_tgid_stat(struct task_struct *, char *);
+extern int proc_pspace_stat(struct task_struct *, char *);
extern int proc_pid_status(struct task_struct *, char *);
+extern int proc_pspace_status(struct task_struct *, char *);
extern int proc_pid_statm(struct task_struct *, char *);
+extern int proc_pspace_statm(struct task_struct *, char *);
+extern struct inode *proc_pspace_make_inode(struct super_block *sb, struct pspace *pspace);
+extern struct vfsmount *proc_pspace_get_mnt(struct pspace *pspace);

-void free_proc_entry(struct proc_dir_entry *de);
-
-int proc_init_inodecache(void);

static inline struct task_struct *proc_task(struct inode *inode)
{
@@ -50,3 +37,10 @@ static inline int proc_type(struct inode
{
return PROC_I(inode)->type;
}
+
+static inline struct pspace *proc_pspace(struct inode *inode)
+{
+ struct task_struct *task = proc_task(inode);
+ return task ? task->pspace : NULL;
+}
+
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index adc2cd9..fa24c71 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -11,7 +11,7 @@

#include <linux/config.h>
#include <linux/mm.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/user.h>
#include <linux/a.out.h>
#include <linux/capability.h>
diff --git a/fs/proc/mmu.c b/fs/proc/mmu.c
index 25d2d9c..887eb5e 100644
--- a/fs/proc/mmu.c
+++ b/fs/proc/mmu.c
@@ -15,7 +15,7 @@
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/mman.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/pagemap.h>
@@ -29,7 +29,7 @@
#include <asm/pgtable.h>
#include <asm/tlb.h>
#include <asm/div64.h>
-#include "internal.h"
+#include "sysinfo.h"

void get_vmalloc_info(struct vmalloc_info *vmi)
{
diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c
index cff10ab..65c66ef 100644
--- a/fs/proc/nommu.c
+++ b/fs/proc/nommu.c
@@ -16,7 +16,7 @@
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/mman.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/pagemap.h>
@@ -30,7 +30,7 @@
#include <asm/pgtable.h>
#include <asm/tlb.h>
#include <asm/div64.h>
-#include "internal.h"
+#include "sysinfo.h"

/*
* display a list of all the VMAs the kernel knows about
diff --git a/fs/proc/proc.c b/fs/proc/proc.c
new file mode 100644
index 0000000..0501975
--- /dev/null
+++ b/fs/proc/proc.c
@@ -0,0 +1,168 @@
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pspace.h>
+
+#include <linux/proc_fs.h>
+#include "internal.h"
+/*
+ * Release the task.
+ */
+static void proc_delete_inode(struct inode *inode)
+{
+ struct task_struct *tsk;
+
+ truncate_inode_pages(&inode->i_data, 0);
+
+ /* Let go of any associated process */
+ tsk = PROC_I(inode)->task;
+ if (tsk)
+ put_task_struct(tsk);
+
+ clear_inode(inode);
+}
+
+static void proc_read_inode(struct inode * inode)
+{
+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+}
+
+static kmem_cache_t * proc_inode_cachep;
+
+static struct inode *proc_alloc_inode(struct super_block *sb)
+{
+ struct proc_inode *ei;
+ struct inode *inode;
+
+ ei = (struct proc_inode *)kmem_cache_alloc(proc_inode_cachep, SLAB_KERNEL);
+ if (!ei)
+ return NULL;
+ ei->task = NULL;
+ ei->type = 0;
+ ei->op.proc_get_link = NULL;
+ inode = &ei->vfs_inode;
+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+ return inode;
+}
+
+static void proc_destroy_inode(struct inode *inode)
+{
+ kmem_cache_free(proc_inode_cachep, PROC_I(inode));
+}
+
+static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
+{
+ struct proc_inode *ei = (struct proc_inode *) foo;
+
+ if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
+ SLAB_CTOR_CONSTRUCTOR)
+ inode_init_once(&ei->vfs_inode);
+}
+
+static int __init proc_init_inodecache(void)
+{
+ proc_inode_cachep = kmem_cache_create("proc_inode_cache",
+ sizeof(struct proc_inode),
+ 0, SLAB_RECLAIM_ACCOUNT,
+ init_once, NULL);
+ if (proc_inode_cachep == NULL)
+ return -ENOMEM;
+ return 0;
+}
+
+static int proc_remount(struct super_block *sb, int *flags, char *data)
+{
+ *flags |= MS_NODIRATIME;
+ return 0;
+}
+
+static struct super_operations proc_sops = {
+ .alloc_inode = proc_alloc_inode,
+ .destroy_inode = proc_destroy_inode,
+ .read_inode = proc_read_inode,
+ .drop_inode = generic_delete_inode,
+ .delete_inode = proc_delete_inode,
+ .statfs = simple_statfs,
+ .remount_fs = proc_remount,
+};
+
+static int proc_fill_super(struct super_block *s, void *data, int silent)
+{
+ struct inode * root_inode;
+ struct pspace *pspace = data;
+
+ s->s_flags |= MS_NODIRATIME;
+ s->s_blocksize = 1024;
+ s->s_blocksize_bits = 10;
+ s->s_magic = PROC_SUPER_MAGIC;
+ s->s_op = &proc_sops;
+ s->s_time_gran = 1;
+
+ root_inode = proc_pspace_make_inode(s, pspace);
+ if (!root_inode)
+ goto out_no_root;
+ s->s_root = d_alloc_root(root_inode);
+ if (!s->s_root)
+ goto out_no_root;
+ return 0;
+
+out_no_root:
+ printk("%s: get root inode failed\n", __func__);
+ iput(root_inode);
+ return -ENOMEM;
+}
+
+static int proc_test_sb(struct super_block *s, void *data)
+{
+ struct dentry *dentry = s->s_root;
+ struct inode *inode = dentry->d_inode;
+ struct task_struct *task = PROC_I(inode)->task;
+
+ return task->pspace == data;
+}
+
+static int proc_set_sb(struct super_block *s, void *data)
+{
+ int err;
+ err = set_anon_super(s, NULL);
+ if (!err)
+ err = proc_fill_super(s, data, 0);
+ return err;
+}
+
+static struct super_block *proc_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data)
+{
+ return sget(fs_type, proc_test_sb, proc_set_sb, current->pspace);
+}
+
+static struct file_system_type proc_fs_type = {
+ .name = "proc",
+ .get_sb = proc_get_sb,
+ .kill_sb = kill_anon_super,
+};
+
+struct vfsmount *proc_pspace_get_mnt(struct pspace *pspace)
+{
+ struct vfsmount *mnt;
+ mnt = pspace->proc_mnt;
+ if (!mnt)
+ mnt = kern_mount(&proc_fs_type);
+ if (!IS_ERR(mnt))
+ pspace->proc_mnt = mnt;
+ return mnt;
+}
+
+static int __init proc_init(void)
+{
+ int err = 0;
+ err = proc_init_inodecache();
+ if (err)
+ goto out;
+ err = register_filesystem(&proc_fs_type);
+ out:
+ return err;
+}
+
+module_init(proc_init);
+MODULE_LICENSE("GPL");
diff --git a/fs/proc/proc_devtree.c b/fs/proc/proc_devtree.c
index 9bdd077..8b749d6 100644
--- a/fs/proc/proc_devtree.c
+++ b/fs/proc/proc_devtree.c
@@ -5,7 +5,7 @@
*/
#include <linux/errno.h>
#include <linux/time.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/stat.h>
#include <linux/string.h>
#include <asm/prom.h>
diff --git a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c
index 8f80142..5037ed6 100644
--- a/fs/proc/proc_misc.c
+++ b/fs/proc/proc_misc.c
@@ -24,7 +24,7 @@
#include <linux/tty.h>
#include <linux/string.h>
#include <linux/mman.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/ioport.h>
#include <linux/config.h>
#include <linux/mm.h>
@@ -46,15 +46,14 @@
#include <linux/sysrq.h>
#include <linux/vmalloc.h>
#include <linux/crash_dump.h>
+#include <linux/pspace.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/io.h>
#include <asm/tlb.h>
#include <asm/div64.h>
-#include "internal.h"
+#include "sysinfo.h"

-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
/*
* Warning: stuff below (imported functions) assumes that its output will fit
* into one page. For some of those functions it may be wrong. Moreover, we
@@ -79,23 +78,6 @@ static int proc_calc_metrics(char *page,
return len;
}

-static int loadavg_read_proc(char *page, char **start, off_t off,
- int count, int *eof, void *data)
-{
- int a, b, c;
- int len;
-
- a = avenrun[0] + (FIXED_1/200);
- b = avenrun[1] + (FIXED_1/200);
- c = avenrun[2] + (FIXED_1/200);
- len = sprintf(page,"%d.%02d %d.%02d %d.%02d %ld/%d %d\n",
- LOAD_INT(a), LOAD_FRAC(a),
- LOAD_INT(b), LOAD_FRAC(b),
- LOAD_INT(c), LOAD_FRAC(c),
- nr_running(), nr_threads, last_pid);
- return proc_calc_metrics(page, start, off, count, eof, len);
-}
-
static int uptime_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
@@ -712,7 +694,6 @@ void __init proc_misc_init(void)
char *name;
int (*read_proc)(char*,char**,off_t,int,int*,void*);
} *p, simple_ones[] = {
- {"loadavg", loadavg_read_proc},
{"uptime", uptime_read_proc},
{"meminfo", meminfo_read_proc},
{"version", version_read_proc},
@@ -731,8 +712,6 @@ void __init proc_misc_init(void)
for (p = simple_ones; p->name; p++)
create_proc_read_entry(p->name, 0, NULL, p->read_proc, NULL);

- proc_symlink("mounts", NULL, "self/mounts");
-
/* And now for trickier ones */
entry = create_proc_entry("kmsg", S_IRUSR, &proc_root);
if (entry)
diff --git a/fs/proc/proc_tty.c b/fs/proc/proc_tty.c
index 15c4455..9715c1f 100644
--- a/fs/proc/proc_tty.c
+++ b/fs/proc/proc_tty.c
@@ -9,7 +9,7 @@
#include <linux/init.h>
#include <linux/errno.h>
#include <linux/time.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/stat.h>
#include <linux/tty.h>
#include <linux/seq_file.h>
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6889628..6e0f451 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -10,15 +10,15 @@

#include <linux/errno.h>
#include <linux/time.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/stat.h>
#include <linux/config.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/smp_lock.h>
-
-#include "internal.h"
+#include <linux/mount.h>
+#include "sysinfo.h"

struct proc_dir_entry *proc_net, *proc_net_stat, *proc_bus, *proc_root_fs, *proc_root_driver;

@@ -26,30 +26,29 @@ struct proc_dir_entry *proc_net, *proc_n
struct proc_dir_entry *proc_sys_root;
#endif

-static struct super_block *proc_get_sb(struct file_system_type *fs_type,
+
+static struct super_block *proc_sysinfo_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
- return get_sb_single(fs_type, flags, data, proc_fill_super);
+ return get_sb_single(fs_type, flags, data, proc_sysinfo_fill_super);
}

-static struct file_system_type proc_fs_type = {
- .name = "proc",
- .get_sb = proc_get_sb,
+static struct file_system_type proc_sysinfo_fs_type = {
+ .name = "proc_sysinfo",
+ .get_sb = proc_sysinfo_get_sb,
.kill_sb = kill_anon_super,
};

void __init proc_root_init(void)
{
- int err = proc_init_inodecache();
- if (err)
- return;
- err = register_filesystem(&proc_fs_type);
+ int err;
+ err = register_filesystem(&proc_sysinfo_fs_type);
if (err)
return;
- proc_mnt = kern_mount(&proc_fs_type);
- err = PTR_ERR(proc_mnt);
- if (IS_ERR(proc_mnt)) {
- unregister_filesystem(&proc_fs_type);
+ proc_sysinfo_mnt = kern_mount(&proc_sysinfo_fs_type);
+ err = PTR_ERR(proc_sysinfo_mnt);
+ if (IS_ERR(proc_sysinfo_mnt)) {
+ unregister_filesystem(&proc_sysinfo_fs_type);
return;
}
proc_misc_init();
@@ -80,76 +79,6 @@ void __init proc_root_init(void)
proc_bus = proc_mkdir("bus", NULL);
}

-static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, struct nameidata *nd)
-{
- /*
- * nr_threads is actually protected by the tasklist_lock;
- * however, it's conventional to do reads, especially for
- * reporting, without any locking whatsoever.
- */
- if (dir->i_ino == PROC_ROOT_INO) /* check for safety... */
- dir->i_nlink = proc_root.nlink + nr_threads;
-
- if (!proc_lookup(dir, dentry, nd)) {
- return NULL;
- }
-
- return proc_pid_lookup(dir, dentry, nd);
-}
-
-static int proc_root_readdir(struct file * filp,
- void * dirent, filldir_t filldir)
-{
- unsigned int nr = filp->f_pos;
- int ret;
-
- lock_kernel();
-
- if (nr < FIRST_PROCESS_ENTRY) {
- int error = proc_readdir(filp, dirent, filldir);
- if (error <= 0) {
- unlock_kernel();
- return error;
- }
- filp->f_pos = FIRST_PROCESS_ENTRY;
- }
- unlock_kernel();
-
- ret = proc_pid_readdir(filp, dirent, filldir);
- return ret;
-}
-
-/*
- * The root /proc directory is special, as it has the
- * <pid> directories. Thus we don't use the generic
- * directory handling functions for that..
- */
-static struct file_operations proc_root_operations = {
- .read = generic_read_dir,
- .readdir = proc_root_readdir,
-};
-
-/*
- * proc root can do almost nothing..
- */
-static struct inode_operations proc_root_inode_operations = {
- .lookup = proc_root_lookup,
-};
-
-/*
- * This is the root "inode" in the /proc tree..
- */
-struct proc_dir_entry proc_root = {
- .low_ino = PROC_ROOT_INO,
- .namelen = 5,
- .name = "/proc",
- .mode = S_IFDIR | S_IRUGO | S_IXUGO,
- .nlink = 2,
- .proc_iops = &proc_root_inode_operations,
- .proc_fops = &proc_root_operations,
- .parent = &proc_root,
-};
-
EXPORT_SYMBOL(proc_symlink);
EXPORT_SYMBOL(proc_mkdir);
EXPORT_SYMBOL(create_proc_entry);
diff --git a/fs/proc/sysinfo.h b/fs/proc/sysinfo.h
new file mode 100644
index 0000000..86cec89
--- /dev/null
+++ b/fs/proc/sysinfo.h
@@ -0,0 +1,35 @@
+/* sysinfo.h: internal proc_sysinfo_fs definitions
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/proc_sysinfo_fs.h>
+
+struct vmalloc_info {
+ unsigned long used;
+ unsigned long largest_chunk;
+};
+
+#ifdef CONFIG_MMU
+#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
+extern void get_vmalloc_info(struct vmalloc_info *vmi);
+#else
+
+#define VMALLOC_TOTAL 0UL
+#define get_vmalloc_info(vmi) \
+do { \
+ (vmi)->used = 0; \
+ (vmi)->largest_chunk = 0; \
+} while(0)
+
+#endif
+
+extern void create_seq_entry(char *name, mode_t mode, struct file_operations *f);
+
+void free_proc_entry(struct proc_dir_entry *de);
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 4063fb3..fc192ae 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -9,7 +9,7 @@

#include <linux/config.h>
#include <linux/mm.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/user.h>
#include <linux/a.out.h>
#include <linux/elf.h>
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index aa6322d..185dac9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -11,12 +11,6 @@
*/

/*
- * Offset of the first process in the /proc root directory..
- */
-#define FIRST_PROCESS_ENTRY 256
-
-
-/*
* We always define these enumerators
*/

@@ -26,225 +20,24 @@ enum {

#define PROC_SUPER_MAGIC 0x9fa0

-/*
- * This is not completely implemented yet. The idea is to
- * create an in-memory tree (like the actual /proc filesystem
- * tree) of these proc_dir_entries, so that we can dynamically
- * add new files to /proc.
- *
- * The "next" pointer creates a linked list of one /proc directory,
- * while parent/subdir create the directory structure (every
- * /proc file has a parent, but "subdir" is NULL for all
- * non-directory entries).
- *
- * "get_info" is called at "read", while "owner" is used to protect module
- * from unloading while proc_dir_entry is in use
- */
-
-typedef int (read_proc_t)(char *page, char **start, off_t off,
- int count, int *eof, void *data);
-typedef int (write_proc_t)(struct file *file, const char __user *buffer,
- unsigned long count, void *data);
-typedef int (get_info_t)(char *, char **, off_t, int);
-
-struct proc_dir_entry {
- unsigned int low_ino;
- unsigned short namelen;
- const char *name;
- mode_t mode;
- nlink_t nlink;
- uid_t uid;
- gid_t gid;
- unsigned long size;
- struct inode_operations * proc_iops;
- struct file_operations * proc_fops;
- get_info_t *get_info;
- struct module *owner;
- struct proc_dir_entry *next, *parent, *subdir;
- void *data;
- read_proc_t *read_proc;
- write_proc_t *write_proc;
- atomic_t count; /* use count */
- int deleted; /* delete flag */
- void *set;
-};
-
-struct kcore_list {
- struct kcore_list *next;
- unsigned long addr;
- size_t size;
-};
-
-struct vmcore {
- struct list_head list;
- unsigned long long paddr;
- unsigned long size;
- loff_t offset;
-};
-
#ifdef CONFIG_PROC_FS

-extern struct proc_dir_entry proc_root;
-extern struct proc_dir_entry *proc_root_fs;
-extern struct proc_dir_entry *proc_net;
-extern struct proc_dir_entry *proc_net_stat;
-extern struct proc_dir_entry *proc_bus;
-extern struct proc_dir_entry *proc_root_driver;
-extern struct proc_dir_entry *proc_root_kcore;
-
-extern void proc_root_init(void);
-extern void proc_misc_init(void);
-
struct mm_struct;

struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *);
struct dentry *proc_pid_unhash(struct task_struct *p);
void proc_pid_flush(struct dentry *proc_dentry);
-int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir);
unsigned long task_vsize(struct mm_struct *);
int task_statm(struct mm_struct *, int *, int *, int *, int *);
char *task_mem(struct mm_struct *, char *);

-extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
- struct proc_dir_entry *parent);
-extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent);
-
-extern struct vfsmount *proc_mnt;
-extern int proc_fill_super(struct super_block *,void *,int);
-extern struct inode *proc_get_inode(struct super_block *, unsigned int, struct proc_dir_entry *);
-
-extern int proc_match(int, const char *,struct proc_dir_entry *);
-
-/*
- * These are generic /proc routines that use the internal
- * "struct proc_dir_entry" tree to traverse the filesystem.
- *
- * The /proc root directory has extended versions to take care
- * of the /proc/<pid> subdirectories.
- */
-extern int proc_readdir(struct file *, void *, filldir_t);
-extern struct dentry *proc_lookup(struct inode *, struct dentry *, struct nameidata *);
-
-extern struct file_operations proc_kcore_operations;
-extern struct file_operations proc_kmsg_operations;
-extern struct file_operations ppc_htab_operations;
-
-/*
- * proc_tty.c
- */
-struct tty_driver;
-extern void proc_tty_init(void);
-extern void proc_tty_register_driver(struct tty_driver *driver);
-extern void proc_tty_unregister_driver(struct tty_driver *driver);
-
-/*
- * proc_devtree.c
- */
-#ifdef CONFIG_PROC_DEVICETREE
-struct device_node;
-struct property;
-extern void proc_device_tree_init(void);
-extern void proc_device_tree_add_node(struct device_node *, struct proc_dir_entry *);
-extern void proc_device_tree_add_prop(struct proc_dir_entry *pde, struct property *prop);
-extern void proc_device_tree_remove_prop(struct proc_dir_entry *pde,
- struct property *prop);
-extern void proc_device_tree_update_prop(struct proc_dir_entry *pde,
- struct property *newprop,
- struct property *oldprop);
-#endif /* CONFIG_PROC_DEVICETREE */
-
-extern struct proc_dir_entry *proc_symlink(const char *,
- struct proc_dir_entry *, const char *);
-extern struct proc_dir_entry *proc_mkdir(const char *,struct proc_dir_entry *);
-extern struct proc_dir_entry *proc_mkdir_mode(const char *name, mode_t mode,
- struct proc_dir_entry *parent);
-
-static inline struct proc_dir_entry *create_proc_read_entry(const char *name,
- mode_t mode, struct proc_dir_entry *base,
- read_proc_t *read_proc, void * data)
-{
- struct proc_dir_entry *res=create_proc_entry(name,mode,base);
- if (res) {
- res->read_proc=read_proc;
- res->data=data;
- }
- return res;
-}
-
-static inline struct proc_dir_entry *create_proc_info_entry(const char *name,
- mode_t mode, struct proc_dir_entry *base, get_info_t *get_info)
-{
- struct proc_dir_entry *res=create_proc_entry(name,mode,base);
- if (res) res->get_info=get_info;
- return res;
-}
-
-static inline struct proc_dir_entry *proc_net_create(const char *name,
- mode_t mode, get_info_t *get_info)
-{
- return create_proc_info_entry(name,mode,proc_net,get_info);
-}
-
-static inline struct proc_dir_entry *proc_net_fops_create(const char *name,
- mode_t mode, struct file_operations *fops)
-{
- struct proc_dir_entry *res = create_proc_entry(name, mode, proc_net);
- if (res)
- res->proc_fops = fops;
- return res;
-}
-
-static inline void proc_net_remove(const char *name)
-{
- remove_proc_entry(name,proc_net);
-}
-
-#else
-
-#define proc_root_driver NULL
-#define proc_net NULL
-#define proc_bus NULL
-
-#define proc_net_fops_create(name, mode, fops) ({ (void)(mode), NULL; })
-#define proc_net_create(name, mode, info) ({ (void)(mode), NULL; })
-static inline void proc_net_remove(const char *name) {}
+#else /* CONFIG_PROC_FS */

static inline struct dentry *proc_pid_unhash(struct task_struct *p) { return NULL; }
static inline void proc_pid_flush(struct dentry *proc_dentry) { }

-static inline struct proc_dir_entry *create_proc_entry(const char *name,
- mode_t mode, struct proc_dir_entry *parent) { return NULL; }
-
-#define remove_proc_entry(name, parent) do {} while (0)
-
-static inline struct proc_dir_entry *proc_symlink(const char *name,
- struct proc_dir_entry *parent,const char *dest) {return NULL;}
-static inline struct proc_dir_entry *proc_mkdir(const char *name,
- struct proc_dir_entry *parent) {return NULL;}
-
-static inline struct proc_dir_entry *create_proc_read_entry(const char *name,
- mode_t mode, struct proc_dir_entry *base,
- read_proc_t *read_proc, void * data) { return NULL; }
-static inline struct proc_dir_entry *create_proc_info_entry(const char *name,
- mode_t mode, struct proc_dir_entry *base, get_info_t *get_info)
- { return NULL; }
-
-struct tty_driver;
-static inline void proc_tty_register_driver(struct tty_driver *driver) {};
-static inline void proc_tty_unregister_driver(struct tty_driver *driver) {};
-
-extern struct proc_dir_entry proc_root;
-
#endif /* CONFIG_PROC_FS */

-#if !defined(CONFIG_PROC_KCORE)
-static inline void kclist_add(struct kcore_list *new, void *addr, size_t size)
-{
-}
-#else
-extern void kclist_add(struct kcore_list *, void *, size_t);
-#endif
-
struct proc_inode {
struct task_struct *task;
int type;
@@ -252,8 +45,8 @@ struct proc_inode {
int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
int (*proc_read)(struct task_struct *task, char *page);
} op;
- struct proc_dir_entry *pde;
struct inode vfs_inode;
+
};

static inline struct proc_inode *PROC_I(const struct inode *inode)
@@ -261,9 +54,6 @@ static inline struct proc_inode *PROC_I(
return container_of(inode, struct proc_inode, vfs_inode);
}

-static inline struct proc_dir_entry *PDE(const struct inode *inode)
-{
- return PROC_I(inode)->pde;
-}
+#include <linux/proc_sysinfo_fs.h>

#endif /* _LINUX_PROC_FS_H */
diff --git a/include/linux/proc_sysinfo_fs.h b/include/linux/proc_sysinfo_fs.h
new file mode 100644
index 0000000..cbd8e1d
--- /dev/null
+++ b/include/linux/proc_sysinfo_fs.h
@@ -0,0 +1,228 @@
+#ifndef _LINUX_PROC_SYSINFO_FS_H
+#define _LINUX_PROC_SYSINFO_FS_H
+
+#include <linux/config.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <asm/atomic.h>
+
+/*
+ * We always define these enumerators
+ */
+
+enum {
+ PROC_SYSINFO_ROOT_INO = 1,
+};
+
+#define PROC_SYSINFO_SUPER_MAGIC 0x9fa1
+
+struct kcore_list {
+ struct kcore_list *next;
+ unsigned long addr;
+ size_t size;
+};
+
+struct vmcore {
+ struct list_head list;
+ unsigned long long paddr;
+ unsigned long size;
+ loff_t offset;
+};
+
+#ifdef CONFIG_PROC_SYSINFO_FS
+
+/*
+ * This is not completely implemented yet. The idea is to
+ * create an in-memory tree (like the actual /proc filesystem
+ * tree) of these proc_dir_entries, so that we can dynamically
+ * add new files to /proc.
+ *
+ * The "next" pointer creates a linked list of one /proc directory,
+ * while parent/subdir create the directory structure (every
+ * /proc file has a parent, but "subdir" is NULL for all
+ * non-directory entries).
+ *
+ * "get_info" is called at "read", while "owner" is used to protect module
+ * from unloading while proc_dir_entry is in use
+ */
+
+typedef int (read_proc_t)(char *page, char **start, off_t off,
+ int count, int *eof, void *data);
+typedef int (write_proc_t)(struct file *file, const char __user *buffer,
+ unsigned long count, void *data);
+typedef int (get_info_t)(char *, char **, off_t, int);
+
+struct proc_dir_entry {
+ unsigned int low_ino;
+ unsigned short namelen;
+ const char *name;
+ mode_t mode;
+ nlink_t nlink;
+ uid_t uid;
+ gid_t gid;
+ unsigned long size;
+ struct inode_operations * proc_iops;
+ struct file_operations * proc_fops;
+ get_info_t *get_info;
+ struct module *owner;
+ struct proc_dir_entry *next, *parent, *subdir;
+ void *data;
+ read_proc_t *read_proc;
+ write_proc_t *write_proc;
+ atomic_t count; /* use count */
+ int deleted; /* delete flag */
+ void *set;
+};
+
+extern struct proc_dir_entry proc_root;
+extern struct proc_dir_entry *proc_root_fs;
+extern struct proc_dir_entry *proc_net;
+extern struct proc_dir_entry *proc_net_stat;
+extern struct proc_dir_entry *proc_bus;
+extern struct proc_dir_entry *proc_root_driver;
+extern struct proc_dir_entry *proc_root_kcore;
+
+extern void proc_root_init(void);
+extern void proc_misc_init(void);
+
+extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
+ struct proc_dir_entry *parent);
+extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent);
+
+extern struct vfsmount *proc_sysinfo_mnt;
+
+extern int proc_sysinfo_fill_super(struct super_block *,void *,int);
+extern struct inode *proc_sysinfo_get_inode(struct super_block *, unsigned int, struct proc_dir_entry *);
+
+extern int proc_match(int, const char *,struct proc_dir_entry *);
+
+/*
+ * These are generic /proc routines that use the internal
+ * "struct proc_dir_entry" tree to traverse the filesystem.
+ *
+ */
+
+extern struct file_operations proc_kcore_operations;
+extern struct file_operations proc_kmsg_operations;
+extern struct file_operations ppc_htab_operations;
+
+/*
+ * proc_tty.c
+ */
+struct tty_driver;
+extern void proc_tty_init(void);
+extern void proc_tty_register_driver(struct tty_driver *driver);
+extern void proc_tty_unregister_driver(struct tty_driver *driver);
+
+/*
+ * proc_devtree.c
+ */
+#ifdef CONFIG_PROC_DEVICETREE
+struct device_node;
+struct property;
+extern void proc_device_tree_init(void);
+extern void proc_device_tree_add_node(struct device_node *, struct proc_dir_entry *);
+extern void proc_device_tree_add_prop(struct proc_dir_entry *pde, struct property *prop);
+extern void proc_device_tree_remove_prop(struct proc_dir_entry *pde,
+ struct property *prop);
+extern void proc_device_tree_update_prop(struct proc_dir_entry *pde,
+ struct property *newprop,
+ struct property *oldprop);
+#endif /* CONFIG_PROC_DEVICETREE */
+
+extern struct proc_dir_entry *proc_symlink(const char *,
+ struct proc_dir_entry *, const char *);
+extern struct proc_dir_entry *proc_mkdir(const char *,struct proc_dir_entry *);
+extern struct proc_dir_entry *proc_mkdir_mode(const char *name, mode_t mode,
+ struct proc_dir_entry *parent);
+
+static inline struct proc_dir_entry *create_proc_read_entry(const char *name,
+ mode_t mode, struct proc_dir_entry *base,
+ read_proc_t *read_proc, void * data)
+{
+ struct proc_dir_entry *res=create_proc_entry(name,mode,base);
+ if (res) {
+ res->read_proc=read_proc;
+ res->data=data;
+ }
+ return res;
+}
+
+static inline struct proc_dir_entry *create_proc_info_entry(const char *name,
+ mode_t mode, struct proc_dir_entry *base, get_info_t *get_info)
+{
+ struct proc_dir_entry *res=create_proc_entry(name,mode,base);
+ if (res) res->get_info=get_info;
+ return res;
+}
+
+static inline struct proc_dir_entry *proc_net_create(const char *name,
+ mode_t mode, get_info_t *get_info)
+{
+ return create_proc_info_entry(name,mode,proc_net,get_info);
+}
+
+static inline struct proc_dir_entry *proc_net_fops_create(const char *name,
+ mode_t mode, struct file_operations *fops)
+{
+ struct proc_dir_entry *res = create_proc_entry(name, mode, proc_net);
+ if (res)
+ res->proc_fops = fops;
+ return res;
+}
+
+static inline void proc_net_remove(const char *name)
+{
+ remove_proc_entry(name,proc_net);
+}
+
+#else /* CONFIG_PROC_SYSINFO_FS */
+
+#define proc_root_driver NULL
+#define proc_net NULL
+#define proc_bus NULL
+
+#define proc_net_fops_create(name, mode, fops) ({ (void)(mode), NULL; })
+#define proc_net_create(name, mode, info) ({ (void)(mode), NULL; })
+static inline void proc_net_remove(const char *name) {}
+
+
+static inline struct proc_dir_entry *create_proc_entry(const char *name,
+ mode_t mode, struct proc_dir_entry *parent) { return NULL; }
+
+#define remove_proc_entry(name, parent) do {} while (0)
+
+static inline struct proc_dir_entry *proc_symlink(const char *name,
+ struct proc_dir_entry *parent,const char *dest) {return NULL;}
+static inline struct proc_dir_entry *proc_mkdir(const char *name,
+ struct proc_dir_entry *parent) {return NULL;}
+
+static inline struct proc_dir_entry *create_proc_read_entry(const char *name,
+ mode_t mode, struct proc_dir_entry *base,
+ read_proc_t *read_proc, void * data) { return NULL; }
+static inline struct proc_dir_entry *create_proc_info_entry(const char *name,
+ mode_t mode, struct proc_dir_entry *base, get_info_t *get_info)
+ { return NULL; }
+
+struct tty_driver;
+static inline void proc_tty_register_driver(struct tty_driver *driver) {};
+static inline void proc_tty_unregister_driver(struct tty_driver *driver) {};
+
+extern struct proc_dir_entry proc_root;
+
+#endif /* CONFIG_PROC_SYSINFO_FS */
+
+#if !defined(CONFIG_PROC_KCORE)
+static inline void kclist_add(struct kcore_list *new, void *addr, size_t size)
+{
+}
+#else
+extern void kclist_add(struct kcore_list *, void *, size_t);
+#endif
+
+static inline struct proc_dir_entry *PDE(const struct inode *inode)
+{
+ return inode->u.generic_ip;
+}
+
+#endif /* _LINUX_PROC_SYSINFO_FS_H */
diff --git a/include/linux/pspace.h b/include/linux/pspace.h
index 950393a..3be247f 100644
--- a/include/linux/pspace.h
+++ b/include/linux/pspace.h
@@ -24,6 +24,7 @@ struct pspace
int last_pid;
int min;
int max;
+ struct vfsmount *proc_mnt;
struct pidmap pidmap[PIDMAP_ENTRIES];
char name[1]; /* For use in debugging print statements */
};
diff --git a/kernel/pid.c b/kernel/pid.c
index 221585d..8732176 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -22,6 +22,7 @@

#include <linux/mm.h>
#include <linux/module.h>
+#include <linux/mount.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/bootmem.h>
@@ -249,6 +250,7 @@ static struct pspace *new_pspace(struct
pspace->nr_threads = 0;
pspace->nr_processes = 0;
pspace->last_pid = 0;
+ pspace->proc_mnt = NULL;
pspace->min = RESERVED_PIDS;
pspace->max = PID_MAX_DEFAULT;
for (i = 0; i < PIDMAP_ENTRIES; i++) {
@@ -303,6 +305,8 @@ void __put_pspace(struct pspace *pspace)

pspace->child_reaper.pspace->nr_processes--;
detach_any_pid(&pspace->child_reaper, PIDTYPE_PID);
+ if (pspace->proc_mnt)
+ mntput(pspace->proc_mnt);
parent = pspace->child_reaper.pspace;
map = pspace->pidmap;
for (i = 0; i < PIDMAP_ENTRIES; i++) {
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4ae834d..f0e148b 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -41,7 +41,7 @@
#include <linux/namei.h>
#include <linux/mount.h>
#include <linux/ext2_fs.h>
-#include <linux/proc_fs.h>
+#include <linux/proc_sysinfo_fs.h>
#include <linux/kd.h>
#include <linux/netfilter_ipv4.h>
#include <linux/netfilter_ipv6.h>
@@ -530,7 +530,7 @@ static int superblock_doinit(struct supe
}
}

- if (strcmp(sb->s_type->name, "proc") == 0)
+ if (strcmp(sb->s_type->name, "proc_sysinfo") == 0)
sbsec->proc = 1;

sbsec->initialized = 1;
@@ -700,7 +700,7 @@ static int selinux_proc_get_sid(struct p
path = end;
de = de->parent;
}
- rc = security_genfs_sid("proc", path, tclass, sid);
+ rc = security_genfs_sid("proc_sysinfo", path, tclass, sid);
free_page((unsigned long)buffer);
return rc;
}
@@ -849,8 +849,8 @@ static int inode_doinit_with_dentry(stru
isec->sid = sbsec->sid;

if (sbsec->proc) {
- struct proc_inode *proci = PROC_I(inode);
- if (proci->pde) {
+ struct proc_dir_entry *pde = PDE(inode);
+ if (pde) {
isec->sclass = inode_mode_to_security_class(inode->i_mode);
rc = selinux_proc_get_sid(proci->pde,
isec->sclass,
--
1.1.5.g3480

2006-02-06 20:25:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 03/20] pid: Introduce a generic helper to test for init.

Dave Hansen <[email protected]> writes:

> On Mon, 2006-02-06 at 12:29 -0700, Eric W. Biederman wrote:
>> Introduce is_init to capture this case.
>>
>> With multiple pid spaces for all of the cases affected we are looking
>> for only the first process on the system, not some other process that
>> has pid == 1.
>
> If we have cases where each container has its own init (like vserver,
> right?), will this naming get confusing? Will we have pseudo-init tasks
> as well?

Agreed. That is a potential issue. I couldn't think of a better name.
Suggestions?

Eric

2006-02-06 20:31:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

"Serge E. Hallyn" <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>>
>> The wait id is the pid returned by wait. For tasks that span 2
>> namespaces (i.e. the process leaders of the pid namespaces) their
>> parent knows the task by a different PID value than the task knows
>> itself. Having a child with PID == 1 would be confusing.
>
> Is it possible here to have wid conflicts?
>
> Does that matter?
>
> Looking at sysvinit, it seems that it does. If the wid happens
> to conflict with the pid of one of the children init knows about,
> it could confuse init.

No. The wid is in the pspace of the parent, and the pid is in the processes
pspace. Add is in any pspace are unique.

Eric

2006-02-06 20:41:04

by Hubertus Franke

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

Eric W. Biederman wrote:
> There have been several discussions in the past month about how
> to do a good job of implementing a subset of user space that
> looks like it has the system to itself. Essentially making
> chroot everything it could be. This is my take on what
> the implementation of a pid namespace should look like.
>

Eric, this looks very good. The pspace internal implementation
is very similar to what I was working on objectifying the pidmap
etc. I was pursuing the same route in using find_pid()
functions as the means to distinguish pspaces rather then
actually virtualizing them.

But this code already goes so much further in many many aspects.
Kudos to you.
I am still going through some of the details, but this is an
excellent starting position.

>
> What follows is a real patch set that is sufficiently complete
> to be used and useful in it's own right. There are a few areas
> of the kernel where the patchset does not reach, mostly these
> cause the compile to fail. In addition a good thorough review
> still needs to be done. This patchset does paint a picture
> of how I think things should look.
>
> From the kernel community at large I am asking:
> Does the code look generally sane?

Yes, but I have one question for you...
Large parts of the patch are adding the pspace argument
to find_task_by_pid() and in many cases that argument is
current->pspace.
It might bring down the size of the patch if you
have

find_task_by_pid( pid ) { return find_task_pidspace_by_pid ( current->pspace, pid ); }

and then only deal with the exceptional cases using find_task_pidspace_by_pid
when the pidspace is different..

>
> Does the use of clone to create a new namespace instance look
> like the sane approach?
>

At he surface it looks OK .. how does this work in a multi-threaded
process which does cloen ( CLONE_NPSPACE ) ?
We discussed at some point that exec is the right place to do it,
but what I get is that because this is the container_init task
we are OK !
A bit clarification would help here ...

> Hopefully this code is sufficiently comprehensible to allow a good
> discussion to come out of this.
>

Yes

> Thanks for your time,
>
> Eric

-- Hubertus

2006-02-06 20:41:28

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 17/20] usb: Fixup usb so it works with pspaces.

Quoting Eric W. Biederman ([email protected]):
>
> Signed-off-by: Eric W. Biederman <[email protected]>

Just how many of these patches would have been unnecessary given your
tref patchset?

thanks,
-serge

2006-02-06 20:54:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

Hubertus Franke <[email protected]> writes:

>> From the kernel community at large I am asking:
>> Does the code look generally sane?
>
> Yes, but I have one question for you...
> Large parts of the patch are adding the pspace argument
> to find_task_by_pid() and in many cases that argument is
> current->pspace.
> It might bring down the size of the patch if you
> have
>
> find_task_by_pid( pid ) { return find_task_pidspace_by_pid ( current->pspace,
> pid ); }
>
> and then only deal with the exceptional cases using find_task_pidspace_by_pid
> when the pidspace is different..

That is a possibility. However I want to break some eggs so that the
users are updated appropriately. It is only by a strenuous act of
will that I don't change the type of pid,tgid,pgrp,session.

The size of the changes is much less important than being clear.
So for I want find_task_by_pid to be an absolute interface.


>> Does the use of clone to create a new namespace instance look
>> like the sane approach?
>>
>
> At he surface it looks OK .. how does this work in a multi-threaded
> process which does cloen ( CLONE_NPSPACE ) ?
> We discussed at some point that exec is the right place to do it,
> but what I get is that because this is the container_init task
> we are OK !
> A bit clarification would help here ...

Well the parent doesn't much matter. But the child must have a fresh
start on all the groups of processes. As all other groupings known by
a pid are per pspace, so they can't cross that line.

Eric

2006-02-06 20:56:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 17/20] usb: Fixup usb so it works with pspaces.

"Serge E. Hallyn" <[email protected]> writes:

> Quoting Eric W. Biederman ([email protected]):
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>
> Just how many of these patches would have been unnecessary given your
> tref patchset?

As far as the amount of work. It is six of one half a dozen of the
other. This current implementation changes the logic of the code
least which I really like for maintainability.

Small bit sized chunks.

But one piece can be replaced by the other easily, the patches
are not contradictory.

Eric

2006-02-06 21:07:15

by Hubertus Franke

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

Eric W. Biederman wrote:
> Hubertus Franke <[email protected]> writes:
>
>
>>find_task_by_pid( pid ) { return find_task_pidspace_by_pid ( current->pspace,
>>pid ); }
>>
>>and then only deal with the exceptional cases using find_task_pidspace_by_pid
>>when the pidspace is different..
>
>
> That is a possibility. However I want to break some eggs so that the
> users are updated appropriately. It is only by a strenuous act of
> will that I don't change the type of pid,tgid,pgrp,session.
>
> The size of the changes is much less important than being clear.
> So for I want find_task_by_pid to be an absolute interface.
>

Fair enough, valid answers .. I checked the patch and it would only take
19/33 instances out .. so not the end of the world.

>
>
>>> Does the use of clone to create a new namespace instance look
>>> like the sane approach?
>>>
>>
>>At he surface it looks OK .. how does this work in a multi-threaded
>>process which does cloen ( CLONE_NPSPACE ) ?
>>We discussed at some point that exec is the right place to do it,
>>but what I get is that because this is the container_init task
>>we are OK !
>>A bit clarification would help here ...
>
>
> Well the parent doesn't much matter. But the child must have a fresh
> start on all the groups of processes. As all other groupings known by
> a pid are per pspace, so they can't cross that line.
>

Now, on which kernel does this compile/work ?
Do you have a "helper" program you can share that starts/exec's an
app under a new container (uhmm, namespace). No point for us to
actually write that..

-- Hubertus


2006-02-06 22:00:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

Hubertus Franke <[email protected]> writes:

> Eric W. Biederman wrote:
>> Hubertus Franke <[email protected]> writes:
>>
>>
>>>find_task_by_pid( pid ) { return find_task_pidspace_by_pid ( current->pspace,
>>>pid ); }
>>>
>>>and then only deal with the exceptional cases using find_task_pidspace_by_pid
>>>when the pidspace is different..
>> That is a possibility. However I want to break some eggs so that the
>> users are updated appropriately. It is only by a strenuous act of
>> will that I don't change the type of pid,tgid,pgrp,session.
>> The size of the changes is much less important than being clear.
>> So for I want find_task_by_pid to be an absolute interface.
>>
>
> Fair enough, valid answers .. I checked the patch and it would only take
> 19/33 instances out .. so not the end of the world.
>
>>
>>>> Does the use of clone to create a new namespace instance look
>>>> like the sane approach?
>>>>
>>>
>>>At he surface it looks OK .. how does this work in a multi-threaded
>>>process which does cloen ( CLONE_NPSPACE ) ?
>>>We discussed at some point that exec is the right place to do it,
>>>but what I get is that because this is the container_init task
>>>we are OK !
>>>A bit clarification would help here ...
>> Well the parent doesn't much matter. But the child must have a fresh
>> start on all the groups of processes. As all other groupings known by
>> a pid are per pspace, so they can't cross that line.
>>
>
> Now, on which kernel does this compile/work ?

2.6.latest plus a few patches I have already sent off to Andrew.

> Do you have a "helper" program you can share that starts/exec's an
> app under a new container (uhmm, namespace). No point for us to
> actually write that..

Ok here is my little helper/tester program. Not beautiful but
it should work.

Eric


/* gcc -Wall -O2 -g chpid.c -o chpid */
#define _XOPEN_SOURCE
#define _XOPEN_SOURCE_EXTENDED
#define _SVID_SOURCE
#define _GNU_SOURCE
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/mount.h>
#include <sys/vfs.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <stdarg.h>
#include <dirent.h>

#ifndef MNT_FORCE
#define MNT_FORCE 0x00000001 /* Attempt to forcibily umount */
#endif /* MNT_FORCE */
#ifndef MNT_DETACH
#define MNT_DETACH 0x00000002 /* Just detach from the tree */
#endif /* MNT_DETACH */
#ifndef MNT_EXPIRE
#define MNT_EXPIRE 0x00000004 /* Mark for expiry */
#endif /* MNT_EXPIRE */

#ifndef MS_MOVE
#define MS_MOVE 8192
#endif
#ifndef MS_REC
#define MS_REC 16384
#endif

#ifndef CLONE_NPSPACE
#define CLONE_NPSPACE 0x04000000 /* New process space */
#endif


#ifndef PROC_SUPER_MAGIC
#define PROC_SUPER_MAGIC 0x9fa0
#endif /* PROC_SUPER_MAGIC */

struct user_desc;
static pid_t raw_clone(int flags, void *child_stack,
int *parent_tidptr, struct user_desc *newtls, int *child_tidptr)
{
return syscall(__NR_clone, flags, child_stack, parent_tidptr, newtls, child_tidptr);
}

static int raw_pivot_root(const char *new_root, const char *old_root)
{
return syscall(__NR_pivot_root, new_root, old_root);
}

static void (*my_exit)(int status) = exit;

static void die(char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
vfprintf(stderr, fmt, ap);
va_end(ap);
fflush(stderr);
fflush(stdout);
my_exit(1);
}

static void *xmalloc(size_t size)
{
void *ptr;
ptr = malloc(size);
if (!ptr) die("malloc of %d bytes failed: %s\n", size, strerror(errno));
return ptr;
}

int main(int argc, char **argv, char **envp)
{
pid_t pid;
int status;
struct rlimit rlim;
int clone_flags;
char **cmd_argv, *shell_argv[2];
char *root = "/", *old = "/mnt";
int i;
int tty, tty_force;

tty = 0;
tty_force = 0;
clone_flags = CLONE_NPSPACE | SIGCHLD;

for (i = 1; (i < argc) && (argv[i][0] == '-'); i++) {
if (strcmp(argv[i], "--") == 0) {
break;
}
else if (((argc - i) >= 2) && (strcmp(argv[i], "-r") == 0)) {
clone_flags |= CLONE_NEWNS;
root = argv[i + 1];
i++;
}
else if (((argc - i) >= 2) && (strcmp(argv[i], "-o") == 0)) {
old = argv[i + 1];
i++;
}
else if (strcmp(argv[i], "-n") == 0) {
clone_flags |= CLONE_NEWNS;
}
else if (strcmp(argv[i], "--tty") == 0) {
tty = 1;
}
else if (strcmp(argv[i], "--tty-force") == 0) {
tty = 1; tty_force = 1;
}
else {
die("Bad argument %s\n", argv[i]);
}
}
cmd_argv = argv + i;
if (cmd_argv[0] == NULL) {
cmd_argv = shell_argv;
shell_argv[0] = getenv("SHELL");
shell_argv[1] = NULL;
}
if (cmd_argv[0] == NULL) {
die("No command specified\n");
}
#if 1
fprintf(stderr, "cmd_argv: %s\n", cmd_argv[0]);
#endif
if (root[0] != '/') {
die("root path: '%s' not absolute\n", root);
}
if (old[0] != '/') {
die("old path: '%s' not absolute\n", old);
}

pid = raw_clone(clone_flags, NULL, NULL, NULL, NULL);
if (pid < 0) {
fprintf(stderr, "clone_failed: pid: %d %d:%s\n",
pid, errno, strerror(errno));
exit(2);
}
if (pid == 0) {
/* In the child */
int result;
my_exit = _exit;

/* FIXME allocate a process inside for controlling the new process space */

fprintf(stderr, "pid: %d, ppid: %d pgrp: %d sid: %d\n",
getpid(), getppid(), getpgid(0), getsid(0));
/* If CLONE_NPSPACE isn't implemented exit */
if (getpid() != 1)
die("CLONE_NPSPACE not implemented\n");
if (clone_flags & CLONE_NEWNS) {
struct statfs stfs;
if (strcmp(root, "/") != 0) {
char put_old[PATH_MAX];
result = snprintf(put_old, sizeof(put_old), "%s%s", root, old);
if (result >= sizeof(put_old))
die("path name to long\n");
if (result < 0)
die("snprintf failed: %d:%s\n",
errno, strerror(errno));

/* Ensure I have a mount point at the directory I want to export */
result = mount(root, root, NULL, MS_BIND | MS_REC, NULL);
if (result < 0)
die("bind of '%s' failed: %d:%s\n",
root, errno, strerror(errno));

/* Switch the mount points */
result = raw_pivot_root(root, put_old);
if (result < 0)
die("pivot_root('%s', '%s') failed: %d:%s\n",
root, put_old, errno, strerror(errno));

/* Unmount all of the old mounts */
result = umount2(old, MNT_DETACH);
if (result < 0)
die("umount2 of '%s' failed: %d:%s\n",
put_old, errno, strerror(errno));
}

result = statfs("/proc", &stfs);
if ((result == 0) && (stfs.f_type == PROC_SUPER_MAGIC)) {
/* Unmount and remount proc so it reflects the new pid space */
result = umount2("/proc", 0);
if (result < 0)
die("umount failed: %d:%s\n", errno, strerror(errno));

result = mount("proc", "/proc", "proc", 0, NULL);
if (result < 0)
die("mount failed: %d:%s\n",
errno, strerror(errno));
}
}
if (tty) {
pid_t sid, pgrp;
sid = setsid();
if (sid < 0)
die("setsid failed: %d:%s\n",
errno, strerror(errno));
fprintf(stderr, "pid: %d, ppid: %d pgrp: %d sid: %d\n",
getpid(), getppid(), getpgid(0), getsid(0));

result = ioctl(STDIN_FILENO, TIOCSCTTY, tty_force);
if (result < 0)
die("tiocsctty failed: %d:%s\n",
errno, strerror(errno));

pgrp = tcgetpgrp(STDIN_FILENO);

fprintf(stderr, "pgrp: %d\n", pgrp);

fprintf(stderr, "pid: %d, ppid: %d pgrp: %d sid: %d\n",
getpid(), getppid(), getpgid(0), getsid(0));

}
result = execve(cmd_argv[0], cmd_argv, envp);
die("execve of %s failed: %d:%s\n",
cmd_argv[0], errno, strerror(errno));
}
/* In the parent */
fprintf(stderr, "child pid: %d\n", pid);
pid = waitpid(pid, &status, 0);
fprintf(stderr, "pid: %d exited status: %d\n",
pid, status);
if (pid < 0) {
fprintf(stderr, "waitpid failed: %d %s\n",
errno, strerror(errno));
exit(9);
}
if (pid == 0) {
fprintf(stderr, "waitpid returned no pid!\n");
exit(10);
}
if (WIFEXITED(status)) {
fprintf(stderr, "pid: %d exited: %d\n",
pid, WEXITSTATUS(status));
}
if (WIFSIGNALED(status)) {
fprintf(stderr, "pid: %d exited with a uncaught signal: %d %s\n",
pid, WTERMSIG(status), strsignal(WTERMSIG(status)));
}
if (WIFSTOPPED(status)) {
fprintf(stderr, "pid: %d stopped with signal: %d\n",
pid, WSTOPSIG(status));
}
return 0;
}

2006-02-07 00:48:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

On Mon, 2006-02-06 at 12:19 -0700, Eric W. Biederman wrote:
> p.s. My apologies at the size of the CC list. It is very hard to tell who
> the interested parties are, and since there is no one list we all subscribe
> to other than linux-kernel how to reach everyone in a timely manner. I am
> copying everyone who has chimed in on a previous thread on the subject. If
> you don't want to be copied in the future tell and I will take your name off
> of my list.

Is it worth creating a vger or OSDL mailing list for these discussions?
There was some concern the cc list is getting too large. :)

If yes, would "linux-containers" be an appropriate name?

-- Dave

2006-02-07 05:17:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

Dave Hansen <[email protected]> writes:

> On Mon, 2006-02-06 at 12:19 -0700, Eric W. Biederman wrote:
>> p.s. My apologies at the size of the CC list. It is very hard to tell who
>> the interested parties are, and since there is no one list we all subscribe
>> to other than linux-kernel how to reach everyone in a timely manner. I am
>> copying everyone who has chimed in on a previous thread on the subject. If
>> you don't want to be copied in the future tell and I will take your name off
>> of my list.
>
> Is it worth creating a vger or OSDL mailing list for these discussions?
> There was some concern the cc list is getting too large. :)
>
> If yes, would "linux-containers" be an appropriate name?

I have think we can wait until the current discussion is over, as I have
not received any complaints yet.

For ongoing maintenance of the future in kernel implementation I think
it is worth setting something up.

Eric

2006-02-07 09:41:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

On Tuesday 07 February 2006 01:48, Dave Hansen wrote:
> On Mon, 2006-02-06 at 12:19 -0700, Eric W. Biederman wrote:
> > p.s. My apologies at the size of the CC list. It is very hard to tell who
> > the interested parties are, and since there is no one list we all subscribe
> > to other than linux-kernel how to reach everyone in a timely manner. I am
> > copying everyone who has chimed in on a previous thread on the subject. If
> > you don't want to be copied in the future tell and I will take your name off
> > of my list.
>
> Is it worth creating a vger or OSDL mailing list for these discussions?
> There was some concern the cc list is getting too large. :)

No, linux-kernel is fine. It will keep everybody informed, instead of
small groups doing their own things. But just keep the cc lists short please.
I'm sure all interested people can get it from l-k.

-Andi

2006-02-07 15:08:08

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [RFC][PATCH 03/20] pid: Introduce a generic helper to test for init.

On Mon, 6 Feb 2006, Eric W. Biederman wrote:
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -894,6 +894,19 @@ static inline int pid_alive(struct task_
> return p->pids[PIDTYPE_PID].nr != 0;
> }
>
> +/**
> + * is_init - check if a task structure is the first user space
> + * task the kernel created.
> + * @p: Task structure to be checked.
^
> + */
> +static inline int is_init(struct task_struct *tsk)
^^^
Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2006-02-07 15:18:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 03/20] pid: Introduce a generic helper to test for init.

Geert Uytterhoeven <[email protected]> writes:

> On Mon, 6 Feb 2006, Eric W. Biederman wrote:
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -894,6 +894,19 @@ static inline int pid_alive(struct task_
>> return p->pids[PIDTYPE_PID].nr != 0;
>> }
>>
>> +/**
>> + * is_init - check if a task structure is the first user space
>> + * task the kernel created.
>> + * @p: Task structure to be checked.
> ^
>> + */
>> +static inline int is_init(struct task_struct *tsk)
> ^^^

Thanks. The perils of cut & past documentation!

Eric

2006-02-07 16:42:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Mon, Feb 06, 2006 at 12:34:08PM -0700, Eric W. Biederman wrote:
> -#define mk_pid(map, off) (((map) - pidmap_array)*BITS_PER_PAGE + (off))
> +#define mk_pid(map, off) (((map) - pspace->pidmap)*BITS_PER_PAGE + (off))
> #define find_next_offset(map, off) \

mk_pid() should be:

#define mk_pid(pspace, map, off) \
(((map) - (pspace)->pidmap)*BITS_PER_PAGE + (off))

or otherwise made an inline; pscape escaping like this shouldn't happen.


On Mon, Feb 06, 2006 at 12:34:08PM -0700, Eric W. Biederman wrote:
> +static struct pspace *new_pspace(struct task_struct *leader)
> +{
> + struct pspace *pspace, *parent;
> + int i;
> + size_t len;
> + parent = leader->pspace;
> + len = strlen(parent->name) + 10;
> + pspace = kzalloc(sizeof(struct pspace) + len, GFP_KERNEL);
> + if (!pspace)
> + return NULL;
> + atomic_set(&pspace->count, 1);
> + pspace->flags = 0;
> + pspace->nr_threads = 0;
> + pspace->nr_processes = 0;
> + pspace->last_pid = 0;
> + pspace->min = RESERVED_PIDS;
> + pspace->max = PID_MAX_DEFAULT;
> + for (i = 0; i < PIDMAP_ENTRIES; i++) {
> + atomic_set(&pspace->pidmap[i].nr_free, BITS_PER_PAGE);
> + pspace->pidmap[i].page = NULL;
> + }
> + attach_any_pid(&pspace->child_reaper, leader, PIDTYPE_PID,
> + parent, leader->wid);
> + leader->pspace->nr_processes++;
> + snprintf(pspace->name, len + 1, "%s/%d", parent->name, leader->wid);
> +
> + return pspace;
> +}

kzalloc() followed by zeroing ->flags et al by hand is redundant.


-- wli

2006-02-07 17:08:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

William Lee Irwin III <[email protected]> writes:

> On Mon, Feb 06, 2006 at 12:34:08PM -0700, Eric W. Biederman wrote:
>> -#define mk_pid(map, off) (((map) - pidmap_array)*BITS_PER_PAGE + (off))
>> +#define mk_pid(map, off) (((map) - pspace->pidmap)*BITS_PER_PAGE + (off))
>> #define find_next_offset(map, off) \
>
> mk_pid() should be:
>
> #define mk_pid(pspace, map, off) \
> (((map) - (pspace)->pidmap)*BITS_PER_PAGE + (off))
>
> or otherwise made an inline; pscape escaping like this shouldn't happen.

Agreed that is a bug.

> On Mon, Feb 06, 2006 at 12:34:08PM -0700, Eric W. Biederman wrote:
>> +static struct pspace *new_pspace(struct task_struct *leader)
>> +{
>> + struct pspace *pspace, *parent;
>> + int i;
>> + size_t len;
>> + parent = leader->pspace;
>> + len = strlen(parent->name) + 10;
>> + pspace = kzalloc(sizeof(struct pspace) + len, GFP_KERNEL);
>> + if (!pspace)
>> + return NULL;
>> + atomic_set(&pspace->count, 1);
>> + pspace->flags = 0;
>> + pspace->nr_threads = 0;
>> + pspace->nr_processes = 0;
>> + pspace->last_pid = 0;
>> + pspace->min = RESERVED_PIDS;
>> + pspace->max = PID_MAX_DEFAULT;
>> + for (i = 0; i < PIDMAP_ENTRIES; i++) {
>> + atomic_set(&pspace->pidmap[i].nr_free, BITS_PER_PAGE);
>> + pspace->pidmap[i].page = NULL;
>> + }
>> + attach_any_pid(&pspace->child_reaper, leader, PIDTYPE_PID,
>> + parent, leader->wid);
>> + leader->pspace->nr_processes++;
>> + snprintf(pspace->name, len + 1, "%s/%d", parent->name, leader->wid);
>> +
>> + return pspace;
>> +}
>
> kzalloc() followed by zeroing ->flags et al by hand is redundant.

Totally agreed. I was explicitly initializing all of the fields
then I missed one in an update and got bit rather badly. So I went
for the shotgun approach. I should probably just make it kmalloc
again at this point. I think kmalloc plus explicitly initializing
the fields is going to be a more readable and more efficient because
it removes the redundancy.

Eric

2006-02-07 17:40:52

by Jeff Dike

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

First of all, for an RFC, this is very thorough.

Second, I've been thinking along these lines for UML. The motivation
is to get UML out of the system call tracing business as much as
possible, and to do so by having the host set up such that it can run
system calls itself and they do the same thing as the UML system call
would.

For example, for a UML process chrooted into a UML filesystem, the
file operations on normal files will do the same thing as they would
in UML, so they could be left to run on the host.

Similarly, something like virtualized processes could be made to do
the same thing with the process operations. Trivially, getpid() will
return the right value if left to run on the host, so UML wouldn't
need to intercept it. If there is a process tree inside a container
that mirrors the UML process tree, then lots of other system calls
also work, and don't need to be intercepted.

Ideally, I'd like namespaces on the host for all the resources under
UML control, and for a container to group those namespaces. However,
something which stops short of that is still usable - UML just gets
less benefit from it.

As far as processes go, ideally I'd like a containerized process to be
an empty shell which can be completely filled from userspace. The
motivation for this is that when you have a UP UML with 100 processes,
it's wasteful to have 100 virtualized processes on the host. What I
would want is one virtualized process which can be completely refilled
with new attributes on a context switch.

What I want to do is related to process migration, where you want to
move a process but have it not be able to tell. I'm describing
migrating a process from the UML to the host such that the host
performs as many system calls itself, but those which can't get
intercepted and executed within the UML. For migration between
physical machines, this would be the same as redirecting a system call
from the new host back to its original home. You want to do that as
infrequently as possible, so you want the container to provide as much
context from the home host as possible.

Jeff

2006-02-07 18:35:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

Jeff Dike <[email protected]> writes:

> First of all, for an RFC, this is very thorough.

Thank you.

> Second, I've been thinking along these lines for UML. The motivation
> is to get UML out of the system call tracing business as much as
> possible, and to do so by having the host set up such that it can run
> system calls itself and they do the same thing as the UML system call
> would.
>
> For example, for a UML process chrooted into a UML filesystem, the
> file operations on normal files will do the same thing as they would
> in UML, so they could be left to run on the host.
>
> Similarly, something like virtualized processes could be made to do
> the same thing with the process operations. Trivially, getpid() will
> return the right value if left to run on the host, so UML wouldn't
> need to intercept it. If there is a process tree inside a container
> that mirrors the UML process tree, then lots of other system calls
> also work, and don't need to be intercepted.
>
> Ideally, I'd like namespaces on the host for all the resources under
> UML control, and for a container to group those namespaces. However,
> something which stops short of that is still usable - UML just gets
> less benefit from it.

Having all of the namespaces is certainly on my TODO list.

I'm not at all certain if there is a need for a kernel container
concept.

> As far as processes go, ideally I'd like a containerized process to be
> an empty shell which can be completely filled from userspace. The
> motivation for this is that when you have a UP UML with 100 processes,
> it's wasteful to have 100 virtualized processes on the host. What I
> would want is one virtualized process which can be completely refilled
> with new attributes on a context switch.
>
> What I want to do is related to process migration, where you want to
> move a process but have it not be able to tell. I'm describing
> migrating a process from the UML to the host such that the host
> performs as many system calls itself, but those which can't get
> intercepted and executed within the UML. For migration between
> physical machines, this would be the same as redirecting a system call
> from the new host back to its original home. You want to do that as
> infrequently as possible, so you want the container to provide as much
> context from the home host as possible.

Currently redirecting a system call from the new host back to it's
original home is not something I had planned on. Most of the reasons
I want to migrate relate to avoiding the hardware I am migrating from.
Either to reduce it's load or to leave before the hardware dies.

That said the idea of a user space monitor that can handle
the strange virtualization things that don't fit well into the
kernel is appealing.

Note all of the migration I am looking is not process migration but
container migration. So I want a container per application.

Eric

2006-02-08 04:18:49

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

On Tue, 7 Feb 2006 10:33:21 +0100 Andi Kleen wrote:

> On Tuesday 07 February 2006 01:48, Dave Hansen wrote:
> > On Mon, 2006-02-06 at 12:19 -0700, Eric W. Biederman wrote:
> > > p.s. My apologies at the size of the CC list. It is very hard to tell who
> > > the interested parties are, and since there is no one list we all subscribe
> > > to other than linux-kernel how to reach everyone in a timely manner. I am
> > > copying everyone who has chimed in on a previous thread on the subject. If
> > > you don't want to be copied in the future tell and I will take your name off
> > > of my list.
> >
> > Is it worth creating a vger or OSDL mailing list for these discussions?
> > There was some concern the cc list is getting too large. :)
>
> No, linux-kernel is fine. It will keep everybody informed, instead of
> small groups doing their own things. But just keep the cc lists short please.
> I'm sure all interested people can get it from l-k.

OK. (Yes, I have a dream.)

I'd like to see less traffic on lkml, with various things moved off
to other mailing lists.

According to http://marc.theaimsgroup.com/?l=linux-kernel,
Jan. 2006 was a near record month and Feb. is on track to be large also.

---
~Randy

2006-02-08 04:31:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

"Randy.Dunlap" <[email protected]> writes:

> OK. (Yes, I have a dream.)
>
> I'd like to see less traffic on lkml, with various things moved off
> to other mailing lists.
>
> According to http://marc.theaimsgroup.com/?l=linux-kernel,
> Jan. 2006 was a near record month and Feb. is on track to be large also.

A dream I can understand.

However discussion about core kernel infrastructure is very much on
topic. Or nothing is on topic on the mailing list.

Once this matures to the point where we are discussing the maintenance
of a subsystem we can consider moving off lkml.

Eric



2006-02-08 04:50:55

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/20] Multiple instances of the process id namespace

On Tue, 07 Feb 2006 21:28:04 -0700 Eric W. Biederman wrote:

> "Randy.Dunlap" <[email protected]> writes:
>
> > OK. (Yes, I have a dream.)
> >
> > I'd like to see less traffic on lkml, with various things moved off
> > to other mailing lists.
> >
> > According to http://marc.theaimsgroup.com/?l=linux-kernel,
> > Jan. 2006 was a near record month and Feb. is on track to be large also.
>
> A dream I can understand.
>
> However discussion about core kernel infrastructure is very much on
> topic. Or nothing is on topic on the mailing list.
>
> Once this matures to the point where we are discussing the maintenance
> of a subsystem we can consider moving off lkml.

Yes, I wasn't referring to this particular topic for offloading.

thanks,
---
~Randy

2006-02-10 18:51:58

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

Eric,

1. I would rename wid to wpid :)
2. Maybe I'm missing something in the discussions, but why is it
required? if you provide fully isolated pid spaces, why do you care for wid?
And how ptrace can work at all if cldstop reports some pid, but child is
not accessiable via ptrace()? and if it doesn't work, what are your
changes for?

Kirill


> The wait id is the pid returned by wait. For tasks that span 2
> namespaces (i.e. the process leaders of the pid namespaces) their
> parent knows the task by a different PID value than the task knows
> itself. Having a child with PID == 1 would be confusing.
>
> This patch introduces the wid and walks through kernel and modifies
> the places that observe the pid from the parent processes perspective
> to use the wid instead of the pid.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
>
>
> ---
>
> include/linux/sched.h | 1 +
> kernel/exit.c | 18 +++++++++---------
> kernel/fork.c | 4 ++--
> kernel/sched.c | 2 +-
> kernel/signal.c | 6 +++---
> 5 files changed, 16 insertions(+), 15 deletions(-)
>
> 598714d79648463ab3f2cbf6f6acd3cd6c09c87a
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f368048..e8ea561 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -740,6 +740,7 @@ struct task_struct {
> /* ??? */
> unsigned long personality;
> unsigned did_exec:1;
> + pid_t wid;
> pid_t pid;
> pid_t tgid;
> /*
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fb4c8b1..749bc8b 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -941,7 +941,7 @@ asmlinkage void sys_exit_group(int error
> static int eligible_child(pid_t pid, int options, task_t *p)
> {
> if (pid > 0) {
> - if (p->pid != pid)
> + if (p->wid != pid)
> return 0;
> } else if (!pid) {
> if (process_group(p) != process_group(current))
> @@ -1018,7 +1018,7 @@ static int wait_task_zombie(task_t *p, i
> int status;
>
> if (unlikely(noreap)) {
> - pid_t pid = p->pid;
> + pid_t pid = p->wid;
> uid_t uid = p->uid;
> int exit_code = p->exit_code;
> int why, status;
> @@ -1130,7 +1130,7 @@ static int wait_task_zombie(task_t *p, i
> retval = put_user(status, &infop->si_status);
> }
> if (!retval && infop)
> - retval = put_user(p->pid, &infop->si_pid);
> + retval = put_user(p->wid, &infop->si_pid);
> if (!retval && infop)
> retval = put_user(p->uid, &infop->si_uid);
> if (retval) {
> @@ -1138,7 +1138,7 @@ static int wait_task_zombie(task_t *p, i
> p->exit_state = EXIT_ZOMBIE;
> return retval;
> }
> - retval = p->pid;
> + retval = p->wid;
> if (p->real_parent != p->parent) {
> write_lock_irq(&tasklist_lock);
> /* Double-check with lock held. */
> @@ -1198,7 +1198,7 @@ static int wait_task_stopped(task_t *p,
> read_unlock(&tasklist_lock);
>
> if (unlikely(noreap)) {
> - pid_t pid = p->pid;
> + pid_t pid = p->wid;
> uid_t uid = p->uid;
> int why = (p->ptrace & PT_PTRACED) ? CLD_TRAPPED : CLD_STOPPED;
>
> @@ -1269,11 +1269,11 @@ bail_ref:
> if (!retval && infop)
> retval = put_user(exit_code, &infop->si_status);
> if (!retval && infop)
> - retval = put_user(p->pid, &infop->si_pid);
> + retval = put_user(p->wid, &infop->si_pid);
> if (!retval && infop)
> retval = put_user(p->uid, &infop->si_uid);
> if (!retval)
> - retval = p->pid;
> + retval = p->wid;
> put_task_struct(p);
>
> BUG_ON(!retval);
> @@ -1310,7 +1310,7 @@ static int wait_task_continued(task_t *p
> p->signal->flags &= ~SIGNAL_STOP_CONTINUED;
> spin_unlock_irq(&p->sighand->siglock);
>
> - pid = p->pid;
> + pid = p->wid;
> uid = p->uid;
> get_task_struct(p);
> read_unlock(&tasklist_lock);
> @@ -1321,7 +1321,7 @@ static int wait_task_continued(task_t *p
> if (!retval && stat_addr)
> retval = put_user(0xffff, stat_addr);
> if (!retval)
> - retval = p->pid;
> + retval = pid;
> } else {
> retval = wait_noreap_copyout(p, pid, uid,
> CLD_CONTINUED, SIGCONT,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f4a7281..743d46c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -931,10 +931,10 @@ static task_t *copy_process(unsigned lon
>
> p->did_exec = 0;
> copy_flags(clone_flags, p);
> - p->pid = pid;
> + p->wid = p->pid = pid;
> retval = -EFAULT;
> if (clone_flags & CLONE_PARENT_SETTID)
> - if (put_user(p->pid, parent_tidptr))
> + if (put_user(p->wid, parent_tidptr))
> goto bad_fork_cleanup;
>
> p->proc_dentry = NULL;
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f77f23f..6579d49 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1674,7 +1674,7 @@ asmlinkage void schedule_tail(task_t *pr
> preempt_enable();
> #endif
> if (current->set_child_tid)
> - put_user(current->pid, current->set_child_tid);
> + put_user(current->wid, current->set_child_tid);
> }
>
> /*
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1f54ed7..70c226c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1564,7 +1564,7 @@ void do_notify_parent(struct task_struct
>
> info.si_signo = sig;
> info.si_errno = 0;
> - info.si_pid = tsk->pid;
> + info.si_pid = tsk->wid;
> info.si_uid = tsk->uid;
>
> /* FIXME: find out whether or not this is supposed to be c*time. */
> @@ -1629,7 +1629,7 @@ static void do_notify_parent_cldstop(str
>
> info.si_signo = SIGCHLD;
> info.si_errno = 0;
> - info.si_pid = tsk->pid;
> + info.si_pid = tsk->wid;
> info.si_uid = tsk->uid;
>
> /* FIXME: find out whether or not this is supposed to be c*time. */
> @@ -1732,7 +1732,7 @@ void ptrace_notify(int exit_code)
> memset(&info, 0, sizeof info);
> info.si_signo = SIGTRAP;
> info.si_code = exit_code;
> - info.si_pid = current->pid;
> + info.si_pid = current->wid;
> info.si_uid = current->uid;
>
> /* Let the debugger run. */


2006-02-10 20:29:54

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Eric,

All my commments are inline below.

> This patch modifies the fork/exit, signal handling, and pid and
> process group manipulating syscalls to support multiple process
> spaces, and implements the data for allow multiple instaces of the pid
> namespace.

[ ... skipped .... ]

> +extern struct pspace init_pspace;
> +
> +#define INVALID_PID 0x7fffffff
<<<< what is it for?

> +
> +static inline int pspace_task_visible(struct pspace *pspace, struct task_struct *tsk)
> +{
> + return (tsk->pspace == pspace) ||
> + ((tsk->pspace->child_reaper.pspace == pspace) &&
> + (tsk->pspace->child_reaper.task == tsk));
<<< the logic with child_reaper which can be somehow partly inside
pspace... and this check is not that abvious.

Actually I can't say your patch is cleaner somehow.
It is very big and most of the changes are trivial, which creates an
illusion that it is straightforward and clean.

[ ... skipped .... ]

> @@ -788,6 +801,16 @@ fastcall NORET_TYPE void do_exit(long co
> panic("Attempted to kill the idle task!");
> if (unlikely(is_init(tsk)))
> panic("Attempted to kill init!");
> +
> + /*
> + * If we are the pspace leader it is nonsense for the pspace
> + * to continue so kill everyone else in the pspace.
> + */
> + if (pspace_leader(tsk)) {
> + tsk->pspace->flags = PSPACE_EXIT;
> + kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
> + }
> +
<<<<

1.
flags are neither atomic nor protected with any lock.
So if it will be used later for something else in future, there will be
100% race. You also assigns them here, while everywhere in other places
a bit is checked.

2. due to 1) you code is buggy. in this respect do_exit() is not
serialized with copy_process().

3. due to the same 1) reason
> + kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
can miss a task being forked. Bang!!!

4.
So you are effectively inserting a code for cleaning up pspace here and
the same is actually required for other subsystems like networking/IPC etc.
I think you suppose that other resources are freed when the last
reference is dropped, but to tell the truth this is a way to deadlocks.
Because refs are put in too many places, where you can't make a real
cleanup due to locking etc. You can't for example call syncronize_net()
from bh, which is required for network cleanup.

Just an another argument why containers are easier/better and why you
will eventually end up with it.

> if (tsk->io_context)
> exit_io_context();
>

[ ... skipped ... ]

> @@ -1147,11 +1150,55 @@ retry:
> }
>
> /*
> + * kill_pspace_info() sends a signal to all processes in a process space.
> + * This is what kill(-1, sig) does.
> + */
> +
> +int __kill_pspace_info(int sig, struct siginfo *info, struct pspace *pspace)
> +{
> + struct task_struct *p = NULL;
> + int retval = 0, count = 0;
> +
> + for_each_process(p) {
> + int err;
> + /* Skip the current pspace leader */
> + if (current_pspace_leader(p))
> + continue;
> +
> + /* Skip the sender of the signal */
> + if (p->signal == current->signal)
> + continue;
> +
> + /* Skip processes outside the target process space */
> + if (!in_pspace(pspace, p))
> + continue;
> +
> + /* Finally it is a good process send the signal. */
> + err = group_send_sig_info(sig, info, p);
> + ++count;
> + if (err != -EPERM)
> + retval = err;
<<<<
why EPERM is ok?
do you want to miss some tasks?
> + }
> + return count ? retval : -ESRCH;
> +}
> +

Thanks,
Kirill

2006-02-10 20:39:44

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 20/20] proc: Update /proc to support multiple pid spaces.

Hello,

> This patch does a couple of things.
> - It splits proc into proc and proc_sysinfo
> - It adds pspace support to proc
> - It adds getattr methods to ensure proc has the proper hard link count.
> - It increases the size of a couple of buffers by one to avoid buffer overflow
> - It moves /proc/mounts and /proc/loadavg into the proc filesystem from proc_sysinfo
>
> Sorry for the big patch. When I start feeding this changes seriously I will
> split this patch.
>
> The split of /proc into mutliple filesystems works well however it comes
> with one downsides. There are now some directories where cd -P <subdir>/..
> is not a noop. Basically it is doing the equivalent of following symlinks
> into an internal kernel mount. It is well defined and safe behaviour but
> I'm not certain if it is desirable.
>
> Signed-off-by: Eric W. Biederman <[email protected]>

This one is really ugly.
And it is also controversial to your own idea of having separate
namespaces, but introduces a pointer to proc_mnt in pspace.

You have many namespaces to which task_struct refers.
Do you want proc to work in any configuration of namespaces?
Then you can't have pointers to proc_mnt from namespaces.
Well, I understand that proc is the most painfull for you... yeah...

Kirill

2006-02-11 09:28:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

Kirill Korotaev <[email protected]> writes:

> Eric,
>
> 1. I would rename wid to wpid :)
:)

> 2. Maybe I'm missing something in the discussions, but why is it required? if
> you provide fully isolated pid spaces, why do you care for wid?
> And how ptrace can work at all if cldstop reports some pid, but child is not
> accessiable via ptrace()? and if it doesn't work, what are your changes for?

A good question.

My pid spaces while isolated from each other are connected together in
something that resembles the mount relationship when mounting a filesystem.

The init process of a child pspace is visible in the parent pspace,
by it's wid. So you should be able to ptrace pid == 1. The other
children remain inaccessible directly.

The idea was to do my very best to preserve the unix process tree when
constructing a separate pid space.

I have to admit I have not yet tested the ptrace corner case, or
digested all of it's ramifications. Unless I have messed up somewhere
it should just work.

This visibility of the child tree by a single pid inside the parent
tree is why I think my approach doesn't suffer from most of the
problems a completely isolated pid space has.

In fact I would even be willing to look at it as a global but
hierarchical pid space if that proved an interesting case.
So you could have something like pkill(int sig, int which, char *pid_path).
Where pid_path is something like "1537/3/58", and which specified
what kind of pid you are talking about (thread, pid, process group...)

>From a security stand point I want the option of a fully isolated
group of processes, so there is a possibility of keeping secrets from
the system administrator, but there is no reason that needs to be tied
to a pid space.

Eric

2006-02-11 09:44:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

>> + * kill_pspace_info() sends a signal to all processes in a process space.
>> + * This is what kill(-1, sig) does.
>> + */
>> +
>> +int __kill_pspace_info(int sig, struct siginfo *info, struct pspace *pspace)
>> +{
>> + struct task_struct *p = NULL;
>> + int retval = 0, count = 0;
>> +
>> + for_each_process(p) {
>> + int err;
>> + /* Skip the current pspace leader */
>> + if (current_pspace_leader(p))
>> + continue;
>> +
>> + /* Skip the sender of the signal */
>> + if (p->signal == current->signal)
>> + continue;
>> +
>> + /* Skip processes outside the target process space */
>> + if (!in_pspace(pspace, p))
>> + continue;
>> +
>> + /* Finally it is a good process send the signal. */
>> + err = group_send_sig_info(sig, info, p);
>> + ++count;
>> + if (err != -EPERM)
>> + retval = err;
> <<<<
> why EPERM is ok?
> do you want to miss some tasks?


A good question. This is how kill -1 is currently implemented.
It doesn't align with how signals are sent to a process group,
so it could very well be wrong.

>> + }
>> + return count ? retval : -ESRCH;
>> +}
>> +

2006-02-11 10:12:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 20/20] proc: Update /proc to support multiple pid spaces.

Kirill Korotaev <[email protected]> writes:

> Hello,
>
>> This patch does a couple of things.
>> - It splits proc into proc and proc_sysinfo
>> - It adds pspace support to proc
>> - It adds getattr methods to ensure proc has the proper hard link count.
>> - It increases the size of a couple of buffers by one to avoid buffer overflow
>> - It moves /proc/mounts and /proc/loadavg into the proc filesystem from
> proc_sysinfo
>> Sorry for the big patch. When I start feeding this changes seriously I will
>> split this patch.
>> The split of /proc into mutliple filesystems works well however it comes
>> with one downsides. There are now some directories where cd -P <subdir>/..
>> is not a noop. Basically it is doing the equivalent of following symlinks
>> into an internal kernel mount. It is well defined and safe behaviour but
>> I'm not certain if it is desirable.
>> Signed-off-by: Eric W. Biederman <[email protected]>
>
> This one is really ugly.

It is certainly to much at one time, and the code while semanticly
interesting is still has significant issues with the implementation.
But a lot of the ugliness is inherent in the current implementation of
/proc and not what I am trying to do with it.

> And it is also controversial to your own idea of having separate namespaces,
> but introduces a pointer to proc_mnt in pspace.

An instance of /proc being connected to a pid space is perfectly
natural. /proc is after all the filesystem that reports what is in a
pspace.

I freely admit the way I am using the internal mount of proc is
wrong, and is something that needs to be resolved before I submit
this for kernel inclusion. I was attempting to solve the problem
of having duplicate dcache entries in my recursive structure.
Unfortunately it was one of those clever solutions that only gets
you 99% of the way to where you want to go.

> You have many namespaces to which task_struct refers.
> Do you want proc to work in any configuration of namespaces?
Yes.

> Then you can't have pointers to proc_mnt from namespaces.
> Well, I understand that proc is the most painfull for you... yeah...

proc is the painful for any pid change in how pids are dealt with.
So far I have not seen a single implementation that cleanly and
correctly addresses all of the issues (including mine :)

Eric


2006-02-11 10:13:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

> Eric,
>
> All my commments are inline below.
>
>> This patch modifies the fork/exit, signal handling, and pid and
>> process group manipulating syscalls to support multiple process
>> spaces, and implements the data for allow multiple instaces of the pid
>> namespace.
>
> [ ... skipped .... ]
>
>> +extern struct pspace init_pspace;
>> +
>> +#define INVALID_PID 0x7fffffff
> <<<< what is it for?

It is a hold over from an earlier version that never got deleted. oops.

Eric

2006-02-11 10:46:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

>> +
>> +static inline int pspace_task_visible(struct pspace *pspace, struct
> task_struct *tsk)
>> +{
>> + return (tsk->pspace == pspace) ||
>> + ((tsk->pspace->child_reaper.pspace == pspace) &&
>> + (tsk->pspace->child_reaper.task == tsk));
> <<< the logic with child_reaper which can be somehow partly inside pspace... and
> this check is not that abvious.

This is the check for what shows up in /proc.

Given that is how I have explicitly documented things to work, (the
init process straddles the boundary) I fail to see how it is not obvious.

Eric

2006-02-11 11:06:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

> Actually I can't say your patch is cleaner somehow.
> It is very big and most of the changes are trivial, which creates an illusion
> that it is straightforward and clean.

It is hard to make a comparison. Your patch posted to the mail list
was incomplete, and I could only find a giant patch for OpenVZ.

Beyond that if your patch introduced a type change for internal pids
and used that generate compile errors when someone did not use the
appropriate type, I would be a lot happier and the code would
be a lot more maintainable. I.e. It would not take an audit of
the kernel source to find the issues an allyesconfig build would find
them for you.

I don't think my current implementation actually causes enough compile
errors, but I need think closely about it before I go much farther.

Maintainable code is a delicate balancing act between things that
trip you up when you get it wrong, and not being so cumbersome you
get in the programmers way.

The advantages I see with my approach.
- I have hierarchical pids so nesting is possible.
- The state after migration is not suboptimal.
- I cause compiler errors which makes maintenance easier.
- Other kernel developers gut feel is that (container, pid) is the proper
representation.
I actually flip flop on the issue of if I want the internal representation
to be (container, pid) or a magic kpid that combines the into one integer.
I know I don't want the kpid to be user space visible though.

So far you have not addressed the issues of maintaining code in the
kernel tree.

Eric

2006-02-11 12:00:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

> [ ... skipped .... ]
>
>> @@ -788,6 +801,16 @@ fastcall NORET_TYPE void do_exit(long co
>> panic("Attempted to kill the idle task!");
>> if (unlikely(is_init(tsk)))
>> panic("Attempted to kill init!");
>> +
>> + /*
>> + * If we are the pspace leader it is nonsense for the pspace
>> + * to continue so kill everyone else in the pspace.
>> + */
>> + if (pspace_leader(tsk)) {
>> + tsk->pspace->flags = PSPACE_EXIT;
>> + kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
>> + }
>> +
> <<<<
>
> 1.
> flags are neither atomic nor protected with any lock.

flags are atomic as they are a machine word. So they do not
require a read/modify write so they will either be written
or not written. Plus this allows write-sharing of the appropriate
cache line which is very polite (assuming the line is not shared with
something else)

I do need a big fat comment about this though.

There is serialization as there is only one process that can write
to this value from exactly one spot.

> So if it will be used later for something else in future, there will be 100%
> race. You also assigns them here, while everywhere in other places a bit is
> checked.

Agreed. I need to change the name to something like state and stop doing
bit tests. To make this clear.

> 2. due to 1) you code is buggy. in this respect do_exit() is not serialized with
> copy_process().

Yes. I may need a memory barrier in there. I need to think
about that a little more.

> 3. due to the same 1) reason
> > + kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
> can miss a task being forked. Bang!!!

Well the only bad thing that can happen is that I get a process that
can run and observe pid == 1 has exited. So Bang!! is not too
painful.


For an initial implementation that is not yet ready for kernel
inclusion, and was posted to show the form I expect my patches
to take that wasn't too bad.

> 4.
> So you are effectively inserting a code for cleaning up pspace here and the same
> is actually required for other subsystems like networking/IPC etc.

So the SIGKILL is not about pspace clean up so much as making it
impossible to observe a state where pid == 1 is not running.

> I think you suppose that other resources are freed when the last reference is
> dropped, but to tell the truth this is a way to deadlocks. Because refs are put
> in too many places, where you can't make a real cleanup due to locking etc. You
> can't for example call syncronize_net() from bh, which is required for network
> cleanup.

Yes references counts can work, and do so regularly in the kernel.
Work queues will put you in a process context trivially, so being
dropping the last ref from an irq handler and then needed process
context to do the cleanup is not a problem.

> Just an another argument why containers are easier/better and why you will
> eventually end up with it.

Does not follow.

Kirill you are not making all of your reasoning explicit and making
mistakes in the gaps.

The steps you are making seem to be.
containers are simpler.
simpler means easier to get correct.
bugs will happen.
therefore you will use the simpler solution.

The above reasoning seems to assume that containers are not prone
to the same reference counting bugs as namespaces. Given that
we are solving the same problem in similar problems this seems
unwarranted.

In addition the rejection of the basic container approach is largely
based upon lack of flexibility. No amount of additional simplicity
can make up for not meeting real world requirements.

Kirill there may be a case where your container concept performs
beautifully and simplifies the problem, but I have looked and I
have not seen it.

Eric

2006-02-13 08:43:45

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

>>2. Maybe I'm missing something in the discussions, but why is it required? if
>>you provide fully isolated pid spaces, why do you care for wid?
>>And how ptrace can work at all if cldstop reports some pid, but child is not
>>accessiable via ptrace()? and if it doesn't work, what are your changes for?
>
>
> A good question.
>
> My pid spaces while isolated from each other are connected together in
> something that resembles the mount relationship when mounting a filesystem.
why? is it really required?
doing it this way, you get all the issues with external refs we have
with a weak isolation, i.e. pspace init can be seen from outside and
inside (correct? or am I wrong?). This makes your approach eseentially
the same as ours from checkpointing point of view BTW.

> The init process of a child pspace is visible in the parent pspace,
> by it's wid. So you should be able to ptrace pid == 1. The other
> children remain inaccessible directly.
>
> The idea was to do my very best to preserve the unix process tree when
> constructing a separate pid space
This more looks like you were trying to make less changes with it and
avoid fixing issus which can arise when pspaces are fully isolated.
Maybe I'm wrong.

> I have to admit I have not yet tested the ptrace corner case, or
> digested all of it's ramifications. Unless I have messed up somewhere
> it should just work.
>
> This visibility of the child tree by a single pid inside the parent
> tree is why I think my approach doesn't suffer from most of the
> problems a completely isolated pid space has.
which problems? Actually I can't see much problems with isolated pid
spaces. And isolated pid spaces doesn't require hierarchy, since they
are fully isolated.

> In fact I would even be willing to look at it as a global but
> hierarchical pid space if that proved an interesting case.
> So you could have something like pkill(int sig, int which, char *pid_path).
> Where pid_path is something like "1537/3/58", and which specified
> what kind of pid you are talking about (thread, pid, process group...)
And here you need to introduce new non-unix semantics, security checks
on lookups, etc.

>>From a security stand point I want the option of a fully isolated
> group of processes, so there is a possibility of keeping secrets from
> the system administrator, but there is no reason that needs to be tied
> to a pid space.

Kirill

2006-02-13 08:51:54

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

>>>+static inline int pspace_task_visible(struct pspace *pspace, struct
>>
>>task_struct *tsk)
>>
>>>+{
>>>+ return (tsk->pspace == pspace) ||
>>>+ ((tsk->pspace->child_reaper.pspace == pspace) &&
>>>+ (tsk->pspace->child_reaper.task == tsk));
>>
>><<< the logic with child_reaper which can be somehow partly inside pspace... and
>>this check is not that abvious.
>
>
> This is the check for what shows up in /proc.
>
> Given that is how I have explicitly documented things to work, (the
> init process straddles the boundary) I fail to see how it is not obvious.
I was confused by the fact that child_reaper.pspace is actually a parent
pspace.

Kirill

2006-02-13 09:00:51

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

>>Actually I can't say your patch is cleaner somehow.
>>It is very big and most of the changes are trivial, which creates an illusion
>>that it is straightforward and clean.
>
>
> It is hard to make a comparison. Your patch posted to the mail list
> was incomplete, and I could only find a giant patch for OpenVZ.
this is wrong. it lacked only alpha get_xpid().
sure, we didn't do that dirty games with /proc you did, so maybe this is
why you have such a feeling.

> Beyond that if your patch introduced a type change for internal pids
> and used that generate compile errors when someone did not use the
> appropriate type, I would be a lot happier and the code would
> be a lot more maintainable. I.e. It would not take an audit of
> the kernel source to find the issues an allyesconfig build would find
> them for you.
it is not a problem at all and can be done.

> I don't think my current implementation actually causes enough compile
> errors, but I need think closely about it before I go much farther.
>
> Maintainable code is a delicate balancing act between things that
> trip you up when you get it wrong, and not being so cumbersome you
> get in the programmers way

> The advantages I see with my approach.
> - I have hierarchical pids so nesting is possible.
Once again, our approach doesn't prohibit hierarchical pids.
But in addition to yours it allows to select whether weak or strong
isolation is required.

> - The state after migration is not suboptimal.
sorry?

> - I cause compiler errors which makes maintenance easier.
it is not a question to the approach itself, you see? so not an argument.

> - Other kernel developers gut feel is that (container, pid) is the proper
> representation.
> I actually flip flop on the issue of if I want the internal representation
> to be (container, pid) or a magic kpid that combines the into one integer.
> I know I don't want the kpid to be user space visible though.
>
> So far you have not addressed the issues of maintaining code in the
> kernel tree.
Uhhh... I see that you have no real arguments. Nice.

Kirill

2006-02-13 09:20:54

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

>>1.
>>flags are neither atomic nor protected with any lock.
>
>
> flags are atomic as they are a machine word. So they do not
> require a read/modify write so they will either be written
> or not written. Plus this allows write-sharing of the appropriate
> cache line which is very polite (assuming the line is not shared with
> something else)
Eric I'm familiar with SMP, thanks :)
Why do you write all this if you agreed below that have problems with it?

>>2. due to 1) you code is buggy. in this respect do_exit() is not serialized with
>>copy_process().
> Yes. I may need a memory barrier in there. I need to think
> about that a little more.
memory barrier doesn't help. you really need to think about.

>>3. due to the same 1) reason
>> > + kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
>>can miss a task being forked. Bang!!!
>
> Well the only bad thing that can happen is that I get a process that
> can run and observe pid == 1 has exited. So Bang!! is not too
> painful.
And what about references to pspace->child_reaper which was freed already?

[skipped the flood]

Kirill

2006-02-13 16:40:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Sat, 2006-02-11 at 03:43 -0700, Eric W. Biederman wrote:
> Kirill Korotaev <[email protected]> writes:
> >> +static inline int pspace_task_visible(struct pspace *pspace, struct
> > task_struct *tsk)
> >> +{
> >> + return (tsk->pspace == pspace) ||
> >> + ((tsk->pspace->child_reaper.pspace == pspace) &&
> >> + (tsk->pspace->child_reaper.task == tsk));
> > <<< the logic with child_reaper which can be somehow partly inside pspace... and
> > this check is not that abvious.
>
> This is the check for what shows up in /proc.
>
> Given that is how I have explicitly documented things to work, (the
> init process straddles the boundary) I fail to see how it is not obvious.

I'd claim that the (tsk->pspace == pspace) test is pretty obvious.

However, the child_reaper one takes a little deduction. Sometimes, I
think separating out even trivial functions into even trivialler :)
functions really does make sense for these. They can be really
confusing. BTW, I _still_ don't understand exactly what this is doing,
but I haven't had any coffee.

Is something like this more clear?

static inline int pspace_task_visible(struct pspace *pspace, struct
task_struct *tsk)
{
if (tsk->pspace == pspace)
return 1;

/*
* Init tasks straggle namespaces. They have the explicit
* pspace of their parent, but are visible from thier
* children.
*/
if (pspace_child_reaper_is_task(pspace, tsk)
return 1;

return 0;
}

int pspace_child_reaper_is_task(struct pspace *pspace,
struct task_struct *tsk)
{
if ((tsk->pspace->child_reaper.pspace == pspace) &&
(tsk->pspace->child_reaper.task == tsk))
return 1;

return 0;
}

-- Dave

2006-02-13 17:02:18

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Quoting Kirill Korotaev ([email protected]):
> >>1.
> >>flags are neither atomic nor protected with any lock.
> >
> >
> >flags are atomic as they are a machine word. So they do not
> >require a read/modify write so they will either be written
> >or not written. Plus this allows write-sharing of the appropriate
> >cache line which is very polite (assuming the line is not shared with
> >something else)
> Eric I'm familiar with SMP, thanks :)
> Why do you write all this if you agreed below that have problems with it?
>
> >>2. due to 1) you code is buggy. in this respect do_exit() is not
> >>serialized with
> >>copy_process().
> >Yes. I may need a memory barrier in there. I need to think
> >about that a little more.
> memory barrier doesn't help. you really need to think about.
>
> >>3. due to the same 1) reason
> >>> + kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
> >>can miss a task being forked. Bang!!!
> >
> >Well the only bad thing that can happen is that I get a process that
> >can run and observe pid == 1 has exited. So Bang!! is not too
> >painful.
> And what about references to pspace->child_reaper which was freed already?

This seems a very valid point. And even if you implement code to detect
when a process exits whether it is a child_reaper for some pspace, you
can't just make pspace->child_reaper = pspace->child_reaper->child_reaper,
as the wid may not be valid in the grandparent's namespace, right?

-serge

2006-02-13 21:31:42

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Quoting Kirill Korotaev ([email protected]):
> >So far you have not addressed the issues of maintaining code in the
> >kernel tree.

> Uhhh... I see that you have no real arguments. Nice.

Au contraire, this is a central issue which you have yet to address.
How do you suggest helping maintainers in the future know which pid they
are working with?

For our vpid patchset we had considered using sparse to make sure that
the real ('kernel') and user ('virtual') pids were never used in place
of each other without going through a conversion. That could be one
solution. Using opaque typedefs is of course another solution. But
suspect there will be places in the code where it simply isn't clear
at first glance which pid you'd want. And if it's not clear at first
glance, then it's likely to be done wrong in at least some of those
places.

-serge

2006-02-13 23:45:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

>>>Actually I can't say your patch is cleaner somehow.
>>>It is very big and most of the changes are trivial, which creates an illusion
>>>that it is straightforward and clean.
>> It is hard to make a comparison. Your patch posted to the mail list
>> was incomplete, and I could only find a giant patch for OpenVZ.
> this is wrong. it lacked only alpha get_xpid().
> sure, we didn't do that dirty games with /proc you did, so maybe this is why you
> have such a feeling.

Not at all.
- You didn't have a system call interface.
- You didn't include your patches to proc.
- The fact that you missed several whole subsystems, (as did I)
is a minor issue.
etc.

The patches we posted were a different subset of the whole, and as such
a clear comparison could not be made.

>> Beyond that if your patch introduced a type change for internal pids
>> and used that generate compile errors when someone did not use the
>> appropriate type, I would be a lot happier and the code would
>> be a lot more maintainable. I.e. It would not take an audit of
>> the kernel source to find the issues an allyesconfig build would find
>> them for you.
> it is not a problem at all and can be done.

Then why isn't it done in OpenVZ?

I agree it can be done. But an audit of the entire kernel is painful.

>> The advantages I see with my approach.
>> - I have hierarchical pids so nesting is possible.
> Once again, our approach doesn't prohibit hierarchical pids.
> But in addition to yours it allows to select whether weak or strong isolation is
> required.

Again. I have not heard this assertion before, and I don't see it.
But let me ask a question that helps frame the issue.

Can I migrate all of the processes and (containers) on a real server
(not in a container) into a container on another machine without
any change in pids?

I believe that would require 3 on the inner most nested processes with
your approach.


>> - The state after migration is not suboptimal.
> sorry?

You have 2 approaches, in your patch. I have one approach that works
for everything. Possibly I misunderstood my skimming, but usually
optimizations are only added when needed.

>> - I cause compiler errors which makes maintenance easier.
> it is not a question to the approach itself, you see? so not an argument.

How the code will be maintained when discussing merging with the mainline
kernel is one of the most important issues to consider.

>> - Other kernel developers gut feel is that (container, pid) is the proper
>> representation. I actually flip flop on the issue of if I want the
>> internal representation
>> to be (container, pid) or a magic kpid that combines the into one integer.
>> I know I don't want the kpid to be user space visible though.
>> So far you have not addressed the issues of maintaining code in the
>> kernel tree.
> Uhhh... I see that you have no real arguments. Nice.

I wasn't trying to argue simply to lay it out as I saw it.

Eric

2006-02-14 00:06:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 01/20] pid: Intoduce the concept of a wid (wait id)

Kirill Korotaev <[email protected]> writes:

>>>2. Maybe I'm missing something in the discussions, but why is it required? if
>>>you provide fully isolated pid spaces, why do you care for wid?
>>>And how ptrace can work at all if cldstop reports some pid, but child is not
>>>accessiable via ptrace()? and if it doesn't work, what are your changes for?
>> A good question.
>> My pid spaces while isolated from each other are connected together in
>> something that resembles the mount relationship when mounting a filesystem.
> why? is it really required?

A very good question. Supporting waitpid and the normal process management
semantics seems to be a requirement for usability. Completely separate pid
spaces would probably be easier to implement. However then we would need
to implement another interface like waitpid to allow for notifications when
the all of the processes exited.

I have not seen any alternatives being proposed to handle that case.

> doing it this way, you get all the issues with external refs we have with a weak
> isolation, i.e. pspace init can be seen from outside and inside (correct? or am
> I wrong?).

Largely correct. Looking at it from the outside is like looking at a thread
group leader. Sometimes you see the leader sometimes you see the entire
group of processes.

> This makes your approach eseentially the same as ours from
> checkpointing point of view BTW.

Sure. Any attempt to build a general purpose mechanism will suffer from
that. But it is easy enough to drop the external references from the
init processes, and then it's children don't have a chance to get them.


>> The init process of a child pspace is visible in the parent pspace,
>> by it's wid. So you should be able to ptrace pid == 1. The other
>> children remain inaccessible directly.
>> The idea was to do my very best to preserve the unix process tree when
>> constructing a separate pid space
> This more looks like you were trying to make less changes with it and avoid
> fixing issus which can arise when pspaces are fully isolated. Maybe I'm wrong.

It's not sloth. Simply trying to extend the existing API. Not changing
what doesn't need to be changed is a virtue.

But yes fully isolated pid spaces are unmanageable unless you add and API
to deal with it. My api is using the wid to allow a parent process to manage
it essentially like normal.

>> I have to admit I have not yet tested the ptrace corner case, or
>> digested all of it's ramifications. Unless I have messed up somewhere
>> it should just work.
>> This visibility of the child tree by a single pid inside the parent
>> tree is why I think my approach doesn't suffer from most of the
>> problems a completely isolated pid space has.
> which problems? Actually I can't see much problems with isolated pid spaces. And
> isolated pid spaces doesn't require hierarchy, since they are fully isolated.

You also don't find them interesting enough to implement. Or else we are using
the same terms to mean different things.

>> In fact I would even be willing to look at it as a global but
>> hierarchical pid space if that proved an interesting case.
>> So you could have something like pkill(int sig, int which, char *pid_path).
>> Where pid_path is something like "1537/3/58", and which specified
>> what kind of pid you are talking about (thread, pid, process group...)
> And here you need to introduce new non-unix semantics, security checks on
> lookups, etc.

Sure you need to introduce new semantics, etc. But nothing more
than what you already have in OpenVZ. The only necessary difference
is in how you name the process. The target process is still being
manipulated by a process outside of it's scope of existence.

So if the need for doing that is as great as you say it is something
that is doable, and their may be other methods that are easier to work
with.

Personally I only implemented as much manipulation of the init
processes as I did because it came for free. I don't see the need
to manipulate a process in another pid namespace.

Eric

2006-02-14 05:57:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

>>>1.
>>>flags are neither atomic nor protected with any lock.
>> flags are atomic as they are a machine word. So they do not
>> require a read/modify write so they will either be written
>> or not written. Plus this allows write-sharing of the appropriate
>> cache line which is very polite (assuming the line is not shared with
>> something else)
> Eric I'm familiar with SMP, thanks :)
> Why do you write all this if you agreed below that have problems with it?

To establish a baseline of understanding and because you made an assertion
that is counter to my understanding.

>>>2. due to 1) you code is buggy. in this respect do_exit() is not serialized
> with
>>>copy_process().
>> Yes. I may need a memory barrier in there. I need to think
>> about that a little more.
> memory barrier doesn't help. you really need to think about.

Except for instances where you need an atomic read/modify/write the
only primitives you have are reads, writes and barriers.

I have all of the correct reads and writes therefore the only thing
that could help are barriers if the logic is otherwise sane.

A write barrier to ensure the write of flags is visible before I write
the kill signal will ensure the write of flags is globally visible
first. Although I am having a hard time convincing myself even that
matters.

>>>3. due to the same 1) reason
>>> > + kill_pspace_info(SIGKILL, (void *)1, tsk->pspace);
>>>can miss a task being forked. Bang!!!
>>
>> Well the only bad thing that can happen is that I get a process that
>> can run and observe pid == 1 has exited. So Bang!! is not too
>> painful.
> And what about references to pspace->child_reaper which was freed already?

The assumes that release_task() is called synchronously with do_exit
which is not the case. Looking at the code I do think release_task()
for the pspace leader can be called too soon. But that is really
has nothing to do with whether or not all of it's children got sent
SIGKILL.

That is a significant issue, that needs to be fixed before I submit
this piece of code for inclusion into the kernel.

The issue is depending on the context is that a process actively running
in kernel space could proceed for a long time before it returns to user
space and receives a signal. In that span of time it could execute just
about any code in the kernel.

Kirill thank you for spotting this.

This exchange seems to have a hostile and not a cooperative tone so
I will finish the investigation and bug fixing elsewhere.


I expect that there might be a few more issues like this. My only
expectation was that the code was complete enough to discuss semantics
and kernel mechanisms for creating a namespaces, and the code has
successfully served that purpose.

Eric








2006-02-14 06:02:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

> This seems a very valid point. And even if you implement code to detect
> when a process exits whether it is a child_reaper for some pspace, you
> can't just make pspace->child_reaper = pspace->child_reaper->child_reaper,
> as the wid may not be valid in the grandparent's namespace, right?

Right. What I have done is in the PSPACE_EXIT condition all children
when they die self reap and have the global process reaper set to their
parent. However since exit_signal == -1 the initial process reaper
never sees them.

Even for a nested pspace I figure it is an error for anything to outlive
init. The wid not working is a valid reason for this however if it was
really necessary a new wid value could be allocated. As nothing on
the inside of a pspace is directly aware of it.

The situation is very similar to being behind a NAT firewall.

Eric

2006-02-14 18:46:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Mon, 2006-02-13 at 22:54 -0700, Eric W. Biederman wrote:
> That is a significant issue, that needs to be fixed before I submit
> this piece of code for inclusion into the kernel.

Is there a chance you could re-post your current set, or throw them out
on kernel.org? The thread is a week old now, and I would imagine
there's been a lot of churn. :)

-- Dave

2006-02-15 12:06:59

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Hello Eric,

>>memory barrier doesn't help. you really need to think about.
> Except for instances where you need an atomic read/modify/write the
> only primitives you have are reads, writes and barriers.
>
> I have all of the correct reads and writes therefore the only thing
> that could help are barriers if the logic is otherwise sane.
the problem is that you do read/write for synchronization of other
operations (fork/pspace leader die).
i.e. you try to make a serialization, you see? It doesn't work this way.

....

> This exchange seems to have a hostile and not a cooperative tone so
> I will finish the investigation and bug fixing elsewhere.
Eric, I think it turned this way because you started pointing to our
bugs in VPIDs, while having lots of own and not working code.
I can point you another _really_ disastrous bugs in your IPC and
networking virtualization. But discussing bugs is not want I want, you
see? I want us to make some solution suatable for all the parties.

And while you have not working solution it is hard to discuss whether it
is good or not, whether it is beatutiful or not. It is incomplete and
doesn't work. So many statements that one solution is better than
another are not valid.

And we are too cycled on PIDs. Why? I think this is the most minor
feature which can easily live out of mainstream if needed, while
virtualization is the main goal.
I also don't understand why you are eager to introduce new sys calls
like pkill(path_to_process), but is trying to use waitpid() for pspace
die notifications? Maybe it is simply better to introduce container
specific syscalls which should be clean and tidy, instead of messing
things up with clone()/waitpid() and so on? The more simple code we
have, the better it is for all of us.

If possible keep posting your patches. I would even ask you to add me to
CC always. You are doing a really good job and discussion solutions is
the only possible way to push something to mainstream I suppose.
I would also be happy to arrange a meeting with the interested parties,
since talking eye to eye can be much more productive.

> I expect that there might be a few more issues like this. My only
> expectation was that the code was complete enough to discuss semantics
> and kernel mechanisms for creating a namespaces, and the code has
> successfully served that purpose.
To some extent yes.

Kirill

2006-02-15 13:31:35

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Wed, Feb 15, 2006 at 03:07:14PM +0300, Kirill Korotaev wrote:
> Hello Eric,
>
> >>memory barrier doesn't help. you really need to think about.
> >Except for instances where you need an atomic read/modify/write the
> >only primitives you have are reads, writes and barriers.
> >
> >I have all of the correct reads and writes therefore the only thing
> >that could help are barriers if the logic is otherwise sane.
> the problem is that you do read/write for synchronization of other
> operations (fork/pspace leader die).
> i.e. you try to make a serialization, you see? It doesn't work this way.
>
> ....
>
> >This exchange seems to have a hostile and not a cooperative tone so
> >I will finish the investigation and bug fixing elsewhere.
> Eric, I think it turned this way because you started pointing to our
> bugs in VPIDs, while having lots of own and not working code.
> I can point you another _really_ disastrous bugs in your IPC and
> networking virtualization. But discussing bugs is not want I want, you
> see? I want us to make some solution suatable for all the parties.

that's good!

> And while you have not working solution it is hard to discuss whether
> it is good or not, whether it is beatutiful or not. It is incomplete
> and doesn't work. So many statements that one solution is better than
> another are not valid.

well, it's kind of moot to demand a final solution just
to discuss the implementation ...
usually software design happens before the implementation
and the implementation details can be fed back into the
cycle later ...

> And we are too cycled on PIDs. Why? I think this is the most minor
> feature which can easily live out of mainstream if needed, while
> virtualization is the main goal.

although I could be totally ignorant regarding the PID
stuff, it seems to me that it pretty well reflects the
requirements for virtualization in general, while being
simple enough to check/test several solutions ...

why? simple: it reflects the way 'containers' work in
general, it has all the issues with administration and
setup, similar to 'guests' (it requires some management
and access from outside, while providing security and
isolation inside), containers could be easily built on
top of that or as an addition to the pid space structures
at the same time it's easy to test, and issues will show
up quite early, so that they can be discussed and, if
necessary, solved without rewriting an entire framework.

> I also don't understand why you are eager to introduce new sys calls
> like pkill(path_to_process), but is trying to use waitpid() for pspace
> die notifications? Maybe it is simply better to introduce container
> specific syscalls which should be clean and tidy, instead of messing
> things up with clone()/waitpid() and so on? The more simple code we
> have, the better it is for all of us.

now that you mention it, maybe we should have a few
rounds how those 'generic' container syscalls would
look like?

best,
Herbert

> If possible keep posting your patches. I would even ask you to add me to
> CC always. You are doing a really good job and discussion solutions is
> the only possible way to push something to mainstream I suppose.
> I would also be happy to arrange a meeting with the interested parties,
> since talking eye to eye can be much more productive.
>
> >I expect that there might be a few more issues like this. My only
> >expectation was that the code was complete enough to discuss semantics
> >and kernel mechanisms for creating a namespaces, and the code has
> >successfully served that purpose.
> To some extent yes.
>
> Kirill

2006-02-15 18:49:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Kirill Korotaev <[email protected]> writes:

> Hello Eric,
>
>>>memory barrier doesn't help. you really need to think about.
>> Except for instances where you need an atomic read/modify/write the
>> only primitives you have are reads, writes and barriers.
>> I have all of the correct reads and writes therefore the only thing
>> that could help are barriers if the logic is otherwise sane.
> the problem is that you do read/write for synchronization of other operations
> (fork/pspace leader die).
> i.e. you try to make a serialization, you see? It doesn't work this way.

I'm missing something, and unfortunately that usually seems to be the case
when discussing problems of this kind. When we pick up discussing bugs
again, please small simple steps. We have different backgrounds and
different base line assumptions, and different points of view, so what
is obvious to you is not obvious to me and vice versa.

> ....
>
>> This exchange seems to have a hostile and not a cooperative tone so
>> I will finish the investigation and bug fixing elsewhere.
> Eric, I think it turned this way because you started pointing to our bugs in
> VPIDs, while having lots of own and not working code.
> I can point you another _really_ disastrous bugs in your IPC and networking
> virtualization. But discussing bugs is not want I want, you see? I want us to
> make some solution suatable for all the parties.

But with an incomplete implementation we can discussion each others approaches.

Basically what is of interest right now are the userspace interface,
the semantics we are trying to implement, and the maintenance issues
for the users of the code.

> And while you have not working solution it is hard to discuss whether it is good
> or not, whether it is beatutiful or not. It is incomplete and doesn't work. So
> many statements that one solution is better than another are not valid.

:)

I would draw a distinction between a mostly but buggy implementation and
an in incomplete implementation. When you can run the code and see it doing
things some of the subtle issues are interesting.

> And we are too cycled on PIDs. Why? I think this is the most minor feature which
> can easily live out of mainstream if needed, while virtualization is the main
> goal.

> I also don't understand why you are eager to introduce new sys calls like
> pkill(path_to_process), but is trying to use waitpid() for pspace die
> notifications?

I am not eager to introduce new syscalls. But I am willing to consider it
when we reach far enough past a certain point.

> Maybe it is simply better to introduce container specific
> syscalls which should be clean and tidy, instead of messing things up with
> clone()/waitpid() and so on? The more simple code we have, the better it is for
> all of us.

Could be propose it and we can discuss it. I suspect you already have something
in mid.

> If possible keep posting your patches. I would even ask you to add me to CC
> always. You are doing a really good job and discussion solutions is the only
> possible way to push something to mainstream I suppose.
> I would also be happy to arrange a meeting with the interested parties, since
> talking eye to eye can be much more productive.

Thanks. As for a meeting given that there are at least 4 different groups
in at least USA and Europe arranging it could be hard. Currently I am aiming
at the Ottowa Linux Symposium in June. Hopefully we won't have major
unresolved issues by then, but it if we do we really need to get together
face to face.


Eric

2006-02-16 13:45:11

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Quoting Herbert Poetzl ([email protected]):
> > I also don't understand why you are eager to introduce new sys calls
> > like pkill(path_to_process), but is trying to use waitpid() for pspace
> > die notifications? Maybe it is simply better to introduce container
> > specific syscalls which should be clean and tidy, instead of messing
> > things up with clone()/waitpid() and so on? The more simple code we
> > have, the better it is for all of us.
>
> now that you mention it, maybe we should have a few
> rounds how those 'generic' container syscalls would
> look like?

I still like the following:

clone(): extended with flags for asking a private copy of various
namespaces. For the CLONE_NEWPIDSPACE flag, the pid which
is returned to the parent process is it's handle to the
new pidspace.

sys_execpid(pid, argv, envp): exec a new syscall with the requested
pid, if the pid is available. Else either return an error,
or pick a random pid. Error makes sense to me, but that's
probably debatable.

sys_fork_slide(pid): fork and slide into the pidspace belong to pid.
This way we can do things like

p = sys_fork_slide(2794);
if (p == 0) {
kill(2974);
} else {
waitpid(p, 0, 0);
}

Ok, this last one in particular needs to be improved, but these two
syscalls and the clone flags together give us all we need. Right?

-serge

2006-02-20 09:25:27

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

>>>I also don't understand why you are eager to introduce new sys calls
>>>like pkill(path_to_process), but is trying to use waitpid() for pspace
>>>die notifications? Maybe it is simply better to introduce container
>>>specific syscalls which should be clean and tidy, instead of messing
>>>things up with clone()/waitpid() and so on? The more simple code we
>>>have, the better it is for all of us.
>>
>>now that you mention it, maybe we should have a few
>>rounds how those 'generic' container syscalls would
>>look like?
>
>
> I still like the following:
>
> clone(): extended with flags for asking a private copy of various
> namespaces. For the CLONE_NEWPIDSPACE flag, the pid which
> is returned to the parent process is it's handle to the
> new pidspace.
- clone has limited number of flags.
- sooner or later you will need to add flags about what and how you want
to close (e.g. pspace with weak or strong isolation and so on)

> sys_execpid(pid, argv, envp): exec a new syscall with the requested
> pid, if the pid is available. Else either return an error,
> or pick a random pid. Error makes sense to me, but that's
> probably debatable.
the problem is that in real life environments where executables can be
substitutes this is kind of a security issue. Also I really hate the
idea of using exec() for changing something.
In OpenVZ we successfully do context changes without exec()'s.

> sys_fork_slide(pid): fork and slide into the pidspace belong to pid.
> This way we can do things like
>
> p = sys_fork_slide(2794);
> if (p == 0) {
> kill(2974);
> } else {
> waitpid(p, 0, 0);
> }
> Ok, this last one in particular needs to be improved, but these two
> syscalls and the clone flags together give us all we need. Right?
Again, you concentrate on PIDspaces forgeting about all the other
namespaces.

I would prefer:

- sys_ns_create()
creates namespace and makes a task to inherit this namespace. If
_needed_, it _can_ fork inside.

- sys_ns_inherit()
change active namespace.

But how should we reference namespace? by globabl ID?

Kirill

2006-02-20 11:46:21

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

>>And we are too cycled on PIDs. Why? I think this is the most minor
>>feature which can easily live out of mainstream if needed, while
>>virtualization is the main goal.
>
>
> although I could be totally ignorant regarding the PID
> stuff, it seems to me that it pretty well reflects the
> requirements for virtualization in general, while being
> simple enough to check/test several solutions ...
>
> why? simple: it reflects the way 'containers' work in
> general, it has all the issues with administration and
> setup, similar to 'guests' (it requires some management
> and access from outside, while providing security and
> isolation inside), containers could be easily built on
> top of that or as an addition to the pid space structures
> at the same time it's easy to test, and issues will show
> up quite early, so that they can be discussed and, if
> necessary, solved without rewriting an entire framework.
I would disagree with you. These discussions IMHO led us to the wrong
direction.

Can I ask a bunch of questions which are related to other virtualization
issues, but which are not addressed by Eric anyhow?

- How are you planning to make hierarchical namespaces for such
resources as IPC? Sockets? Unix sockets?
Process tree is hierarchical by it's nature. But what is heirarchical
IPC and other resources?
And no one ever told me why we need heierarchy at all. No any _real_ use
cases. But it's ok.

- Eric wants to introduce name spaces, but totally forgots how much they
are tied with each other. IPC refers to netlinks, network refers to proc
and sysctl, etc. It is some rare cases when you will be able to use
namespaces without full OpenVZ/vserver containers.
But yes, you are right it will be quite easy for us to use containers on
top of namespaces :)

- How long these namespaces live? And which management interface each of
them will have?
For example, can you destroy some namespace? Or it is automatically
destroyed when the last reference to it is dropped? This is not that
simple question as it may seem to be, especially taking into account
that some namespaces can live longer (e.g. time wait buckets should live
some time after container died; or shm which also can die in a foreign
context...).

So I really hate that we concentrated on discussion of VPIDs, while
there are more general and global questions on the whole virtualization
itself.

> now that you mention it, maybe we should have a few
> rounds how those 'generic' container syscalls would
> look like?

First of all, I don't think syscalls like
"do_something and exec" should be introduced.

Next, these syscalls interfaces can be introduced only after we
discussed the _general_ concepts, like: how these containers exist,
live, are created, waited and so on. But this is impossible to discuss
on PID example only. Because:
1. pids are directly related to process lifetime. no much issues here.
2. pids are hierarchical by its nature.
3. there are much more approaches here, then in network for example.

Kirill

2006-02-20 12:34:30

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Mon, Feb 20, 2006 at 02:29:14PM +0300, Kirill Korotaev wrote:
> >>And we are too cycled on PIDs. Why? I think this is the most minor
> >>feature which can easily live out of mainstream if needed, while
> >>virtualization is the main goal.
> >
> >
> >although I could be totally ignorant regarding the PID
> >stuff, it seems to me that it pretty well reflects the
> >requirements for virtualization in general, while being
> >simple enough to check/test several solutions ...
> >
> >why? simple: it reflects the way 'containers' work in
> >general, it has all the issues with administration and
> >setup, similar to 'guests' (it requires some management
> >and access from outside, while providing security and
> >isolation inside), containers could be easily built on
> >top of that or as an addition to the pid space structures
> >at the same time it's easy to test, and issues will show
> >up quite early, so that they can be discussed and, if
> >necessary, solved without rewriting an entire framework.

> I would disagree with you. These discussions IMHO led us to the wrong
> direction.
>
> Can I ask a bunch of questions which are related to other
> virtualization issues, but which are not addressed by Eric anyhow?
>
> - How are you planning to make hierarchical namespaces for such
> resources as IPC? Sockets? Unix sockets?

in the same way as for resources or filesystems -
management is within the parent, usage within the
child

> Process tree is hierarchical by it's nature. But what is heirarchical
> IPC and other resources?

for resources it's simple, you have to 'give away'
a certain share to your children, which in turn will
enable them to utilize them up to the given amount,
or (depending on the implementation) up to the total
amount of the parent.

(check out ckrm as a proof of concept, and example
how it should not be done :)

> And no one ever told me why we need heierarchy at all. No any _real_
> use cases. But it's ok.

there are many use cases here, first, the VPS within
a VPS (of course, not the most useful one, but the
most obvious one), then there are all kind of test,
build and security scenarios which can benefit from
hierarchical structures and 'delegated' administrative
power (just think web space management)

> - Eric wants to introduce name spaces, but totally forgots how much
> they are tied with each other. IPC refers to netlinks, network refers
> to proc and sysctl, etc. It is some rare cases when you will be able
> to use namespaces without full OpenVZ/vserver containers.

well, we already visited the following:

- filesystem namespaces (works quite fine completely
independant of all other)

- pid spaces (again they are quite fine without any
other namespace)

- resource spaces (another one which makes perfect
sense to have without the others)

the fact that some things like proc or netlink is tied
to networking and ipc IMHO is just a sign that those
interfaces need revisiting and proper isolation or
virtualization ...

> But yes, you are right it will be quite easy for us to use
> containers on top of namespaces :)
>
> - How long these namespaces live? And which management interface each
> of them will have?

well, the lifetime of such a space is very likely to
be at least the time of the users, including all
created sockets, file-handles, ipc resources, etc ...

> For example, can you destroy some namespace?

not directly, well, you 'could' make some kind of
crawler which goes and kills off all structures
keeping a reference to that particular namespace

> Or it is automatically destroyed when the last reference to it is
> dropped?

yup, that's how it could/should work ...

> This is not that simple question as it may seem to be, especially
> taking into account that some namespaces can live longer (e.g. time
> wait buckets should live some time after container died; or shm which
> also can die in a foreign context...).

yes, those have to be adressed and a very specific
semantic has to be found, so that no surprises happen

> So I really hate that we concentrated on discussion of VPIDs,
> while there are more general and global questions on the whole
> virtualization itself.

well, I was not the one rolling out the 'perfect'
vpid solution ...

> >now that you mention it, maybe we should have a few
> >rounds how those 'generic' container syscalls would
> >look like?
>
> First of all, I don't think syscalls like
> "do_something and exec" should be introduced.

Linux-VServer does not have any of those syscalls
and it works quite well, so why should we need such
syscalls?

note: I do not consider clone() or unshare() to be
of this category (just to avoid confusion)

> Next, these syscalls interfaces can be introduced only after we
> discussed the _general_ concepts, like: how these containers exist,
> live, are created, waited and so on. But this is impossible to discuss
> on PID example only. Because:

> 1. pids are directly related to process lifetime. no much issues here.
> 2. pids are hierarchical by its nature.
> 3. there are much more approaches here, then in network for example.

I have no problem at all to discuss a general plan
(hey I though we were already doing so :) or move
to some other area (like networking :)

best,
Herbert

> Kirill

2006-02-20 14:10:35

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

>>I would disagree with you. These discussions IMHO led us to the wrong
>>direction.
>>
>>Can I ask a bunch of questions which are related to other
>>virtualization issues, but which are not addressed by Eric anyhow?
>>
>>- How are you planning to make hierarchical namespaces for such
>> resources as IPC? Sockets? Unix sockets?
>
> in the same way as for resources or filesystems -
> management is within the parent, usage within the
> child
So taking example with IPC, you propose the following:
- parent is able to setup limits on segments, sizes, messages etc.
- parent doesn't see child objects itself, i.e. it is unable to share
segments with a child, send messages to child etc.
Am I correct?

Provided I got it correctly, how does this differ from the situation,
when one container is granted rights to manage another container?
So where is hierarchy?

Moreover, granting/revoking rights is more fine grained I suppose. And
it is more secure, since uses the model - allow only things which are
safe, while heirarchy uses model "allow everything" to do with a child
and leads to possible DoS.

>>Process tree is hierarchical by it's nature. But what is heirarchical
>>IPC and other resources?
> for resources it's simple, you have to 'give away'
> a certain share to your children, which in turn will
> enable them to utilize them up to the given amount,
> or (depending on the implementation) up to the total
> amount of the parent.
Again, how does this differ from the situation when one container is
granted to manage another one? In this case it grant some portion of
it's resources to anyone he wishes.

Take a look at this from another angle:
You have a child container, which was granted 1/2 of your resources.
But since parent consumed 3/4 of it, child will never be able to get his
1/2 portion. And child will be unable to find out the reason for
resources allocation denies.

> (check out ckrm as a proof of concept, and example
> how it should not be done :)
let's be more friendly to each other :)

>>And no one ever told me why we need heierarchy at all. No any _real_
>>use cases. But it's ok.
>
> there are many use cases here, first, the VPS within
> a VPS (of course, not the most useful one, but the
> most obvious one), then there are all kind of test,
> build and security scenarios which can benefit from
> hierarchical structures and 'delegated' administrative
> power (just think web space management)
If you are talking about management, then see my prev paragraph. Rights
can be granted. Can you provide some other example, what do you want
from hierarchy?

>>- Eric wants to introduce name spaces, but totally forgots how much
>> they are tied with each other. IPC refers to netlinks, network refers
>> to proc and sysctl, etc. It is some rare cases when you will be able
>> to use namespaces without full OpenVZ/vserver containers.

> well, we already visited the following:
>
> - filesystem namespaces (works quite fine completely
> independant of all other)
it is tightly interconnected with unix sockets, proc, sysfs, ipc, and
I'm sure something else :)
Herber, Eric, I'm not against namespaces. Actualy OpenVZ doesn't care
whether we have single container or namespaces, I'm just trying to show
you, that all of them are not that separate namespaces as you are trying
to think of them.

> - pid spaces (again they are quite fine without any
> other namespace)
only if we remove all these pid uses from fown, netlinks etc.

> - resource spaces (another one which makes perfect
> sense to have without the others)
which one? give me an example please.

> the fact that some things like proc or netlink is tied
> to networking and ipc IMHO is just a sign that those
> interfaces need revisiting and proper isolation or
> virtualization ...
it needs virtualization, really. But being virtualized they are still
tied to the subsystems they were.

>>- How long these namespaces live? And which management interface each
>> of them will have?
> well, the lifetime of such a space is very likely to
> be at least the time of the users, including all
> created sockets, file-handles, ipc resources, etc ...
So if you have a socket in TCP_FIN_WAIT1 state, which can live long
time, what do you do with it?
Full example: the process dies, but network space is still alive due to
such a socket. You won't be able to reuse the address:port until it died.
I'm curios about how do you propose to handle similar issues in separate
namespaces?

Also as a continuation of this example, if all the processes exited, how
can you manage namespaces which leaked? where should you go to see these
sockets if /proc is tightly related to pspace on the task, but there are
no tasks?

>>So I really hate that we concentrated on discussion of VPIDs,
>>while there are more general and global questions on the whole
>>virtualization itself.
>
>
> well, I was not the one rolling out the 'perfect'
> vpid solution ...
ha ha :) won't start flaming with you.

>>First of all, I don't think syscalls like
>>"do_something and exec" should be introduced.
> Linux-VServer does not have any of those syscalls
> and it works quite well, so why should we need such
> syscalls?
My question is the same! Why?

> I have no problem at all to discuss a general plan
> (hey I though we were already doing so :) or move
> to some other area (like networking :)
Yup. Would be nice to switch to networking, IPC or something like this.

Kirill

2006-02-20 15:08:14

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Mon, Feb 20, 2006 at 05:11:40PM +0300, Kirill Korotaev wrote:
> >>I would disagree with you. These discussions IMHO led us to the wrong
> >>direction.
> >>
> >>Can I ask a bunch of questions which are related to other
> >>virtualization issues, but which are not addressed by Eric anyhow?
> >>
> >>- How are you planning to make hierarchical namespaces for such
> >> resources as IPC? Sockets? Unix sockets?
> >
> >in the same way as for resources or filesystems -
> >management is within the parent, usage within the
> >child

> So taking example with IPC, you propose the following:
> - parent is able to setup limits on segments, sizes, messages etc.
> - parent doesn't see child objects itself, i.e. it is unable to share
> segments with a child, send messages to child etc.
> Am I correct?

that is limits, not resources, but for them, you are
right ...

> Provided I got it correctly, how does this differ from the situation,
> when one container is granted rights to manage another container?
> So where is hierarchy?

this is permissions, yet another *space we have not
really covered that well ...

> Moreover, granting/revoking rights is more fine grained I suppose. And
> it is more secure, since uses the model - allow only things which are
> safe, while heirarchy uses model "allow everything" to do with a child
> and leads to possible DoS.

DoS of what? of course a hierarhical model would allow
to control 'everything' for/within a child space. this
doesn't mean that it would escape the refinements it
has received from the parent ...

in this way it is _very_ different from one container
being able to administrate a completely different
container (as there is no inheritance), which would
easily cause the beforementioned DoS situation ...

>>> Process tree is hierarchical by it's nature. But what is
>>> heirarchical IPC and other resources?

>> for resources it's simple, you have to 'give away'
>> a certain share to your children, which in turn will
>> enable them to utilize them up to the given amount,
>> or (depending on the implementation) up to the total
>> amount of the parent.

> Again, how does this differ from the situation when one container is
> granted to manage another one? In this case it grant some portion of
> it's resources to anyone he wishes.

well, if one container can give away resources to
another one, and can maintain that context, as well
as inspect tasks and resources inside that context,
then the structure is hierarchical in my book :)

> Take a look at this from another angle:
> You have a child container, which was granted 1/2 of your resources.
> But since parent consumed 3/4 of it, child will never be able to get
> his 1/2 portion. And child will be unable to find out the reason for
> resources allocation denies.

well, this is called overbooking, and if you want
to allow that, you ahve to accept the consequences
(doesn't mean I'm against it :)

> >(check out ckrm as a proof of concept, and example
> >how it should not be done :)
> let's be more friendly to each other :)

I'm not being 'unfriendly' here, we all agree that
the basic idea of ckrm was really great, but the
implementation and complexity makes it somewhat
unusable ...

>>> And no one ever told me why we need heierarchy at all. No any _real_
>>> use cases. But it's ok.

>> there are many use cases here, first, the VPS within
>> a VPS (of course, not the most useful one, but the
>> most obvious one), then there are all kind of test,
>> build and security scenarios which can benefit from
>> hierarchical structures and 'delegated' administrative
>> power (just think web space management)

> If you are talking about management, then see my prev paragraph.
> Rights can be granted. Can you provide some other example, what do you
> want from hierarchy?

see my reply to your paragraph :)

> >>- Eric wants to introduce name spaces, but totally forgots how much
> >>they are tied with each other. IPC refers to netlinks, network
> >>refers to proc and sysctl, etc. It is some rare cases when you will
> >>be able to use namespaces without full OpenVZ/vserver containers.
>
> >well, we already visited the following:
> >
> > - filesystem namespaces (works quite fine completely
> > independant of all other)

> it is tightly interconnected with unix sockets, proc, sysfs, ipc, and
> I'm sure something else :)

well, for procfs and sysfs that does not come
surprisingly, as they _are_ filesystems after
all, named sockets do belong there too, for ipc
I do not see the direct conenction as long as
the elements are not represented by files ...

> Herber, Eric, I'm not against namespaces. Actualy OpenVZ doesn't care
> whether we have single container or namespaces, I'm just trying to
> show you, that all of them are not that separate namespaces as you are
> trying to think of them.

I guess all of the involved parties _know_ that
they are somewhat interconencted, and as I said
several times now, I think we should try to split
them and separate 'em wherever possible, sometimes
even by changing existing semantics and structures

>> - pid spaces (again they are quite fine without any
>> other namespace)
> only if we remove all these pid uses from fown, netlinks etc.

again, some cleanup is probably required ...

>> - resource spaces (another one which makes perfect
>> sense to have without the others)
> which one? give me an example please.

- cpu resources (scheduler)

>> the fact that some things like proc or netlink is tied
>> to networking and ipc IMHO is just a sign that those
>> interfaces need revisiting and proper isolation or
>> virtualization ...
> it needs virtualization, really. But being virtualized they are still
> tied to the subsystems they were.
>
>>> - How long these namespaces live? And which management interface
>>> each of them will have?
>> well, the lifetime of such a space is very likely to
>> be at least the time of the users, including all
>> created sockets, file-handles, ipc resources, etc ...
> So if you have a socket in TCP_FIN_WAIT1 state, which can live long
> time, what do you do with it?

IMHO you have two options there:

- zap the sockets when you destroy the context
(a machine which is rebooted will do the same :)

- keep them hanging around for some days, until the
timeout is reached and they go away quite naturally

note: this is something which does not even have to
be a design decision right now, as the zapping can
be done later quite easily ...

> Full example: the process dies, but network space is still alive due
> to such a socket. You won't be able to reuse the address:port until it
> died. I'm curios about how do you propose to handle similar issues in
> separate namespaces?

see above, btw, the network space would only be
partially alive, i.e. no new sockets could be created
within it, or it might even be persistant and you can
'handle' the sockets after a restart as if they were
on a 'normal' machine ...

(btw, this issue is quite common on a real machine, no?)

> Also as a continuation of this example, if all the processes exited,
> how can you manage namespaces which leaked? where should you go to see
> these sockets if /proc is tightly related to pspace on the task, but
> there are no tasks?

that is where an administrative interfaces makes sense
you can review and investigate those spaces and take
proper actions if you like ...

>>> So I really hate that we concentrated on discussion of VPIDs,
>>> while there are more general and global questions on the whole
>>> virtualization itself.

>> well, I was not the one rolling out the 'perfect'
>> vpid solution ...

> ha ha :) won't start flaming with you.

>>> First of all, I don't think syscalls like
>>> "do_something and exec" should be introduced.
>> Linux-VServer does not have any of those syscalls
>> and it works quite well, so why should we need such
>> syscalls?
> My question is the same! Why?

because we don't need them?!

> >I have no problem at all to discuss a general plan
> >(hey I though we were already doing so :) or move
> >to some other area (like networking :)
> Yup. Would be nice to switch to networking, IPC or something like this.

best,
Herbert

> Kirill

2006-02-20 17:04:51

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Mon, Feb 20, 2006 at 12:27:03PM +0300, Kirill Korotaev wrote:
>>>> I also don't understand why you are eager to introduce new sys
>>>> calls like pkill(path_to_process), but is trying to use waitpid()
>>>> for pspace die notifications? Maybe it is simply better to
>>>> introduce container specific syscalls which should be clean and
>>>> tidy, instead of messing things up with clone()/waitpid() and so
>>>> on? The more simple code we have, the better it is for all of us.

>>> now that you mention it, maybe we should have a few
>>> rounds how those 'generic' container syscalls would
>>> look like?

>> I still like the following:
>>
>> clone(): extended with flags for asking a private copy of various
>> namespaces. For the CLONE_NEWPIDSPACE flag, the pid which
>> is returned to the parent process is it's handle to the
>> new pidspace.

> - clone has limited number of flags.
> - sooner or later you will need to add flags about what and how
> you want to close (e.g. pspace with weak or strong isolation
> and so on)

I still do not see a need to do that at clone() time
but I assume you have your reasons for postulating this ...

>> sys_execpid(pid, argv, envp): exec a new syscall with the requested
>> pid, if the pid is available. Else either return an error,
>> or pick a random pid. Error makes sense to me, but that's
>> probably debatable.
> the problem is that in real life environments where executables can be
> substitutes this is kind of a security issue. Also I really hate the
> idea of using exec() for changing something.

> In OpenVZ we successfully do context changes without exec()'s.

here I agree, changes between *spaces should be independant
from exex(), but IMHO there is no need to reuse existing
interfaces for that, a syscall will do fine here ...

>> sys_fork_slide(pid): fork and slide into the pidspace belong to pid.
>> This way we can do things like
>>
>> p = sys_fork_slide(2794);
>> if (p == 0) {
>> kill(2974);
>> } else {
>> waitpid(p, 0, 0);
>> }
>> Ok, this last one in particular needs to be improved, but these two
>> syscalls and the clone flags together give us all we need. Right?
> Again, you concentrate on PIDspaces forgeting about all the other
> namespaces.
>
> I would prefer:
>
> - sys_ns_create()
> creates namespace and makes a task to inherit this namespace.
> If _needed_, it _can_ fork inside.

I don't see why not have both, the auto-created
*space on clone() and the ability to create empty
*spaces which can be populated and/or entered

> - sys_ns_inherit()
> change active namespace.

hmm, sounds like a misnomer to me ...

> But how should we reference namespace? by globabl ID?

definitely by some system-unique identifier ...

best,
Herbert

> Kirill

2006-02-21 16:46:30

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

>>I would prefer:
>>- sys_ns_create()
>> creates namespace and makes a task to inherit this namespace.
>> If _needed_, it _can_ fork inside.

> I don't see why not have both, the auto-created
> *space on clone() and the ability to create empty
> *spaces which can be populated and/or entered
Can you give more details on what you mean by auto-created *space and
empty *space?
I see no much difference...

>>- sys_ns_inherit()
>> change active namespace.
> hmm, sounds like a misnomer to me ...
sys_ns_change ? :)

>>But how should we reference namespace? by globabl ID?
> definitely by some system-unique identifier ...
Should it be integer or path as Eric proposes?

Thanks,
Kirill

2006-02-21 23:23:18

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

On Tue, Feb 21, 2006 at 07:29:17PM +0300, Kirill Korotaev wrote:
>>>I would prefer:
>>>- sys_ns_create()
>>> creates namespace and makes a task to inherit this namespace.
>>> If _needed_, it _can_ fork inside.

>>I don't see why not have both, the auto-created
>>*space on clone() and the ability to create empty
>>*spaces which can be populated and/or entered
>Can you give more details on what you mean by auto-created *space and
>empty *space?
>I see no much difference...

good, in this case we can drop the empty/standalone
*space and 'just' use the clone() one. glad that
we finally agree here ....

>>>- sys_ns_inherit()
>>> change active namespace.
>>hmm, sounds like a misnomer to me ...
>sys_ns_change ? :)

personally I prefer to see it as either enter or
move, but change is probably fine too (except for
the fact that it doesn't change the namespace)

>>>But how should we reference namespace? by globabl ID?
>>definitely by some system-unique identifier ...
>Should it be integer or path as Eric proposes?
IMHO the pointer would be sufficient, of course
this can be mapped to arbitrary names/int/etc ...

> Thanks,
> Kirill

2006-03-01 22:06:46

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [RFC][PATCH 04/20] pspace: Allow multiple instaces of the process id namespace

Herbert Poetzl wrote:
>>>>First of all, I don't think syscalls like
>>>>"do_something and exec" should be introduced.
>>>
>>>Linux-VServer does not have any of those syscalls
>>>and it works quite well, so why should we need such
>>>syscalls?
>>
>>My question is the same! Why?
>
>
> because we don't need them?!

I think that the reason behind such syscalls is to make sure that the
process is "clean" when it enters a new container. "clean" means that the
process is not holding in memory an identifier, like a pid for instance,
that doesn't make sense in the new container.

This is really a edge case which supposes that processes changing
containers don't know what they are doing. This case also depends on how
identifiers will be virtualized.

C.