2007-11-17 18:31:34

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] do_task_stat: don't use task_pid_nr_ns() lockless

Without rcu/tasklist/siglock lock task_pid_nr_ns() may read the freed memory,
move the callsite under ->siglock.

Sadly, we can report pid == 0 if the task was detached.

Signed-off-by: Oleg Nesterov <[email protected]>

--- 24/fs/proc/array.c~dtst 2007-11-09 12:57:30.000000000 +0300
+++ 24/fs/proc/array.c 2007-11-17 21:26:55.000000000 +0300
@@ -392,7 +392,7 @@ static int do_task_stat(struct task_stru
sigset_t sigign, sigcatch;
char state;
int res;
- pid_t ppid = 0, pgid = -1, sid = -1;
+ pid_t pid = 0, ppid = 0, pgid = -1, sid = -1;
int num_threads = 0;
struct mm_struct *mm;
unsigned long long start_time;
@@ -403,9 +403,6 @@ static int do_task_stat(struct task_stru
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
- struct pid_namespace *ns;
-
- ns = current->nsproxy->pid_ns;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -425,6 +422,7 @@ static int do_task_stat(struct task_stru

rcu_read_lock();
if (lock_task_sighand(task, &flags)) {
+ struct pid_namespace *ns = current->nsproxy->pid_ns;
struct signal_struct *sig = task->signal;

if (sig->tty) {
@@ -461,6 +459,7 @@ static int do_task_stat(struct task_stru
gtime = cputime_add(gtime, sig->gtime);
}

+ pid = task_pid_nr_ns(task, ns);
sid = task_session_nr_ns(task, ns);
pgid = task_pgrp_nr_ns(task, ns);
ppid = task_ppid_nr_ns(task, ns);
@@ -495,7 +494,7 @@ static int do_task_stat(struct task_stru
res = sprintf(buffer, "%d (%s) %c %d %d %d %d %d %u %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
- task_pid_nr_ns(task, ns),
+ pid,
tcomm,
state,
ppid,


2007-11-17 20:12:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] do_task_stat: don't use task_pid_nr_ns() lockless

Oleg Nesterov <[email protected]> writes:

> Without rcu/tasklist/siglock lock task_pid_nr_ns() may read the freed memory,
> move the callsite under ->siglock.
>
> Sadly, we can report pid == 0 if the task was detached.

We only get detached in release_task so it is a pretty small window
where we can return pid == 0. Usually get_task_pid will fail first
and we will return -ESRCH. Still the distance from open to

There is another bug in here as well. current->nsproxy->pid_ns is wrong.
What we want is: ns = dentry->d_sb->s_fs_info;

Otherwise we will have file descriptor passing races and the like.

We could also do: proc_pid(inode) to get the pid, which is a little
more race free, and will prevent us from returning pid == 0.

In either event it looks like we need to implement some proper
file operations for these proc files, maybe even going to seq file
status.

Eric

2007-11-19 09:31:23

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH] do_task_stat: don't use task_pid_nr_ns() lockless

Eric W. Biederman wrote:
> Oleg Nesterov <[email protected]> writes:
>
>> Without rcu/tasklist/siglock lock task_pid_nr_ns() may read the freed memory,
>> move the callsite under ->siglock.
>>
>> Sadly, we can report pid == 0 if the task was detached.
>
> We only get detached in release_task so it is a pretty small window
> where we can return pid == 0. Usually get_task_pid will fail first
> and we will return -ESRCH. Still the distance from open to
>
> There is another bug in here as well. current->nsproxy->pid_ns is wrong.
> What we want is: ns = dentry->d_sb->s_fs_info;

Actually I thought about this recently - if we produce the list
of tasks based on the sb's namespace, then we should fill the
tasks' files according to the sb's namespace as well, not according
to the current namespace.

> Otherwise we will have file descriptor passing races and the like.

Can you elaborate?

> We could also do: proc_pid(inode) to get the pid, which is a little
> more race free, and will prevent us from returning pid == 0.
>
> In either event it looks like we need to implement some proper
> file operations for these proc files, maybe even going to seq file
> status.
>
> Eric
>

2007-11-19 09:32:59

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH] do_task_stat: don't use task_pid_nr_ns() lockless

Oleg Nesterov wrote:
> Without rcu/tasklist/siglock lock task_pid_nr_ns() may read the freed memory,
> move the callsite under ->siglock.
>
> Sadly, we can report pid == 0 if the task was detached.
>
> Signed-off-by: Oleg Nesterov <[email protected]>

Acked-by: Pavel Emelyanov <[email protected]>

> --- 24/fs/proc/array.c~dtst 2007-11-09 12:57:30.000000000 +0300
> +++ 24/fs/proc/array.c 2007-11-17 21:26:55.000000000 +0300
> @@ -392,7 +392,7 @@ static int do_task_stat(struct task_stru
> sigset_t sigign, sigcatch;
> char state;
> int res;
> - pid_t ppid = 0, pgid = -1, sid = -1;
> + pid_t pid = 0, ppid = 0, pgid = -1, sid = -1;
> int num_threads = 0;
> struct mm_struct *mm;
> unsigned long long start_time;
> @@ -403,9 +403,6 @@ static int do_task_stat(struct task_stru
> unsigned long rsslim = 0;
> char tcomm[sizeof(task->comm)];
> unsigned long flags;
> - struct pid_namespace *ns;
> -
> - ns = current->nsproxy->pid_ns;
>
> state = *get_task_state(task);
> vsize = eip = esp = 0;
> @@ -425,6 +422,7 @@ static int do_task_stat(struct task_stru
>
> rcu_read_lock();
> if (lock_task_sighand(task, &flags)) {
> + struct pid_namespace *ns = current->nsproxy->pid_ns;
> struct signal_struct *sig = task->signal;
>
> if (sig->tty) {
> @@ -461,6 +459,7 @@ static int do_task_stat(struct task_stru
> gtime = cputime_add(gtime, sig->gtime);
> }
>
> + pid = task_pid_nr_ns(task, ns);
> sid = task_session_nr_ns(task, ns);
> pgid = task_pgrp_nr_ns(task, ns);
> ppid = task_ppid_nr_ns(task, ns);
> @@ -495,7 +494,7 @@ static int do_task_stat(struct task_stru
> res = sprintf(buffer, "%d (%s) %c %d %d %d %d %d %u %lu \
> %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> %lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
> - task_pid_nr_ns(task, ns),
> + pid,
> tcomm,
> state,
> ppid,
>
>

2007-11-19 18:06:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] do_task_stat: don't use task_pid_nr_ns() lockless

Pavel Emelyanov <[email protected]> writes:

> Eric W. Biederman wrote:
>> We only get detached in release_task so it is a pretty small window
>> where we can return pid == 0. Usually get_task_pid will fail first
>> and we will return -ESRCH. Still the distance from open to
>>
>> There is another bug in here as well. current->nsproxy->pid_ns is wrong.
>> What we want is: ns = dentry->d_sb->s_fs_info;
>
> Actually I thought about this recently - if we produce the list
> of tasks based on the sb's namespace, then we should fill the
> tasks' files according to the sb's namespace as well, not according
> to the current namespace.

Exactly. That is the problem I was pointing out. Patch in a bit.
In do_proc_stat we are currently referencing current->nsproxy->pid_ns;

>> Otherwise we will have file descriptor passing races and the like.
>
> Can you elaborate?

Just the classic problem that if you use current at other then open
time the contents of the file descriptor may depend on who is reading
it which is problematic.

Eric

2007-11-19 22:01:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/4] proc: Cleanup status files.


There is a long standing ugliness with /proc/<pid>/stat,
/proc/<pid>/statm, and /proc/<pid>/status that they do not
use the seqfile API.

In addition they are currently reporting pids in the pid namespace
of the current task instead of the pid namespace with which proc
was mounted which is confusing.

This patch series first build the infrastructure to allow these
problems to be fixed and then it converts the individual proc
status files.

Eric

2007-11-19 22:06:49

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/4] proc: Implement proc_single_file_operations


Currently many /proc/pid files use a crufty precursor to
the current seq_file api, and they don't have direct access
to the pid_namespace or the pid of for which they are displaying
data.

So implement proc_single_file_operations to make the seq_file
routines easy to use, and to give access to the full state of the
pid of we are displaying data for.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/base.c | 43 +++++++++++++++++++++++++++++++++++++++++++
include/linux/proc_fs.h | 3 +++
2 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a17c268..e9ff77e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -125,6 +125,10 @@ struct pid_entry {
NOD(NAME, (S_IFREG|(MODE)), \
NULL, &proc_info_file_operations, \
{ .proc_read = &proc_##OTYPE } )
+#define ONE(NAME, MODE, OTYPE) \
+ NOD(NAME, (S_IFREG|(MODE)), \
+ NULL, &proc_single_file_operations, \
+ { .proc_show = &proc_##OTYPE } )

int maps_protect;
EXPORT_SYMBOL(maps_protect);
@@ -571,6 +575,45 @@ static const struct file_operations proc_info_file_operations = {
.read = proc_info_read,
};

+static int proc_single_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct pid_namespace *ns;
+ struct pid *pid;
+ struct task_struct *task;
+ int ret;
+
+ ns = inode->i_sb->s_fs_info;
+ pid = proc_pid(inode);
+ task = get_pid_task(pid, PIDTYPE_PID);
+ if (!task)
+ return -ESRCH;
+
+ ret = PROC_I(inode)->op.proc_show(m, ns, pid, task);
+
+ put_task_struct(task);
+ return ret;
+}
+
+static int proc_single_open(struct inode *inode, struct file *filp)
+{
+ int ret;
+ ret = single_open(filp, proc_single_show, NULL);
+ if (!ret) {
+ struct seq_file *m = filp->private_data;
+
+ m->private = inode;
+ }
+ return ret;
+}
+
+static const struct file_operations proc_single_file_operations = {
+ .open = proc_single_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
static int mem_open(struct inode* inode, struct file* file)
{
file->private_data = (void*)((long)current->self_exec_id);
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 1273c6e..69307c9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -257,6 +257,9 @@ extern void kclist_add(struct kcore_list *, void *, size_t);
union proc_op {
int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **);
int (*proc_read)(struct task_struct *task, char *page);
+ int (*proc_show)(struct seq_file *m,
+ struct pid_namespace *ns, struct pid *pid,
+ struct task_struct *task);
};

struct proc_inode {
--
1.5.3.rc6.17.g1911

2007-11-19 22:10:52

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/4] proc: Rewrite do_task_stat to correctly handle pid namespaces.


Currently (as pointed out by Oleg) do_task_stat has a race
when calling task_pid_nr_ns with the task exiting. In addition
do_task_stat is not currently displaying information in the
context of the pid namespace that mounted the /proc filesystem.
So "cut -d' ' -f 1 /proc/<pid>/stat" may not equal <pid>.

This patch fixes the problem by converting to a single_open
seq_file show method. Getting the pid namespace from the
filesystem superblock instead of current, and simply using
the the struct pid from the inode instead of attempting to get
that same pid from the task.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/array.c | 24 ++++++++++++------------
fs/proc/base.c | 4 ++--
fs/proc/internal.h | 4 ++--
3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index eba339e..e7d290f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -77,6 +77,7 @@
#include <linux/cpuset.h>
#include <linux/rcupdate.h>
#include <linux/delayacct.h>
+#include <linux/seq_file.h>
#include <linux/pid_namespace.h>

#include <asm/pgtable.h>
@@ -384,14 +385,14 @@ static cputime_t task_gtime(struct task_struct *p)
return p->gtime;
}

-static int do_task_stat(struct task_struct *task, char *buffer, int whole)
+static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task, int whole)
{
unsigned long vsize, eip, esp, wchan = ~0UL;
long priority, nice;
int tty_pgrp = -1, tty_nr = 0;
sigset_t sigign, sigcatch;
char state;
- int res;
pid_t ppid = 0, pgid = -1, sid = -1;
int num_threads = 0;
struct mm_struct *mm;
@@ -403,9 +404,6 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
- struct pid_namespace *ns;
-
- ns = current->nsproxy->pid_ns;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -492,10 +490,10 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
/* convert nsec -> ticks */
start_time = nsec_to_clock_t(start_time);

- res = sprintf(buffer, "%d (%s) %c %d %d %d %d %d %u %lu \
+ seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
- task_pid_nr_ns(task, ns),
+ pid_nr_ns(pid, ns),
tcomm,
state,
ppid,
@@ -544,17 +542,19 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
cputime_to_clock_t(cgtime));
if (mm)
mmput(mm);
- return res;
+ return 0;
}

-int proc_tid_stat(struct task_struct *task, char *buffer)
+int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
{
- return do_task_stat(task, buffer, 0);
+ return do_task_stat(m, ns, pid, task, 0);
}

-int proc_tgid_stat(struct task_struct *task, char *buffer)
+int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
{
- return do_task_stat(task, buffer, 1);
+ return do_task_stat(m, ns, pid, task, 1);
}

int proc_pid_statm(struct task_struct *task, char *buffer)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index e9ff77e..51f0e7a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2228,7 +2228,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("sched", S_IRUGO|S_IWUSR, pid_sched),
#endif
INF("cmdline", S_IRUGO, pid_cmdline),
- INF("stat", S_IRUGO, tgid_stat),
+ ONE("stat", S_IRUGO, tgid_stat),
INF("statm", S_IRUGO, pid_statm),
REG("maps", S_IRUGO, maps),
#ifdef CONFIG_NUMA
@@ -2549,7 +2549,7 @@ static const struct pid_entry tid_base_stuff[] = {
REG("sched", S_IRUGO|S_IWUSR, pid_sched),
#endif
INF("cmdline", S_IRUGO, pid_cmdline),
- INF("stat", S_IRUGO, tid_stat),
+ ONE("stat", S_IRUGO, tid_stat),
INF("statm", S_IRUGO, pid_statm),
REG("maps", S_IRUGO, maps),
#ifdef CONFIG_NUMA
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 1b2b6c6..74bc772 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -46,8 +46,8 @@ extern int maps_protect;

extern void create_seq_entry(char *name, mode_t mode, const struct file_operations *f);
extern int proc_exe_link(struct inode *, struct dentry **, struct vfsmount **);
-extern int proc_tid_stat(struct task_struct *, char *);
-extern int proc_tgid_stat(struct task_struct *, char *);
+extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
+extern int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
extern int proc_pid_status(struct task_struct *, char *);
extern int proc_pid_statm(struct task_struct *, char *);

--
1.5.3.rc6.17.g1911

2007-11-19 22:12:38

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 3/4] proc: seqfile convert proc_pid_status to properly handle pid namespaces


Currently we possibly lookup the pid in the wrong pid namespace. So
seq_file convert proc_pid_status which ensures the proper pid
namespaces is passed in, and ensures that we don't overflow any
internal buffers when generating /proc/<pid>/status

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/s390/kernel/traps.c | 23 ++++----
fs/proc/array.c | 122 ++++++++++++++++++++----------------------
fs/proc/base.c | 4 +-
fs/proc/internal.h | 2 +-
fs/proc/task_mmu.c | 6 +-
fs/proc/task_nommu.c | 2 +-
include/asm-s390/processor.h | 2 +-
include/linux/cpuset.h | 8 ++--
include/linux/proc_fs.h | 2 +-
kernel/cpuset.c | 18 ++++---
10 files changed, 93 insertions(+), 96 deletions(-)

diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 8ec9def..fd75106 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -210,41 +210,40 @@ void show_registers(struct pt_regs *regs)
}

/* This is called from fs/proc/array.c */
-char *task_show_regs(struct task_struct *task, char *buffer)
+void task_show_regs(struct seq_file *m, struct task_struct *task)
{
struct pt_regs *regs;

regs = task_pt_regs(task);
- buffer += sprintf(buffer, "task: %p, ksp: %p\n",
+ seq_printf(m, "task: %p, ksp: %p\n",
task, (void *)task->thread.ksp);
- buffer += sprintf(buffer, "User PSW : %p %p\n",
+ seq_printf(m, "User PSW : %p %p\n",
(void *) regs->psw.mask, (void *)regs->psw.addr);

- buffer += sprintf(buffer, "User GPRS: " FOURLONG,
+ seq_printf(m, "User GPRS: " FOURLONG,
regs->gprs[0], regs->gprs[1],
regs->gprs[2], regs->gprs[3]);
- buffer += sprintf(buffer, " " FOURLONG,
+ seq_printf(m, " " FOURLONG,
regs->gprs[4], regs->gprs[5],
regs->gprs[6], regs->gprs[7]);
- buffer += sprintf(buffer, " " FOURLONG,
+ seq_printf(m, " " FOURLONG,
regs->gprs[8], regs->gprs[9],
regs->gprs[10], regs->gprs[11]);
- buffer += sprintf(buffer, " " FOURLONG,
+ seq_printf(m, " " FOURLONG,
regs->gprs[12], regs->gprs[13],
regs->gprs[14], regs->gprs[15]);
- buffer += sprintf(buffer, "User ACRS: %08x %08x %08x %08x\n",
+ seq_printf(m, "User ACRS: %08x %08x %08x %08x\n",
task->thread.acrs[0], task->thread.acrs[1],
task->thread.acrs[2], task->thread.acrs[3]);
- buffer += sprintf(buffer, " %08x %08x %08x %08x\n",
+ seq_printf(m, " %08x %08x %08x %08x\n",
task->thread.acrs[4], task->thread.acrs[5],
task->thread.acrs[6], task->thread.acrs[7]);
- buffer += sprintf(buffer, " %08x %08x %08x %08x\n",
+ seq_printf(m, " %08x %08x %08x %08x\n",
task->thread.acrs[8], task->thread.acrs[9],
task->thread.acrs[10], task->thread.acrs[11]);
- buffer += sprintf(buffer, " %08x %08x %08x %08x\n",
+ seq_printf(m, " %08x %08x %08x %08x\n",
task->thread.acrs[12], task->thread.acrs[13],
task->thread.acrs[14], task->thread.acrs[15]);
- return buffer;
}

static DEFINE_SPINLOCK(die_lock);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index e7d290f..d23d91d 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -89,18 +89,21 @@
do { memcpy(buffer, string, strlen(string)); \
buffer += strlen(string); } while (0)

-static inline char *task_name(struct task_struct *p, char *buf)
+static inline void task_name(struct seq_file *m, struct task_struct *p)
{
int i;
+ char *buf, *end;
char *name;
char tcomm[sizeof(p->comm)];

get_task_comm(tcomm, p);

- ADDBUF(buf, "Name:\t");
+ end = m->buf + m->size;
+ buf = m->buf + m->count;
+ seq_printf(m, "Name:\n");
name = tcomm;
i = sizeof(tcomm);
- do {
+ while (i && (buf < end)) {
unsigned char c = *name;
name++;
i--;
@@ -108,20 +111,21 @@ static inline char *task_name(struct task_struct *p, char *buf)
if (!c)
break;
if (c == '\\') {
- buf[1] = c;
- buf += 2;
+ buf++;
+ if (buf < end)
+ *buf++ = c;
continue;
}
if (c == '\n') {
- buf[0] = '\\';
- buf[1] = 'n';
- buf += 2;
+ *buf++ = '\\';
+ if (buf < end)
+ *buf++ = 'n';
continue;
}
buf++;
- } while (i);
- *buf = '\n';
- return buf+1;
+ }
+ m->count = buf - m->buf;
+ seq_printf(m, "\n");
}

/*
@@ -157,21 +161,20 @@ static inline const char *get_task_state(struct task_struct *tsk)
return *p;
}

-static inline char *task_state(struct task_struct *p, char *buffer)
+static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *p)
{
struct group_info *group_info;
int g;
struct fdtable *fdt = NULL;
- struct pid_namespace *ns;
pid_t ppid, tpid;

- ns = current->nsproxy->pid_ns;
rcu_read_lock();
ppid = pid_alive(p) ?
task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
tpid = pid_alive(p) && p->ptrace ?
task_ppid_nr_ns(rcu_dereference(p->parent), ns) : 0;
- buffer += sprintf(buffer,
+ seq_printf(m,
"State:\t%s\n"
"Tgid:\t%d\n"
"Pid:\t%d\n"
@@ -181,7 +184,7 @@ static inline char *task_state(struct task_struct *p, char *buffer)
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
task_tgid_nr_ns(p, ns),
- task_pid_nr_ns(p, ns),
+ pid_nr_ns(pid, ns),
ppid, tpid,
p->uid, p->euid, p->suid, p->fsuid,
p->gid, p->egid, p->sgid, p->fsgid);
@@ -189,7 +192,7 @@ static inline char *task_state(struct task_struct *p, char *buffer)
task_lock(p);
if (p->files)
fdt = files_fdtable(p->files);
- buffer += sprintf(buffer,
+ seq_printf(m,
"FDSize:\t%d\n"
"Groups:\t",
fdt ? fdt->max_fds : 0);
@@ -200,20 +203,17 @@ static inline char *task_state(struct task_struct *p, char *buffer)
task_unlock(p);

for (g = 0; g < min(group_info->ngroups, NGROUPS_SMALL); g++)
- buffer += sprintf(buffer, "%d ", GROUP_AT(group_info, g));
+ seq_printf(m, "%d ", GROUP_AT(group_info, g));
put_group_info(group_info);

- buffer += sprintf(buffer, "\n");
- return buffer;
+ seq_printf(m, "\n");
}

-static char *render_sigset_t(const char *header, sigset_t *set, char *buffer)
+static void render_sigset_t(struct seq_file *m, const char *header, sigset_t *set)
{
- int i, len;
+ int i;

- len = strlen(header);
- memcpy(buffer, header, len);
- buffer += len;
+ seq_printf(m, "%s", header);

i = _NSIG;
do {
@@ -224,12 +224,10 @@ static char *render_sigset_t(const char *header, sigset_t *set, char *buffer)
if (sigismember(set, i+2)) x |= 2;
if (sigismember(set, i+3)) x |= 4;
if (sigismember(set, i+4)) x |= 8;
- *buffer++ = (x < 10 ? '0' : 'a' - 10) + x;
+ seq_printf(m, "%x", x);
} while (i >= 4);

- *buffer++ = '\n';
- *buffer = 0;
- return buffer;
+ seq_printf(m, "\n");
}

static void collect_sigign_sigcatch(struct task_struct *p, sigset_t *ign,
@@ -247,7 +245,7 @@ static void collect_sigign_sigcatch(struct task_struct *p, sigset_t *ign,
}
}

-static inline char *task_sig(struct task_struct *p, char *buffer)
+static inline void task_sig(struct seq_file *m, struct task_struct *p)
{
unsigned long flags;
sigset_t pending, shpending, blocked, ignored, caught;
@@ -274,58 +272,56 @@ static inline char *task_sig(struct task_struct *p, char *buffer)
}
rcu_read_unlock();

- buffer += sprintf(buffer, "Threads:\t%d\n", num_threads);
- buffer += sprintf(buffer, "SigQ:\t%lu/%lu\n", qsize, qlim);
+ seq_printf(m, "Threads:\t%d\n", num_threads);
+ seq_printf(m, "SigQ:\t%lu/%lu\n", qsize, qlim);

/* render them all */
- buffer = render_sigset_t("SigPnd:\t", &pending, buffer);
- buffer = render_sigset_t("ShdPnd:\t", &shpending, buffer);
- buffer = render_sigset_t("SigBlk:\t", &blocked, buffer);
- buffer = render_sigset_t("SigIgn:\t", &ignored, buffer);
- buffer = render_sigset_t("SigCgt:\t", &caught, buffer);
-
- return buffer;
+ render_sigset_t(m, "SigPnd:\t", &pending);
+ render_sigset_t(m, "ShdPnd:\t", &shpending);
+ render_sigset_t(m, "SigBlk:\t", &blocked);
+ render_sigset_t(m, "SigIgn:\t", &ignored);
+ render_sigset_t(m, "SigCgt:\t", &caught);
}

-static inline char *task_cap(struct task_struct *p, char *buffer)
+static inline void task_cap(struct seq_file *m, struct task_struct *p)
{
- return buffer + sprintf(buffer, "CapInh:\t%016x\n"
- "CapPrm:\t%016x\n"
- "CapEff:\t%016x\n",
- cap_t(p->cap_inheritable),
- cap_t(p->cap_permitted),
- cap_t(p->cap_effective));
+ seq_printf(m, "CapInh:\t%016x\n"
+ "CapPrm:\t%016x\n"
+ "CapEff:\t%016x\n",
+ cap_t(p->cap_inheritable),
+ cap_t(p->cap_permitted),
+ cap_t(p->cap_effective));
}

-static inline char *task_context_switch_counts(struct task_struct *p,
- char *buffer)
+static inline void task_context_switch_counts(struct seq_file *m,
+ struct task_struct *p)
{
- return buffer + sprintf(buffer, "voluntary_ctxt_switches:\t%lu\n"
- "nonvoluntary_ctxt_switches:\t%lu\n",
- p->nvcsw,
- p->nivcsw);
+ seq_printf(m, "voluntary_ctxt_switches:\t%lu\n"
+ "nonvoluntary_ctxt_switches:\t%lu\n",
+ p->nvcsw,
+ p->nivcsw);
}

-int proc_pid_status(struct task_struct *task, char *buffer)
+int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid,
+ struct task_struct *task)
{
- char *orig = buffer;
struct mm_struct *mm = get_task_mm(task);

- buffer = task_name(task, buffer);
- buffer = task_state(task, buffer);
+ task_name(m, task);
+ task_state(m, ns, pid, task);

if (mm) {
- buffer = task_mem(mm, buffer);
+ task_mem(m, mm);
mmput(mm);
}
- buffer = task_sig(task, buffer);
- buffer = task_cap(task, buffer);
- buffer = cpuset_task_status_allowed(task, buffer);
+ task_sig(m, task);
+ task_cap(m, task);
+ cpuset_task_status_allowed(m, task);
#if defined(CONFIG_S390)
- buffer = task_show_regs(task, buffer);
+ task_show_regs(m, task);
#endif
- buffer = task_context_switch_counts(task, buffer);
- return buffer - orig;
+ task_context_switch_counts(m, task);
+ return 0;
}

/*
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 51f0e7a..086a262 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2222,7 +2222,7 @@ static const struct pid_entry tgid_base_stuff[] = {
DIR("fdinfo", S_IRUSR|S_IXUSR, fdinfo),
REG("environ", S_IRUSR, environ),
INF("auxv", S_IRUSR, pid_auxv),
- INF("status", S_IRUGO, pid_status),
+ ONE("status", S_IRUGO, pid_status),
INF("limits", S_IRUSR, pid_limits),
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, pid_sched),
@@ -2543,7 +2543,7 @@ static const struct pid_entry tid_base_stuff[] = {
DIR("fdinfo", S_IRUSR|S_IXUSR, fdinfo),
REG("environ", S_IRUSR, environ),
INF("auxv", S_IRUSR, pid_auxv),
- INF("status", S_IRUGO, pid_status),
+ ONE("status", S_IRUGO, pid_status),
INF("limits", S_IRUSR, pid_limits),
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, pid_sched),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 74bc772..0033645 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -48,7 +48,7 @@ extern void create_seq_entry(char *name, mode_t mode, const struct file_operatio
extern int proc_exe_link(struct inode *, struct dentry **, struct vfsmount **);
extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
extern int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
-extern int proc_pid_status(struct task_struct *, char *);
+extern int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
extern int proc_pid_statm(struct task_struct *, char *);

extern const struct file_operations proc_maps_operations;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c24d81a..d5cb8fb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -6,13 +6,14 @@
#include <linux/ptrace.h>
#include <linux/pagemap.h>
#include <linux/mempolicy.h>
+#include <linux/seq_file.h>

#include <asm/elf.h>
#include <asm/uaccess.h>
#include <asm/tlbflush.h>
#include "internal.h"

-char *task_mem(struct mm_struct *mm, char *buffer)
+void task_mem(struct seq_file *m, struct mm_struct *mm)
{
unsigned long data, text, lib;
unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
@@ -34,7 +35,7 @@ char *task_mem(struct mm_struct *mm, char *buffer)
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
- buffer += sprintf(buffer,
+ seq_printf(m,
"VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
@@ -53,7 +54,6 @@ char *task_mem(struct mm_struct *mm, char *buffer)
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
- return buffer;
}

unsigned long task_vsize(struct mm_struct *mm)
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index d8b8c71..d4e9495 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -12,7 +12,7 @@
* each process that owns it. Non-shared memory is counted
* accurately.
*/
-char *task_mem(struct mm_struct *mm, char *buffer)
+void task_mem(struct seq_file *m, struct mm_struct *mm)
{
struct vm_list_struct *vml;
unsigned long bytes = 0, sbytes = 0, slack = 0;
diff --git a/include/asm-s390/processor.h b/include/asm-s390/processor.h
index 21d40a1..3d4ff06 100644
--- a/include/asm-s390/processor.h
+++ b/include/asm-s390/processor.h
@@ -180,7 +180,7 @@ extern unsigned long thread_saved_pc(struct task_struct *t);
/*
* Print register of task into buffer. Used in fs/proc/array.c.
*/
-extern char *task_show_regs(struct task_struct *task, char *buffer);
+extern void task_show_regs(struct seq_file *m, struct task_struct *task);

extern void show_registers(struct pt_regs *regs);
extern void show_code(struct pt_regs *regs);
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index ecae585..3eb2e81 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -57,7 +57,8 @@ extern int cpuset_memory_pressure_enabled;
extern void __cpuset_memory_pressure_bump(void);

extern const struct file_operations proc_cpuset_operations;
-extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
+struct seq_file;
+extern void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task);

extern void cpuset_lock(void);
extern void cpuset_unlock(void);
@@ -126,10 +127,9 @@ static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,

static inline void cpuset_memory_pressure_bump(void) {}

-static inline char *cpuset_task_status_allowed(struct task_struct *task,
- char *buffer)
+static inline void cpuset_task_status_allowed(struct seq_file *m,
+ struct task_struct *task)
{
- return buffer;
}

static inline void cpuset_lock(void) {}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 69307c9..b070b3b 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -114,7 +114,7 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, struct
int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir);
unsigned long task_vsize(struct mm_struct *);
int task_statm(struct mm_struct *, int *, int *, int *, int *);
-char *task_mem(struct mm_struct *, char *);
+void task_mem(struct seq_file *, struct mm_struct *);
void clear_refs_smap(struct mm_struct *mm);

struct proc_dir_entry *de_get(struct proc_dir_entry *de);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 50f5dc4..9a0ebcf 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2219,13 +2219,15 @@ const struct file_operations proc_cpuset_operations = {
#endif /* CONFIG_PROC_PID_CPUSET */

/* Display task cpus_allowed, mems_allowed in /proc/<pid>/status file. */
-char *cpuset_task_status_allowed(struct task_struct *task, char *buffer)
+void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task)
{
- buffer += sprintf(buffer, "Cpus_allowed:\t");
- buffer += cpumask_scnprintf(buffer, PAGE_SIZE, task->cpus_allowed);
- buffer += sprintf(buffer, "\n");
- buffer += sprintf(buffer, "Mems_allowed:\t");
- buffer += nodemask_scnprintf(buffer, PAGE_SIZE, task->mems_allowed);
- buffer += sprintf(buffer, "\n");
- return buffer;
+ int len;
+ seq_printf(m, "Cpus_allowed:\t");
+ m->count += cpumask_scnprintf(m->buf + m->count, m->size - m->count,
+ task->cpus_allowed);
+ seq_printf(m, "\n");
+ seq_printf(m, "Mems_allowed:\t");
+ m->count += nodemask_scnprintf(m->buf + m->count, m->size - m->count,
+ task->mems_allowed);
+ seq_printf(buffer, "\n");
}
--
1.5.3.rc6.17.g1911

2007-11-19 22:14:25

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] proc: seqfile convert proc_pid_statm


This conversion is just for code cleanliness, uniformity, and general safety.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/array.c | 8 +++++---
fs/proc/base.c | 4 ++--
fs/proc/internal.h | 2 +-
3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index d23d91d..b1a90c0 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -553,7 +553,8 @@ int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns,
return do_task_stat(m, ns, pid, task, 1);
}

-int proc_pid_statm(struct task_struct *task, char *buffer)
+int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
{
int size = 0, resident = 0, shared = 0, text = 0, lib = 0, data = 0;
struct mm_struct *mm = get_task_mm(task);
@@ -562,7 +563,8 @@ int proc_pid_statm(struct task_struct *task, char *buffer)
size = task_statm(mm, &shared, &text, &data, &resident);
mmput(mm);
}
+ seq_printf(m, "%d %d %d %d %d %d %d\n",
+ size, resident, shared, text, lib, data, 0);

- return sprintf(buffer, "%d %d %d %d %d %d %d\n",
- size, resident, shared, text, lib, data, 0);
+ return 0;
}
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 086a262..5b92b79 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2229,7 +2229,7 @@ static const struct pid_entry tgid_base_stuff[] = {
#endif
INF("cmdline", S_IRUGO, pid_cmdline),
ONE("stat", S_IRUGO, tgid_stat),
- INF("statm", S_IRUGO, pid_statm),
+ ONE("statm", S_IRUGO, pid_statm),
REG("maps", S_IRUGO, maps),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, numa_maps),
@@ -2550,7 +2550,7 @@ static const struct pid_entry tid_base_stuff[] = {
#endif
INF("cmdline", S_IRUGO, pid_cmdline),
ONE("stat", S_IRUGO, tid_stat),
- INF("statm", S_IRUGO, pid_statm),
+ ONE("statm", S_IRUGO, pid_statm),
REG("maps", S_IRUGO, maps),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, numa_maps),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 0033645..4c17360 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -49,7 +49,7 @@ extern int proc_exe_link(struct inode *, struct dentry **, struct vfsmount **);
extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
extern int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
extern int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);
-extern int proc_pid_statm(struct task_struct *, char *);
+extern int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task);

extern const struct file_operations proc_maps_operations;
extern const struct file_operations proc_numa_maps_operations;
--
1.5.3.rc6.17.g1911

2007-11-19 22:31:47

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] proc: Proper pidns handling for /proc/self


Currently if you access a /proc that is not mounted with your
processes current pid namespace /proc/self will point at a completely
different process. Ouch!.

This patch fixes /proc/self to point to the current process if
it is available in the particular mount of /proc or to return
-ENOENT if the current process is not visible.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/proc/base.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 7e8eca4..34a1821 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2049,15 +2049,23 @@ static const struct file_operations proc_coredump_filter_operations = {
static int proc_self_readlink(struct dentry *dentry, char __user *buffer,
int buflen)
{
+ struct pid_namespace *ns = dentry->d_sb->s_fs_info;
+ pid_t tgid = task_tgid_nr_ns(current, ns);
char tmp[PROC_NUMBUF];
- sprintf(tmp, "%d", task_tgid_vnr(current));
+ if (!tgid)
+ return -ENOENT;
+ sprintf(tmp, "%d", tgid);
return vfs_readlink(dentry,buffer,buflen,tmp);
}

static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd)
{
+ struct pid_namespace *ns = dentry->d_sb->s_fs_info;
+ pid_t tgid = task_tgid_nr_ns(current, ns);
char tmp[PROC_NUMBUF];
- sprintf(tmp, "%d", task_tgid_vnr(current));
+ if (!tgid)
+ return ERR_PTR(-ENOENT);
+ sprintf(tmp, "%d", task_tgid_nr_ns(current, ns));
return ERR_PTR(vfs_follow_link(nd,tmp));
}

--
1.5.3.rc6.17.g1911

2007-11-20 10:35:28

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 0/4] proc: Cleanup status files.

Eric W. Biederman wrote:
> There is a long standing ugliness with /proc/<pid>/stat,
> /proc/<pid>/statm, and /proc/<pid>/status that they do not
> use the seqfile API.
>
> In addition they are currently reporting pids in the pid namespace
> of the current task instead of the pid namespace with which proc
> was mounted which is confusing.

100% agree with that. Thanks for fixing this issue :)

Acked-by: Pavel Emelyanov <[email protected]>

> This patch series first build the infrastructure to allow these
> problems to be fixed and then it converts the individual proc
> status files.
>
> Eric
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>