2016-04-11 23:36:09

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 0/15] task_diag: add a new interface to get information about processes (v3)

Current interface is a bunch of files in /proc/PID. While this appears to be
simple and there are a number of problems with it.

* Lots of syscalls

At least three syscalls per each PID are required — open(), read(), and
close()

* Variety of formats

There are many different formats used by files in /proc/PID/ hierarchy.
Therefore, there is a need to write parser for each such format.

* Non-extendable formats

Some formats in /proc/PID are non-extendable. For example, /proc/PID/maps
last column (file name) is optional, therefore there is no way to add more
columns without breaking the format.

* Slow read due to extra info[edit]
Sometimes getting information is slow due to extra attributes that are not
always needed. For example, /proc/PID/smaps contains VmFlags field (which
can't be added to /proc/PID/maps, see previous item), but it also contains
page stats that take long time to generate.

$ time cat /proc/*/maps > /dev/null
real 0m0.061s
user 0m0.002s
sys 0m0.059s


$ time cat /proc/*/smaps > /dev/null
real 0m0.253s
user 0m0.004s
sys 0m0.247s

Proposed solution
-----------------

The proposed solution is the /proc/task_diag file, which operates based on the
following principles:

* Transactional: write request, read response
* Netlink message format (same as used by sock_diag; binary and extendable)
* Ability to specify a set of processes to get info about
* Optimal grouping of attributes
Any attribute in a group can't affect a response time

The user-kernel interface is encapsulated in include/uapi/linux/task_diag.h

A request is described by the task_diag_pid structure:

struct task_diag_pid {
__u64 show_flags; /* specify which information are required */
__u64 dump_stratagy; /* specify a group of processes */

__u32 pid;
};

dump_stratagy specifies a group of processes:
/* system wide strategies (the pid fiel is ignored) */
TASK_DIAG_DUMP_ALL - all processes
TASK_DIAG_DUMP_ALL_THREAD - all threads
/* per-process strategies */
TASK_DIAG_DUMP_CHILDREN - all children
TASK_DIAG_DUMP_THREAD - all threads
TASK_DIAG_DUMP_ONE - one process

show_flags specifies which information are required. If we set the
TASK_DIAG_SHOW_BASE flag, the response message will contain the TASK_DIAG_BASE
attribute which is described by the task_diag_base structure.

struct task_diag_base {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};

In future, it can be extended by optional attributes. The request describes
which task properties are required and for which processes they are required
for.

A response can be divided into a few netlink packets. Each task is described
by a netlink message. If all information about a process doesn't fit into a
message, the TASK_DIAG_FLAG_CONT flag will be set and the next message will
continue describing the same process.

The task diag is much faster than the proc file system. We don't need to create
a new file descriptor for each task. We need to send a request and get a
response. It allows to get information for a few tasks for one request-response
iteration.

As for security, task_diag always works as procfs with hidepid = 2 (highest
level of security).

I have compared performance of procfs and task-diag for the
"ps ax -o pid,ppid" command.

ps uses /proc/PID/* files:
$ time ./ps/pscommand ax | wc -l
50089

real 0m1.596s
user 0m0.475s
sys 0m1.126s

ps uses the task_diag interface
$ time ./ps/pscommand ax | wc -l
50089

real 0m0.148s
user 0m0.069s
sys 0m0.086s

Read /proc/PID/stat for 30K tasks:
$ time ./task_proc_all > /dev/null

real 0m0.258s
user 0m0.019s
sys 0m0.232s

Get the same information via task_diag:
$ time ./task_diag_all > /dev/null

real 0m0.052s
user 0m0.013s
sys 0m0.036s

And here are statistics on syscalls which were called by each
command.

$ perf trace -s -o log -- ./task_proc_all > /dev/null

Summary of events:

task_proc_all (30781), 180785 events, 100.0%, 0.000 msec

syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 30111 0.000 0.013 0.107 0.21%
write 1 0.008 0.008 0.008 0.00%
open 30111 0.007 0.012 0.145 0.24%
close 30112 0.004 0.011 0.110 0.20%
fstat 3 0.009 0.013 0.016 16.15%
mmap 8 0.011 0.020 0.027 11.24%
mprotect 4 0.019 0.023 0.028 8.33%
munmap 1 0.026 0.026 0.026 0.00%
brk 8 0.007 0.015 0.024 11.94%
ioctl 1 0.007 0.007 0.007 0.00%
access 1 0.019 0.019 0.019 0.00%
execve 1 0.000 0.000 0.000 0.00%
getdents 29 0.008 1.010 2.215 8.88%
arch_prctl 1 0.016 0.016 0.016 0.00%
openat 1 0.021 0.021 0.021 0.00%


$ perf trace -s -o log -- ./task_diag_all > /dev/null
Summary of events:

task_diag_all (30762), 717 events, 98.9%, 0.000 msec

syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 2 0.000 0.008 0.016 100.00%
write 197 0.008 0.019 0.041 3.00%
open 2 0.023 0.029 0.036 22.45%
close 3 0.010 0.012 0.014 11.34%
fstat 3 0.012 0.044 0.106 70.52%
mmap 8 0.014 0.031 0.054 18.88%
mprotect 4 0.016 0.023 0.027 10.93%
munmap 1 0.022 0.022 0.022 0.00%
brk 1 0.040 0.040 0.040 0.00%
ioctl 1 0.011 0.011 0.011 0.00%
access 1 0.032 0.032 0.032 0.00%
getpid 1 0.012 0.012 0.012 0.00%
socket 1 0.032 0.032 0.032 0.00%
sendto 2 0.032 0.095 0.157 65.77%
recvfrom 129 0.009 0.235 0.418 2.45%
bind 1 0.018 0.018 0.018 0.00%
execve 1 0.000 0.000 0.000 0.00%
arch_prctl 1 0.012 0.012 0.012 0.00%

You can find the test programs from this experiment in tools/test/selftest/task_diag.

The idea of this functionality was suggested by Pavel Emelyanov (xemul@),
when he found that operations with /proc forms a significant part
of a checkpointing time.

Ten years ago there was attempt to add a netlink interface to access to /proc
information:
http://lwn.net/Articles/99600/

Links
-----

kernel: https://github.com/avagin/linux-task-diag
procps: https://github.com/avagin/procps-task-diag
wiki: https://criu.org/Task-diag

Changes from the first version:
-------------------------------

David Ahern implemented all required functionality to use task_diag in
perf.

Bellow you can find his results how it affects performance.
> Using the fork test command:
> 10,000 processes; 10k proc with 5 threads = 50,000 tasks
> reading /proc: 11.3 sec
> task_diag: 2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
> reading /proc: 32.1 sec
> task_diag: 3.9 sec
>
> So overall much snappier startup times.

Many thanks to David Ahern for the help with improving task_diag.

Changes from the second version:
--------------------------------

Use a proc transation file instead of the netlink interface.
Andy Lutomirski pointed out on security problems related to netlink sockets:

> Slightly off-topic, but this netlink is really rather bad as an
> example of how fds can be used as capabilities (in the real capability
> sense, not the Linux capabilities sense). You call socket and get a
> socket. That socket captures f_cred. Then you drop privs, and you
> assume that the socket you're holding on to retains the right to do
> certain things.
>
> This breaks pretty badly when, through things such as this patch set,
> existing code that creates netlink sockets suddenly starts capturing
> brand-new rights that didn't exist as part of a netlink socket before.

Cc: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Roger Luethi <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Pavel Odintsov <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
--
2.1.0


2016-04-11 23:36:11

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 01/15] proc: pick out a function to iterate task children

This function will be used in task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/array.c | 53 +++++++++++++++++++++++++++++++++--------------------
fs/proc/internal.h | 3 +++
2 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..3eceab1 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -593,31 +593,25 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
}

#ifdef CONFIG_PROC_CHILDREN
-static struct pid *
-get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
+struct task_struct *
+task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos)
{
- struct task_struct *start, *task;
- struct pid *pid = NULL;
-
- read_lock(&tasklist_lock);
-
- start = pid_task(proc_pid(inode), PIDTYPE_PID);
- if (!start)
- goto out;
+ struct task_struct *task;

/*
* Lets try to continue searching first, this gives
* us significant speedup on children-rich processes.
*/
- if (pid_prev) {
- task = pid_task(pid_prev, PIDTYPE_PID);
- if (task && task->real_parent == start &&
+ if (prev) {
+ task = prev;
+ if (task && task->real_parent == parent &&
!(list_empty(&task->sibling))) {
- if (list_is_last(&task->sibling, &start->children))
+ if (list_is_last(&task->sibling, &parent->children)) {
+ task = NULL;
goto out;
+ }
task = list_first_entry(&task->sibling,
struct task_struct, sibling);
- pid = get_pid(task_pid(task));
goto out;
}
}
@@ -637,12 +631,31 @@ get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
* So one need to stop or freeze the leader and all
* its children to get a precise result.
*/
- list_for_each_entry(task, &start->children, sibling) {
- if (pos-- == 0) {
- pid = get_pid(task_pid(task));
- break;
- }
+ list_for_each_entry(task, &parent->children, sibling) {
+ if (pos-- == 0)
+ goto out;
}
+ task = NULL;
+out:
+ return task;
+}
+
+static struct pid *
+get_children_pid(struct inode *inode, struct pid *prev_pid, loff_t pos)
+{
+ struct task_struct *start, *task, *prev;
+ struct pid *pid = NULL;
+
+ read_lock(&tasklist_lock);
+ start = pid_task(proc_pid(inode), PIDTYPE_PID);
+ if (!start)
+ goto out;
+
+ prev = prev_pid ? pid_task(prev_pid, PIDTYPE_PID) : NULL;
+
+ task = task_next_child(start, prev, pos);
+ if (task)
+ pid = get_pid(task_pid(task));

out:
read_unlock(&tasklist_lock);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa27810..969e05b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -303,3 +303,6 @@ extern unsigned long task_statm(struct mm_struct *,
unsigned long *, unsigned long *,
unsigned long *, unsigned long *);
extern void task_mem(struct seq_file *, struct mm_struct *);
+
+struct task_struct *
+task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos);
--
2.5.5

2016-04-11 23:36:13

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 02/15] proc: export task_first_tid() and task_next_tid()

It will be more convenient when this function will be used in
task_diag.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/base.c | 8 ++++----
fs/proc/internal.h | 4 ++++
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b1755b2..614f1d0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3341,7 +3341,7 @@ out_no_task:
* In the case of a seek we start with the leader and walk nr
* threads past it.
*/
-static struct task_struct *first_tid(struct pid *pid, int tid, loff_t f_pos,
+struct task_struct *task_first_tid(struct pid *pid, int tid, loff_t f_pos,
struct pid_namespace *ns)
{
struct task_struct *pos, *task;
@@ -3390,7 +3390,7 @@ out:
*
* The reference to the input task_struct is released.
*/
-static struct task_struct *next_tid(struct task_struct *start)
+struct task_struct *task_next_tid(struct task_struct *start)
{
struct task_struct *pos = NULL;
rcu_read_lock();
@@ -3426,9 +3426,9 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
ns = inode->i_sb->s_fs_info;
tid = (int)file->f_version;
file->f_version = 0;
- for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
+ for (task = task_first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
task;
- task = next_tid(task), ctx->pos++) {
+ task = task_next_tid(task), ctx->pos++) {
char name[PROC_NUMBUF];
int len;
tid = task_pid_nr_ns(task, ns);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 969e05b..49145e2 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -306,3 +306,7 @@ extern void task_mem(struct seq_file *, struct mm_struct *);

struct task_struct *
task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos);
+
+struct task_struct *task_first_tid(struct pid *pid, int tid, loff_t f_pos,
+ struct pid_namespace *ns);
+struct task_struct *task_next_tid(struct task_struct *start);
--
2.5.5

2016-04-11 23:36:31

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 07/15] task_diag: add ability to dump children and threads

Now we can dump all task or children, threads of a specified task.
It's an example how this interface can be expanded for different
use-cases.

v2: Fixes from David Ahern
Add missing break in iter_stop
Fix 8-byte alignment issues

Cc: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/task_diag.c | 93 ++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/task_diag.h | 7 +++-
2 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
index 9c1ed45..e0f0b03 100644
--- a/fs/proc/task_diag.c
+++ b/fs/proc/task_diag.c
@@ -479,6 +479,12 @@ static void iter_stop(struct task_iter *iter)
case TASK_DIAG_DUMP_ALL:
task = iter->tgid.task;
break;
+ case TASK_DIAG_DUMP_ALL_THREAD:
+ /* release both tgid task and thread task */
+ if (iter->task)
+ put_task_struct(iter->task);
+ task = iter->tgid.task;
+ break;
default:
task = iter->task;
}
@@ -486,6 +492,23 @@ static void iter_stop(struct task_iter *iter)
put_task_struct(task);
}

+static struct task_struct *
+task_diag_next_child(struct task_struct *parent,
+ struct task_struct *prev, unsigned int pos)
+{
+ struct task_struct *task;
+
+ read_lock(&tasklist_lock);
+ task = task_next_child(parent, prev, pos);
+ if (prev)
+ put_task_struct(prev);
+ if (task)
+ get_task_struct(task);
+ read_unlock(&tasklist_lock);
+
+ return task;
+}
+
static struct task_struct *iter_start(struct task_iter *iter)
{
if (iter->req.pid > 0) {
@@ -508,11 +531,48 @@ static struct task_struct *iter_start(struct task_iter *iter)
iter->task = NULL;
return iter->task;

+ case TASK_DIAG_DUMP_THREAD:
+ if (iter->parent == NULL)
+ return ERR_PTR(-ESRCH);
+
+ iter->pos = iter->cb->pos;
+ iter->task = task_first_tid(task_pid(iter->parent),
+ iter->cb->pid,iter->pos, iter->ns);
+ return iter->task;
+
+ case TASK_DIAG_DUMP_CHILDREN:
+ if (iter->parent == NULL)
+ return ERR_PTR(-ESRCH);
+
+ iter->pos = iter->cb->pos;
+ iter->task = task_diag_next_child(iter->parent, NULL, iter->pos);
+ return iter->task;
+
case TASK_DIAG_DUMP_ALL:
iter->tgid.tgid = iter->cb->pid;
iter->tgid.task = NULL;
iter->tgid = next_tgid(iter->ns, iter->tgid);
return iter->tgid.task;
+
+ case TASK_DIAG_DUMP_ALL_THREAD:
+ iter->pos = iter->cb->pos;
+ iter->tgid.tgid = iter->cb->pid;
+ iter->tgid.task = NULL;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ if (!iter->tgid.task)
+ return NULL;
+
+ iter->task = task_first_tid(task_pid(iter->tgid.task),
+ 0, iter->pos, iter->ns);
+ if (!iter->task) {
+ iter->pos = 0;
+ iter->tgid.tgid += 1;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ iter->task = iter->tgid.task;
+ if (iter->task)
+ get_task_struct(iter->task);
+ }
+ return iter->task;
}

return ERR_PTR(-EINVAL);
@@ -529,11 +589,44 @@ static struct task_struct *iter_next(struct task_iter *iter)
iter->task = NULL;
return NULL;

+ case TASK_DIAG_DUMP_THREAD:
+ iter->pos++;
+ iter->task = task_next_tid(iter->task);
+ iter->cb->pos = iter->pos;
+ if (iter->task)
+ iter->cb->pid = task_pid_nr_ns(iter->task, iter->ns);
+ else
+ iter->cb->pid = -1;
+ return iter->task;
+ case TASK_DIAG_DUMP_CHILDREN:
+ iter->pos++;
+ iter->task = task_diag_next_child(iter->parent, iter->task, iter->pos);
+ iter->cb->pos = iter->pos;
+ return iter->task;
+
case TASK_DIAG_DUMP_ALL:
iter->tgid.tgid += 1;
iter->tgid = next_tgid(iter->ns, iter->tgid);
iter->cb->pid = iter->tgid.tgid;
return iter->tgid.task;
+
+ case TASK_DIAG_DUMP_ALL_THREAD:
+ iter->pos++;
+ iter->task = task_next_tid(iter->task);
+ if (!iter->task) {
+ iter->pos = 0;
+ iter->tgid.tgid += 1;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ iter->task = iter->tgid.task;
+ if (iter->task)
+ get_task_struct(iter->task);
+ }
+
+ /* save current position */
+ iter->cb->pid = iter->tgid.tgid;
+ iter->cb->pos = iter->pos;
+
+ return iter->task;
}

return NULL;
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 3486f2f..8bccd02 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -151,8 +151,11 @@ struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma *vma)
(void *) vma < nla_data(attr) + nla_len(attr); \
vma = (void *) vma + vma->vma_len)

-#define TASK_DIAG_DUMP_ALL 0
-#define TASK_DIAG_DUMP_ONE 1
+#define TASK_DIAG_DUMP_ALL 0
+#define TASK_DIAG_DUMP_ONE 1
+#define TASK_DIAG_DUMP_ALL_THREAD 2
+#define TASK_DIAG_DUMP_CHILDREN 3
+#define TASK_DIAG_DUMP_THREAD 4

struct task_diag_pid {
__u64 show_flags;
--
2.5.5

2016-04-11 23:36:17

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 05/15] task_diag: add a new group to get process credentials

A response is represented by the task_diag_creds structure:

struct task_diag_creds {
struct task_diag_caps cap_inheritable;
struct task_diag_caps cap_permitted;
struct task_diag_caps cap_effective;
struct task_diag_caps cap_bset;

__u32 uid;
__u32 euid;
__u32 suid;
__u32 fsuid;
__u32 gid;
__u32 egid;
__u32 sgid;
__u32 fsgid;
};

This group is optional and it's filled only if show_flags contains
TASK_DIAG_SHOW_CRED.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/task_diag.c | 50 ++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/task_diag.h | 21 ++++++++++++++++++
2 files changed, 71 insertions(+)

diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
index 3c2127e..fc31771 100644
--- a/fs/proc/task_diag.c
+++ b/fs/proc/task_diag.c
@@ -80,6 +80,48 @@ static int fill_task_base(struct task_struct *p,
return 0;
}

+static inline void caps2diag(struct task_diag_caps *diag, const kernel_cap_t *cap)
+{
+ int i;
+
+ for (i = 0; i < _LINUX_CAPABILITY_U32S_3; i++)
+ diag->cap[i] = cap->cap[i];
+}
+
+static int fill_creds(struct task_struct *p, struct sk_buff *skb,
+ struct user_namespace *user_ns)
+{
+ struct task_diag_creds *diag_cred;
+ const struct cred *cred;
+ struct nlattr *attr;
+
+ attr = nla_reserve(skb, TASK_DIAG_CRED, sizeof(struct task_diag_creds));
+ if (!attr)
+ return -EMSGSIZE;
+
+ diag_cred = nla_data(attr);
+
+ cred = get_task_cred(p);
+
+ caps2diag(&diag_cred->cap_inheritable, &cred->cap_inheritable);
+ caps2diag(&diag_cred->cap_permitted, &cred->cap_permitted);
+ caps2diag(&diag_cred->cap_effective, &cred->cap_effective);
+ caps2diag(&diag_cred->cap_bset, &cred->cap_bset);
+
+ diag_cred->uid = from_kuid_munged(user_ns, cred->uid);
+ diag_cred->euid = from_kuid_munged(user_ns, cred->euid);
+ diag_cred->suid = from_kuid_munged(user_ns, cred->suid);
+ diag_cred->fsuid = from_kuid_munged(user_ns, cred->fsuid);
+ diag_cred->gid = from_kgid_munged(user_ns, cred->gid);
+ diag_cred->egid = from_kgid_munged(user_ns, cred->egid);
+ diag_cred->sgid = from_kgid_munged(user_ns, cred->sgid);
+ diag_cred->fsgid = from_kgid_munged(user_ns, cred->fsgid);
+
+ put_cred(cred);
+
+ return 0;
+}
+
static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
struct task_diag_pid *req,
struct task_diag_cb *cb, struct pid_namespace *pidns,
@@ -113,6 +155,14 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ if (show_flags & TASK_DIAG_SHOW_CRED) {
+ if (i >= n)
+ err = fill_creds(tsk, skb, userns);
+ if (err)
+ goto err;
+ i++;
+ }
+
nlmsg_end(skb, nlh);
if (cb)
cb->attr = 0;
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index ba0f71a..ea500c6 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -15,12 +15,14 @@ struct task_diag_msg {

enum {
TASK_DIAG_BASE = 0,
+ TASK_DIAG_CRED,

__TASK_DIAG_ATTR_MAX
#define TASK_DIAG_ATTR_MAX (__TASK_DIAG_ATTR_MAX - 1)
};

#define TASK_DIAG_SHOW_BASE (1ULL << TASK_DIAG_BASE)
+#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)

enum {
TASK_DIAG_RUNNING,
@@ -45,6 +47,25 @@ struct task_diag_base {
char comm[TASK_DIAG_COMM_LEN];
};

+struct task_diag_caps {
+ __u32 cap[_LINUX_CAPABILITY_U32S_3];
+};
+
+struct task_diag_creds {
+ struct task_diag_caps cap_inheritable;
+ struct task_diag_caps cap_permitted;
+ struct task_diag_caps cap_effective;
+ struct task_diag_caps cap_bset;
+
+ __u32 uid;
+ __u32 euid;
+ __u32 suid;
+ __u32 fsuid;
+ __u32 gid;
+ __u32 egid;
+ __u32 sgid;
+ __u32 fsgid;
+};
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_ONE 1

--
2.5.5

2016-04-11 23:36:37

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 15/15] test: check that task_diag can dump all thread of one process

Signed-off-by: Andrey Vagin <[email protected]>
---
tools/testing/selftests/task_diag/_run.sh | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/task_diag/_run.sh b/tools/testing/selftests/task_diag/_run.sh
index 559f02a..d2e8544 100755
--- a/tools/testing/selftests/task_diag/_run.sh
+++ b/tools/testing/selftests/task_diag/_run.sh
@@ -10,11 +10,15 @@ nchildren=`./task_diag_all children --pid 1 | grep 'pid.*tgid.*ppid.*comm fork$'

./task_diag_all one --pid 1 --cred

+( exec -a fork_thread ./fork 1 1234 )
+pid=`pidof fork_thread`
+ntaskthreads=`./task_diag_all thread --maps --cred --smaps --pid $pid | grep 'pid.*tgid.*ppid.*comm' | wc -l`
killall -9 fork

[ "$nthreads" -eq 10000 ] &&
[ "$nprocesses" -eq 1000 ] &&
[ "$nchildren" -eq 1000 ] &&
+[ "$ntaskthreads" -eq 1234 ] &&
true || {
echo "Unexpected number of tasks $nthreads:$nprocesses" 1>&2
exit 1
--
2.5.5

2016-04-11 23:36:29

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 08/15] task_diag: Only add VMAs for thread_group leader

From: David Ahern <[email protected]>

threads of a process share the same VMAs, so when dumping all threads
for all processes only push vma data for group leader.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/task_diag.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
index e0f0b03..00db32d 100644
--- a/fs/proc/task_diag.c
+++ b/fs/proc/task_diag.c
@@ -433,7 +433,17 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
}

if (show_flags & TASK_DIAG_SHOW_VMA) {
- if (i >= n)
+ bool dump_vma = true;
+
+ /* if the request is to dump all threads of all processes
+ * only show VMAs for group leader.
+ */
+ if ((req->dump_strategy == TASK_DIAG_DUMP_ALL_THREAD ||
+ req->dump_strategy == TASK_DIAG_DUMP_THREAD) &&
+ !thread_group_leader(tsk))
+ dump_vma = false;
+
+ if (dump_vma && i >= n)
err = fill_vma(tsk, skb, cb, &progress, show_flags);
if (err)
goto err;
--
2.5.5

2016-04-11 23:37:06

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 14/15] task_diag: Enhance fork tool to spawn threads

From: David Ahern <[email protected]>

Add option to fork threads as well as processes.
Make the sleep time configurable too so that spawned
tasks exit on their own.

Signed-off-by: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
tools/testing/selftests/task_diag/Makefile | 2 ++
tools/testing/selftests/task_diag/_run.sh | 4 ++--
tools/testing/selftests/task_diag/fork.c | 36 ++++++++++++++++++++++++++----
3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/task_diag/Makefile b/tools/testing/selftests/task_diag/Makefile
index 69b7934..c9977231 100644
--- a/tools/testing/selftests/task_diag/Makefile
+++ b/tools/testing/selftests/task_diag/Makefile
@@ -10,6 +10,8 @@ task_diag_comm.o: task_diag_comm.c task_diag_comm.h

task_diag_all: task_diag_all.o task_diag_comm.o
fork: fork.c
+ $(CC) $(CFLAGS) $(LDFLAGS) -o $@ $^ -lpthread
+
task_proc_all: task_proc_all.c

clean:
diff --git a/tools/testing/selftests/task_diag/_run.sh b/tools/testing/selftests/task_diag/_run.sh
index 3f541fe..559f02a 100755
--- a/tools/testing/selftests/task_diag/_run.sh
+++ b/tools/testing/selftests/task_diag/_run.sh
@@ -2,7 +2,7 @@
set -o pipefail
set -e -x

-./fork 1000
+./fork 1000 10

nprocesses=`./task_diag_all all --maps | grep 'pid.*tgid.*ppid.*comm fork$' | wc -l`
nthreads=`./task_diag_all All --smaps --cred | grep 'pid.*tgid.*ppid.*comm fork$' | wc -l`
@@ -12,7 +12,7 @@ nchildren=`./task_diag_all children --pid 1 | grep 'pid.*tgid.*ppid.*comm fork$'

killall -9 fork

-[ "$nthreads" -eq 1000 ] &&
+[ "$nthreads" -eq 10000 ] &&
[ "$nprocesses" -eq 1000 ] &&
[ "$nchildren" -eq 1000 ] &&
true || {
diff --git a/tools/testing/selftests/task_diag/fork.c b/tools/testing/selftests/task_diag/fork.c
index c6e17d1..ebddedd2 100644
--- a/tools/testing/selftests/task_diag/fork.c
+++ b/tools/testing/selftests/task_diag/fork.c
@@ -2,15 +2,39 @@
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
+#include <pthread.h>

+void *f(void *arg)
+{
+ unsigned long t = (unsigned long) arg;
+
+ sleep(t);
+ return NULL;
+}
+
+/* usage: fork nproc [mthreads [sleep]] */
int main(int argc, char **argv)
{
- int i, n;
+ int i, j, n, m = 0;
+ unsigned long t_sleep = 1000;
+ pthread_attr_t attr;
+ pthread_t id;

- if (argc < 2)
+ if (argc < 2) {
+ fprintf(stderr, "usage: fork nproc [mthreads [sleep]]\n");
return 1;
+ }

n = atoi(argv[1]);
+
+ if (argc > 2)
+ m = atoi(argv[2]);
+
+ if (argc > 3)
+ t_sleep = atoi(argv[3]);
+
+ pthread_attr_init(&attr);
+
for (i = 0; i < n; i++) {
pid_t pid;

@@ -20,8 +44,12 @@ int main(int argc, char **argv)
return 1;
}
if (pid == 0) {
- while (1)
- sleep(1000);
+ if (m) {
+ for (j = 0; j < m-1; ++j)
+ pthread_create(&id, &attr, f, (void *)t_sleep);
+ }
+
+ sleep(t_sleep);
return 0;
}
}
--
2.5.5

2016-04-11 23:36:26

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 11/15] task_diag: add a new group to get memory usage

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/task_diag.c | 79 ++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/task_diag.h | 21 +++++++++++
2 files changed, 100 insertions(+)

diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
index c8499f2..3dc3617 100644
--- a/fs/proc/task_diag.c
+++ b/fs/proc/task_diag.c
@@ -468,6 +468,77 @@ static int fill_task_stat(struct task_struct *task, struct sk_buff *skb, int who
return 0;
}

+static int fill_task_statm(struct task_struct *task, struct sk_buff *skb, int whole)
+{
+ struct task_diag_statm *st;
+ struct nlattr *attr;
+
+ unsigned long text, lib, swap, ptes, pmds, anon, file, shmem;
+ unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+ unsigned long stack_vm, data_vm, locked_vm, pinned_vm;
+ struct mm_struct *mm;
+
+ mm = get_task_mm(task);
+ if (!mm)
+ return 0;
+
+ anon = get_mm_counter(mm, MM_ANONPAGES);
+ file = get_mm_counter(mm, MM_FILEPAGES);
+ shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+
+ /*
+ * Note: to minimize their overhead, mm maintains hiwater_vm and
+ * hiwater_rss only when about to *lower* total_vm or rss. Any
+ * collector of these hiwater stats must therefore get total_vm
+ * and rss too, which will usually be the higher. Barriers? not
+ * worth the effort, such snapshots can always be inconsistent.
+ */
+ hiwater_vm = total_vm = mm->total_vm;
+ if (hiwater_vm < mm->hiwater_vm)
+ hiwater_vm = mm->hiwater_vm;
+ hiwater_rss = total_rss = anon + file + shmem;
+ if (hiwater_rss < mm->hiwater_rss)
+ hiwater_rss = mm->hiwater_rss;
+
+ text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT;
+ lib = mm->exec_vm - text;
+ swap = get_mm_counter(mm, MM_SWAPENTS);
+ ptes = PTRS_PER_PTE * sizeof(pte_t) * atomic_long_read(&mm->nr_ptes);
+ pmds = PTRS_PER_PMD * sizeof(pmd_t) * mm_nr_pmds(mm);
+
+ data_vm = mm->data_vm;
+ stack_vm = mm->stack_vm;
+ locked_vm = mm->locked_vm;
+ pinned_vm = mm->pinned_vm;
+
+ mmput(mm);
+
+ attr = nla_reserve(skb, TASK_DIAG_STATM, sizeof(*st));
+ if (!attr)
+ return -EMSGSIZE;
+
+ st = nla_data(attr);
+
+ st->anon = anon;
+ st->file = file;
+ st->shmem = shmem;
+ st->hiwater_vm = hiwater_vm;
+ st->hiwater_rss = hiwater_rss;
+ st->text = text;
+ st->lib = lib;
+ st->swap = swap;
+ st->ptes = ptes;
+ st->pmds = pmds;
+ st->total_rss = total_rss;
+ st->total_vm = total_vm;
+ st->data_vm = data_vm;
+ st->stack_vm = stack_vm;
+ st->locked_vm = locked_vm;
+ st->pinned_vm = pinned_vm;
+
+ return 0;
+}
+
static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
struct task_diag_pid *req,
struct task_diag_cb *cb, struct pid_namespace *pidns,
@@ -543,6 +614,14 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ if (show_flags & TASK_DIAG_SHOW_STATM) {
+ if (i >= n)
+ err = fill_task_statm(tsk, skb, 1);
+ if (err)
+ goto err;
+ i++;
+ }
+
msg->flags &= ~TASK_DIAG_FLAG_CONT;

nlmsg_end(skb, nlh);
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 551d4fa..9ab96f1 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -21,6 +21,7 @@ enum {
TASK_DIAG_VMA,
TASK_DIAG_VMA_STAT,
TASK_DIAG_STAT,
+ TASK_DIAG_STATM,

__TASK_DIAG_ATTR_MAX
#define TASK_DIAG_ATTR_MAX (__TASK_DIAG_ATTR_MAX - 1)
@@ -31,6 +32,7 @@ enum {
#define TASK_DIAG_SHOW_VMA (1ULL << TASK_DIAG_VMA)
#define TASK_DIAG_SHOW_VMA_STAT (1ULL << TASK_DIAG_VMA_STAT)
#define TASK_DIAG_SHOW_STAT (1ULL << TASK_DIAG_STAT)
+#define TASK_DIAG_SHOW_STATM (1ULL << TASK_DIAG_STATM)

enum {
TASK_DIAG_RUNNING,
@@ -168,6 +170,25 @@ struct task_diag_stat {
__u32 threads;
};

+struct task_diag_statm {
+ __u64 anon;
+ __u64 file;
+ __u64 shmem;
+ __u64 total_vm;
+ __u64 total_rss;
+ __u64 hiwater_vm;
+ __u64 hiwater_rss;
+ __u64 text;
+ __u64 lib;
+ __u64 swap;
+ __u64 ptes;
+ __u64 pmds;
+ __u64 locked_vm;
+ __u64 pinned_vm;
+ __u64 data_vm;
+ __u64 stack_vm;
+};
+
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_ONE 1
#define TASK_DIAG_DUMP_ALL_THREAD 2
--
2.5.5

2016-04-11 23:37:34

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 13/15] selftest: check the task_diag functinonality

Here are two test (example) programs.

task_diag - request information for two processes.
test_diag_all - request information about all processes

v2: Fixes from David Ahern:
* task_diag: Fix 8-byte alignment for vma and vma_stats

Signed-off-by: Andrey Vagin <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/task_diag/.gitignore | 4 +
tools/testing/selftests/task_diag/Makefile | 16 ++
tools/testing/selftests/task_diag/_run.sh | 21 +++
tools/testing/selftests/task_diag/fork.c | 30 ++++
tools/testing/selftests/task_diag/run.sh | 1 +
tools/testing/selftests/task_diag/task_diag.h | 1 +
tools/testing/selftests/task_diag/task_diag_all.c | 150 ++++++++++++++++++
tools/testing/selftests/task_diag/task_diag_comm.c | 172 +++++++++++++++++++++
tools/testing/selftests/task_diag/task_diag_comm.h | 34 ++++
tools/testing/selftests/task_diag/task_proc_all.c | 35 +++++
11 files changed, 465 insertions(+)
create mode 100644 tools/testing/selftests/task_diag/.gitignore
create mode 100644 tools/testing/selftests/task_diag/Makefile
create mode 100755 tools/testing/selftests/task_diag/_run.sh
create mode 100644 tools/testing/selftests/task_diag/fork.c
create mode 100755 tools/testing/selftests/task_diag/run.sh
create mode 120000 tools/testing/selftests/task_diag/task_diag.h
create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h
create mode 100644 tools/testing/selftests/task_diag/task_proc_all.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b04afc3..b399c8b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -29,6 +29,7 @@ TARGETS += user
TARGETS += vm
TARGETS += x86
TARGETS += zram
+TARGETS += task_diag
#Please keep the TARGETS list alphabetically sorted
# Run "make quicktest=1 run_tests" or
# "make quicktest=1 kselftest from top level Makefile
diff --git a/tools/testing/selftests/task_diag/.gitignore b/tools/testing/selftests/task_diag/.gitignore
new file mode 100644
index 0000000..f963a1f
--- /dev/null
+++ b/tools/testing/selftests/task_diag/.gitignore
@@ -0,0 +1,4 @@
+task_diag
+task_diag_all
+task_proc_all
+fork
diff --git a/tools/testing/selftests/task_diag/Makefile b/tools/testing/selftests/task_diag/Makefile
new file mode 100644
index 0000000..69b7934
--- /dev/null
+++ b/tools/testing/selftests/task_diag/Makefile
@@ -0,0 +1,16 @@
+all: task_diag_all fork task_proc_all fork
+
+CFLAGS += -g -Wall -O2 -I/usr/include/libnl3
+LDFLAGS += -lnl-3
+TEST_PROGS := run.sh
+include ../lib.mk
+
+task_diag_all.o: task_diag_all.c task_diag_comm.h
+task_diag_comm.o: task_diag_comm.c task_diag_comm.h
+
+task_diag_all: task_diag_all.o task_diag_comm.o
+fork: fork.c
+task_proc_all: task_proc_all.c
+
+clean:
+ rm -rf task_diag task_diag_all task_diag_comm.o task_diag_all.o task_diag.o fork task_proc_all
diff --git a/tools/testing/selftests/task_diag/_run.sh b/tools/testing/selftests/task_diag/_run.sh
new file mode 100755
index 0000000..3f541fe
--- /dev/null
+++ b/tools/testing/selftests/task_diag/_run.sh
@@ -0,0 +1,21 @@
+#!/bin/sh
+set -o pipefail
+set -e -x
+
+./fork 1000
+
+nprocesses=`./task_diag_all all --maps | grep 'pid.*tgid.*ppid.*comm fork$' | wc -l`
+nthreads=`./task_diag_all All --smaps --cred | grep 'pid.*tgid.*ppid.*comm fork$' | wc -l`
+nchildren=`./task_diag_all children --pid 1 | grep 'pid.*tgid.*ppid.*comm fork$' | wc -l`
+
+./task_diag_all one --pid 1 --cred
+
+killall -9 fork
+
+[ "$nthreads" -eq 1000 ] &&
+[ "$nprocesses" -eq 1000 ] &&
+[ "$nchildren" -eq 1000 ] &&
+true || {
+ echo "Unexpected number of tasks $nthreads:$nprocesses" 1>&2
+ exit 1
+}
diff --git a/tools/testing/selftests/task_diag/fork.c b/tools/testing/selftests/task_diag/fork.c
new file mode 100644
index 0000000..c6e17d1
--- /dev/null
+++ b/tools/testing/selftests/task_diag/fork.c
@@ -0,0 +1,30 @@
+#include <unistd.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+int main(int argc, char **argv)
+{
+ int i, n;
+
+ if (argc < 2)
+ return 1;
+
+ n = atoi(argv[1]);
+ for (i = 0; i < n; i++) {
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ printf("Unable to fork: %m\n");
+ return 1;
+ }
+ if (pid == 0) {
+ while (1)
+ sleep(1000);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/tools/testing/selftests/task_diag/run.sh b/tools/testing/selftests/task_diag/run.sh
new file mode 100755
index 0000000..28a8550
--- /dev/null
+++ b/tools/testing/selftests/task_diag/run.sh
@@ -0,0 +1 @@
+unshare -p -f -m --mount-proc ./_run.sh && { echo PASS; exit 0; } || { echo FAIL; exit 1; }
diff --git a/tools/testing/selftests/task_diag/task_diag.h b/tools/testing/selftests/task_diag/task_diag.h
new file mode 120000
index 0000000..d20a38c
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag.h
@@ -0,0 +1 @@
+../../../../include/uapi/linux/task_diag.h
\ No newline at end of file
diff --git a/tools/testing/selftests/task_diag/task_diag_all.c b/tools/testing/selftests/task_diag/task_diag_all.c
new file mode 100644
index 0000000..d865207
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_all.c
@@ -0,0 +1,150 @@
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <getopt.h>
+
+#include <linux/netlink.h>
+#include <netlink/msg.h>
+
+#include "task_diag.h"
+#include "task_diag_comm.h"
+
+#ifndef SOL_NETLINK
+#define SOL_NETLINK 270
+#endif
+
+#ifndef NETLINK_SCM_PID
+#define NETLINK_SCM_PID 11
+#endif
+
+static void usage(char *name)
+{
+ pr_err("Usage: %s command [options]", name);
+ pr_err(
+"Commands:\n"
+"\tall - dump all processes\n"
+"\tAll - dump all threads\n"
+"\tthreads - dump all thread for the specified process\n"
+"\tchildren - dump all thread for the specified process\n"
+"\tone - dump the specified process\n"
+"Options:\n"
+"\t-p|--pid - PID of the required process\n"
+"\t-m|--maps - dump memory regions\n"
+"\t-s|--smaps - dump statistics for memory regions\n"
+"\t-c|--cred - dump credentials"
+);
+}
+int main(int argc, char *argv[])
+{
+ int exit_status = 1, fd;
+ struct task_diag_pid *req;
+ char nl_req[4096];
+ struct nlmsghdr *hdr = (void *)nl_req;
+ int last_pid = 0;
+ int opt, idx;
+ int err, size = 0;
+ static const char short_opts[] = "p:cms";
+ static struct option long_opts[] = {
+ { "pid", required_argument, 0, 'p' },
+ { "maps", no_argument, 0, 'm' },
+ { "smaps", no_argument, 0, 's' },
+ { "cred", no_argument, 0, 'c' },
+ {},
+ };
+
+ hdr->nlmsg_len = nlmsg_total_size(0);
+
+ req = nlmsg_data(hdr);
+ size += nla_total_size(sizeof(*req));
+
+ hdr->nlmsg_len += size;
+
+
+ req->show_flags = TASK_DIAG_SHOW_BASE;
+
+ if (argc < 2) {
+ pr_err("Usage: %s type pid scm_pid", argv[0]);
+ return 1;
+ }
+
+ req->pid = 0; /* dump all tasks by default */
+
+ switch (argv[1][0]) {
+ case 'c':
+ req->dump_strategy = TASK_DIAG_DUMP_CHILDREN;
+ break;
+ case 't':
+ req->dump_strategy = TASK_DIAG_DUMP_THREAD;
+ break;
+ case 'o':
+ req->dump_strategy = TASK_DIAG_DUMP_ONE;
+ break;
+ case 'a':
+ req->dump_strategy = TASK_DIAG_DUMP_ALL;
+ req->pid = 0;
+ break;
+ case 'A':
+ req->dump_strategy = TASK_DIAG_DUMP_ALL_THREAD;
+ req->pid = 0;
+ break;
+ default:
+ usage(argv[0]);
+ return 1;
+ }
+
+ while (1) {
+ idx = -1;
+ opt = getopt_long(argc, argv, short_opts, long_opts, &idx);
+ if (opt == -1)
+ break;
+ switch (opt) {
+ case 'p':
+ req->pid = atoi(optarg);
+ break;
+ case 'c':
+ req->show_flags |= TASK_DIAG_SHOW_CRED;
+ break;
+ case 'm':
+ req->show_flags |= TASK_DIAG_SHOW_VMA;
+ break;
+ case 's':
+ req->show_flags |= TASK_DIAG_SHOW_VMA_STAT | TASK_DIAG_SHOW_VMA;
+ break;
+ default:
+ usage(argv[0]);
+ return 1;
+ }
+ }
+
+ fd = open("/proc/task-diag", O_RDWR);
+ if (fd < 0)
+ return -1;
+
+ if (write(fd, hdr, hdr->nlmsg_len) != hdr->nlmsg_len)
+ return -1;
+
+ while (1) {
+ char buf[163840];
+ size = read(fd, buf, sizeof(buf));
+
+ if (size < 0)
+ goto err;
+
+ if (size == 0)
+ break;
+
+ err = nlmsg_receive(buf, size, &show_task, &last_pid);
+ if (err < 0)
+ goto err;
+
+ if (err == 0)
+ break;
+ }
+
+ exit_status = 0;
+err:
+ return exit_status;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.c b/tools/testing/selftests/task_diag/task_diag_comm.c
new file mode 100644
index 0000000..65f2536
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.c
@@ -0,0 +1,172 @@
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/types.h>
+#include <linux/netlink.h>
+#include <netlink/cli/utils.h>
+
+#include "task_diag.h"
+#include "task_diag_comm.h"
+
+int quiet;
+
+#define PSS_SHIFT 12
+
+int nlmsg_receive(void *buf, int len, int (*cb)(struct nlmsghdr *, void *), void *args)
+{
+ struct nlmsghdr *hdr;
+
+ for (hdr = (struct nlmsghdr *)buf;
+ NLMSG_OK(hdr, len); hdr = NLMSG_NEXT(hdr, len)) {
+
+ if (hdr->nlmsg_type == NLMSG_DONE) {
+ int *len = (int *)NLMSG_DATA(hdr);
+
+ if (*len < 0) {
+ pr_err("ERROR %d reported by netlink (%s)\n",
+ *len, strerror(-*len));
+ return *len;
+ }
+
+ return 0;
+ }
+
+ if (hdr->nlmsg_type == NLMSG_ERROR) {
+ struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(hdr);
+
+ if (hdr->nlmsg_len - sizeof(*hdr) < sizeof(struct nlmsgerr)) {
+ pr_err("ERROR truncated\n");
+ return -1;
+ }
+
+ if (err->error == 0)
+ return 0;
+
+ return -1;
+ }
+ if (cb && cb(hdr, args))
+ return -1;
+ }
+
+ return 1;
+}
+
+int show_task(struct nlmsghdr *hdr, void *arg)
+{
+ int msg_len;
+ struct msgtemplate *msg;
+ struct task_diag_msg *diag_msg;
+ struct nlattr *na;
+ int *last_pid = arg;
+ int len;
+
+ msg_len = NLMSG_PAYLOAD(hdr, 0);
+
+ msg = (struct msgtemplate *)hdr;
+ diag_msg = NLMSG_DATA(msg);
+
+#if 1
+ if (diag_msg->pid != *last_pid)
+ pr_info("Start getting information about %d\n", diag_msg->pid);
+ else
+ pr_info("Continue getting information about %d\n", diag_msg->pid);
+#endif
+ *last_pid = diag_msg->pid;
+
+ na = ((void *) diag_msg) + NLMSG_ALIGN(sizeof(*diag_msg));
+ len = NLMSG_ALIGN(sizeof(*diag_msg));
+ while (len < msg_len) {
+ len += NLA_ALIGN(na->nla_len);
+ switch (na->nla_type) {
+ case TASK_DIAG_BASE:
+ {
+ struct task_diag_base *msg;
+
+ /* For nested attributes, na follows */
+ msg = NLA_DATA(na);
+ pr_info("pid %5d tgid %5d ppid %5d sid %5d pgid %5d comm %s\n",
+ msg->pid, msg->tgid, msg->ppid, msg->sid, msg->pgid, msg->comm);
+ }
+ break;
+
+ case TASK_DIAG_CRED:
+ {
+ struct task_diag_creds *creds;
+
+ creds = NLA_DATA(na);
+ pr_info("uid: %d %d %d %d\n", creds->uid,
+ creds->euid, creds->suid, creds->fsuid);
+ pr_info("gid: %d %d %d %d\n", creds->uid,
+ creds->euid, creds->suid, creds->fsuid);
+ pr_info("CapInh: %08x%08x\n",
+ creds->cap_inheritable.cap[1],
+ creds->cap_inheritable.cap[0]);
+ pr_info("CapPrm: %08x%08x\n",
+ creds->cap_permitted.cap[1],
+ creds->cap_permitted.cap[0]);
+ pr_info("CapEff: %08x%08x\n",
+ creds->cap_effective.cap[1],
+ creds->cap_effective.cap[0]);
+ pr_info("CapBnd: %08x%08x\n", creds->cap_bset.cap[1],
+ creds->cap_bset.cap[0]);
+ }
+ break;
+
+ case TASK_DIAG_VMA:
+ {
+ struct task_diag_vma *vma_tmp, vma;
+
+ task_diag_for_each_vma(vma_tmp, na) {
+ char *name;
+ struct task_diag_vma_stat *stat_tmp, stat;
+
+ name = task_diag_vma_name(vma_tmp);
+ if (name == NULL)
+ name = "";
+
+ memcpy(&vma, vma_tmp, sizeof(vma));
+ pr_info("%016llx-%016llx %016llx %s\n",
+ vma.start, vma.end, vma.vm_flags, name);
+
+ stat_tmp = task_diag_vma_stat(vma_tmp);
+ if (stat_tmp)
+ memcpy(&stat, stat_tmp, sizeof(stat));
+ else
+ memset(&stat, 0, sizeof(stat));
+
+ pr_info(
+ "Size: %8llu kB\n"
+ "Rss: %8llu kB\n"
+ "Pss: %8llu kB\n"
+ "Shared_Clean: %8llu kB\n"
+ "Shared_Dirty: %8llu kB\n"
+ "Private_Clean: %8llu kB\n"
+ "Private_Dirty: %8llu kB\n"
+ "Referenced: %8llu kB\n"
+ "Anonymous: %8llu kB\n"
+ "AnonHugePages: %8llu kB\n"
+ "Swap: %8llu kB\n",
+ (vma.end - vma.start) >> 10,
+ stat.resident >> 10,
+ (stat.pss >> (10 + PSS_SHIFT)),
+ stat.shared_clean >> 10,
+ stat.shared_dirty >> 10,
+ stat.private_clean >> 10,
+ stat.private_dirty >> 10,
+ stat.referenced >> 10,
+ stat.anonymous >> 10,
+ stat.anonymous_thp >> 10,
+ stat.swap >> 10);
+ }
+ }
+ break;
+ default:
+ pr_info("Unknown nla_type %d\n",
+ na->nla_type);
+ }
+ na = ((void *) diag_msg) + len;
+ }
+
+ return 0;
+}
diff --git a/tools/testing/selftests/task_diag/task_diag_comm.h b/tools/testing/selftests/task_diag/task_diag_comm.h
new file mode 100644
index 0000000..40e83b7
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_diag_comm.h
@@ -0,0 +1,34 @@
+#ifndef __TASK_DIAG_COMM__
+#define __TASK_DIAG_COMM__
+
+#include <stdio.h>
+
+#include "task_diag.h"
+
+/*
+ * Generic macros for dealing with netlink sockets. Might be duplicated
+ * elsewhere. It is recommended that commercial grade applications use
+ * libnl or libnetlink and use the interfaces provided by the library
+ */
+#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
+#define NLA_DATA(na) ((void *)((char *)(na) + NLA_HDRLEN))
+#define NLA_PAYLOAD(len) (len - NLA_HDRLEN)
+
+#define pr_err(fmt, ...) \
+ fprintf(stderr, "%s:%d" fmt"\n", __func__, __LINE__, ##__VA_ARGS__)
+
+#define pr_perror(fmt, ...) \
+ fprintf(stderr, fmt " : %m\n", ##__VA_ARGS__)
+
+extern int quiet;
+#define pr_info(fmt, arg...) \
+ do { \
+ if (!quiet) \
+ printf(fmt, ##arg); \
+ } while (0) \
+
+int nlmsg_receive(void *buf, int len, int (*cb)(struct nlmsghdr *, void *), void *args);
+extern int show_task(struct nlmsghdr *hdr, void *arg);
+
+#endif /* __TASK_DIAG_COMM__ */
diff --git a/tools/testing/selftests/task_diag/task_proc_all.c b/tools/testing/selftests/task_diag/task_proc_all.c
new file mode 100644
index 0000000..07ee80c
--- /dev/null
+++ b/tools/testing/selftests/task_diag/task_proc_all.c
@@ -0,0 +1,35 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+
+int main(int argc, char **argv)
+{
+ DIR *d;
+ int fd, tasks = 0;
+ struct dirent *de;
+ char buf[4096];
+
+ d = opendir("/proc");
+ if (d == NULL)
+ return 1;
+
+ while ((de = readdir(d))) {
+ if (de->d_name[0] < '0' || de->d_name[0] > '9')
+ continue;
+ snprintf(buf, sizeof(buf), "/proc/%s/stat", de->d_name);
+ fd = open(buf, O_RDONLY);
+ read(fd, buf, sizeof(buf));
+ close(fd);
+ tasks++;
+ }
+
+ closedir(d);
+
+ printf("tasks: %d\n", tasks);
+
+ return 0;
+}
--
2.5.5

2016-04-11 23:37:32

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 10/15] task_diag: add a new group to get resource usage

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/task_diag.c | 92 ++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/task_diag.h | 15 +++++++
2 files changed, 107 insertions(+)

diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
index cd10374..c8499f2 100644
--- a/fs/proc/task_diag.c
+++ b/fs/proc/task_diag.c
@@ -390,6 +390,84 @@ err:
return rc;
}

+static int fill_task_stat(struct task_struct *task, struct sk_buff *skb, int whole)
+{
+ struct task_diag_stat *st;
+ struct nlattr *attr;
+
+ int priority, nice;
+ int num_threads = 0;
+ unsigned long cmin_flt = 0, cmaj_flt = 0;
+ unsigned long min_flt = 0, maj_flt = 0;
+ cputime_t cutime, cstime, utime, stime;
+ cputime_t cgtime, gtime;
+ unsigned long flags;
+
+ attr = nla_reserve(skb, TASK_DIAG_STAT, sizeof(struct task_diag_stat));
+ if (!attr)
+ return -EMSGSIZE;
+
+ st = nla_data(attr);
+
+ cutime = cstime = utime = stime = 0;
+ cgtime = gtime = 0;
+ if (lock_task_sighand(task, &flags)) {
+ struct signal_struct *sig = task->signal;
+
+ num_threads = get_nr_threads(task);
+
+ cmin_flt = sig->cmin_flt;
+ cmaj_flt = sig->cmaj_flt;
+ cutime = sig->cutime;
+ cstime = sig->cstime;
+ cgtime = sig->cgtime;
+
+ /* add up live thread stats at the group level */
+ if (whole) {
+ struct task_struct *t = task;
+
+ do {
+ min_flt += t->min_flt;
+ maj_flt += t->maj_flt;
+ gtime += task_gtime(t);
+ } while_each_thread(task, t);
+
+ min_flt += sig->min_flt;
+ maj_flt += sig->maj_flt;
+ thread_group_cputime_adjusted(task, &utime, &stime);
+ gtime += sig->gtime;
+ }
+
+ unlock_task_sighand(task, &flags);
+ }
+
+ if (!whole) {
+ min_flt = task->min_flt;
+ maj_flt = task->maj_flt;
+ task_cputime_adjusted(task, &utime, &stime);
+ gtime = task_gtime(task);
+ }
+
+ /* scale priority and nice values from timeslices to -20..20 */
+ /* to make it look like a "normal" Unix priority/nice value */
+ priority = task_prio(task);
+ nice = task_nice(task);
+
+
+ st->minflt = min_flt;
+ st->cminflt = cmin_flt;
+ st->majflt = maj_flt;
+ st->cmajflt = cmaj_flt;
+ st->utime = cputime_to_clock_t(utime);
+ st->stime = cputime_to_clock_t(stime);
+ st->cutime = cputime_to_clock_t(cutime);
+ st->cstime = cputime_to_clock_t(cstime);
+
+ st->threads = num_threads;
+
+ return 0;
+}
+
static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
struct task_diag_pid *req,
struct task_diag_cb *cb, struct pid_namespace *pidns,
@@ -451,6 +529,20 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ if (show_flags & TASK_DIAG_SHOW_STAT) {
+ int whole = 1;
+
+ if (req->dump_strategy == TASK_DIAG_DUMP_ALL_THREAD ||
+ req->dump_strategy == TASK_DIAG_DUMP_THREAD)
+ whole = 0;
+
+ if (i >= n)
+ err = fill_task_stat(tsk, skb, whole);
+ if (err)
+ goto err;
+ i++;
+ }
+
msg->flags &= ~TASK_DIAG_FLAG_CONT;

nlmsg_end(skb, nlh);
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index e967c5b..551d4fa 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -20,6 +20,7 @@ enum {
TASK_DIAG_CRED,
TASK_DIAG_VMA,
TASK_DIAG_VMA_STAT,
+ TASK_DIAG_STAT,

__TASK_DIAG_ATTR_MAX
#define TASK_DIAG_ATTR_MAX (__TASK_DIAG_ATTR_MAX - 1)
@@ -29,6 +30,7 @@ enum {
#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
#define TASK_DIAG_SHOW_VMA (1ULL << TASK_DIAG_VMA)
#define TASK_DIAG_SHOW_VMA_STAT (1ULL << TASK_DIAG_VMA_STAT)
+#define TASK_DIAG_SHOW_STAT (1ULL << TASK_DIAG_STAT)

enum {
TASK_DIAG_RUNNING,
@@ -153,6 +155,19 @@ struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma *vma)
(void *) vma < nla_data(attr) + nla_len(attr); \
vma = (void *) vma + vma->vma_len)

+struct task_diag_stat {
+ __u64 minflt;
+ __u64 cminflt;
+ __u64 majflt;
+ __u64 cmajflt;
+ __u64 utime;
+ __u64 stime;
+ __u64 cutime;
+ __u64 cstime;
+
+ __u32 threads;
+};
+
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_ONE 1
#define TASK_DIAG_DUMP_ALL_THREAD 2
--
2.5.5

2016-04-11 23:38:16

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 12/15] Documentation: add documentation for task_diag

Signed-off-by: Andrey Vagin <[email protected]>
---
Documentation/accounting/task_diag.txt | 57 ++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
create mode 100644 Documentation/accounting/task_diag.txt

diff --git a/Documentation/accounting/task_diag.txt b/Documentation/accounting/task_diag.txt
new file mode 100644
index 0000000..ff486b9
--- /dev/null
+++ b/Documentation/accounting/task_diag.txt
@@ -0,0 +1,57 @@
+The task-diag interface allows to get information about running processes
+(roughly same info that is now available from /proc/PID/* files). Compared to
+/proc/PID/* files, it is faster, more flexible and provides data in a binary
+format. Task-diag was created using the basic idea of socket_diag.
+
+Interface
+---------
+
+Here is the /proc/task-diag file, which operates based on the following
+principles:
+
+* Transactional: write request, read response
+* Netlink message format (same as used by sock_diag; binary and extendable)
+
+The user-kernel interface is encapsulated in include/uapi/linux/task_diag.h
+
+Request
+-------
+
+A request is described by the task_diag_pid structure.
+
+struct task_diag_pid {
+ __u64 show_flags; /* TASK_DIAG_SHOW_* */
+ __u64 dump_stratagy; /* TASK_DIAG_DUMP_* */
+
+ __u32 pid;
+};
+
+dump_stratagy specifies a group of processes:
+/* per-process strategies */
+TASK_DIAG_DUMP_CHILDREN - all children
+TASK_DIAG_DUMP_THREAD - all threads
+TASK_DIAG_DUMP_ONE - one process
+/* system wide strategies (the pid fiel is ignored) */
+TASK_DIAG_DUMP_ALL - all processes
+TASK_DIAG_DUMP_ALL_THREAD - all threads
+
+show_flags specifies which information are required. If we set the
+TASK_DIAG_SHOW_BASE flag, the response message will contain the TASK_DIAG_BASE
+attribute which is described by the task_diag_base structure.
+
+In future, it can be extended by optional attributes. The request describes
+which task properties are required and for which processes they are required
+for.
+
+Response
+--------
+
+A response can be divided into a few packets. Each task is described by a
+netlink message. If all information about a process doesn't fit into a message,
+the TASK_DIAG_FLAG_CONT flag will be set and the next message will continue
+describing the same process.
+
+Examples
+--------
+
+A few examples can be found in tools/testing/selftests/task_diag/
--
2.5.5

2016-04-11 23:38:13

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 09/15] task_diag: add a flag to mark incomplete messages

If all information about a process don't fit in a message, it's marked
by the TASK_DIAG_FLAG_CONT flag and the next message will describe the
same process.

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/task_diag.c | 3 +++
include/uapi/linux/task_diag.h | 2 ++
2 files changed, 5 insertions(+)

diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
index 00db32d..cd10374 100644
--- a/fs/proc/task_diag.c
+++ b/fs/proc/task_diag.c
@@ -415,6 +415,7 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
msg = nlmsg_data(nlh);
msg->pid = task_pid_nr_ns(tsk, pidns);
msg->tgid = task_tgid_nr_ns(tsk, pidns);
+ msg->flags |= TASK_DIAG_FLAG_CONT;

if (show_flags & TASK_DIAG_SHOW_BASE) {
if (i >= n)
@@ -450,6 +451,8 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ msg->flags &= ~TASK_DIAG_FLAG_CONT;
+
nlmsg_end(skb, nlh);
if (cb)
cb->attr = 0;
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index 8bccd02..e967c5b 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -13,6 +13,8 @@ struct task_diag_msg {
__u32 flags;
};

+#define TASK_DIAG_FLAG_CONT 0x00000001
+
enum {
TASK_DIAG_BASE = 0,
TASK_DIAG_CRED,
--
2.5.5

2016-04-11 23:38:45

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 06/15] task_diag: add a new group to get tasks memory mappings (v2)

v2: Fixes from David Ahern
* Fix 8-byte alignment
* Change implementation of DIAG_VMA attribute:

This patch puts the filename into the task_diag_vma struct and
converts TASK_DIAG_VMA attribute into a series of task_diag_vma.
Now is there is a single TASK_DIAG_VMA attribute that is parsed
as:

| struct task_diag_vma | filename | ...

Cc: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/internal.h | 21 ++++
fs/proc/task_diag.c | 279 ++++++++++++++++++++++++++++++++++++++++-
fs/proc/task_mmu.c | 18 +--
include/uapi/linux/task_diag.h | 85 +++++++++++++
4 files changed, 385 insertions(+), 18 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2a2b1e6..75b57a3 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -316,3 +316,24 @@ task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned i
struct task_struct *task_first_tid(struct pid *pid, int tid, loff_t f_pos,
struct pid_namespace *ns);
struct task_struct *task_next_tid(struct task_struct *start);
+
+struct mem_size_stats {
+ unsigned long resident;
+ unsigned long shared_clean;
+ unsigned long shared_dirty;
+ unsigned long private_clean;
+ unsigned long private_dirty;
+ unsigned long referenced;
+ unsigned long anonymous;
+ unsigned long anonymous_thp;
+ unsigned long swap;
+ unsigned long shared_hugetlb;
+ unsigned long private_hugetlb;
+ u64 pss;
+ u64 swap_pss;
+ bool check_shmem_swap;
+};
+
+struct mm_walk;
+int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk);
diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
index fc31771..9c1ed45 100644
--- a/fs/proc/task_diag.c
+++ b/fs/proc/task_diag.c
@@ -7,6 +7,8 @@
#include <linux/taskstats.h>
#include <net/sock.h>

+#include "internal.h"
+
struct task_diag_cb {
struct sk_buff *req;
struct sk_buff *resp;
@@ -14,6 +16,11 @@ struct task_diag_cb {
pid_t pid;
int pos;
int attr;
+ union { /* per-attribute */
+ struct {
+ unsigned long mark;
+ } vma;
+ };
};

/*
@@ -122,6 +129,267 @@ static int fill_creds(struct task_struct *p, struct sk_buff *skb,
return 0;
}

+static u64 get_vma_flags(struct vm_area_struct *vma)
+{
+ u64 flags = 0;
+
+ static const u64 mnemonics[BITS_PER_LONG] = {
+ /*
+ * In case if we meet a flag we don't know about.
+ */
+ [0 ... (BITS_PER_LONG-1)] = 0,
+
+ [ilog2(VM_READ)] = TASK_DIAG_VMA_F_READ,
+ [ilog2(VM_WRITE)] = TASK_DIAG_VMA_F_WRITE,
+ [ilog2(VM_EXEC)] = TASK_DIAG_VMA_F_EXEC,
+ [ilog2(VM_SHARED)] = TASK_DIAG_VMA_F_SHARED,
+ [ilog2(VM_MAYREAD)] = TASK_DIAG_VMA_F_MAYREAD,
+ [ilog2(VM_MAYWRITE)] = TASK_DIAG_VMA_F_MAYWRITE,
+ [ilog2(VM_MAYEXEC)] = TASK_DIAG_VMA_F_MAYEXEC,
+ [ilog2(VM_MAYSHARE)] = TASK_DIAG_VMA_F_MAYSHARE,
+ [ilog2(VM_GROWSDOWN)] = TASK_DIAG_VMA_F_GROWSDOWN,
+ [ilog2(VM_PFNMAP)] = TASK_DIAG_VMA_F_PFNMAP,
+ [ilog2(VM_DENYWRITE)] = TASK_DIAG_VMA_F_DENYWRITE,
+#ifdef CONFIG_X86_INTEL_MPX
+ [ilog2(VM_MPX)] = TASK_DIAG_VMA_F_MPX,
+#endif
+ [ilog2(VM_LOCKED)] = TASK_DIAG_VMA_F_LOCKED,
+ [ilog2(VM_IO)] = TASK_DIAG_VMA_F_IO,
+ [ilog2(VM_SEQ_READ)] = TASK_DIAG_VMA_F_SEQ_READ,
+ [ilog2(VM_RAND_READ)] = TASK_DIAG_VMA_F_RAND_READ,
+ [ilog2(VM_DONTCOPY)] = TASK_DIAG_VMA_F_DONTCOPY,
+ [ilog2(VM_DONTEXPAND)] = TASK_DIAG_VMA_F_DONTEXPAND,
+ [ilog2(VM_ACCOUNT)] = TASK_DIAG_VMA_F_ACCOUNT,
+ [ilog2(VM_NORESERVE)] = TASK_DIAG_VMA_F_NORESERVE,
+ [ilog2(VM_HUGETLB)] = TASK_DIAG_VMA_F_HUGETLB,
+ [ilog2(VM_ARCH_1)] = TASK_DIAG_VMA_F_ARCH_1,
+ [ilog2(VM_DONTDUMP)] = TASK_DIAG_VMA_F_DONTDUMP,
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ [ilog2(VM_SOFTDIRTY)] = TASK_DIAG_VMA_F_SOFTDIRTY,
+#endif
+ [ilog2(VM_MIXEDMAP)] = TASK_DIAG_VMA_F_MIXEDMAP,
+ [ilog2(VM_HUGEPAGE)] = TASK_DIAG_VMA_F_HUGEPAGE,
+ [ilog2(VM_NOHUGEPAGE)] = TASK_DIAG_VMA_F_NOHUGEPAGE,
+ [ilog2(VM_MERGEABLE)] = TASK_DIAG_VMA_F_MERGEABLE,
+ };
+ size_t i;
+
+ for (i = 0; i < BITS_PER_LONG; i++) {
+ if (vma->vm_flags & (1UL << i))
+ flags |= mnemonics[i];
+ }
+
+ return flags;
+}
+
+/*
+ * use a tmp variable and copy to input arg to deal with
+ * alignment issues. diag_vma contains u64 elements which
+ * means extended load operations can be used and those can
+ * require 8-byte alignment (e.g., sparc)
+ */
+static void fill_diag_vma(struct vm_area_struct *vma,
+ struct task_diag_vma *diag_vma)
+{
+ struct task_diag_vma tmp;
+
+ /* We don't show the stack guard page in /proc/maps */
+ tmp.start = vma->vm_start;
+ if (stack_guard_page_start(vma, tmp.start))
+ tmp.start += PAGE_SIZE;
+
+ tmp.end = vma->vm_end;
+ if (stack_guard_page_end(vma, tmp.end))
+ tmp.end -= PAGE_SIZE;
+ tmp.vm_flags = get_vma_flags(vma);
+
+ if (vma->vm_file) {
+ struct inode *inode = file_inode(vma->vm_file);
+ dev_t dev;
+
+ dev = inode->i_sb->s_dev;
+ tmp.major = MAJOR(dev);
+ tmp.minor = MINOR(dev);
+ tmp.inode = inode->i_ino;
+ tmp.generation = inode->i_generation;
+ tmp.pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+ } else {
+ tmp.major = 0;
+ tmp.minor = 0;
+ tmp.inode = 0;
+ tmp.generation = 0;
+ tmp.pgoff = 0;
+ }
+
+ memcpy(diag_vma, &tmp, sizeof(*diag_vma));
+}
+
+static const char *get_vma_name(struct vm_area_struct *vma, char *page)
+{
+ const char *name = NULL;
+
+ if (vma->vm_file) {
+ name = d_path(&vma->vm_file->f_path, page, PAGE_SIZE);
+ goto out;
+ }
+
+ if (vma->vm_ops && vma->vm_ops->name) {
+ name = vma->vm_ops->name(vma);
+ if (name)
+ goto out;
+ }
+
+ name = arch_vma_name(vma);
+
+out:
+ return name;
+}
+
+static void fill_diag_vma_stat(struct vm_area_struct *vma,
+ struct task_diag_vma_stat *stat)
+{
+ struct task_diag_vma_stat tmp;
+ struct mem_size_stats mss;
+ struct mm_walk smaps_walk = {
+ .pmd_entry = smaps_pte_range,
+ .mm = vma->vm_mm,
+ .private = &mss,
+ };
+
+ memset(&mss, 0, sizeof(mss));
+ memset(&tmp, 0, sizeof(tmp));
+
+ /* mmap_sem is held in m_start */
+ walk_page_vma(vma, &smaps_walk);
+
+ tmp.resident = mss.resident;
+ tmp.pss = mss.pss;
+ tmp.shared_clean = mss.shared_clean;
+ tmp.private_clean = mss.private_clean;
+ tmp.private_dirty = mss.private_dirty;
+ tmp.referenced = mss.referenced;
+ tmp.anonymous = mss.anonymous;
+ tmp.anonymous_thp = mss.anonymous_thp;
+ tmp.swap = mss.swap;
+
+ memcpy(stat, &tmp, sizeof(*stat));
+}
+
+static int fill_vma(struct task_struct *p, struct sk_buff *skb,
+ struct task_diag_cb *cb, bool *progress, u64 show_flags)
+{
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ struct nlattr *attr = NULL;
+ struct task_diag_vma *diag_vma;
+ unsigned long mark = 0;
+ char *page;
+ int i, rc = -EMSGSIZE, size;
+
+ if (cb)
+ mark = cb->vma.mark;
+
+ mm = p->mm;
+ if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+ return 0;
+
+ page = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!page) {
+ mmput(mm);
+ return -ENOMEM;
+ }
+
+ size = NLA_ALIGN(sizeof(struct task_diag_vma));
+ if (show_flags & TASK_DIAG_SHOW_VMA_STAT)
+ size += NLA_ALIGN(sizeof(struct task_diag_vma_stat));
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next, i++) {
+ unsigned char *b = skb_tail_pointer(skb);
+ const char *name;
+ void *pfile;
+
+
+ if (mark >= vma->vm_start)
+ continue;
+
+ /* setup pointer for next map */
+ if (attr == NULL) {
+ attr = nla_reserve(skb, TASK_DIAG_VMA, size);
+ if (!attr)
+ goto err;
+
+ diag_vma = nla_data(attr);
+ } else {
+ diag_vma = nla_reserve_nohdr(skb, size);
+
+ if (diag_vma == NULL) {
+ nlmsg_trim(skb, b);
+ goto out;
+ }
+ }
+
+ fill_diag_vma(vma, diag_vma);
+
+ if (show_flags & TASK_DIAG_SHOW_VMA_STAT) {
+ struct task_diag_vma_stat *stat;
+
+ stat = (void *) diag_vma + NLA_ALIGN(sizeof(*diag_vma));
+
+ fill_diag_vma_stat(vma, stat);
+ diag_vma->stat_len = sizeof(struct task_diag_vma_stat);
+ diag_vma->stat_off = (void *) stat - (void *)diag_vma;
+ } else {
+ diag_vma->stat_len = 0;
+ diag_vma->stat_off = 0;
+ }
+
+ name = get_vma_name(vma, page);
+ if (IS_ERR(name)) {
+ nlmsg_trim(skb, b);
+ rc = PTR_ERR(name);
+ goto out;
+ }
+
+ if (name) {
+ diag_vma->name_len = strlen(name) + 1;
+
+ /* reserves NLA_ALIGN(len) */
+ pfile = nla_reserve_nohdr(skb, diag_vma->name_len);
+ if (pfile == NULL) {
+ nlmsg_trim(skb, b);
+ goto out;
+ }
+ diag_vma->name_off = pfile - (void *) diag_vma;
+ memcpy(pfile, name, diag_vma->name_len);
+ } else {
+ diag_vma->name_len = 0;
+ diag_vma->name_off = 0;
+ }
+
+ mark = vma->vm_start;
+
+ diag_vma->vma_len = skb_tail_pointer(skb) - (unsigned char *) diag_vma;
+
+ *progress = true;
+ }
+
+ rc = 0;
+ mark = 0;
+out:
+ if (*progress)
+ attr->nla_len = skb_tail_pointer(skb) - (unsigned char *) attr;
+
+err:
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ free_page((unsigned long) page);
+ if (cb)
+ cb->vma.mark = mark;
+
+ return rc;
+}
+
static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
struct task_diag_pid *req,
struct task_diag_cb *cb, struct pid_namespace *pidns,
@@ -131,6 +399,7 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
struct nlmsghdr *nlh;
struct task_diag_msg *msg;
int err = 0, i = 0, n = 0;
+ bool progress = false;
int flags = 0;

if (cb) {
@@ -163,13 +432,21 @@ static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
i++;
}

+ if (show_flags & TASK_DIAG_SHOW_VMA) {
+ if (i >= n)
+ err = fill_vma(tsk, skb, cb, &progress, show_flags);
+ if (err)
+ goto err;
+ i++;
+ }
+
nlmsg_end(skb, nlh);
if (cb)
cb->attr = 0;

return 0;
err:
- if (err == -EMSGSIZE && (i > n)) {
+ if (err == -EMSGSIZE && (i > n || progress)) {
if (cb)
cb->attr = i;
nlmsg_end(skb, nlh);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 229cb54..211147e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -439,22 +439,6 @@ const struct file_operations proc_tid_maps_operations = {
#define PSS_SHIFT 12

#ifdef CONFIG_PROC_PAGE_MONITOR
-struct mem_size_stats {
- unsigned long resident;
- unsigned long shared_clean;
- unsigned long shared_dirty;
- unsigned long private_clean;
- unsigned long private_dirty;
- unsigned long referenced;
- unsigned long anonymous;
- unsigned long anonymous_thp;
- unsigned long swap;
- unsigned long shared_hugetlb;
- unsigned long private_hugetlb;
- u64 pss;
- u64 swap_pss;
- bool check_shmem_swap;
-};

static void smaps_account(struct mem_size_stats *mss, struct page *page,
bool compound, bool young, bool dirty)
@@ -586,7 +570,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
}
#endif

-static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
index ea500c6..3486f2f 100644
--- a/include/uapi/linux/task_diag.h
+++ b/include/uapi/linux/task_diag.h
@@ -16,6 +16,8 @@ struct task_diag_msg {
enum {
TASK_DIAG_BASE = 0,
TASK_DIAG_CRED,
+ TASK_DIAG_VMA,
+ TASK_DIAG_VMA_STAT,

__TASK_DIAG_ATTR_MAX
#define TASK_DIAG_ATTR_MAX (__TASK_DIAG_ATTR_MAX - 1)
@@ -23,6 +25,8 @@ enum {

#define TASK_DIAG_SHOW_BASE (1ULL << TASK_DIAG_BASE)
#define TASK_DIAG_SHOW_CRED (1ULL << TASK_DIAG_CRED)
+#define TASK_DIAG_SHOW_VMA (1ULL << TASK_DIAG_VMA)
+#define TASK_DIAG_SHOW_VMA_STAT (1ULL << TASK_DIAG_VMA_STAT)

enum {
TASK_DIAG_RUNNING,
@@ -66,6 +70,87 @@ struct task_diag_creds {
__u32 sgid;
__u32 fsgid;
};
+
+#define TASK_DIAG_VMA_F_READ (1ULL << 0)
+#define TASK_DIAG_VMA_F_WRITE (1ULL << 1)
+#define TASK_DIAG_VMA_F_EXEC (1ULL << 2)
+#define TASK_DIAG_VMA_F_SHARED (1ULL << 3)
+#define TASK_DIAG_VMA_F_MAYREAD (1ULL << 4)
+#define TASK_DIAG_VMA_F_MAYWRITE (1ULL << 5)
+#define TASK_DIAG_VMA_F_MAYEXEC (1ULL << 6)
+#define TASK_DIAG_VMA_F_MAYSHARE (1ULL << 7)
+#define TASK_DIAG_VMA_F_GROWSDOWN (1ULL << 8)
+#define TASK_DIAG_VMA_F_PFNMAP (1ULL << 9)
+#define TASK_DIAG_VMA_F_DENYWRITE (1ULL << 10)
+#define TASK_DIAG_VMA_F_MPX (1ULL << 11)
+#define TASK_DIAG_VMA_F_LOCKED (1ULL << 12)
+#define TASK_DIAG_VMA_F_IO (1ULL << 13)
+#define TASK_DIAG_VMA_F_SEQ_READ (1ULL << 14)
+#define TASK_DIAG_VMA_F_RAND_READ (1ULL << 15)
+#define TASK_DIAG_VMA_F_DONTCOPY (1ULL << 16)
+#define TASK_DIAG_VMA_F_DONTEXPAND (1ULL << 17)
+#define TASK_DIAG_VMA_F_ACCOUNT (1ULL << 18)
+#define TASK_DIAG_VMA_F_NORESERVE (1ULL << 19)
+#define TASK_DIAG_VMA_F_HUGETLB (1ULL << 20)
+#define TASK_DIAG_VMA_F_ARCH_1 (1ULL << 21)
+#define TASK_DIAG_VMA_F_DONTDUMP (1ULL << 22)
+#define TASK_DIAG_VMA_F_SOFTDIRTY (1ULL << 23)
+#define TASK_DIAG_VMA_F_MIXEDMAP (1ULL << 24)
+#define TASK_DIAG_VMA_F_HUGEPAGE (1ULL << 25)
+#define TASK_DIAG_VMA_F_NOHUGEPAGE (1ULL << 26)
+#define TASK_DIAG_VMA_F_MERGEABLE (1ULL << 27)
+
+struct task_diag_vma_stat {
+ __u64 resident;
+ __u64 shared_clean;
+ __u64 shared_dirty;
+ __u64 private_clean;
+ __u64 private_dirty;
+ __u64 referenced;
+ __u64 anonymous;
+ __u64 anonymous_thp;
+ __u64 swap;
+ __u64 pss;
+} __attribute__((__aligned__(NLA_ALIGNTO)));
+
+/* task_diag_vma must be NLA_ALIGN'ed */
+struct task_diag_vma {
+ __u64 start, end;
+ __u64 vm_flags;
+ __u64 pgoff;
+ __u32 major;
+ __u32 minor;
+ __u64 inode;
+ __u32 generation;
+ __u16 vma_len;
+ __u16 name_off;
+ __u16 name_len;
+ __u16 stat_off;
+ __u16 stat_len;
+} __attribute__((__aligned__(NLA_ALIGNTO)));
+
+static inline char *task_diag_vma_name(struct task_diag_vma *vma)
+{
+ if (!vma->name_len)
+ return NULL;
+
+ return ((char *)vma) + vma->name_off;
+}
+
+static inline
+struct task_diag_vma_stat *task_diag_vma_stat(struct task_diag_vma *vma)
+{
+ if (!vma->stat_len)
+ return NULL;
+
+ return ((void *)vma) + vma->stat_off;
+}
+
+#define task_diag_for_each_vma(vma, attr) \
+ for (vma = nla_data(attr); \
+ (void *) vma < nla_data(attr) + nla_len(attr); \
+ vma = (void *) vma + vma->vma_len)
+
#define TASK_DIAG_DUMP_ALL 0
#define TASK_DIAG_DUMP_ONE 1

--
2.5.5

2016-04-11 23:38:58

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 04/15] task_diag: add a new interface to get information about tasks (v4)

The task-diag interface allows to get information about running
processes (roughly same info that is now available from /proc/PID/*
files). Compared to /proc/PID/*, it is faster, more flexible and
provides data in a binary format.

Task-diag was created using the basic idea of socket_diag.

Here is the /proc/task-diag file, which operates based on the following
principles:

* Transactional: write request, read response
* Netlink message format (same as used by sock_diag; binary and extendable)

A request messages is described by the task_diag_pid structure:
struct task_diag_pid {
__u64 show_flags;
__u64 dump_strategy;

__u32 pid;
};

A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_PID group, and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_BASE, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.

struct task_diag_base {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};

The dump_strategy field will be used in following patches to request
information for a group of processes.

v2: A few changes from David Ahern
Use a consistent name
Add max attr enum
task diag: Send pid as u32
Change _MSG/msg references to base
Fix 8-byte alignment

v3: take pid namespace from scm credentials. There is a pid of a process
which sent an request. If we need to get information from another
namespace, we can set pid in scm of a process from this namespaces.

v4: use a transaction file instead of netlink

Cc: David Ahern <[email protected]>
Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/Kconfig | 13 ++
fs/proc/Makefile | 3 +
fs/proc/task_diag.c | 424 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/task_diag.h | 66 +++++++
4 files changed, 506 insertions(+)
create mode 100644 fs/proc/task_diag.c
create mode 100644 include/uapi/linux/task_diag.h

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade120..ca223f5 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -81,3 +81,16 @@ config PROC_CHILDREN

Say Y if you are running any user-space software which takes benefit from
this interface. For example, rkt is such a piece of software.
+
+config TASK_DIAG
+ bool "Task-diag support (/proc/task-diag)"
+ depends on NET
+ default n
+ help
+ Export selected properties for tasks/processes through the /proc/task-diag
+ transaction file. Unlike the proc file system, task_diag returns
+ information in a binary format (netlink) and allows to specify which
+ properties are required.
+
+ Say N if unsure.
+
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 7151ea4..94965b9 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -30,3 +30,6 @@ proc-$(CONFIG_PROC_KCORE) += kcore.o
proc-$(CONFIG_PROC_VMCORE) += vmcore.o
proc-$(CONFIG_PRINTK) += kmsg.o
proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o
+
+obj-$(CONFIG_TASK_DIAG) += task_diag.o
+
diff --git a/fs/proc/task_diag.c b/fs/proc/task_diag.c
new file mode 100644
index 0000000..3c2127e
--- /dev/null
+++ b/fs/proc/task_diag.c
@@ -0,0 +1,424 @@
+#include <linux/kernel.h>
+#include <linux/task_diag.h>
+#include <linux/pid_namespace.h>
+#include <linux/ptrace.h>
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+#include <linux/taskstats.h>
+#include <net/sock.h>
+
+struct task_diag_cb {
+ struct sk_buff *req;
+ struct sk_buff *resp;
+ const struct nlmsghdr *nlh;
+ pid_t pid;
+ int pos;
+ int attr;
+};
+
+/*
+ * The task state array is a strange "bitmap" of
+ * reasons to sleep. Thus "running" is zero, and
+ * you can test for combinations of others with
+ * simple bit tests.
+ */
+static const __u8 task_state_array[] = {
+ TASK_DIAG_RUNNING,
+ TASK_DIAG_INTERRUPTIBLE,
+ TASK_DIAG_UNINTERRUPTIBLE,
+ TASK_DIAG_STOPPED,
+ TASK_DIAG_TRACE_STOP,
+ TASK_DIAG_DEAD,
+ TASK_DIAG_ZOMBIE,
+};
+
+static inline const __u8 get_task_state(struct task_struct *tsk)
+{
+ unsigned int state = (tsk->state | tsk->exit_state) & TASK_REPORT;
+
+ BUILD_BUG_ON(1 + ilog2(TASK_REPORT) != ARRAY_SIZE(task_state_array)-1);
+
+ return task_state_array[fls(state)];
+}
+
+static int fill_task_base(struct task_struct *p,
+ struct sk_buff *skb, struct pid_namespace *ns)
+{
+ struct task_diag_base *base;
+ struct nlattr *attr;
+ char tcomm[sizeof(p->comm)];
+ struct task_struct *tracer;
+
+ attr = nla_reserve(skb, TASK_DIAG_BASE, sizeof(struct task_diag_base));
+ if (!attr)
+ return -EMSGSIZE;
+
+ base = nla_data(attr);
+
+ rcu_read_lock();
+ base->ppid = pid_alive(p) ?
+ task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0;
+
+ base->tpid = 0;
+ tracer = ptrace_parent(p);
+ if (tracer)
+ base->tpid = task_pid_nr_ns(tracer, ns);
+
+ base->tgid = task_tgid_nr_ns(p, ns);
+ base->pid = task_pid_nr_ns(p, ns);
+ base->sid = task_session_nr_ns(p, ns);
+ base->pgid = task_pgrp_nr_ns(p, ns);
+
+ rcu_read_unlock();
+
+ get_task_comm(tcomm, p);
+ memset(base->comm, 0, TASK_DIAG_COMM_LEN);
+ strncpy(base->comm, tcomm, TASK_DIAG_COMM_LEN);
+
+ base->state = get_task_state(p);
+
+ return 0;
+}
+
+static int task_diag_fill(struct task_struct *tsk, struct sk_buff *skb,
+ struct task_diag_pid *req,
+ struct task_diag_cb *cb, struct pid_namespace *pidns,
+ struct user_namespace *userns)
+{
+ u64 show_flags = req->show_flags;
+ struct nlmsghdr *nlh;
+ struct task_diag_msg *msg;
+ int err = 0, i = 0, n = 0;
+ int flags = 0;
+
+ if (cb) {
+ n = cb->attr;
+ flags |= NLM_F_MULTI;
+ }
+
+ nlh = nlmsg_put(skb, 0, cb->nlh->nlmsg_seq,
+ TASK_DIAG_CMD_GET, sizeof(*msg), flags);
+ if (nlh == NULL)
+ return -EMSGSIZE;
+
+ msg = nlmsg_data(nlh);
+ msg->pid = task_pid_nr_ns(tsk, pidns);
+ msg->tgid = task_tgid_nr_ns(tsk, pidns);
+
+ if (show_flags & TASK_DIAG_SHOW_BASE) {
+ if (i >= n)
+ err = fill_task_base(tsk, skb, pidns);
+ if (err)
+ goto err;
+ i++;
+ }
+
+ nlmsg_end(skb, nlh);
+ if (cb)
+ cb->attr = 0;
+
+ return 0;
+err:
+ if (err == -EMSGSIZE && (i > n)) {
+ if (cb)
+ cb->attr = i;
+ nlmsg_end(skb, nlh);
+ } else
+ nlmsg_cancel(skb, nlh);
+
+ return err;
+}
+
+struct task_iter {
+ struct task_diag_pid req;
+ struct pid_namespace *ns;
+ struct task_struct *parent;
+
+ struct task_diag_cb *cb;
+
+ struct tgid_iter tgid;
+ unsigned int pos;
+ struct task_struct *task;
+};
+
+static void iter_stop(struct task_iter *iter)
+{
+ struct task_struct *task;
+
+ if (iter->parent)
+ put_task_struct(iter->parent);
+
+ switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_ALL:
+ task = iter->tgid.task;
+ break;
+ default:
+ task = iter->task;
+ }
+ if (task)
+ put_task_struct(task);
+}
+
+static struct task_struct *iter_start(struct task_iter *iter)
+{
+ if (iter->req.pid > 0) {
+ rcu_read_lock();
+ iter->parent = find_task_by_pid_ns(iter->req.pid, iter->ns);
+ if (iter->parent)
+ get_task_struct(iter->parent);
+ rcu_read_unlock();
+ }
+
+ switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_ONE:
+ if (iter->parent == NULL)
+ return ERR_PTR(-ESRCH);
+ iter->pos = iter->cb->pos;
+ if (iter->pos == 0) {
+ iter->task = iter->parent;
+ iter->parent = NULL;
+ } else
+ iter->task = NULL;
+ return iter->task;
+
+ case TASK_DIAG_DUMP_ALL:
+ iter->tgid.tgid = iter->cb->pid;
+ iter->tgid.task = NULL;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ return iter->tgid.task;
+ }
+
+ return ERR_PTR(-EINVAL);
+}
+
+static struct task_struct *iter_next(struct task_iter *iter)
+{
+ switch (iter->req.dump_strategy) {
+ case TASK_DIAG_DUMP_ONE:
+ iter->pos++;
+ iter->cb->pos = iter->pos;
+ if (iter->task)
+ put_task_struct(iter->task);
+ iter->task = NULL;
+ return NULL;
+
+ case TASK_DIAG_DUMP_ALL:
+ iter->tgid.tgid += 1;
+ iter->tgid = next_tgid(iter->ns, iter->tgid);
+ iter->cb->pid = iter->tgid.tgid;
+ return iter->tgid.task;
+ }
+
+ return NULL;
+}
+
+static int __taskdiag_dumpit(struct task_iter *iter,
+ struct task_diag_cb *cb, struct task_struct **start)
+{
+ struct user_namespace *userns = current_user_ns();
+ struct task_struct *task = *start;
+ int rc;
+
+ for (; task; task = iter_next(iter)) {
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+ continue;
+
+ rc = task_diag_fill(task, cb->resp, &iter->req,
+ cb, iter->ns, userns);
+ if (rc < 0) {
+ if (rc != -EMSGSIZE)
+ return rc;
+ break;
+ }
+ }
+ *start = task;
+
+ return 0;
+}
+
+static int taskdiag_dumpit(struct task_diag_cb *cb,
+ struct pid_namespace *pidns,
+ struct msghdr *msg, size_t len)
+{
+ struct sk_buff *skb = cb->resp;
+ struct task_struct *task;
+ struct task_iter iter;
+ struct nlattr *na;
+ size_t copied;
+ int err;
+
+ if (nlmsg_len(cb->nlh) < sizeof(iter.req))
+ return -EINVAL;
+
+ na = nlmsg_data(cb->nlh);
+ if (na->nla_type < 0)
+ return -EINVAL;
+
+ memcpy(&iter.req, na, sizeof(iter.req));
+
+ iter.ns = pidns;
+ iter.cb = cb;
+ iter.parent = NULL;
+ iter.pos = 0;
+ iter.task = NULL;
+
+ task = iter_start(&iter);
+ if (IS_ERR(task))
+ return PTR_ERR(task);
+
+ copied = 0;
+ while (1) {
+ err = __taskdiag_dumpit(&iter, cb, &task);
+ if (err < 0)
+ goto err;
+ if (skb->len == 0)
+ break;
+
+ err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
+ if (err < 0)
+ goto err;
+
+ copied += skb->len;
+
+ skb_trim(skb, 0);
+ if (skb_tailroom(skb) + copied > len)
+ break;
+
+ if (signal_pending(current))
+ break;
+ }
+
+ iter_stop(&iter);
+ return copied;
+err:
+ iter_stop(&iter);
+ return err;
+}
+
+static ssize_t task_diag_write(struct file *f, const char __user *buf,
+ size_t len, loff_t *off)
+{
+ struct task_diag_cb *cb = f->private_data;
+ struct sk_buff *skb;
+ struct msghdr msg;
+ struct iovec iov;
+ int err;
+
+ if (cb->req)
+ return -EBUSY;
+ if (len < nlmsg_total_size(0))
+ return -EINVAL;
+
+ err = import_single_range(WRITE, (void __user *) buf, len,
+ &iov, &msg.msg_iter);
+ if (unlikely(err))
+ return err;
+
+ msg.msg_name = NULL;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_namelen = 0;
+ msg.msg_flags = 0;
+
+ skb = nlmsg_new(len, GFP_KERNEL);
+ if (skb == NULL)
+ return -ENOMEM;
+
+ if (memcpy_from_msg(skb_put(skb, len), &msg, len)) {
+ kfree_skb(skb);
+ return -EFAULT;
+ }
+
+ memset(cb, 0, sizeof(*cb));
+ cb->req = skb;
+ cb->nlh = nlmsg_hdr(skb);
+
+ return len;
+}
+
+static ssize_t task_diag_read(struct file *file, char __user *ubuf,
+ size_t len, loff_t *off)
+{
+ struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info;
+ struct task_diag_cb *cb = file->private_data;
+ struct iovec iov;
+ struct msghdr msg;
+ int size, err;
+
+ if (cb->req == NULL)
+ return 0;
+
+ err = import_single_range(READ, ubuf, len, &iov, &msg.msg_iter);
+ if (unlikely(err))
+ goto err;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+
+ if (!cb->resp) {
+ size = min_t(size_t, len, 16384);
+ cb->resp = alloc_skb(size, GFP_KERNEL);
+ if (cb->resp == NULL) {
+ err = -ENOMEM;
+ goto err;
+ }
+ /* Trim skb to allocated size. */
+ skb_reserve(cb->resp, skb_tailroom(cb->resp) - size);
+ }
+
+ err = taskdiag_dumpit(cb, ns, &msg, len);
+
+err:
+ skb_trim(cb->resp, 0);
+ if (err <= 0) {
+ kfree_skb(cb->req);
+ cb->req = NULL;
+ }
+
+ return err;
+}
+
+static int task_diag_open (struct inode *inode, struct file *f)
+{
+ f->private_data = kzalloc(sizeof(struct task_diag_cb), GFP_KERNEL);
+ if (f->private_data == NULL)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int task_diag_release(struct inode *inode, struct file *f)
+{
+ struct task_diag_cb *cb = f->private_data;
+
+ kfree_skb(cb->req);
+ kfree_skb(cb->resp);
+
+ kfree(f->private_data);
+ return 0;
+}
+
+static const struct file_operations task_diag_fops = {
+ .owner = THIS_MODULE,
+ .open = task_diag_open,
+ .release = task_diag_release,
+ .write = task_diag_write,
+ .read = task_diag_read,
+};
+
+static __init int task_diag_init(void)
+{
+ if (!proc_create("task-diag", S_IRUGO | S_IWUGO, NULL, &task_diag_fops))
+ return -ENOMEM;
+
+ return 0;
+}
+
+static __exit void task_diag_exit(void)
+{
+ remove_proc_entry("task-diag", NULL);
+}
+
+module_init(task_diag_init);
+module_exit(task_diag_exit);
diff --git a/include/uapi/linux/task_diag.h b/include/uapi/linux/task_diag.h
new file mode 100644
index 0000000..ba0f71a
--- /dev/null
+++ b/include/uapi/linux/task_diag.h
@@ -0,0 +1,66 @@
+#ifndef _LINUX_TASK_DIAG_H
+#define _LINUX_TASK_DIAG_H
+
+#include <linux/types.h>
+#include <linux/netlink.h>
+#include <linux/capability.h>
+
+#define TASK_DIAG_CMD_GET 0xd101U
+
+struct task_diag_msg {
+ __u32 pid;
+ __u32 tgid;
+ __u32 flags;
+};
+
+enum {
+ TASK_DIAG_BASE = 0,
+
+ __TASK_DIAG_ATTR_MAX
+#define TASK_DIAG_ATTR_MAX (__TASK_DIAG_ATTR_MAX - 1)
+};
+
+#define TASK_DIAG_SHOW_BASE (1ULL << TASK_DIAG_BASE)
+
+enum {
+ TASK_DIAG_RUNNING,
+ TASK_DIAG_INTERRUPTIBLE,
+ TASK_DIAG_UNINTERRUPTIBLE,
+ TASK_DIAG_STOPPED,
+ TASK_DIAG_TRACE_STOP,
+ TASK_DIAG_DEAD,
+ TASK_DIAG_ZOMBIE,
+};
+
+#define TASK_DIAG_COMM_LEN 16
+
+struct task_diag_base {
+ __u32 tgid;
+ __u32 pid;
+ __u32 ppid;
+ __u32 tpid;
+ __u32 sid;
+ __u32 pgid;
+ __u8 state;
+ char comm[TASK_DIAG_COMM_LEN];
+};
+
+#define TASK_DIAG_DUMP_ALL 0
+#define TASK_DIAG_DUMP_ONE 1
+
+struct task_diag_pid {
+ __u64 show_flags;
+ __u64 dump_strategy;
+
+ __u32 pid;
+};
+
+enum {
+ TASK_DIAG_CMD_ATTR_UNSPEC = 0,
+ TASK_DIAG_CMD_ATTR_GET,
+ __TASK_DIAG_CMD_ATTR_MAX,
+};
+
+#define TASK_DIAG_CMD_ATTR_MAX (__TASK_DIAG_CMD_ATTR_MAX - 1)
+
+#endif /* _LINUX_TASK_DIAG_H */
--
2.5.5

2016-04-11 23:39:39

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 03/15] proc: export next_tgid()

It's going to be used in task_diag

Signed-off-by: Andrey Vagin <[email protected]>
---
fs/proc/base.c | 6 +-----
fs/proc/internal.h | 6 ++++++
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 614f1d0..9e5fd1c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3071,11 +3071,7 @@ out:
* Find the first task with tgid >= tgid
*
*/
-struct tgid_iter {
- unsigned int tgid;
- struct task_struct *task;
-};
-static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
{
struct pid *pid;

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 49145e2..2a2b1e6 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -304,6 +304,12 @@ extern unsigned long task_statm(struct mm_struct *,
unsigned long *, unsigned long *);
extern void task_mem(struct seq_file *, struct mm_struct *);

+struct tgid_iter {
+ unsigned int tgid;
+ struct task_struct *task;
+};
+struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter);
+
struct task_struct *
task_next_child(struct task_struct *parent, struct task_struct *prev, unsigned int pos);

--
2.5.5

2016-04-12 01:05:25

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 04/15] task_diag: add a new interface to get information about tasks (v4)

Hi Andrey,

[auto build test ERROR on v4.6-rc3]
[also build test ERROR on next-20160411]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url: https://github.com/0day-ci/linux/commits/Andrey-Vagin/task_diag-add-a-new-interface-to-get-information-about-processes-v3/20160412-074109
config: microblaze-allmodconfig (attached as .config)
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=microblaze

Note: the linux-review/Andrey-Vagin/task_diag-add-a-new-interface-to-get-information-about-processes-v3/20160412-074109 HEAD 2e0f174ce7e6fddc8fdd89f3dbb8d990626358a0 builds fine.
It only hurts bisectibility.

All errors (new ones prefixed by >>):

>> fs/proc/task_diag.c:139:19: error: field 'tgid' has incomplete type
struct tgid_iter tgid;
^
fs/proc/task_diag.c: In function 'iter_start':
>> fs/proc/task_diag.c:187:3: error: implicit declaration of function 'next_tgid' [-Werror=implicit-function-declaration]
iter->tgid = next_tgid(iter->ns, iter->tgid);
^
cc1: some warnings being treated as errors

vim +/tgid +139 fs/proc/task_diag.c

133 struct task_diag_pid req;
134 struct pid_namespace *ns;
135 struct task_struct *parent;
136
137 struct task_diag_cb *cb;
138
> 139 struct tgid_iter tgid;
140 unsigned int pos;
141 struct task_struct *task;
142 };
143
144 static void iter_stop(struct task_iter *iter)
145 {
146 struct task_struct *task;
147
148 if (iter->parent)
149 put_task_struct(iter->parent);
150
151 switch (iter->req.dump_strategy) {
152 case TASK_DIAG_DUMP_ALL:
153 task = iter->tgid.task;
154 break;
155 default:
156 task = iter->task;
157 }
158 if (task)
159 put_task_struct(task);
160 }
161
162 static struct task_struct *iter_start(struct task_iter *iter)
163 {
164 if (iter->req.pid > 0) {
165 rcu_read_lock();
166 iter->parent = find_task_by_pid_ns(iter->req.pid, iter->ns);
167 if (iter->parent)
168 get_task_struct(iter->parent);
169 rcu_read_unlock();
170 }
171
172 switch (iter->req.dump_strategy) {
173 case TASK_DIAG_DUMP_ONE:
174 if (iter->parent == NULL)
175 return ERR_PTR(-ESRCH);
176 iter->pos = iter->cb->pos;
177 if (iter->pos == 0) {
178 iter->task = iter->parent;
179 iter->parent = NULL;
180 } else
181 iter->task = NULL;
182 return iter->task;
183
184 case TASK_DIAG_DUMP_ALL:
185 iter->tgid.tgid = iter->cb->pid;
186 iter->tgid.task = NULL;
> 187 iter->tgid = next_tgid(iter->ns, iter->tgid);
188 return iter->tgid.task;
189 }
190

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (3.01 kB)
.config.gz (43.65 kB)
Download all attachments

2016-04-12 07:09:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 04/15] task_diag: add a new interface to get information about tasks (v4)

On Mon, Apr 11, 2016 at 04:35:44PM -0700, Andrey Vagin wrote:
...
> +static int __taskdiag_dumpit(struct task_iter *iter,
> + struct task_diag_cb *cb, struct task_struct **start)
> +{
> + struct user_namespace *userns = current_user_ns();
> + struct task_struct *task = *start;
> + int rc;
> +
> + for (; task; task = iter_next(iter)) {
> + if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
> + continue;
> +
> + rc = task_diag_fill(task, cb->resp, &iter->req,
> + cb, iter->ns, userns);
> + if (rc < 0) {
> + if (rc != -EMSGSIZE)
> + return rc;
> + break;
> + }
> + }
> + *start = task;

task = NULL always here?

> +
> + return 0;
> +}

Cyrill

2016-04-13 00:39:20

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 04/15] task_diag: add a new interface to get information about tasks (v4)

On Tue, Apr 12, 2016 at 10:08:57AM +0300, Cyrill Gorcunov wrote:
> On Mon, Apr 11, 2016 at 04:35:44PM -0700, Andrey Vagin wrote:
> ...
> > +static int __taskdiag_dumpit(struct task_iter *iter,
> > + struct task_diag_cb *cb, struct task_struct **start)
> > +{
> > + struct user_namespace *userns = current_user_ns();
> > + struct task_struct *task = *start;
> > + int rc;
> > +
> > + for (; task; task = iter_next(iter)) {
> > + if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
> > + continue;
> > +
> > + rc = task_diag_fill(task, cb->resp, &iter->req,
> > + cb, iter->ns, userns);
> > + if (rc < 0) {
> > + if (rc != -EMSGSIZE)
> > + return rc;
> > + break;

task isn't NULL here

> > + }
> > + }
> > + *start = task;
>
> task = NULL always here?

No, it isn't if the loop is interrupted by break.

Thanks,
Andrew

2016-04-13 03:21:41

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 04/15] task_diag: add a new interface to get information about tasks (v4)

On Tue, Apr 12, 2016 at 09:03:39AM +0800, kbuild test robot wrote:
> Hi Andrey,
>
> [auto build test ERROR on v4.6-rc3]
> [also build test ERROR on next-20160411]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
>
> url: https://github.com/0day-ci/linux/commits/Andrey-Vagin/task_diag-add-a-new-interface-to-get-information-about-processes-v3/20160412-074109
> config: microblaze-allmodconfig (attached as .config)
> reproduce:
> wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=microblaze
>
> Note: the linux-review/Andrey-Vagin/task_diag-add-a-new-interface-to-get-information-about-processes-v3/20160412-074109 HEAD 2e0f174ce7e6fddc8fdd89f3dbb8d990626358a0 builds fine.
> It only hurts bisectibility.
>
> All errors (new ones prefixed by >>):
>
> >> fs/proc/task_diag.c:139:19: error: field 'tgid' has incomplete type
> struct tgid_iter tgid;
> ^
> fs/proc/task_diag.c: In function 'iter_start':
> >> fs/proc/task_diag.c:187:3: error: implicit declaration of function 'next_tgid' [-Werror=implicit-function-declaration]
> iter->tgid = next_tgid(iter->ns, iter->tgid);
> ^
> cc1: some warnings being treated as errors
>

Unfortunately I forgot to include fs/proc/internal.h here. It will be
included in the 6-th patch. I will fix this issue in a final version.

Thanks,
Andrew

2016-04-13 05:26:29

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 04/15] task_diag: add a new interface to get information about tasks (v4)

On Tue, Apr 12, 2016 at 05:39:14PM -0700, Andrew Vagin wrote:
> >
> > task = NULL always here?
>
> No, it isn't if the loop is interrupted by break.

Yeah, managed to miss break, thanks!