I am submitting nproc, a new netlink interface to process information,
for review and a possible inclusion in mainline.
The problems with /proc as far as parsers go are widely known. Parsing is
both difficult and slow (including a more detailed discussion by reference:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109361019528995). What
follows is an overview showing how nproc fares in those areas.
Roger
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Clean Interface
---------------
The main motivation was to clean up the mess that are /proc semantics
and provide a clean interface for tools to gather process information.
Nproc does not add new knowledge to the kernel (some redundancy remains
until routines are shared with /proc). Instead, it offers existing
information in a form that works for tools. In fact, a tool can pass
the buffer read from the netlink directly as a va_list to vprintf
(strings require a trivial extra operation).
A small user-space app can present a view like the one below based on
zero prior knowledge about the fields the kernel has to offer. While I
don't envision that as common for tools in the future, it demonstrates
what can be done with little effort. This is not a mock-up, by the way,
the nprocdemo tool exists (lines truncated to fit 80 chars).
MemFree |PageSize|Jiffies |nr_dirty|nr_writeback|nr_unstable|[...]
____page|____byte|__________|____page|________page|_______page|[...]
7546| 4096| 1917203| 1| 0| 0|[...]
PID |Name |VmSize |VmLock |VmRSS |VmData |VmStack |[...]
_____|_______________|_____KiB|_____KiB|_____KiB|_____KiB|_____KiB|[...]
1|init | 1340| 0| 468| 144| 4|[...]
2|ksoftirqd/0 | 0| 0| 0| 0| 0|[...]
3|events/0 | 0| 0| 0| 0| 0|[...]
4|khelper | 0| 0| 0| 0| 0|[...]
5|netlink/0 | 0| 0| 0| 0| 0|[...]
6|kacpid | 0| 0| 0| 0| 0|[...]
23|kblockd/0 | 0| 0| 0| 0| 0|[...]
24|khubd | 0| 0| 0| 0| 0|[...]
36|pdflush | 0| 0| 0| 0| 0|[...]
37|pdflush | 0| 0| 0| 0| 0|[...]
38|kswapd0 | 0| 0| 0| 0| 0|[...]
39|aio/0 | 0| 0| 0| 0| 0|[...]
671|kseriod | 0| 0| 0| 0| 0|[...]
686|reiserfs/0 | 0| 0| 0| 0| 0|[...]
851|udevd | 1320| 0| 360| 144| 4|[...]
9159|syslogd | 1516| 0| 588| 272| 16|[...]
9382|gpm | 1540| 0| 468| 152| 4|[...]
9452|klogd | 1468| 0| 432| 276| 8|[...]
9478|hddtemp | 1692| 0| 848| 472| 16|[...]
9486|login | 2152| 0| 1204| 392| 36|[...]
9487|agetty | 1340| 0| 488| 156| 4|[...]
9488|agetty | 1340| 0| 488| 156| 4|[...]
9489|agetty | 1340| 0| 488| 156| 4|[...]
9490|agetty | 1340| 0| 488| 156| 4|[...]
9491|agetty | 1340| 0| 488| 156| 4|[...]
9598|zsh | 4748| 0| 1688| 532| 20|[...]
[...]
Performance
-----------
I measured the time to write a complete process table dump for 5000
tasks to /dev/null 100 times for "ps ax" and nprocdemo.
ps ax (5 process fields):
real 1m0.472s
user 0m18.227s
sys 0m28.545s
nprocdemo (automatic field discovery, reading and printing 11 process
fields + 9 global fields):
real 0m9.064s
user 0m2.491s
sys 0m1.554s
The details of resource usage for the benchmarks show that /proc based
tools are suffering badly from the inefficiency of three(!) conversions
between data and strings (kernel produces strings from numbers, app
converts back to numbers, app converts numbers again to strings for
printing).
For nproc based tools, only one conversion remains.
# ps ax > /dev/null
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name symbol name
6524 14.0613 vmlinux ps number
4828 10.4058 libc-2.3.3.so ps _IO_vfscanf_internal
2740 5.9056 vmlinux ps vsnprintf
2689 5.7956 vmlinux ps proc_pid_stat
1807 3.8946 vmlinux ps __d_lookup
1676 3.6123 libc-2.3.3.so ps ____strtol_l_internal
1335 2.8773 vmlinux ps link_path_walk
1133 2.4420 libproc-3.2.3.so ps status2proc
1094 2.3579 vmlinux ps render_sigset_t
1088 2.3450 libc-2.3.3.so ps _IO_vfprintf_internal
1086 2.3407 libc-2.3.3.so ps __GI_strchr
885 1.9075 libc-2.3.3.so ps ____strtoul_l_internal
800 1.7242 vmlinux ps pid_revalidate
581 1.2522 vmlinux ps proc_pid_status
551 1.1876 libc-2.3.3.so ps _IO_sputbackc_internal
529 1.1402 vmlinux ps system_call
524 1.1294 libc-2.3.3.so ps _IO_default_xsputn_internal
476 1.0259 libc-2.3.3.so ps __i686.get_pc_thunk.bx
466 1.0044 vmlinux ps get_tgid_list
442 0.9526 vmlinux ps atomic_dec_and_lock
373 0.8039 vmlinux ps dput
311 0.6703 libc-2.3.3.so ps __GI___strtol_internal
274 0.5906 vmlinux ps __copy_to_user_ll
272 0.5862 vmlinux ps path_lookup
270 0.5819 vmlinux ps strncpy_from_user
262 0.5647 libproc-3.2.3.so ps escape_str
259 0.5582 vmlinux ps page_address
249 0.5367 libc-2.3.3.so ps __GI_____strtoull_l_internal
244 0.5259 libc-2.3.3.so ps __GI_strlen
# nprocdemo > /dev/null
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name symbol name
1142 15.9208 libc-2.3.3.so nprocdemo _IO_vfprintf_internal
1072 14.9449 vmlinux vmlinux __task_mem
611 8.5181 libc-2.3.3.so nprocdemo _IO_new_file_xsputn
445 6.2038 vmlinux vmlinux nproc_pid_fields
244 3.4016 vmlinux vmlinux get_wchan
235 3.2762 vmlinux nprocdemo __copy_to_user_ll
233 3.2483 vmlinux vmlinux find_pid
215 2.9974 vmlinux vmlinux finish_task_switch
208 2.8998 vmlinux nprocdemo netlink_recvmsg
158 2.2027 vmlinux nprocdemo __wake_up
153 2.1330 libc-2.3.3.so nprocdemo __find_specmb
149 2.0772 vmlinux nprocdemo finish_task_switch
146 2.0354 libc-2.3.3.so nprocdemo __i686.get_pc_thunk.bx
114 1.5893 vmlinux vmlinux get_task_mm
94 1.3105 vmlinux nprocdemo skb_release_data
87 1.2129 vmlinux vmlinux nproc_ps_do_pid
76 1.0595 vmlinux vmlinux alloc_skb
72 1.0038 vmlinux nprocdemo system_call
68 0.9480 libc-2.3.3.so nprocdemo _IO_padn_internal
65 0.9062 libc-2.3.3.so nprocdemo read_int
64 0.8922 libc-2.3.3.so nprocdemo __recv
63 0.8783 vmlinux vmlinux netlink_attachskb
61 0.8504 vmlinux nprocdemo kfree
56 0.7807 vmlinux vmlinux __kmalloc
55 0.7668 vmlinux vmlinux schedule
47 0.6552 vmlinux vmlinux __task_mem_cheap
42 0.5855 vmlinux nprocdemo sys_socketcall
40 0.5576 vmlinux nprocdemo fget
37 0.5158 nprocdemo nprocdemo nproc_get_reply
EOT
A few notes:
- Access control can be implemented easily. Right now it would be bloat,
though -- the vast majority of fields in /proc are world-readable
(/proc/pid/environ being the notable exception).
- Additional process selectors (e.g. select by UID) are not hard to
add, either, should there ever be a need.
- There are a few things I'm not sure about: For instance, what is a good
return value for mm_struct related fields wrt kernel threads? I picked
0, but ~(0) might be preferable because it's distinct.
Signed-off-by: Roger Luethi <[email protected]>
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/netlink.h linux-2.6.9-rc1-bk13-nproc/include/linux/netlink.h
--- linux-2.6.9-rc1-bk13/include/linux/netlink.h 2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/include/linux/netlink.h 2004-09-06 19:50:56.000000000 +0200
@@ -15,6 +15,7 @@
#define NETLINK_ARPD 8
#define NETLINK_AUDIT 9 /* auditing */
#define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */
+#define NETLINK_NPROC 12 /* /proc information */
#define NETLINK_IP6_FW 13
#define NETLINK_DNRTMSG 14 /* DECnet routing messages */
#define NETLINK_TAPBASE 16 /* 16 to 31 are ethertap */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/nproc.h linux-2.6.9-rc1-bk13-nproc/include/linux/nproc.h
--- linux-2.6.9-rc1-bk13/include/linux/nproc.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-rc1-bk13-nproc/include/linux/nproc.h 2004-09-08 18:56:41.763526856 +0200
@@ -0,0 +1,119 @@
+#ifndef _LINUX_NPROC_H
+#define _LINUX_NPROC_H
+
+#include <linux/config.h>
+
+#ifndef __KERNEL__
+#define CONFIG_NPROC
+#endif
+
+#ifdef CONFIG_NPROC
+
+/* Request types */
+#define NPROC_BASE 0x10
+#define NPROC_GET_FIELD_LIST (NPROC_BASE+0)
+#define NPROC_GET_LABEL (NPROC_BASE+1)
+#define NPROC_GET_GLOBAL (NPROC_BASE+2)
+#define NPROC_GET_PS (NPROC_BASE+3)
+#define NPROC_GET_PID_LIST (NPROC_BASE+4)
+
+/* Request flags */
+
+
+/* Field scopes */
+#define NPROC_SCOPE_MASK 0x70000000
+#define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */
+#define NPROC_SCOPE_PROCESS 0x20000000
+#define NPROC_SCOPE_LABEL 0x30000000
+
+/* Data types */
+#define NPROC_TYPE_MASK 0x07000000
+#define NPROC_TYPE_STRING 0x01000000
+#define NPROC_TYPE_U32 0x02000000
+#define NPROC_TYPE_UL 0x03000000
+#define NPROC_TYPE_U64 0x04000000
+
+/* Access control (unused) */
+#define NPROC_PERM_MASK 0x00300000
+#define NPROC_PERM_USER 0x00100000
+#define NPROC_PERM_ROOT 0x00200000
+
+/* Selectors */
+#define NPROC_SELECT_ALL 0x00000001
+#define NPROC_SELECT_PID 0x00000002
+#define NPROC_SELECT_UID 0x00000003
+
+/* Labels */
+#define NPROC_LABEL_FIELD_NAME 0x00000001
+#define NPROC_LABEL_FIELD_FMT 0x00000002
+#define NPROC_LABEL_FIELD_UNIT 0x00000003
+#define NPROC_LABEL_WCHAN 0x00000004
+
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL)
+#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+/* Amount of free memory (pages) */
+#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* Size of a page (bytes) */
+#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* There's no guarantee about anything with jiffies. Still useful for some. */
+#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL)
+/* Process: VM size (KiB) */
+#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: locked memory (KiB) */
+#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size (KiB) */
+#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING)
+
+#ifdef __KERNEL__
+struct nproc_field {
+ __u32 id;
+ const char *label;
+ const char *fmt;
+ const char *unit;
+};
+
+static struct nproc_field labels[] = {
+ { NPROC_PID, "PID", "%5u", "" },
+ { NPROC_NAME, "Name", "%-15s","" },
+ { NPROC_MEMFREE, "MemFree", "%8u", "page" },
+ { NPROC_PAGESIZE, "PageSize", "%4u", "byte" },
+ { NPROC_JIFFIES, "Jiffies", "%10u", "" },
+ { NPROC_VMSIZE, "VmSize", "%8u", "KiB" },
+ { NPROC_VMLOCK, "VmLock", "%8u", "KiB" },
+ { NPROC_VMRSS, "VmRSS", "%8u", "KiB" },
+ { NPROC_VMDATA, "VmData", "%8u", "KiB" },
+ { NPROC_VMSTACK, "VmStack", "%8u", "KiB" },
+ { NPROC_VMEXE, "VmExe", "%8u", "KiB" },
+ { NPROC_VMLIB, "VmLib", "%8u", "KiB" },
+ { NPROC_UID, "UID", "%5u", "" },
+ { NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" },
+ { NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" },
+ { NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" },
+ { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" },
+ { NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" },
+ { NPROC_NR_SLAB, "nr_slab", "%8u", "page" },
+ { NPROC_WCHAN, "wchan", "%p", "" },
+#ifdef CONFIG_KALLSYMS
+ { NPROC_WCHAN_NAME, "wchan_symbol", "%s"},
+#endif
+};
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_NPROC */
+
+#endif /* _LINUX_NPROC_H */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/pid.h linux-2.6.9-rc1-bk13-nproc/include/linux/pid.h
--- linux-2.6.9-rc1-bk13/include/linux/pid.h 2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/include/linux/pid.h 2004-09-06 19:50:56.000000000 +0200
@@ -37,6 +37,7 @@ extern void FASTCALL(detach_pid(struct t
extern struct pid *FASTCALL(find_pid(enum pid_type, int));
extern int alloc_pidmap(void);
+extern void *get_pid_map(int);
extern void FASTCALL(free_pidmap(int));
extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread);
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/Makefile linux-2.6.9-rc1-bk13-nproc/kernel/Makefile
--- linux-2.6.9-rc1-bk13/kernel/Makefile 2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/kernel/Makefile 2004-09-06 19:50:56.000000000 +0200
@@ -15,6 +15,7 @@ obj-$(CONFIG_SMP) += cpu.o spinlock.o
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_KALLSYMS) += kallsyms.o
+obj-$(CONFIG_NPROC) += nproc.o
obj-$(CONFIG_PM) += power/
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_COMPAT) += compat.o
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/nproc.c linux-2.6.9-rc1-bk13-nproc/kernel/nproc.c
--- linux-2.6.9-rc1-bk13/kernel/nproc.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-rc1-bk13-nproc/kernel/nproc.c 2004-09-08 18:34:49.000000000 +0200
@@ -0,0 +1,851 @@
+/*
+ * nproc.c
+ *
+ * netlink interface to /proc information.
+ */
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <linux/swap.h> /* nr_free_pages() */
+#include <linux/kallsyms.h> /* kallsyms_lookup() */
+#include <linux/pid.h> /* get_pid_map() */
+#include <linux/nproc.h>
+#include <asm/bitops.h>
+
+//#define DEBUG
+
+/* There must be like 5 million dprintk definitions, so let's add some more */
+#ifdef DEBUG
+#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
+#else
+#define pdebug(x,args...)
+#define pwarn(x,args...)
+#endif
+
+#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
+
+static struct sock *nproc_sock = NULL;
+
+struct task_mem {
+ u32 vmdata;
+ u32 vmstack;
+ u32 vmexe;
+ u32 vmlib;
+};
+
+struct task_mem_cheap {
+ u32 vmsize;
+ u32 vmlock;
+ u32 vmrss;
+};
+
+/*
+ * __task_mem/__task_mem_cheap basically duplicate the MMU version of
+ * task_mem, but they are split by cost and work on structs.
+ */
+
+static void __task_mem(struct task_struct *tsk, struct task_mem *res)
+{
+ struct mm_struct *mm = get_task_mm(tsk);
+ if (mm) {
+ unsigned long data = 0, stack = 0, exec = 0, lib = 0;
+ struct vm_area_struct *vma;
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
+ if (!vma->vm_file) {
+ data += len;
+ if (vma->vm_flags & VM_GROWSDOWN)
+ stack += len;
+ continue;
+ }
+ if (vma->vm_flags & VM_WRITE)
+ continue;
+ if (vma->vm_flags & VM_EXEC) {
+ exec += len;
+ if (vma->vm_flags & VM_EXECUTABLE)
+ continue;
+ lib += len;
+ }
+ }
+ res->vmdata = data - stack;
+ res->vmstack = stack;
+ res->vmexe = exec - lib;
+ res->vmlib = lib;
+ up_read(&mm->mmap_sem);
+
+ mmput(mm);
+ } else {
+ res->vmdata = 0;
+ res->vmstack = 0;
+ res->vmexe = 0;
+ res->vmlib = 0;
+ }
+}
+
+static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
+{
+ struct mm_struct *mm = get_task_mm(tsk);
+ if (mm) {
+ res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
+ res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
+ res->vmrss = mm->rss << (PAGE_SHIFT-10);
+ mmput(mm);
+ } else {
+ res->vmsize = 0;
+ res->vmlock = 0;
+ res->vmrss = 0;
+ }
+}
+
+/*
+ * page_alloc.c already has an extra function broken out to fill a
+ * struct with information. Cool. Not sure whether pgpgin/pgpgout
+ * should be left as is or nailed down as kbytes.
+ */
+static struct page_state *__vmstat(void)
+{
+ struct page_state *ps;
+ ps = kmalloc(sizeof(*ps), GFP_KERNEL);
+ if (!ps)
+ return ERR_PTR(-ENOMEM);
+ get_full_page_state(ps);
+ ps->pgpgin /= 2; /* sectors -> kbytes */
+ ps->pgpgout /= 2;
+ return ps;
+}
+
+/*
+ * Allocate and prefill an skb. The nlmsghdr provided to the function
+ * is a pointer to the respective struct in the request message.
+ */
+static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len)
+{
+ __u32 seq = nlh->nlmsg_seq;
+ __u16 type = nlh->nlmsg_type;
+ __u32 pid = nlh->nlmsg_pid;
+ struct sk_buff *skb2 = 0;
+
+ skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
+ if (!skb2) {
+ skb2 = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+
+ NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
+out:
+ return skb2;
+
+nlmsg_failure: /* Used by NLMSG_PUT */
+ kfree_skb(skb2);
+ return NULL;
+}
+
+#define mstore(value, id, buf) \
+({ \
+ u32 _type = id & NPROC_TYPE_MASK; \
+ switch (_type) { \
+ case NPROC_TYPE_U32: { \
+ __u32 *p = (u32 *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ case NPROC_TYPE_UL: { \
+ unsigned long *p = (unsigned long *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ case NPROC_TYPE_U64: { \
+ __u64 *p = (u64 *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ default: \
+ perror("Huh? Bad type!\n"); \
+ } \
+})
+
+static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
+{
+ struct task_mem tsk_mem;
+ struct task_mem_cheap tsk_mem_cheap;
+
+ tsk_mem.vmdata = (~0);
+ tsk_mem_cheap.vmsize = (~0);
+
+ switch (id) {
+ case NPROC_PID:
+ mstore(tsk->pid, NPROC_PID, buf);
+ break;
+ case NPROC_UID:
+ mstore(tsk->uid, NPROC_UID, buf);
+ break;
+ case NPROC_VMSIZE:
+ case NPROC_VMLOCK:
+ case NPROC_VMRSS:
+ if (tsk_mem_cheap.vmsize == (~0))
+ __task_mem_cheap(tsk, &tsk_mem_cheap);
+
+ switch (id) {
+ case NPROC_VMSIZE:
+ mstore(tsk_mem_cheap.vmsize,
+ NPROC_VMSIZE, buf);
+ break;
+ case NPROC_VMLOCK:
+ mstore(tsk_mem_cheap.vmlock,
+ NPROC_VMLOCK, buf);
+ break;
+ case NPROC_VMRSS:
+ mstore(tsk_mem_cheap.vmrss,
+ NPROC_VMRSS, buf);
+ break;
+ }
+ break;
+ case NPROC_VMDATA:
+ case NPROC_VMSTACK:
+ case NPROC_VMEXE:
+ case NPROC_VMLIB:
+ if (tsk_mem.vmdata == (~0))
+ __task_mem(tsk, &tsk_mem);
+
+ switch (id) {
+ case NPROC_VMDATA:
+ mstore(tsk_mem.vmdata, NPROC_VMDATA,
+ buf);
+ break;
+ case NPROC_VMSTACK:
+ mstore(tsk_mem.vmstack, NPROC_VMSTACK,
+ buf);
+ break;
+ case NPROC_VMEXE:
+ mstore(tsk_mem.vmexe, NPROC_VMEXE, buf);
+ break;
+ case NPROC_VMLIB:
+ mstore(tsk_mem.vmlib, NPROC_VMLIB, buf);
+ break;
+ }
+ break;
+ case NPROC_JIFFIES:
+ mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+ break;
+ case NPROC_WCHAN:
+ mstore(get_wchan(tsk), NPROC_WCHAN, buf);
+ break;
+ case NPROC_NAME:
+ mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf);
+ strncpy(buf, tsk->comm, sizeof(tsk->comm));
+ buf += sizeof(tsk->comm);
+ break;
+ case NPROC_NOP_UL:
+ mstore(0, NPROC_TYPE_UL, buf);
+ break;
+ default:
+ pwarn("Unknown field ID %#x.\n", id);
+ goto err_inval;
+ }
+ return buf;
+err_inval:
+ return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Build and send a netlink msg for one PID.
+ */
+static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk)
+{
+ int i;
+ int err = 0;
+ struct sk_buff *skb2;
+ char *buf;
+ struct nlmsghdr *nlh2;
+ u32 fcnt, *fields;
+
+ fcnt = fdata[0];
+ fields = &fdata[1];
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+ nlh2 = (struct nlmsghdr *)skb2->data;
+ buf = NLMSG_DATA(nlh2);
+
+ for (i = 0; i < fcnt; i++) {
+ buf = nproc_ps_field(fields[i], buf, tsk);
+ if (IS_ERR(buf)) {
+ err = PTR_ERR(buf);
+ goto out_free;
+ }
+ }
+ err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+ return err;
+out_free:
+ kfree_skb(skb2);
+out:
+ return err;
+}
+
+/*
+ * Find task for given pid, grab task lock (caller must unlock).
+ */
+static task_t *nproc_ps_get_task(int pid)
+{
+ task_t *tsk;
+
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (tsk)
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ return tsk;
+}
+
+/*
+ * Iterate over a list of PIDs.
+ */
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+{
+ int i;
+ int err = 0;
+ u32 tcnt;
+ u32 *pids;
+
+ if (left < sizeof(tcnt))
+ goto err_inval;
+ left -= sizeof(tcnt);
+
+ tcnt = sdata[0];
+
+ if (left < (tcnt * sizeof(u32)))
+ goto err_inval;
+ left -= tcnt * sizeof(u32);
+
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ pids = &sdata[1];
+
+ for (i = 0; i < tcnt; i++) {
+ task_t *tsk;
+ tsk = nproc_ps_get_task(pids[i]);
+ if (!tsk)
+ continue;
+ err = nproc_pid_msg(nlh, fdata, len, tsk);
+ put_task_struct(tsk);
+ if (err)
+ goto out;
+ }
+
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+
+/*
+ * Iterate over all PIDs.
+ */
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+{
+ void *map;
+ int offset, i;
+ int err = 0;
+
+ for (i = 0; i < PIDMAP_ENTRIES; i++) {
+
+ map = get_pid_map(i);
+ if (!map) /* done -- there are no holes in pidmap_array */
+ break;
+ if (IS_ERR(map)) /* No PIDs used in this map */
+ continue;
+ offset = 0;
+ for ( ; ; ) {
+ int pid;
+ task_t *tsk;
+ offset = find_next_bit(map, BITS_PER_PAGE, ++offset);
+ if (offset >= BITS_PER_PAGE)
+ break;
+ pid = offset + i * BITS_PER_PAGE;
+ tsk = nproc_ps_get_task(pid);
+ if (!tsk)
+ continue;
+ err = nproc_pid_msg(nlh, fdata, len, tsk);
+ put_task_struct(tsk);
+ if (err)
+ goto out;
+ }
+ }
+
+out:
+ return err;
+}
+
+static u32 __reply_size_special(u32 id)
+{
+ u32 len = 0;
+
+ switch (id) {
+ case NPROC_NAME:
+ len = sizeof(u32) +
+ sizeof(((struct task_struct*)0)->comm);
+ break;
+ default:
+ pwarn("Unknown field size in %#x.\n", id);
+ }
+ return len;
+}
+
+/*
+ * Calculates the size of a reply message payload. Alternatively, we could have
+ * the user space caller supply a number along with the request and bail
+ * out or realloc later if we find the allocation was too small. More
+ * responsibility in user space, but faster.
+ */
+static u32 *__reply_size (u32 *data, u32 *left, u32 *len)
+{
+ u32 *fields;
+ u32 fcnt;
+ int i;
+ *len = 0;
+
+ if (*left < sizeof(fcnt))
+ goto err_inval;
+ *left -= sizeof(fcnt);
+
+ fcnt = data[0];
+
+ if (*left < (fcnt * sizeof(u32)))
+ goto err_inval;
+ *left -= fcnt * sizeof(u32);
+
+ fields = &data[1];
+
+ for (i = 0; i < fcnt; i++) {
+ u32 id = fields[i];
+ u32 type = id & NPROC_TYPE_MASK;
+ pdebug(" %#8.8x.\n", fields[i]);
+ switch (type) {
+ case NPROC_TYPE_U32:
+ *len += sizeof(u32);
+ break;
+ case NPROC_TYPE_UL:
+ *len += sizeof(unsigned long);
+ break;
+ case NPROC_TYPE_U64:
+ *len += sizeof(u64);
+ break;
+ default: { /* Special cases */
+ u32 slen;
+ slen = __reply_size_special(id);
+ if (slen)
+ *len += slen;
+ else
+ goto err_inval;
+ }
+ }
+ }
+
+ return &fields[fcnt];
+
+err_inval:
+ return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Call the chosen process selector. Adding additional selectors
+ * (e.g. select by uid) is easy, but is there a need?
+ */
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+{
+ int err;
+ u32 len;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 *sdata;
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+
+ sdata = __reply_size(data, &left, &len);
+ if (IS_ERR(sdata)) {
+ err = PTR_ERR(sdata);
+ goto out;
+ }
+
+ if (left < sizeof(u32))
+ goto err_inval;
+ left -= sizeof(u32);
+
+ switch (*sdata) {
+ case NPROC_SELECT_ALL:
+ if (left)
+ pwarn("%d bytes left.\n", left);
+ err = nproc_ps_select_all(nlh, data, len);
+ break;
+ case NPROC_SELECT_PID:
+ err = nproc_ps_select_pid(nlh, data, len,
+ left, sdata + 1);
+ break;
+ default:
+ pwarn("Unknown selection method %#x.\n", *sdata);
+ goto err_inval;
+ }
+
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static char *nproc_global_field(u32 id, char *buf)
+{
+ struct page_state *ps = NULL;
+
+ switch (id) {
+ case NPROC_NR_DIRTY:
+ case NPROC_NR_WRITEBACK:
+ case NPROC_NR_UNSTABLE:
+ case NPROC_NR_PG_TABLE_PGS:
+ case NPROC_NR_MAPPED:
+ case NPROC_NR_SLAB:
+ if (!ps) {
+ ps = __vmstat();
+ if (IS_ERR(ps)) { /* Just pass it on */
+ buf = (void *)ps;
+ ps = NULL;
+ goto out;
+ }
+ }
+ switch (id) {
+ case NPROC_NR_DIRTY:
+ mstore(ps->nr_dirty, NPROC_NR_DIRTY,
+ buf);
+ break;
+ case NPROC_NR_WRITEBACK:
+ mstore(ps->nr_writeback,
+ NPROC_NR_WRITEBACK,
+ buf);
+ break;
+ case NPROC_NR_UNSTABLE:
+ mstore(ps->nr_unstable,
+ NPROC_NR_UNSTABLE,
+ buf);
+ break;
+ case NPROC_NR_PG_TABLE_PGS:
+ mstore(ps->nr_page_table_pages,
+ NPROC_NR_PG_TABLE_PGS,
+ buf);
+ break;
+ case NPROC_NR_MAPPED:
+ mstore(ps->nr_mapped, NPROC_NR_MAPPED,
+ buf);
+ break;
+ case NPROC_NR_SLAB:
+ mstore(ps->nr_slab, NPROC_NR_SLAB, buf);
+ break;
+ }
+ break;
+ case NPROC_MEMFREE:
+ mstore(nr_free_pages(), NPROC_MEMFREE, buf);
+ break;
+ case NPROC_PAGESIZE:
+ mstore(PAGE_SIZE, NPROC_PAGESIZE, buf);
+ break;
+ case NPROC_JIFFIES:
+ mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+ break;
+ default:
+ pwarn("Unknown field ID %#x.\n", id);
+ buf = ERR_PTR(-EINVAL);
+ goto out;
+ }
+ kfree(ps);
+out:
+ return buf;
+}
+
+static int nproc_get_global(struct nlmsghdr *nlh)
+{
+ int err, i;
+ void *errp;
+ struct sk_buff *skb2;
+ char *buf;
+ u32 fcnt, len;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 *fields;
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+ errp = __reply_size(data, &left, &len);
+ if (IS_ERR(errp)) {
+ err = PTR_ERR(errp);
+ goto out;
+ }
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ fcnt = data[0];
+ fields = &data[1];
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+ for (i = 0; i < fcnt; i++) {
+ buf = nproc_global_field(fields[i], buf);
+ if (IS_ERR(buf)) {
+ err = PTR_ERR(buf);
+ kfree_skb(skb2);
+ goto out;
+ }
+ }
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+}
+
+static int find_id(__u32 *data, __u32 *left)
+{
+ int i;
+ u32 id;
+
+ if (*left < sizeof(id))
+ goto err_inval;
+ *left -= sizeof(sizeof(id));
+
+ if (*left)
+ pwarn("%d bytes left.\n", *left);
+ id = data[1];
+
+ for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
+ ; /* Do nothing */
+
+ if (labels[i].id != id) {
+ pwarn("No matching label found for %#x.\n", id);
+ goto err_inval;
+ }
+
+ return i;
+
+err_inval:
+ return -EINVAL;
+}
+
+
+static int nproc_get_label(struct nlmsghdr *nlh)
+{
+ int err;
+ struct sk_buff *skb2;
+ const char *label;
+ char *buf;
+ int len;
+ u32 ltype;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+ if (left < sizeof(ltype))
+ goto err_inval;
+ left -= sizeof(ltype);
+
+ ltype = data[0];
+
+ if (ltype == NPROC_LABEL_FIELD_NAME) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].label;
+ }
+ else if (ltype == NPROC_LABEL_FIELD_UNIT) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].unit;
+ }
+ else if (ltype == NPROC_LABEL_FIELD_FMT) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].fmt;
+ }
+ else if (ltype == NPROC_LABEL_WCHAN) {
+ char *modname;
+ unsigned long wchan, size, offset;
+ char namebuf[128];
+
+ if (left < sizeof(unsigned long))
+ goto err_inval;
+ left -= sizeof(unsigned long);
+
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ wchan = (unsigned long)data[1];
+ label = kallsyms_lookup(wchan, &size, &offset, &modname,
+ namebuf);
+
+ if (!label) {
+ pwarn("No ksym found for %#lx.\n", wchan);
+ goto err_inval;
+ }
+ }
+ else {
+ pwarn("Unknown label type %#x.\n", ltype);
+ goto err_inval;
+ }
+
+ len = strlen(label) + 1;
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+ strncpy(buf, label, len);
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static int nproc_get_list(struct nlmsghdr *nlh)
+{
+ int err, i, cnt, len;
+ struct sk_buff *skb2;
+ u32 *buf;
+
+ cnt = ARRAY_SIZE(labels);
+ len = (cnt + 1) * sizeof(u32);
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+ buf[0] = cnt;
+ for (i = 0; i < cnt; i++)
+ buf[i + 1] = labels[i].id;
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+}
+
+static __inline__ int nproc_process_msg(struct sk_buff *skb,
+ struct nlmsghdr *nlh)
+{
+ int err = 0;
+ uid_t uid;
+ kernel_cap_t caps;
+
+ if (!(nlh->nlmsg_flags & NLM_F_REQUEST))
+ goto out;
+
+ nlh->nlmsg_pid = NETLINK_CB(skb).pid;
+ uid = NETLINK_CB(skb).creds.uid;
+ caps = NETLINK_CB(skb).eff_cap;
+
+ switch (nlh->nlmsg_type) {
+ case NPROC_GET_FIELD_LIST:
+ err = nproc_get_list(nlh);
+ break;
+ case NPROC_GET_LABEL:
+ err = nproc_get_label(nlh);
+ break;
+ case NPROC_GET_GLOBAL:
+ err = nproc_get_global(nlh);
+ break;
+ case NPROC_GET_PS:
+ err = nproc_get_ps(nlh, uid);
+ break;
+ default:
+ pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
+ err = -EINVAL;
+ }
+out:
+ return err;
+
+}
+
+static int nproc_receive_skb(struct sk_buff *skb)
+{
+ int err = 0;
+ struct nlmsghdr *nlh;
+
+ if (skb->len < NLMSG_LENGTH(0))
+ goto err_inval;
+
+ nlh = (struct nlmsghdr *)skb->data;
+ if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){
+ pwarn("Invalid packet.\n");
+ goto err_inval;
+ }
+
+ err = nproc_process_msg(skb, nlh);
+ if (err || nlh->nlmsg_flags & NLM_F_ACK) {
+ pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err,
+ nlh->nlmsg_type, nlh->nlmsg_flags,
+ nlh->nlmsg_seq);
+ netlink_ack(skb, nlh, err);
+ }
+
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static void nproc_receive(struct sock *sk, int len)
+{
+ struct sk_buff *skb;
+
+ while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+ nproc_receive_skb(skb);
+ kfree_skb(skb);
+ }
+}
+
+static int nproc_init(void)
+{
+ nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive);
+
+ if (!nproc_sock) {
+ pwarn("No netlink socket for nproc.\n");
+ return -ENODEV;
+ }
+
+ return 0;
+}
+
+module_init(nproc_init);
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/pid.c linux-2.6.9-rc1-bk13-nproc/kernel/pid.c
--- linux-2.6.9-rc1-bk13/kernel/pid.c 2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/kernel/pid.c 2004-09-06 19:52:59.000000000 +0200
@@ -146,6 +146,17 @@ failure:
return -1;
}
+void *get_pid_map(int idx)
+{
+ pidmap_t *map = pidmap_array + idx;
+ if (!map->page)
+ return NULL;
+ else if (atomic_read(&map->nr_free) == BITS_PER_PAGE)
+ return ERR_PTR(-1);
+ else
+ return map->page;
+}
+
struct pid * fastcall find_pid(enum pid_type type, int nr)
{
struct hlist_node *elem;
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/init/Kconfig linux-2.6.9-rc1-bk13-nproc/init/Kconfig
--- linux-2.6.9-rc1-bk13/init/Kconfig 2004-09-06 18:53:17.000000000 +0200
+++ linux-2.6.9-rc1-bk13-nproc/init/Kconfig 2004-09-06 19:50:56.000000000 +0200
@@ -139,6 +139,13 @@ config SYSCTL
building a kernel for install/rescue disks or your system is very
limited in memory.
+config NPROC
+ bool "Netlink interface to /proc information"
+ depends on PROC_FS && EXPERIMENTAL
+ default y
+ help
+ Nproc is a netlink interface to /proc information.
+
config AUDIT
bool "Auditing support"
default y if SECURITY_SELINUX
On Wed, Sep 08, 2004 at 08:41:30PM +0200, Roger Luethi wrote:
> A few notes:
> - Access control can be implemented easily. Right now it would be bloat,
> though -- the vast majority of fields in /proc are world-readable
> (/proc/pid/environ being the notable exception).
> - Additional process selectors (e.g. select by UID) are not hard to
> add, either, should there ever be a need.
> - There are a few things I'm not sure about: For instance, what is a good
> return value for mm_struct related fields wrt kernel threads? I picked
> 0, but ~(0) might be preferable because it's distinct.
> Signed-off-by: Roger Luethi <[email protected]>
Any chance you could convert these to use the new vm statistics
accounting?
-- wli
On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote:
> Any chance you could convert these to use the new vm statistics
> accounting?
Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these.
For that you will need to #ifdef on CONFIG_MMU and use the methods
in fs/proc/task_nommu.c and so on.
-- wli
On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote:
>> Any chance you could convert these to use the new vm statistics
>> accounting?
On Wed, Sep 08, 2004 at 05:43:20PM -0700, William Lee Irwin III wrote:
> Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these.
> For that you will need to #ifdef on CONFIG_MMU and use the methods
> in fs/proc/task_nommu.c and so on.
This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes
whatsoever to the underlying code were made; rather, this merely
resolves offsets so it applies cleanly.
Compiletested on ia64.
-- wli
Index: mm4-2.6.9-rc1/include/linux/netlink.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/netlink.h 2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/netlink.h 2004-09-08 17:45:27.500658296 -0700
@@ -15,6 +15,7 @@
#define NETLINK_ARPD 8
#define NETLINK_AUDIT 9 /* auditing */
#define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */
+#define NETLINK_NPROC 12 /* /proc information */
#define NETLINK_IP6_FW 13
#define NETLINK_DNRTMSG 14 /* DECnet routing messages */
#define NETLINK_KEVENT 15 /* Kernel messages to userspace */
Index: mm4-2.6.9-rc1/include/linux/nproc.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/nproc.h 2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/nproc.h 2004-09-08 17:45:27.501634858 -0700
@@ -0,0 +1,119 @@
+#ifndef _LINUX_NPROC_H
+#define _LINUX_NPROC_H
+
+#include <linux/config.h>
+
+#ifndef __KERNEL__
+#define CONFIG_NPROC
+#endif
+
+#ifdef CONFIG_NPROC
+
+/* Request types */
+#define NPROC_BASE 0x10
+#define NPROC_GET_FIELD_LIST (NPROC_BASE+0)
+#define NPROC_GET_LABEL (NPROC_BASE+1)
+#define NPROC_GET_GLOBAL (NPROC_BASE+2)
+#define NPROC_GET_PS (NPROC_BASE+3)
+#define NPROC_GET_PID_LIST (NPROC_BASE+4)
+
+/* Request flags */
+
+
+/* Field scopes */
+#define NPROC_SCOPE_MASK 0x70000000
+#define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */
+#define NPROC_SCOPE_PROCESS 0x20000000
+#define NPROC_SCOPE_LABEL 0x30000000
+
+/* Data types */
+#define NPROC_TYPE_MASK 0x07000000
+#define NPROC_TYPE_STRING 0x01000000
+#define NPROC_TYPE_U32 0x02000000
+#define NPROC_TYPE_UL 0x03000000
+#define NPROC_TYPE_U64 0x04000000
+
+/* Access control (unused) */
+#define NPROC_PERM_MASK 0x00300000
+#define NPROC_PERM_USER 0x00100000
+#define NPROC_PERM_ROOT 0x00200000
+
+/* Selectors */
+#define NPROC_SELECT_ALL 0x00000001
+#define NPROC_SELECT_PID 0x00000002
+#define NPROC_SELECT_UID 0x00000003
+
+/* Labels */
+#define NPROC_LABEL_FIELD_NAME 0x00000001
+#define NPROC_LABEL_FIELD_FMT 0x00000002
+#define NPROC_LABEL_FIELD_UNIT 0x00000003
+#define NPROC_LABEL_WCHAN 0x00000004
+
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL)
+#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+/* Amount of free memory (pages) */
+#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* Size of a page (bytes) */
+#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* There's no guarantee about anything with jiffies. Still useful for some. */
+#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL)
+/* Process: VM size (KiB) */
+#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: locked memory (KiB) */
+#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size (KiB) */
+#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING)
+
+#ifdef __KERNEL__
+struct nproc_field {
+ __u32 id;
+ const char *label;
+ const char *fmt;
+ const char *unit;
+};
+
+static struct nproc_field labels[] = {
+ { NPROC_PID, "PID", "%5u", "" },
+ { NPROC_NAME, "Name", "%-15s","" },
+ { NPROC_MEMFREE, "MemFree", "%8u", "page" },
+ { NPROC_PAGESIZE, "PageSize", "%4u", "byte" },
+ { NPROC_JIFFIES, "Jiffies", "%10u", "" },
+ { NPROC_VMSIZE, "VmSize", "%8u", "KiB" },
+ { NPROC_VMLOCK, "VmLock", "%8u", "KiB" },
+ { NPROC_VMRSS, "VmRSS", "%8u", "KiB" },
+ { NPROC_VMDATA, "VmData", "%8u", "KiB" },
+ { NPROC_VMSTACK, "VmStack", "%8u", "KiB" },
+ { NPROC_VMEXE, "VmExe", "%8u", "KiB" },
+ { NPROC_VMLIB, "VmLib", "%8u", "KiB" },
+ { NPROC_UID, "UID", "%5u", "" },
+ { NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" },
+ { NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" },
+ { NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" },
+ { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" },
+ { NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" },
+ { NPROC_NR_SLAB, "nr_slab", "%8u", "page" },
+ { NPROC_WCHAN, "wchan", "%p", "" },
+#ifdef CONFIG_KALLSYMS
+ { NPROC_WCHAN_NAME, "wchan_symbol", "%s"},
+#endif
+};
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_NPROC */
+
+#endif /* _LINUX_NPROC_H */
Index: mm4-2.6.9-rc1/include/linux/pid.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/pid.h 2004-09-08 06:10:36.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/pid.h 2004-09-08 17:45:27.501634858 -0700
@@ -37,6 +37,7 @@
extern struct pid *FASTCALL(find_pid(enum pid_type, int));
extern int alloc_pidmap(void);
+extern void *get_pid_map(int);
extern void FASTCALL(free_pidmap(int));
extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread);
Index: mm4-2.6.9-rc1/init/Kconfig
===================================================================
--- mm4-2.6.9-rc1.orig/init/Kconfig 2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/init/Kconfig 2004-09-08 17:45:27.504564546 -0700
@@ -139,6 +139,13 @@
building a kernel for install/rescue disks or your system is very
limited in memory.
+config NPROC
+ bool "Netlink interface to /proc information"
+ depends on PROC_FS && EXPERIMENTAL
+ default y
+ help
+ Nproc is a netlink interface to /proc information.
+
config AUDIT
bool "Auditing support"
default y if SECURITY_SELINUX
Index: mm4-2.6.9-rc1/kernel/Makefile
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/Makefile 2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/Makefile 2004-09-08 17:45:27.501634858 -0700
@@ -15,6 +15,7 @@
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_KALLSYMS) += kallsyms.o
+obj-$(CONFIG_NPROC) += nproc.o
obj-$(CONFIG_PM) += power/
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-08 17:45:27.503587983 -0700
@@ -0,0 +1,851 @@
+/*
+ * nproc.c
+ *
+ * netlink interface to /proc information.
+ */
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <linux/swap.h> /* nr_free_pages() */
+#include <linux/kallsyms.h> /* kallsyms_lookup() */
+#include <linux/pid.h> /* get_pid_map() */
+#include <linux/nproc.h>
+#include <asm/bitops.h>
+
+//#define DEBUG
+
+/* There must be like 5 million dprintk definitions, so let's add some more */
+#ifdef DEBUG
+#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
+#else
+#define pdebug(x,args...)
+#define pwarn(x,args...)
+#endif
+
+#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
+
+static struct sock *nproc_sock = NULL;
+
+struct task_mem {
+ u32 vmdata;
+ u32 vmstack;
+ u32 vmexe;
+ u32 vmlib;
+};
+
+struct task_mem_cheap {
+ u32 vmsize;
+ u32 vmlock;
+ u32 vmrss;
+};
+
+/*
+ * __task_mem/__task_mem_cheap basically duplicate the MMU version of
+ * task_mem, but they are split by cost and work on structs.
+ */
+
+static void __task_mem(struct task_struct *tsk, struct task_mem *res)
+{
+ struct mm_struct *mm = get_task_mm(tsk);
+ if (mm) {
+ unsigned long data = 0, stack = 0, exec = 0, lib = 0;
+ struct vm_area_struct *vma;
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
+ if (!vma->vm_file) {
+ data += len;
+ if (vma->vm_flags & VM_GROWSDOWN)
+ stack += len;
+ continue;
+ }
+ if (vma->vm_flags & VM_WRITE)
+ continue;
+ if (vma->vm_flags & VM_EXEC) {
+ exec += len;
+ if (vma->vm_flags & VM_EXECUTABLE)
+ continue;
+ lib += len;
+ }
+ }
+ res->vmdata = data - stack;
+ res->vmstack = stack;
+ res->vmexe = exec - lib;
+ res->vmlib = lib;
+ up_read(&mm->mmap_sem);
+
+ mmput(mm);
+ } else {
+ res->vmdata = 0;
+ res->vmstack = 0;
+ res->vmexe = 0;
+ res->vmlib = 0;
+ }
+}
+
+static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
+{
+ struct mm_struct *mm = get_task_mm(tsk);
+ if (mm) {
+ res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
+ res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
+ res->vmrss = mm->rss << (PAGE_SHIFT-10);
+ mmput(mm);
+ } else {
+ res->vmsize = 0;
+ res->vmlock = 0;
+ res->vmrss = 0;
+ }
+}
+
+/*
+ * page_alloc.c already has an extra function broken out to fill a
+ * struct with information. Cool. Not sure whether pgpgin/pgpgout
+ * should be left as is or nailed down as kbytes.
+ */
+static struct page_state *__vmstat(void)
+{
+ struct page_state *ps;
+ ps = kmalloc(sizeof(*ps), GFP_KERNEL);
+ if (!ps)
+ return ERR_PTR(-ENOMEM);
+ get_full_page_state(ps);
+ ps->pgpgin /= 2; /* sectors -> kbytes */
+ ps->pgpgout /= 2;
+ return ps;
+}
+
+/*
+ * Allocate and prefill an skb. The nlmsghdr provided to the function
+ * is a pointer to the respective struct in the request message.
+ */
+static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len)
+{
+ __u32 seq = nlh->nlmsg_seq;
+ __u16 type = nlh->nlmsg_type;
+ __u32 pid = nlh->nlmsg_pid;
+ struct sk_buff *skb2 = 0;
+
+ skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
+ if (!skb2) {
+ skb2 = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+
+ NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
+out:
+ return skb2;
+
+nlmsg_failure: /* Used by NLMSG_PUT */
+ kfree_skb(skb2);
+ return NULL;
+}
+
+#define mstore(value, id, buf) \
+({ \
+ u32 _type = id & NPROC_TYPE_MASK; \
+ switch (_type) { \
+ case NPROC_TYPE_U32: { \
+ __u32 *p = (u32 *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ case NPROC_TYPE_UL: { \
+ unsigned long *p = (unsigned long *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ case NPROC_TYPE_U64: { \
+ __u64 *p = (u64 *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ default: \
+ perror("Huh? Bad type!\n"); \
+ } \
+})
+
+static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
+{
+ struct task_mem tsk_mem;
+ struct task_mem_cheap tsk_mem_cheap;
+
+ tsk_mem.vmdata = (~0);
+ tsk_mem_cheap.vmsize = (~0);
+
+ switch (id) {
+ case NPROC_PID:
+ mstore(tsk->pid, NPROC_PID, buf);
+ break;
+ case NPROC_UID:
+ mstore(tsk->uid, NPROC_UID, buf);
+ break;
+ case NPROC_VMSIZE:
+ case NPROC_VMLOCK:
+ case NPROC_VMRSS:
+ if (tsk_mem_cheap.vmsize == (~0))
+ __task_mem_cheap(tsk, &tsk_mem_cheap);
+
+ switch (id) {
+ case NPROC_VMSIZE:
+ mstore(tsk_mem_cheap.vmsize,
+ NPROC_VMSIZE, buf);
+ break;
+ case NPROC_VMLOCK:
+ mstore(tsk_mem_cheap.vmlock,
+ NPROC_VMLOCK, buf);
+ break;
+ case NPROC_VMRSS:
+ mstore(tsk_mem_cheap.vmrss,
+ NPROC_VMRSS, buf);
+ break;
+ }
+ break;
+ case NPROC_VMDATA:
+ case NPROC_VMSTACK:
+ case NPROC_VMEXE:
+ case NPROC_VMLIB:
+ if (tsk_mem.vmdata == (~0))
+ __task_mem(tsk, &tsk_mem);
+
+ switch (id) {
+ case NPROC_VMDATA:
+ mstore(tsk_mem.vmdata, NPROC_VMDATA,
+ buf);
+ break;
+ case NPROC_VMSTACK:
+ mstore(tsk_mem.vmstack, NPROC_VMSTACK,
+ buf);
+ break;
+ case NPROC_VMEXE:
+ mstore(tsk_mem.vmexe, NPROC_VMEXE, buf);
+ break;
+ case NPROC_VMLIB:
+ mstore(tsk_mem.vmlib, NPROC_VMLIB, buf);
+ break;
+ }
+ break;
+ case NPROC_JIFFIES:
+ mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+ break;
+ case NPROC_WCHAN:
+ mstore(get_wchan(tsk), NPROC_WCHAN, buf);
+ break;
+ case NPROC_NAME:
+ mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf);
+ strncpy(buf, tsk->comm, sizeof(tsk->comm));
+ buf += sizeof(tsk->comm);
+ break;
+ case NPROC_NOP_UL:
+ mstore(0, NPROC_TYPE_UL, buf);
+ break;
+ default:
+ pwarn("Unknown field ID %#x.\n", id);
+ goto err_inval;
+ }
+ return buf;
+err_inval:
+ return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Build and send a netlink msg for one PID.
+ */
+static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk)
+{
+ int i;
+ int err = 0;
+ struct sk_buff *skb2;
+ char *buf;
+ struct nlmsghdr *nlh2;
+ u32 fcnt, *fields;
+
+ fcnt = fdata[0];
+ fields = &fdata[1];
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+ nlh2 = (struct nlmsghdr *)skb2->data;
+ buf = NLMSG_DATA(nlh2);
+
+ for (i = 0; i < fcnt; i++) {
+ buf = nproc_ps_field(fields[i], buf, tsk);
+ if (IS_ERR(buf)) {
+ err = PTR_ERR(buf);
+ goto out_free;
+ }
+ }
+ err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+ return err;
+out_free:
+ kfree_skb(skb2);
+out:
+ return err;
+}
+
+/*
+ * Find task for given pid, grab task lock (caller must unlock).
+ */
+static task_t *nproc_ps_get_task(int pid)
+{
+ task_t *tsk;
+
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (tsk)
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ return tsk;
+}
+
+/*
+ * Iterate over a list of PIDs.
+ */
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+{
+ int i;
+ int err = 0;
+ u32 tcnt;
+ u32 *pids;
+
+ if (left < sizeof(tcnt))
+ goto err_inval;
+ left -= sizeof(tcnt);
+
+ tcnt = sdata[0];
+
+ if (left < (tcnt * sizeof(u32)))
+ goto err_inval;
+ left -= tcnt * sizeof(u32);
+
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ pids = &sdata[1];
+
+ for (i = 0; i < tcnt; i++) {
+ task_t *tsk;
+ tsk = nproc_ps_get_task(pids[i]);
+ if (!tsk)
+ continue;
+ err = nproc_pid_msg(nlh, fdata, len, tsk);
+ put_task_struct(tsk);
+ if (err)
+ goto out;
+ }
+
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+
+/*
+ * Iterate over all PIDs.
+ */
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+{
+ void *map;
+ int offset, i;
+ int err = 0;
+
+ for (i = 0; i < PIDMAP_ENTRIES; i++) {
+
+ map = get_pid_map(i);
+ if (!map) /* done -- there are no holes in pidmap_array */
+ break;
+ if (IS_ERR(map)) /* No PIDs used in this map */
+ continue;
+ offset = 0;
+ for ( ; ; ) {
+ int pid;
+ task_t *tsk;
+ offset = find_next_bit(map, BITS_PER_PAGE, ++offset);
+ if (offset >= BITS_PER_PAGE)
+ break;
+ pid = offset + i * BITS_PER_PAGE;
+ tsk = nproc_ps_get_task(pid);
+ if (!tsk)
+ continue;
+ err = nproc_pid_msg(nlh, fdata, len, tsk);
+ put_task_struct(tsk);
+ if (err)
+ goto out;
+ }
+ }
+
+out:
+ return err;
+}
+
+static u32 __reply_size_special(u32 id)
+{
+ u32 len = 0;
+
+ switch (id) {
+ case NPROC_NAME:
+ len = sizeof(u32) +
+ sizeof(((struct task_struct*)0)->comm);
+ break;
+ default:
+ pwarn("Unknown field size in %#x.\n", id);
+ }
+ return len;
+}
+
+/*
+ * Calculates the size of a reply message payload. Alternatively, we could have
+ * the user space caller supply a number along with the request and bail
+ * out or realloc later if we find the allocation was too small. More
+ * responsibility in user space, but faster.
+ */
+static u32 *__reply_size (u32 *data, u32 *left, u32 *len)
+{
+ u32 *fields;
+ u32 fcnt;
+ int i;
+ *len = 0;
+
+ if (*left < sizeof(fcnt))
+ goto err_inval;
+ *left -= sizeof(fcnt);
+
+ fcnt = data[0];
+
+ if (*left < (fcnt * sizeof(u32)))
+ goto err_inval;
+ *left -= fcnt * sizeof(u32);
+
+ fields = &data[1];
+
+ for (i = 0; i < fcnt; i++) {
+ u32 id = fields[i];
+ u32 type = id & NPROC_TYPE_MASK;
+ pdebug(" %#8.8x.\n", fields[i]);
+ switch (type) {
+ case NPROC_TYPE_U32:
+ *len += sizeof(u32);
+ break;
+ case NPROC_TYPE_UL:
+ *len += sizeof(unsigned long);
+ break;
+ case NPROC_TYPE_U64:
+ *len += sizeof(u64);
+ break;
+ default: { /* Special cases */
+ u32 slen;
+ slen = __reply_size_special(id);
+ if (slen)
+ *len += slen;
+ else
+ goto err_inval;
+ }
+ }
+ }
+
+ return &fields[fcnt];
+
+err_inval:
+ return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Call the chosen process selector. Adding additional selectors
+ * (e.g. select by uid) is easy, but is there a need?
+ */
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+{
+ int err;
+ u32 len;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 *sdata;
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+
+ sdata = __reply_size(data, &left, &len);
+ if (IS_ERR(sdata)) {
+ err = PTR_ERR(sdata);
+ goto out;
+ }
+
+ if (left < sizeof(u32))
+ goto err_inval;
+ left -= sizeof(u32);
+
+ switch (*sdata) {
+ case NPROC_SELECT_ALL:
+ if (left)
+ pwarn("%d bytes left.\n", left);
+ err = nproc_ps_select_all(nlh, data, len);
+ break;
+ case NPROC_SELECT_PID:
+ err = nproc_ps_select_pid(nlh, data, len,
+ left, sdata + 1);
+ break;
+ default:
+ pwarn("Unknown selection method %#x.\n", *sdata);
+ goto err_inval;
+ }
+
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static char *nproc_global_field(u32 id, char *buf)
+{
+ struct page_state *ps = NULL;
+
+ switch (id) {
+ case NPROC_NR_DIRTY:
+ case NPROC_NR_WRITEBACK:
+ case NPROC_NR_UNSTABLE:
+ case NPROC_NR_PG_TABLE_PGS:
+ case NPROC_NR_MAPPED:
+ case NPROC_NR_SLAB:
+ if (!ps) {
+ ps = __vmstat();
+ if (IS_ERR(ps)) { /* Just pass it on */
+ buf = (void *)ps;
+ ps = NULL;
+ goto out;
+ }
+ }
+ switch (id) {
+ case NPROC_NR_DIRTY:
+ mstore(ps->nr_dirty, NPROC_NR_DIRTY,
+ buf);
+ break;
+ case NPROC_NR_WRITEBACK:
+ mstore(ps->nr_writeback,
+ NPROC_NR_WRITEBACK,
+ buf);
+ break;
+ case NPROC_NR_UNSTABLE:
+ mstore(ps->nr_unstable,
+ NPROC_NR_UNSTABLE,
+ buf);
+ break;
+ case NPROC_NR_PG_TABLE_PGS:
+ mstore(ps->nr_page_table_pages,
+ NPROC_NR_PG_TABLE_PGS,
+ buf);
+ break;
+ case NPROC_NR_MAPPED:
+ mstore(ps->nr_mapped, NPROC_NR_MAPPED,
+ buf);
+ break;
+ case NPROC_NR_SLAB:
+ mstore(ps->nr_slab, NPROC_NR_SLAB, buf);
+ break;
+ }
+ break;
+ case NPROC_MEMFREE:
+ mstore(nr_free_pages(), NPROC_MEMFREE, buf);
+ break;
+ case NPROC_PAGESIZE:
+ mstore(PAGE_SIZE, NPROC_PAGESIZE, buf);
+ break;
+ case NPROC_JIFFIES:
+ mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+ break;
+ default:
+ pwarn("Unknown field ID %#x.\n", id);
+ buf = ERR_PTR(-EINVAL);
+ goto out;
+ }
+ kfree(ps);
+out:
+ return buf;
+}
+
+static int nproc_get_global(struct nlmsghdr *nlh)
+{
+ int err, i;
+ void *errp;
+ struct sk_buff *skb2;
+ char *buf;
+ u32 fcnt, len;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 *fields;
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+ errp = __reply_size(data, &left, &len);
+ if (IS_ERR(errp)) {
+ err = PTR_ERR(errp);
+ goto out;
+ }
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ fcnt = data[0];
+ fields = &data[1];
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+ for (i = 0; i < fcnt; i++) {
+ buf = nproc_global_field(fields[i], buf);
+ if (IS_ERR(buf)) {
+ err = PTR_ERR(buf);
+ kfree_skb(skb2);
+ goto out;
+ }
+ }
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+}
+
+static int find_id(__u32 *data, __u32 *left)
+{
+ int i;
+ u32 id;
+
+ if (*left < sizeof(id))
+ goto err_inval;
+ *left -= sizeof(sizeof(id));
+
+ if (*left)
+ pwarn("%d bytes left.\n", *left);
+ id = data[1];
+
+ for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
+ ; /* Do nothing */
+
+ if (labels[i].id != id) {
+ pwarn("No matching label found for %#x.\n", id);
+ goto err_inval;
+ }
+
+ return i;
+
+err_inval:
+ return -EINVAL;
+}
+
+
+static int nproc_get_label(struct nlmsghdr *nlh)
+{
+ int err;
+ struct sk_buff *skb2;
+ const char *label;
+ char *buf;
+ int len;
+ u32 ltype;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+ if (left < sizeof(ltype))
+ goto err_inval;
+ left -= sizeof(ltype);
+
+ ltype = data[0];
+
+ if (ltype == NPROC_LABEL_FIELD_NAME) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].label;
+ }
+ else if (ltype == NPROC_LABEL_FIELD_UNIT) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].unit;
+ }
+ else if (ltype == NPROC_LABEL_FIELD_FMT) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].fmt;
+ }
+ else if (ltype == NPROC_LABEL_WCHAN) {
+ char *modname;
+ unsigned long wchan, size, offset;
+ char namebuf[128];
+
+ if (left < sizeof(unsigned long))
+ goto err_inval;
+ left -= sizeof(unsigned long);
+
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ wchan = (unsigned long)data[1];
+ label = kallsyms_lookup(wchan, &size, &offset, &modname,
+ namebuf);
+
+ if (!label) {
+ pwarn("No ksym found for %#lx.\n", wchan);
+ goto err_inval;
+ }
+ }
+ else {
+ pwarn("Unknown label type %#x.\n", ltype);
+ goto err_inval;
+ }
+
+ len = strlen(label) + 1;
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+ strncpy(buf, label, len);
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static int nproc_get_list(struct nlmsghdr *nlh)
+{
+ int err, i, cnt, len;
+ struct sk_buff *skb2;
+ u32 *buf;
+
+ cnt = ARRAY_SIZE(labels);
+ len = (cnt + 1) * sizeof(u32);
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+ buf[0] = cnt;
+ for (i = 0; i < cnt; i++)
+ buf[i + 1] = labels[i].id;
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+}
+
+static __inline__ int nproc_process_msg(struct sk_buff *skb,
+ struct nlmsghdr *nlh)
+{
+ int err = 0;
+ uid_t uid;
+ kernel_cap_t caps;
+
+ if (!(nlh->nlmsg_flags & NLM_F_REQUEST))
+ goto out;
+
+ nlh->nlmsg_pid = NETLINK_CB(skb).pid;
+ uid = NETLINK_CB(skb).creds.uid;
+ caps = NETLINK_CB(skb).eff_cap;
+
+ switch (nlh->nlmsg_type) {
+ case NPROC_GET_FIELD_LIST:
+ err = nproc_get_list(nlh);
+ break;
+ case NPROC_GET_LABEL:
+ err = nproc_get_label(nlh);
+ break;
+ case NPROC_GET_GLOBAL:
+ err = nproc_get_global(nlh);
+ break;
+ case NPROC_GET_PS:
+ err = nproc_get_ps(nlh, uid);
+ break;
+ default:
+ pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
+ err = -EINVAL;
+ }
+out:
+ return err;
+
+}
+
+static int nproc_receive_skb(struct sk_buff *skb)
+{
+ int err = 0;
+ struct nlmsghdr *nlh;
+
+ if (skb->len < NLMSG_LENGTH(0))
+ goto err_inval;
+
+ nlh = (struct nlmsghdr *)skb->data;
+ if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){
+ pwarn("Invalid packet.\n");
+ goto err_inval;
+ }
+
+ err = nproc_process_msg(skb, nlh);
+ if (err || nlh->nlmsg_flags & NLM_F_ACK) {
+ pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err,
+ nlh->nlmsg_type, nlh->nlmsg_flags,
+ nlh->nlmsg_seq);
+ netlink_ack(skb, nlh, err);
+ }
+
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static void nproc_receive(struct sock *sk, int len)
+{
+ struct sk_buff *skb;
+
+ while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+ nproc_receive_skb(skb);
+ kfree_skb(skb);
+ }
+}
+
+static int nproc_init(void)
+{
+ nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive);
+
+ if (!nproc_sock) {
+ pwarn("No netlink socket for nproc.\n");
+ return -ENODEV;
+ }
+
+ return 0;
+}
+
+module_init(nproc_init);
Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 06:10:54.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-08 17:45:27.504564546 -0700
@@ -148,6 +148,17 @@
return -1;
}
+void *get_pid_map(int idx)
+{
+ pidmap_t *map = pidmap_array + idx;
+ if (!map->page)
+ return NULL;
+ else if (atomic_read(&map->nr_free) == BITS_PER_PAGE)
+ return ERR_PTR(-1);
+ else
+ return map->page;
+}
+
struct pid * fastcall find_pid(enum pid_type type, int nr)
{
struct hlist_node *elem;
On Wed, Sep 08, 2004 at 06:15:49PM -0700, William Lee Irwin III wrote:
>> This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes
>> whatsoever to the underlying code were made; rather, this merely
>> resolves offsets so it applies cleanly.
>> Compiletested on ia64.
On Wed, Sep 08, 2004 at 06:17:08PM -0700, William Lee Irwin III wrote:
> Repost with appropriate Subject: line.
Make __task_mem() and __task_mem_cheap() use the appropriate methods
for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n.
The new methods for /proc/ accounting involve using counters kept in
the mm instead of iteration over vmas. For the CONFIG_MMU=y case this
does not involve acquiring mm->mmap_sem for any per-mm statistics. The
CONFIG_MMU=n case still needs iteration over tblocks to calculate them.
-- wli
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-08 17:45:27.503587983 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-08 18:11:24.826811093 -0700
@@ -44,44 +44,20 @@
* __task_mem/__task_mem_cheap basically duplicate the MMU version of
* task_mem, but they are split by cost and work on structs.
*/
-
+#ifdef CONFIG_MMU
static void __task_mem(struct task_struct *tsk, struct task_mem *res)
{
struct mm_struct *mm = get_task_mm(tsk);
- if (mm) {
- unsigned long data = 0, stack = 0, exec = 0, lib = 0;
- struct vm_area_struct *vma;
-
- down_read(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
- if (!vma->vm_file) {
- data += len;
- if (vma->vm_flags & VM_GROWSDOWN)
- stack += len;
- continue;
- }
- if (vma->vm_flags & VM_WRITE)
- continue;
- if (vma->vm_flags & VM_EXEC) {
- exec += len;
- if (vma->vm_flags & VM_EXECUTABLE)
- continue;
- lib += len;
- }
- }
- res->vmdata = data - stack;
- res->vmstack = stack;
- res->vmexe = exec - lib;
- res->vmlib = lib;
- up_read(&mm->mmap_sem);
+ if (!mm)
+ memset(res, 0, sizeof(struct task_mem));
+ else {
+ res->vmdata = (mm->total_vm - mm->shared_vm - mm->stack_vm)
+ << (PAGE_SHIFT - 10);
+ res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10);
+ res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
+ res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe;
mmput(mm);
- } else {
- res->vmdata = 0;
- res->vmstack = 0;
- res->vmexe = 0;
- res->vmlib = 0;
}
}
@@ -99,6 +75,80 @@
res->vmrss = 0;
}
}
+#else /* !CONFIG_MMU */
+static void __task_mem(task_t *task, struct task_mem *stats)
+{
+ struct mm_struct *mm = get_task_mm(task)
+
+ if (!mm)
+ memset(stats, 0, sizeof(struct task_mem));
+ else {
+ unsigned long bytes = 0, sbytes = 0, slack = 0;
+ struct mm_tblk_struct *tblk;
+
+ down_read(&mm->mmap_sem);
+ for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) {
+ if (!tblk->rblock)
+ continue;
+ bytes += kobjsize(tblk);
+ if (atomic_read(&mm->mm_count) > 1) ||
+ tblk->rblock->refcount > 1) {
+ sbytes += kobjsize(tblk->rblock->kblock);
+ sbytes += kobjsize(tblk->rblock);
+ } else {
+ bytes += kobjsize(tblk->rblock->kblock);
+ bytes += kobjsize(tblk->rblock);
+ slack += kobjsize(tblock->rblock->kblock);
+ }
+ }
+ if (atomic_read(&mm->mm_count) > 1)
+ sbytes += kobjsize(mm);
+ else
+ bytes += kobjsize(mm);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ if (task->fs && atomic_read(&task->fs->count) > 1)
+ sbytes += kobjsize(task->files);
+ else
+ bytes += kobjsize(task->files);
+ if (task->sighand && atomic_read(&task->sighand->count) > 1)
+ sbytes += kobjsize(task->sighand);
+ else
+ bytes += kobjsize(task->sighand);
+ bytes += kobjsize(task);
+ /* some interpretation is needed */
+ stats->vmdata = bytes;
+ stats->vmstack = sbytes;
+ stats->vmexe = stats->vmlib = 0;
+ }
+}
+
+static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats)
+{
+ struct mm_struct *mm = get_task_mm(task);
+ struct mm_tblock_struct *tblk;
+ int size;
+
+ memset(stats, 0, sizeof(struct task_mem_cheap));
+ stats->vmrss += kobjsize(mm);
+ down_read(&mm->mmap_sem);
+ for (tblk = &mm->context.block; tblk; tblk = tblk->next) {
+ if (tblk->next)
+ stats->vmrss += kobjsize(tblk->next);
+ if (tblk->rblock) {
+ stats->vmsize += kobjsize(tblk->rblock);
+ stats->vmrss += kobjsize(tblk->rblock);
+ stats->vmrss += kobjsize(tblk->rblock->kblock);
+ }
+ }
+ stats->vmrss += mm->end_code - mm->start_code;
+ stats->vmrss += mm->start_stack - mm->start_data;
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ stats->vmrss >>= 10;
+ stats->vmsize >>= 10;
+}
+#endif /* !CONFIG_MMU */
/*
* page_alloc.c already has an extra function broken out to fill a
On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote:
>>> Any chance you could convert these to use the new vm statistics
>>> accounting?
On Wed, Sep 08, 2004 at 05:43:20PM -0700, William Lee Irwin III wrote:
>> Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these.
>> For that you will need to #ifdef on CONFIG_MMU and use the methods
>> in fs/proc/task_nommu.c and so on.
On Wed, Sep 08, 2004 at 06:15:49PM -0700, William Lee Irwin III wrote:
> This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes
> whatsoever to the underlying code were made; rather, this merely
> resolves offsets so it applies cleanly.
> Compiletested on ia64.
Repost with appropriate Subject: line.
-- wli
Index: mm4-2.6.9-rc1/include/linux/netlink.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/netlink.h 2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/netlink.h 2004-09-08 17:45:27.500658296 -0700
@@ -15,6 +15,7 @@
#define NETLINK_ARPD 8
#define NETLINK_AUDIT 9 /* auditing */
#define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */
+#define NETLINK_NPROC 12 /* /proc information */
#define NETLINK_IP6_FW 13
#define NETLINK_DNRTMSG 14 /* DECnet routing messages */
#define NETLINK_KEVENT 15 /* Kernel messages to userspace */
Index: mm4-2.6.9-rc1/include/linux/nproc.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/nproc.h 2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/nproc.h 2004-09-08 17:45:27.501634858 -0700
@@ -0,0 +1,119 @@
+#ifndef _LINUX_NPROC_H
+#define _LINUX_NPROC_H
+
+#include <linux/config.h>
+
+#ifndef __KERNEL__
+#define CONFIG_NPROC
+#endif
+
+#ifdef CONFIG_NPROC
+
+/* Request types */
+#define NPROC_BASE 0x10
+#define NPROC_GET_FIELD_LIST (NPROC_BASE+0)
+#define NPROC_GET_LABEL (NPROC_BASE+1)
+#define NPROC_GET_GLOBAL (NPROC_BASE+2)
+#define NPROC_GET_PS (NPROC_BASE+3)
+#define NPROC_GET_PID_LIST (NPROC_BASE+4)
+
+/* Request flags */
+
+
+/* Field scopes */
+#define NPROC_SCOPE_MASK 0x70000000
+#define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */
+#define NPROC_SCOPE_PROCESS 0x20000000
+#define NPROC_SCOPE_LABEL 0x30000000
+
+/* Data types */
+#define NPROC_TYPE_MASK 0x07000000
+#define NPROC_TYPE_STRING 0x01000000
+#define NPROC_TYPE_U32 0x02000000
+#define NPROC_TYPE_UL 0x03000000
+#define NPROC_TYPE_U64 0x04000000
+
+/* Access control (unused) */
+#define NPROC_PERM_MASK 0x00300000
+#define NPROC_PERM_USER 0x00100000
+#define NPROC_PERM_ROOT 0x00200000
+
+/* Selectors */
+#define NPROC_SELECT_ALL 0x00000001
+#define NPROC_SELECT_PID 0x00000002
+#define NPROC_SELECT_UID 0x00000003
+
+/* Labels */
+#define NPROC_LABEL_FIELD_NAME 0x00000001
+#define NPROC_LABEL_FIELD_FMT 0x00000002
+#define NPROC_LABEL_FIELD_UNIT 0x00000003
+#define NPROC_LABEL_WCHAN 0x00000004
+
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL)
+#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+/* Amount of free memory (pages) */
+#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* Size of a page (bytes) */
+#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* There's no guarantee about anything with jiffies. Still useful for some. */
+#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL)
+/* Process: VM size (KiB) */
+#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: locked memory (KiB) */
+#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size (KiB) */
+#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING)
+
+#ifdef __KERNEL__
+struct nproc_field {
+ __u32 id;
+ const char *label;
+ const char *fmt;
+ const char *unit;
+};
+
+static struct nproc_field labels[] = {
+ { NPROC_PID, "PID", "%5u", "" },
+ { NPROC_NAME, "Name", "%-15s","" },
+ { NPROC_MEMFREE, "MemFree", "%8u", "page" },
+ { NPROC_PAGESIZE, "PageSize", "%4u", "byte" },
+ { NPROC_JIFFIES, "Jiffies", "%10u", "" },
+ { NPROC_VMSIZE, "VmSize", "%8u", "KiB" },
+ { NPROC_VMLOCK, "VmLock", "%8u", "KiB" },
+ { NPROC_VMRSS, "VmRSS", "%8u", "KiB" },
+ { NPROC_VMDATA, "VmData", "%8u", "KiB" },
+ { NPROC_VMSTACK, "VmStack", "%8u", "KiB" },
+ { NPROC_VMEXE, "VmExe", "%8u", "KiB" },
+ { NPROC_VMLIB, "VmLib", "%8u", "KiB" },
+ { NPROC_UID, "UID", "%5u", "" },
+ { NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" },
+ { NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" },
+ { NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" },
+ { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" },
+ { NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" },
+ { NPROC_NR_SLAB, "nr_slab", "%8u", "page" },
+ { NPROC_WCHAN, "wchan", "%p", "" },
+#ifdef CONFIG_KALLSYMS
+ { NPROC_WCHAN_NAME, "wchan_symbol", "%s"},
+#endif
+};
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_NPROC */
+
+#endif /* _LINUX_NPROC_H */
Index: mm4-2.6.9-rc1/include/linux/pid.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/pid.h 2004-09-08 06:10:36.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/pid.h 2004-09-08 17:45:27.501634858 -0700
@@ -37,6 +37,7 @@
extern struct pid *FASTCALL(find_pid(enum pid_type, int));
extern int alloc_pidmap(void);
+extern void *get_pid_map(int);
extern void FASTCALL(free_pidmap(int));
extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread);
Index: mm4-2.6.9-rc1/init/Kconfig
===================================================================
--- mm4-2.6.9-rc1.orig/init/Kconfig 2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/init/Kconfig 2004-09-08 17:45:27.504564546 -0700
@@ -139,6 +139,13 @@
building a kernel for install/rescue disks or your system is very
limited in memory.
+config NPROC
+ bool "Netlink interface to /proc information"
+ depends on PROC_FS && EXPERIMENTAL
+ default y
+ help
+ Nproc is a netlink interface to /proc information.
+
config AUDIT
bool "Auditing support"
default y if SECURITY_SELINUX
Index: mm4-2.6.9-rc1/kernel/Makefile
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/Makefile 2004-09-08 06:10:50.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/Makefile 2004-09-08 17:45:27.501634858 -0700
@@ -15,6 +15,7 @@
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_KALLSYMS) += kallsyms.o
+obj-$(CONFIG_NPROC) += nproc.o
obj-$(CONFIG_PM) += power/
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-04-25 12:31:02.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-08 17:45:27.503587983 -0700
@@ -0,0 +1,851 @@
+/*
+ * nproc.c
+ *
+ * netlink interface to /proc information.
+ */
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <linux/swap.h> /* nr_free_pages() */
+#include <linux/kallsyms.h> /* kallsyms_lookup() */
+#include <linux/pid.h> /* get_pid_map() */
+#include <linux/nproc.h>
+#include <asm/bitops.h>
+
+//#define DEBUG
+
+/* There must be like 5 million dprintk definitions, so let's add some more */
+#ifdef DEBUG
+#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
+#else
+#define pdebug(x,args...)
+#define pwarn(x,args...)
+#endif
+
+#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
+
+static struct sock *nproc_sock = NULL;
+
+struct task_mem {
+ u32 vmdata;
+ u32 vmstack;
+ u32 vmexe;
+ u32 vmlib;
+};
+
+struct task_mem_cheap {
+ u32 vmsize;
+ u32 vmlock;
+ u32 vmrss;
+};
+
+/*
+ * __task_mem/__task_mem_cheap basically duplicate the MMU version of
+ * task_mem, but they are split by cost and work on structs.
+ */
+
+static void __task_mem(struct task_struct *tsk, struct task_mem *res)
+{
+ struct mm_struct *mm = get_task_mm(tsk);
+ if (mm) {
+ unsigned long data = 0, stack = 0, exec = 0, lib = 0;
+ struct vm_area_struct *vma;
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
+ if (!vma->vm_file) {
+ data += len;
+ if (vma->vm_flags & VM_GROWSDOWN)
+ stack += len;
+ continue;
+ }
+ if (vma->vm_flags & VM_WRITE)
+ continue;
+ if (vma->vm_flags & VM_EXEC) {
+ exec += len;
+ if (vma->vm_flags & VM_EXECUTABLE)
+ continue;
+ lib += len;
+ }
+ }
+ res->vmdata = data - stack;
+ res->vmstack = stack;
+ res->vmexe = exec - lib;
+ res->vmlib = lib;
+ up_read(&mm->mmap_sem);
+
+ mmput(mm);
+ } else {
+ res->vmdata = 0;
+ res->vmstack = 0;
+ res->vmexe = 0;
+ res->vmlib = 0;
+ }
+}
+
+static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
+{
+ struct mm_struct *mm = get_task_mm(tsk);
+ if (mm) {
+ res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
+ res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
+ res->vmrss = mm->rss << (PAGE_SHIFT-10);
+ mmput(mm);
+ } else {
+ res->vmsize = 0;
+ res->vmlock = 0;
+ res->vmrss = 0;
+ }
+}
+
+/*
+ * page_alloc.c already has an extra function broken out to fill a
+ * struct with information. Cool. Not sure whether pgpgin/pgpgout
+ * should be left as is or nailed down as kbytes.
+ */
+static struct page_state *__vmstat(void)
+{
+ struct page_state *ps;
+ ps = kmalloc(sizeof(*ps), GFP_KERNEL);
+ if (!ps)
+ return ERR_PTR(-ENOMEM);
+ get_full_page_state(ps);
+ ps->pgpgin /= 2; /* sectors -> kbytes */
+ ps->pgpgout /= 2;
+ return ps;
+}
+
+/*
+ * Allocate and prefill an skb. The nlmsghdr provided to the function
+ * is a pointer to the respective struct in the request message.
+ */
+static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len)
+{
+ __u32 seq = nlh->nlmsg_seq;
+ __u16 type = nlh->nlmsg_type;
+ __u32 pid = nlh->nlmsg_pid;
+ struct sk_buff *skb2 = 0;
+
+ skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
+ if (!skb2) {
+ skb2 = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+
+ NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
+out:
+ return skb2;
+
+nlmsg_failure: /* Used by NLMSG_PUT */
+ kfree_skb(skb2);
+ return NULL;
+}
+
+#define mstore(value, id, buf) \
+({ \
+ u32 _type = id & NPROC_TYPE_MASK; \
+ switch (_type) { \
+ case NPROC_TYPE_U32: { \
+ __u32 *p = (u32 *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ case NPROC_TYPE_UL: { \
+ unsigned long *p = (unsigned long *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ case NPROC_TYPE_U64: { \
+ __u64 *p = (u64 *)buf; \
+ *p = value; \
+ buf = (char *)++p; \
+ break; \
+ } \
+ default: \
+ perror("Huh? Bad type!\n"); \
+ } \
+})
+
+static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
+{
+ struct task_mem tsk_mem;
+ struct task_mem_cheap tsk_mem_cheap;
+
+ tsk_mem.vmdata = (~0);
+ tsk_mem_cheap.vmsize = (~0);
+
+ switch (id) {
+ case NPROC_PID:
+ mstore(tsk->pid, NPROC_PID, buf);
+ break;
+ case NPROC_UID:
+ mstore(tsk->uid, NPROC_UID, buf);
+ break;
+ case NPROC_VMSIZE:
+ case NPROC_VMLOCK:
+ case NPROC_VMRSS:
+ if (tsk_mem_cheap.vmsize == (~0))
+ __task_mem_cheap(tsk, &tsk_mem_cheap);
+
+ switch (id) {
+ case NPROC_VMSIZE:
+ mstore(tsk_mem_cheap.vmsize,
+ NPROC_VMSIZE, buf);
+ break;
+ case NPROC_VMLOCK:
+ mstore(tsk_mem_cheap.vmlock,
+ NPROC_VMLOCK, buf);
+ break;
+ case NPROC_VMRSS:
+ mstore(tsk_mem_cheap.vmrss,
+ NPROC_VMRSS, buf);
+ break;
+ }
+ break;
+ case NPROC_VMDATA:
+ case NPROC_VMSTACK:
+ case NPROC_VMEXE:
+ case NPROC_VMLIB:
+ if (tsk_mem.vmdata == (~0))
+ __task_mem(tsk, &tsk_mem);
+
+ switch (id) {
+ case NPROC_VMDATA:
+ mstore(tsk_mem.vmdata, NPROC_VMDATA,
+ buf);
+ break;
+ case NPROC_VMSTACK:
+ mstore(tsk_mem.vmstack, NPROC_VMSTACK,
+ buf);
+ break;
+ case NPROC_VMEXE:
+ mstore(tsk_mem.vmexe, NPROC_VMEXE, buf);
+ break;
+ case NPROC_VMLIB:
+ mstore(tsk_mem.vmlib, NPROC_VMLIB, buf);
+ break;
+ }
+ break;
+ case NPROC_JIFFIES:
+ mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+ break;
+ case NPROC_WCHAN:
+ mstore(get_wchan(tsk), NPROC_WCHAN, buf);
+ break;
+ case NPROC_NAME:
+ mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf);
+ strncpy(buf, tsk->comm, sizeof(tsk->comm));
+ buf += sizeof(tsk->comm);
+ break;
+ case NPROC_NOP_UL:
+ mstore(0, NPROC_TYPE_UL, buf);
+ break;
+ default:
+ pwarn("Unknown field ID %#x.\n", id);
+ goto err_inval;
+ }
+ return buf;
+err_inval:
+ return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Build and send a netlink msg for one PID.
+ */
+static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk)
+{
+ int i;
+ int err = 0;
+ struct sk_buff *skb2;
+ char *buf;
+ struct nlmsghdr *nlh2;
+ u32 fcnt, *fields;
+
+ fcnt = fdata[0];
+ fields = &fdata[1];
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+ nlh2 = (struct nlmsghdr *)skb2->data;
+ buf = NLMSG_DATA(nlh2);
+
+ for (i = 0; i < fcnt; i++) {
+ buf = nproc_ps_field(fields[i], buf, tsk);
+ if (IS_ERR(buf)) {
+ err = PTR_ERR(buf);
+ goto out_free;
+ }
+ }
+ err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+ return err;
+out_free:
+ kfree_skb(skb2);
+out:
+ return err;
+}
+
+/*
+ * Find task for given pid, grab task lock (caller must unlock).
+ */
+static task_t *nproc_ps_get_task(int pid)
+{
+ task_t *tsk;
+
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (tsk)
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ return tsk;
+}
+
+/*
+ * Iterate over a list of PIDs.
+ */
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+{
+ int i;
+ int err = 0;
+ u32 tcnt;
+ u32 *pids;
+
+ if (left < sizeof(tcnt))
+ goto err_inval;
+ left -= sizeof(tcnt);
+
+ tcnt = sdata[0];
+
+ if (left < (tcnt * sizeof(u32)))
+ goto err_inval;
+ left -= tcnt * sizeof(u32);
+
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ pids = &sdata[1];
+
+ for (i = 0; i < tcnt; i++) {
+ task_t *tsk;
+ tsk = nproc_ps_get_task(pids[i]);
+ if (!tsk)
+ continue;
+ err = nproc_pid_msg(nlh, fdata, len, tsk);
+ put_task_struct(tsk);
+ if (err)
+ goto out;
+ }
+
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+
+/*
+ * Iterate over all PIDs.
+ */
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+{
+ void *map;
+ int offset, i;
+ int err = 0;
+
+ for (i = 0; i < PIDMAP_ENTRIES; i++) {
+
+ map = get_pid_map(i);
+ if (!map) /* done -- there are no holes in pidmap_array */
+ break;
+ if (IS_ERR(map)) /* No PIDs used in this map */
+ continue;
+ offset = 0;
+ for ( ; ; ) {
+ int pid;
+ task_t *tsk;
+ offset = find_next_bit(map, BITS_PER_PAGE, ++offset);
+ if (offset >= BITS_PER_PAGE)
+ break;
+ pid = offset + i * BITS_PER_PAGE;
+ tsk = nproc_ps_get_task(pid);
+ if (!tsk)
+ continue;
+ err = nproc_pid_msg(nlh, fdata, len, tsk);
+ put_task_struct(tsk);
+ if (err)
+ goto out;
+ }
+ }
+
+out:
+ return err;
+}
+
+static u32 __reply_size_special(u32 id)
+{
+ u32 len = 0;
+
+ switch (id) {
+ case NPROC_NAME:
+ len = sizeof(u32) +
+ sizeof(((struct task_struct*)0)->comm);
+ break;
+ default:
+ pwarn("Unknown field size in %#x.\n", id);
+ }
+ return len;
+}
+
+/*
+ * Calculates the size of a reply message payload. Alternatively, we could have
+ * the user space caller supply a number along with the request and bail
+ * out or realloc later if we find the allocation was too small. More
+ * responsibility in user space, but faster.
+ */
+static u32 *__reply_size (u32 *data, u32 *left, u32 *len)
+{
+ u32 *fields;
+ u32 fcnt;
+ int i;
+ *len = 0;
+
+ if (*left < sizeof(fcnt))
+ goto err_inval;
+ *left -= sizeof(fcnt);
+
+ fcnt = data[0];
+
+ if (*left < (fcnt * sizeof(u32)))
+ goto err_inval;
+ *left -= fcnt * sizeof(u32);
+
+ fields = &data[1];
+
+ for (i = 0; i < fcnt; i++) {
+ u32 id = fields[i];
+ u32 type = id & NPROC_TYPE_MASK;
+ pdebug(" %#8.8x.\n", fields[i]);
+ switch (type) {
+ case NPROC_TYPE_U32:
+ *len += sizeof(u32);
+ break;
+ case NPROC_TYPE_UL:
+ *len += sizeof(unsigned long);
+ break;
+ case NPROC_TYPE_U64:
+ *len += sizeof(u64);
+ break;
+ default: { /* Special cases */
+ u32 slen;
+ slen = __reply_size_special(id);
+ if (slen)
+ *len += slen;
+ else
+ goto err_inval;
+ }
+ }
+ }
+
+ return &fields[fcnt];
+
+err_inval:
+ return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Call the chosen process selector. Adding additional selectors
+ * (e.g. select by uid) is easy, but is there a need?
+ */
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+{
+ int err;
+ u32 len;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 *sdata;
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+
+ sdata = __reply_size(data, &left, &len);
+ if (IS_ERR(sdata)) {
+ err = PTR_ERR(sdata);
+ goto out;
+ }
+
+ if (left < sizeof(u32))
+ goto err_inval;
+ left -= sizeof(u32);
+
+ switch (*sdata) {
+ case NPROC_SELECT_ALL:
+ if (left)
+ pwarn("%d bytes left.\n", left);
+ err = nproc_ps_select_all(nlh, data, len);
+ break;
+ case NPROC_SELECT_PID:
+ err = nproc_ps_select_pid(nlh, data, len,
+ left, sdata + 1);
+ break;
+ default:
+ pwarn("Unknown selection method %#x.\n", *sdata);
+ goto err_inval;
+ }
+
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static char *nproc_global_field(u32 id, char *buf)
+{
+ struct page_state *ps = NULL;
+
+ switch (id) {
+ case NPROC_NR_DIRTY:
+ case NPROC_NR_WRITEBACK:
+ case NPROC_NR_UNSTABLE:
+ case NPROC_NR_PG_TABLE_PGS:
+ case NPROC_NR_MAPPED:
+ case NPROC_NR_SLAB:
+ if (!ps) {
+ ps = __vmstat();
+ if (IS_ERR(ps)) { /* Just pass it on */
+ buf = (void *)ps;
+ ps = NULL;
+ goto out;
+ }
+ }
+ switch (id) {
+ case NPROC_NR_DIRTY:
+ mstore(ps->nr_dirty, NPROC_NR_DIRTY,
+ buf);
+ break;
+ case NPROC_NR_WRITEBACK:
+ mstore(ps->nr_writeback,
+ NPROC_NR_WRITEBACK,
+ buf);
+ break;
+ case NPROC_NR_UNSTABLE:
+ mstore(ps->nr_unstable,
+ NPROC_NR_UNSTABLE,
+ buf);
+ break;
+ case NPROC_NR_PG_TABLE_PGS:
+ mstore(ps->nr_page_table_pages,
+ NPROC_NR_PG_TABLE_PGS,
+ buf);
+ break;
+ case NPROC_NR_MAPPED:
+ mstore(ps->nr_mapped, NPROC_NR_MAPPED,
+ buf);
+ break;
+ case NPROC_NR_SLAB:
+ mstore(ps->nr_slab, NPROC_NR_SLAB, buf);
+ break;
+ }
+ break;
+ case NPROC_MEMFREE:
+ mstore(nr_free_pages(), NPROC_MEMFREE, buf);
+ break;
+ case NPROC_PAGESIZE:
+ mstore(PAGE_SIZE, NPROC_PAGESIZE, buf);
+ break;
+ case NPROC_JIFFIES:
+ mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+ break;
+ default:
+ pwarn("Unknown field ID %#x.\n", id);
+ buf = ERR_PTR(-EINVAL);
+ goto out;
+ }
+ kfree(ps);
+out:
+ return buf;
+}
+
+static int nproc_get_global(struct nlmsghdr *nlh)
+{
+ int err, i;
+ void *errp;
+ struct sk_buff *skb2;
+ char *buf;
+ u32 fcnt, len;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 *fields;
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+ errp = __reply_size(data, &left, &len);
+ if (IS_ERR(errp)) {
+ err = PTR_ERR(errp);
+ goto out;
+ }
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ fcnt = data[0];
+ fields = &data[1];
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+ for (i = 0; i < fcnt; i++) {
+ buf = nproc_global_field(fields[i], buf);
+ if (IS_ERR(buf)) {
+ err = PTR_ERR(buf);
+ kfree_skb(skb2);
+ goto out;
+ }
+ }
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+}
+
+static int find_id(__u32 *data, __u32 *left)
+{
+ int i;
+ u32 id;
+
+ if (*left < sizeof(id))
+ goto err_inval;
+ *left -= sizeof(sizeof(id));
+
+ if (*left)
+ pwarn("%d bytes left.\n", *left);
+ id = data[1];
+
+ for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
+ ; /* Do nothing */
+
+ if (labels[i].id != id) {
+ pwarn("No matching label found for %#x.\n", id);
+ goto err_inval;
+ }
+
+ return i;
+
+err_inval:
+ return -EINVAL;
+}
+
+
+static int nproc_get_label(struct nlmsghdr *nlh)
+{
+ int err;
+ struct sk_buff *skb2;
+ const char *label;
+ char *buf;
+ int len;
+ u32 ltype;
+ u32 *data = NLMSG_DATA(nlh);
+ u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+ if (left < sizeof(ltype))
+ goto err_inval;
+ left -= sizeof(ltype);
+
+ ltype = data[0];
+
+ if (ltype == NPROC_LABEL_FIELD_NAME) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].label;
+ }
+ else if (ltype == NPROC_LABEL_FIELD_UNIT) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].unit;
+ }
+ else if (ltype == NPROC_LABEL_FIELD_FMT) {
+ int idx;
+ idx = find_id(data, &left);
+ if (idx < 0)
+ goto err_inval;
+ label = labels[idx].fmt;
+ }
+ else if (ltype == NPROC_LABEL_WCHAN) {
+ char *modname;
+ unsigned long wchan, size, offset;
+ char namebuf[128];
+
+ if (left < sizeof(unsigned long))
+ goto err_inval;
+ left -= sizeof(unsigned long);
+
+ if (left)
+ pwarn("%d bytes left.\n", left);
+
+ wchan = (unsigned long)data[1];
+ label = kallsyms_lookup(wchan, &size, &offset, &modname,
+ namebuf);
+
+ if (!label) {
+ pwarn("No ksym found for %#lx.\n", wchan);
+ goto err_inval;
+ }
+ }
+ else {
+ pwarn("Unknown label type %#x.\n", ltype);
+ goto err_inval;
+ }
+
+ len = strlen(label) + 1;
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+ strncpy(buf, label, len);
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static int nproc_get_list(struct nlmsghdr *nlh)
+{
+ int err, i, cnt, len;
+ struct sk_buff *skb2;
+ u32 *buf;
+
+ cnt = ARRAY_SIZE(labels);
+ len = (cnt + 1) * sizeof(u32);
+
+ skb2 = nproc_alloc_nlmsg(nlh, len);
+ if (IS_ERR(skb2)) {
+ err = PTR_ERR(skb2);
+ goto out;
+ }
+
+ buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+ buf[0] = cnt;
+ for (i = 0; i < cnt; i++)
+ buf[i + 1] = labels[i].id;
+
+ err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0);
+ if (err > 0)
+ err = 0;
+out:
+ return err;
+}
+
+static __inline__ int nproc_process_msg(struct sk_buff *skb,
+ struct nlmsghdr *nlh)
+{
+ int err = 0;
+ uid_t uid;
+ kernel_cap_t caps;
+
+ if (!(nlh->nlmsg_flags & NLM_F_REQUEST))
+ goto out;
+
+ nlh->nlmsg_pid = NETLINK_CB(skb).pid;
+ uid = NETLINK_CB(skb).creds.uid;
+ caps = NETLINK_CB(skb).eff_cap;
+
+ switch (nlh->nlmsg_type) {
+ case NPROC_GET_FIELD_LIST:
+ err = nproc_get_list(nlh);
+ break;
+ case NPROC_GET_LABEL:
+ err = nproc_get_label(nlh);
+ break;
+ case NPROC_GET_GLOBAL:
+ err = nproc_get_global(nlh);
+ break;
+ case NPROC_GET_PS:
+ err = nproc_get_ps(nlh, uid);
+ break;
+ default:
+ pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
+ err = -EINVAL;
+ }
+out:
+ return err;
+
+}
+
+static int nproc_receive_skb(struct sk_buff *skb)
+{
+ int err = 0;
+ struct nlmsghdr *nlh;
+
+ if (skb->len < NLMSG_LENGTH(0))
+ goto err_inval;
+
+ nlh = (struct nlmsghdr *)skb->data;
+ if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){
+ pwarn("Invalid packet.\n");
+ goto err_inval;
+ }
+
+ err = nproc_process_msg(skb, nlh);
+ if (err || nlh->nlmsg_flags & NLM_F_ACK) {
+ pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err,
+ nlh->nlmsg_type, nlh->nlmsg_flags,
+ nlh->nlmsg_seq);
+ netlink_ack(skb, nlh, err);
+ }
+
+ return err;
+
+err_inval:
+ return -EINVAL;
+}
+
+static void nproc_receive(struct sock *sk, int len)
+{
+ struct sk_buff *skb;
+
+ while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+ nproc_receive_skb(skb);
+ kfree_skb(skb);
+ }
+}
+
+static int nproc_init(void)
+{
+ nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive);
+
+ if (!nproc_sock) {
+ pwarn("No netlink socket for nproc.\n");
+ return -ENODEV;
+ }
+
+ return 0;
+}
+
+module_init(nproc_init);
Index: mm4-2.6.9-rc1/kernel/pid.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 06:10:54.000000000 -0700
+++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-08 17:45:27.504564546 -0700
@@ -148,6 +148,17 @@
return -1;
}
+void *get_pid_map(int idx)
+{
+ pidmap_t *map = pidmap_array + idx;
+ if (!map->page)
+ return NULL;
+ else if (atomic_read(&map->nr_free) == BITS_PER_PAGE)
+ return ERR_PTR(-1);
+ else
+ return map->page;
+}
+
struct pid * fastcall find_pid(enum pid_type type, int nr)
{
struct hlist_node *elem;
On Wed, Sep 08, 2004 at 06:21:37PM -0700, William Lee Irwin III wrote:
> Make __task_mem() and __task_mem_cheap() use the appropriate methods
> for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n.
> The new methods for /proc/ accounting involve using counters kept in
> the mm instead of iteration over vmas. For the CONFIG_MMU=y case this
> does not involve acquiring mm->mmap_sem for any per-mm statistics. The
> CONFIG_MMU=n case still needs iteration over tblocks to calculate them.
Round up text memory to the nearest page to resolve potential alignment
anomalies in reported statistics. Compiletested on ia64.
-- wli
Index: mm4-2.6.9-rc1/fs/proc/task_mmu.c
===================================================================
--- mm4-2.6.9-rc1.orig/fs/proc/task_mmu.c 2004-09-08 06:10:35.000000000 -0700
+++ mm4-2.6.9-rc1/fs/proc/task_mmu.c 2004-09-08 18:27:39.401017905 -0700
@@ -9,7 +9,7 @@
unsigned long data, text, lib;
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
- text = (mm->end_code - mm->start_code) >> 10;
+ text = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
buffer += sprintf(buffer,
"VmSize:\t%8lu kB\n"
On Wed, Sep 08, 2004 at 06:21:37PM -0700, William Lee Irwin III wrote:
> Make __task_mem() and __task_mem_cheap() use the appropriate methods
> for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n.
> The new methods for /proc/ accounting involve using counters kept in
> the mm instead of iteration over vmas. For the CONFIG_MMU=y case this
> does not involve acquiring mm->mmap_sem for any per-mm statistics. The
> CONFIG_MMU=n case still needs iteration over tblocks to calculate them.
Once again, compiletested only on ia64.
-- wli
On Wed, 2004-09-08 at 14:41, Roger Luethi wrote:
> A few notes:
> - Access control can be implemented easily. Right now it would be bloat,
> though -- the vast majority of fields in /proc are world-readable
> (/proc/pid/environ being the notable exception).
They aren't world readable when using a security module like SELinux;
they are then typically only accessible by processes in the same
security domain, aside from processes in privileged domains.
security_task_to_inode() hook sets the security attributes on the
/proc/pid inodes based on their security context, and then
security_inode_permission() hook controls access to them. So you need
at least comparable controls.
--
Stephen Smalley <[email protected]>
National Security Agency
On Wed, 2004-09-08 at 14:41, Roger Luethi wrote:
>> A few notes:
>> - Access control can be implemented easily. Right now it would be bloat,
>> though -- the vast majority of fields in /proc are world-readable
>> (/proc/pid/environ being the notable exception).
On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> They aren't world readable when using a security module like SELinux;
> they are then typically only accessible by processes in the same
> security domain, aside from processes in privileged domains.
> security_task_to_inode() hook sets the security attributes on the
> /proc/pid inodes based on their security context, and then
> security_inode_permission() hook controls access to them. So you need
> at least comparable controls.
Can you make a more specific suggestion regarding the controls to use?
It's a bit awkward for those highly unfamiliar with the subsystem to
invent new methods for the security layer independently, so it's likely
best some guidance (e.g. function prototype) be given.
-- wli
On Wed, 08 Sep 2004 17:35:29 -0700, William Lee Irwin III wrote:
> On Wed, Sep 08, 2004 at 08:41:30PM +0200, Roger Luethi wrote:
> > A few notes:
> > - Access control can be implemented easily. Right now it would be bloat,
> > though -- the vast majority of fields in /proc are world-readable
> > (/proc/pid/environ being the notable exception).
> > - Additional process selectors (e.g. select by UID) are not hard to
> > add, either, should there ever be a need.
> > - There are a few things I'm not sure about: For instance, what is a good
> > return value for mm_struct related fields wrt kernel threads? I picked
> > 0, but ~(0) might be preferable because it's distinct.
> > Signed-off-by: Roger Luethi <[email protected]>
>
> Any chance you could convert these to use the new vm statistics
> accounting?
Mea culpa. I copied the routines wholesale from 2.6.7 when I started
work on nproc. They still seemed to work with 2.6.9-rc1-bk13, I hadn't
noticed the work that had gone into field computation already. So for
CONFIG_MMU, values in both __task_mem and __task_mem_cheap are cheap
now. The routines can be merged.
!CONFIG_MMU is a different story. Presumably, it needs a change in the
fields that are offered (cp. task_mem in fs/proc/task_nommu.c).
FWIW, my prefered solution would be to have only one routine task_mem
to fill the respective struct for nproc and /proc.
There seems to be a discrepancy between current task_mem in
fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote
for the nproc !CONFIG_MMU case. Can you explain?
Roger
On Wed, 08 Sep 2004 17:35:29 -0700, William Lee Irwin III wrote:
>> Any chance you could convert these to use the new vm statistics
>> accounting?
On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote:
> Mea culpa. I copied the routines wholesale from 2.6.7 when I started
> work on nproc. They still seemed to work with 2.6.9-rc1-bk13, I hadn't
> noticed the work that had gone into field computation already. So for
> CONFIG_MMU, values in both __task_mem and __task_mem_cheap are cheap
> now. The routines can be merged.
> !CONFIG_MMU is a different story. Presumably, it needs a change in the
> fields that are offered (cp. task_mem in fs/proc/task_nommu.c).
> FWIW, my prefered solution would be to have only one routine task_mem
> to fill the respective struct for nproc and /proc.
I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
patch atop the others I sent.
On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote:
> There seems to be a discrepancy between current task_mem in
> fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote
> for the nproc !CONFIG_MMU case. Can you explain?
I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I
did, however, have to mangle the things via guesswork to avoid adding
the new fields, which I really wanted you to arrange for or comment on
as they are a matter of interface. Also, could you be more specific
about these discrepancies?
-- wli
On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote:
> On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> > They aren't world readable when using a security module like SELinux;
> > they are then typically only accessible by processes in the same
> > security domain, aside from processes in privileged domains.
> > security_task_to_inode() hook sets the security attributes on the
> > /proc/pid inodes based on their security context, and then
> > security_inode_permission() hook controls access to them. So you need
> > at least comparable controls.
>
> Can you make a more specific suggestion regarding the controls to use?
> It's a bit awkward for those highly unfamiliar with the subsystem to
For the same reason, I'm not comfortable with implementing SELinux type
access controls myself. How about:
config NPROC
depends on !SECURITY_SELINUX
Adding access control later won't be a problem for anyone who groks
SELinux.
Roger
On Thu, Sep 09, 2004 at 12:00:24PM -0700, William Lee Irwin III wrote:
> Consolidate __task_mem() and __task_mem_cheap() now that both have been
> made cheap, and also combine struct task_mem with struct task_mem_cheap.
> Also adjust various users of *_cheap to the new terminology so no trace
> of the *_cheap bits remains. Compiletested on ia64.
Repost with appropriate Subject: line.
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-08 18:11:24.826811093 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-09 12:00:44.649267323 -0700
@@ -32,17 +32,14 @@
u32 vmstack;
u32 vmexe;
u32 vmlib;
-};
-
-struct task_mem_cheap {
u32 vmsize;
u32 vmlock;
u32 vmrss;
};
/*
- * __task_mem/__task_mem_cheap basically duplicate the MMU version of
- * task_mem, but they are split by cost and work on structs.
+ * __task_mem() basically duplicates() the MMU and nommu versions of
+ * task_mem() from fs/proc/task_mmu.c and fs/proc/task_nommu.c
*/
#ifdef CONFIG_MMU
static void __task_mem(struct task_struct *tsk, struct task_mem *res)
@@ -57,22 +54,10 @@
res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10);
res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe;
- mmput(mm);
- }
-}
-
-static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
-{
- struct mm_struct *mm = get_task_mm(tsk);
- if (mm) {
res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
res->vmrss = mm->rss << (PAGE_SHIFT-10);
mmput(mm);
- } else {
- res->vmsize = 0;
- res->vmlock = 0;
- res->vmrss = 0;
}
}
#else /* !CONFIG_MMU */
@@ -86,9 +71,16 @@
unsigned long bytes = 0, sbytes = 0, slack = 0;
struct mm_tblk_struct *tblk;
+ stats->vmrss += kobjsize(mm);
down_read(&mm->mmap_sem);
for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) {
- if (!tblk->rblock)
+ if (tblk->next)
+ stats->vmrss += kobjsize(tblk->next);
+ if (tblk->rblock) {
+ stats->vmsize += kobjsize(tblk->rblock);
+ stats->vmrss += kobjsize(tblk->rblock);
+ stats->vmrss += kobjsize(tblk->rblock->kblock);
+ } else
continue;
bytes += kobjsize(tblk);
if (atomic_read(&mm->mm_count) > 1) ||
@@ -120,34 +112,12 @@
stats->vmdata = bytes;
stats->vmstack = sbytes;
stats->vmexe = stats->vmlib = 0;
+ stats->vmrss += mm->end_code - mm->start_code;
+ stats->vmrss += mm->start_stack - mm->start_data;
+ stats->vmrss >>= 10;
+ stats->vmsize >>= 10;
}
}
-
-static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats)
-{
- struct mm_struct *mm = get_task_mm(task);
- struct mm_tblock_struct *tblk;
- int size;
-
- memset(stats, 0, sizeof(struct task_mem_cheap));
- stats->vmrss += kobjsize(mm);
- down_read(&mm->mmap_sem);
- for (tblk = &mm->context.block; tblk; tblk = tblk->next) {
- if (tblk->next)
- stats->vmrss += kobjsize(tblk->next);
- if (tblk->rblock) {
- stats->vmsize += kobjsize(tblk->rblock);
- stats->vmrss += kobjsize(tblk->rblock);
- stats->vmrss += kobjsize(tblk->rblock->kblock);
- }
- }
- stats->vmrss += mm->end_code - mm->start_code;
- stats->vmrss += mm->start_stack - mm->start_data;
- up_read(&mm->mmap_sem);
- mmput(mm);
- stats->vmrss >>= 10;
- stats->vmsize >>= 10;
-}
#endif /* !CONFIG_MMU */
/*
@@ -223,10 +193,9 @@
static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
{
struct task_mem tsk_mem;
- struct task_mem_cheap tsk_mem_cheap;
tsk_mem.vmdata = (~0);
- tsk_mem_cheap.vmsize = (~0);
+ tsk_mem.vmsize = (~0);
switch (id) {
case NPROC_PID:
@@ -238,20 +207,20 @@
case NPROC_VMSIZE:
case NPROC_VMLOCK:
case NPROC_VMRSS:
- if (tsk_mem_cheap.vmsize == (~0))
- __task_mem_cheap(tsk, &tsk_mem_cheap);
+ if (tsk_mem.vmsize == (~0))
+ __task_mem(tsk, &tsk_mem);
switch (id) {
case NPROC_VMSIZE:
- mstore(tsk_mem_cheap.vmsize,
+ mstore(tsk_mem.vmsize,
NPROC_VMSIZE, buf);
break;
case NPROC_VMLOCK:
- mstore(tsk_mem_cheap.vmlock,
+ mstore(tsk_mem.vmlock,
NPROC_VMLOCK, buf);
break;
case NPROC_VMRSS:
- mstore(tsk_mem_cheap.vmrss,
+ mstore(tsk_mem.vmrss,
NPROC_VMRSS, buf);
break;
}
On Thu, Sep 09, 2004 at 11:49:33AM -0700, William Lee Irwin III wrote:
> I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
> patch atop the others I sent.
Consolidate __task_mem() and __task_mem_cheap() now that both have been
made cheap, and also combine struct task_mem with struct task_mem_cheap.
Also adjust various users of *_cheap to the new terminology so no trace
of the *_cheap bits remains. Compiletested on ia64.
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-08 18:11:24.826811093 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-09 12:00:44.649267323 -0700
@@ -32,17 +32,14 @@
u32 vmstack;
u32 vmexe;
u32 vmlib;
-};
-
-struct task_mem_cheap {
u32 vmsize;
u32 vmlock;
u32 vmrss;
};
/*
- * __task_mem/__task_mem_cheap basically duplicate the MMU version of
- * task_mem, but they are split by cost and work on structs.
+ * __task_mem() basically duplicates() the MMU and nommu versions of
+ * task_mem() from fs/proc/task_mmu.c and fs/proc/task_nommu.c
*/
#ifdef CONFIG_MMU
static void __task_mem(struct task_struct *tsk, struct task_mem *res)
@@ -57,22 +54,10 @@
res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10);
res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10;
res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe;
- mmput(mm);
- }
-}
-
-static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
-{
- struct mm_struct *mm = get_task_mm(tsk);
- if (mm) {
res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
res->vmrss = mm->rss << (PAGE_SHIFT-10);
mmput(mm);
- } else {
- res->vmsize = 0;
- res->vmlock = 0;
- res->vmrss = 0;
}
}
#else /* !CONFIG_MMU */
@@ -86,9 +71,16 @@
unsigned long bytes = 0, sbytes = 0, slack = 0;
struct mm_tblk_struct *tblk;
+ stats->vmrss += kobjsize(mm);
down_read(&mm->mmap_sem);
for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) {
- if (!tblk->rblock)
+ if (tblk->next)
+ stats->vmrss += kobjsize(tblk->next);
+ if (tblk->rblock) {
+ stats->vmsize += kobjsize(tblk->rblock);
+ stats->vmrss += kobjsize(tblk->rblock);
+ stats->vmrss += kobjsize(tblk->rblock->kblock);
+ } else
continue;
bytes += kobjsize(tblk);
if (atomic_read(&mm->mm_count) > 1) ||
@@ -120,34 +112,12 @@
stats->vmdata = bytes;
stats->vmstack = sbytes;
stats->vmexe = stats->vmlib = 0;
+ stats->vmrss += mm->end_code - mm->start_code;
+ stats->vmrss += mm->start_stack - mm->start_data;
+ stats->vmrss >>= 10;
+ stats->vmsize >>= 10;
}
}
-
-static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats)
-{
- struct mm_struct *mm = get_task_mm(task);
- struct mm_tblock_struct *tblk;
- int size;
-
- memset(stats, 0, sizeof(struct task_mem_cheap));
- stats->vmrss += kobjsize(mm);
- down_read(&mm->mmap_sem);
- for (tblk = &mm->context.block; tblk; tblk = tblk->next) {
- if (tblk->next)
- stats->vmrss += kobjsize(tblk->next);
- if (tblk->rblock) {
- stats->vmsize += kobjsize(tblk->rblock);
- stats->vmrss += kobjsize(tblk->rblock);
- stats->vmrss += kobjsize(tblk->rblock->kblock);
- }
- }
- stats->vmrss += mm->end_code - mm->start_code;
- stats->vmrss += mm->start_stack - mm->start_data;
- up_read(&mm->mmap_sem);
- mmput(mm);
- stats->vmrss >>= 10;
- stats->vmsize >>= 10;
-}
#endif /* !CONFIG_MMU */
/*
@@ -223,10 +193,9 @@
static char *nproc_ps_field(u32 id, char *buf, task_t *tsk)
{
struct task_mem tsk_mem;
- struct task_mem_cheap tsk_mem_cheap;
tsk_mem.vmdata = (~0);
- tsk_mem_cheap.vmsize = (~0);
+ tsk_mem.vmsize = (~0);
switch (id) {
case NPROC_PID:
@@ -238,20 +207,20 @@
case NPROC_VMSIZE:
case NPROC_VMLOCK:
case NPROC_VMRSS:
- if (tsk_mem_cheap.vmsize == (~0))
- __task_mem_cheap(tsk, &tsk_mem_cheap);
+ if (tsk_mem.vmsize == (~0))
+ __task_mem(tsk, &tsk_mem);
switch (id) {
case NPROC_VMSIZE:
- mstore(tsk_mem_cheap.vmsize,
+ mstore(tsk_mem.vmsize,
NPROC_VMSIZE, buf);
break;
case NPROC_VMLOCK:
- mstore(tsk_mem_cheap.vmlock,
+ mstore(tsk_mem.vmlock,
NPROC_VMLOCK, buf);
break;
case NPROC_VMRSS:
- mstore(tsk_mem_cheap.vmrss,
+ mstore(tsk_mem.vmrss,
NPROC_VMRSS, buf);
break;
}
On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
> I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
> patch atop the others I sent.
I have a few minor changes coming up as well.
One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
ifdef them out of the list of offered fields for that configuration (and
maybe in nproc_ps_field as well).
> On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote:
> > There seems to be a discrepancy between current task_mem in
> > fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote
> > for the nproc !CONFIG_MMU case. Can you explain?
>
> I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I
> did, however, have to mangle the things via guesswork to avoid adding
> the new fields, which I really wanted you to arrange for or comment on
> as they are a matter of interface. Also, could you be more specific
> about these discrepancies?
task_nommu.c offers Mem, Slack, and Shared. __task_mem for !CONFIG_MMU
offers VmData, VmStack, VmRSS, VmSize.
Roger
On Thu, 09 Sep 2004 12:02:14 -0700, William Lee Irwin III wrote:
>> + stats->vmrss += mm->end_code - mm->start_code;
On Thu, Sep 09, 2004 at 09:07:57PM +0200, Roger Luethi wrote:
> s/vmrss/vmsize/ ?
This follows fs/proc/task_nommu.c:task_statm, which ->vmsize would not.
vmsize would be the sum of kobjsize(tblk->rblock->kblock) for each
tblock, which actually does need fixing in the above.
-- wli
Index: mm4-2.6.9-rc1/kernel/nproc.c
===================================================================
--- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-09 12:00:44.649267323 -0700
+++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-09 12:18:01.876793680 -0700
@@ -77,7 +77,7 @@
if (tblk->next)
stats->vmrss += kobjsize(tblk->next);
if (tblk->rblock) {
- stats->vmsize += kobjsize(tblk->rblock);
+ stats->vmsize += kobjsize(tblk->rblock->kblock);
stats->vmrss += kobjsize(tblk->rblock);
stats->vmrss += kobjsize(tblk->rblock->kblock);
} else
On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
>> I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
>> patch atop the others I sent.
On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote:
> I have a few minor changes coming up as well.
I rest assured that nothing I've written thus far will apply to or be
included in any of it, as a matter of course (nothing specific to you).
On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote:
> One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
> ifdef them out of the list of offered fields for that configuration (and
> maybe in nproc_ps_field as well).
This may be; I'll leave that decision to you as the interface designer.
On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
>> I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I
>> did, however, have to mangle the things via guesswork to avoid adding
>> the new fields, which I really wanted you to arrange for or comment on
>> as they are a matter of interface. Also, could you be more specific
>> about these discrepancies?
On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote:
> task_nommu.c offers Mem, Slack, and Shared. __task_mem for !CONFIG_MMU
> offers VmData, VmStack, VmRSS, VmSize.
I took the structure fields to be just an argument passing convention
giving the nommu case an identical prototype much like the helpers in
fs/proc/task_{no,}mmu.c. Using different field names and etc. is also
feasible, of course. I'll wait for your updates to follow up further.
-- wli
On Thu, 2004-09-09 at 13:53, Roger Luethi wrote:
> On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote:
> > On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> > > They aren't world readable when using a security module like SELinux;
> > > they are then typically only accessible by processes in the same
> > > security domain, aside from processes in privileged domains.
> > > security_task_to_inode() hook sets the security attributes on the
> > > /proc/pid inodes based on their security context, and then
> > > security_inode_permission() hook controls access to them. So you need
> > > at least comparable controls.
> >
> > Can you make a more specific suggestion regarding the controls to use?
> > It's a bit awkward for those highly unfamiliar with the subsystem to
>
> For the same reason, I'm not comfortable with implementing SELinux type
> access controls myself. How about:
>
> config NPROC
> depends on !SECURITY_SELINUX
>
> Adding access control later won't be a problem for anyone who groks
> SELinux.
Well, it isn't that easy, or at least I don't think it is. The problem
is that there is no way presently to convey the sender's security
credentials (beyond the existing uid, cap information), since the LSM
patches for adding security fields and hooks for managing skb security
fields were rejected. The best we can do at present is pass along the
sender pid, uid, and cap, and the security module can look up the pid if
it chooses to get the security field (but is naturally subject to races
in that situation).
Most obvious place to hook would be nproc_ps_get_task; we could then
perform a check based on the sender's credentials and the target task's
credentials, and simply return NULL if permission is not granted for
that pair, thus skipping that task as if it didn't exist. That requires
propagating the sender's credentials down to that function.
Untested patch below.
Index: linux-2.6/include/linux/security.h
===================================================================
RCS file: /nfshome/pal/CVS/linux-2.6/include/linux/security.h,v
retrieving revision 1.37
diff -u -p -r1.37 security.h
--- linux-2.6/include/linux/security.h 16 Jun 2004 14:49:42 -0000 1.37
+++ linux-2.6/include/linux/security.h 9 Sep 2004 19:38:23 -0000
@@ -632,6 +632,13 @@ struct swap_info_struct;
* security attributes, e.g. for /proc/pid inodes.
* @p contains the task_struct for the task.
* @inode contains the inode structure for the inode.
+ * @task_getstate:
+ * Check permission before getting the state of a task.
+ * @pid contains the pid of the requesting process.
+ * @p contains the task_struct for the target task.
+ * @uid contains the uid of the requesting process.
+ * @caps contains the capability set of the requesting process.
+ * Return 0 if permission is granted.
*
* Security hooks for Netlink messaging.
*
@@ -1153,6 +1160,7 @@ struct security_operations {
unsigned long arg5);
void (*task_reparent_to_init) (struct task_struct * p);
void (*task_to_inode)(struct task_struct *p, struct inode *inode);
+ int (*task_getstate)(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps);
int (*ipc_permission) (struct kern_ipc_perm * ipcp, short flag);
@@ -1756,6 +1764,11 @@ static inline void security_task_to_inod
security_ops->task_to_inode(p, inode);
}
+static inline int security_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps)
+{
+ return security_ops->task_getstate(pid, p, uid, caps);
+}
+
static inline int security_ipc_permission (struct kern_ipc_perm *ipcp,
short flag)
{
@@ -2389,6 +2402,11 @@ static inline void security_task_reparen
static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
{ }
+static inline int security_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps)
+{
+ return 0;
+}
+
static inline int security_ipc_permission (struct kern_ipc_perm *ipcp,
short flag)
{
Index: linux-2.6/security/dummy.c
===================================================================
RCS file: /nfshome/pal/CVS/linux-2.6/security/dummy.c,v
retrieving revision 1.34
diff -u -p -r1.34 dummy.c
--- linux-2.6/security/dummy.c 16 Jun 2004 14:49:42 -0000 1.34
+++ linux-2.6/security/dummy.c 9 Sep 2004 19:39:01 -0000
@@ -619,6 +619,12 @@ static void dummy_task_reparent_to_init
static void dummy_task_to_inode(struct task_struct *p, struct inode *inode)
{ }
+
+static int dummy_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps)
+{
+ return 0;
+}
+
static int dummy_ipc_permission (struct kern_ipc_perm *ipcp, short flag)
{
return 0;
@@ -979,6 +985,7 @@ void security_fixup_ops (struct security
set_to_dummy_if_null(ops, task_prctl);
set_to_dummy_if_null(ops, task_reparent_to_init);
set_to_dummy_if_null(ops, task_to_inode);
+ set_to_dummy_if_null(ops, task_getstate);
set_to_dummy_if_null(ops, ipc_permission);
set_to_dummy_if_null(ops, msg_msg_alloc_security);
set_to_dummy_if_null(ops, msg_msg_free_security);
--- linux-2.6/kernel/nproc.c.orig 2004-09-09 15:51:25.727833776 -0400
+++ linux-2.6/kernel/nproc.c 2004-09-09 15:30:19.171379624 -0400
@@ -296,7 +296,7 @@ out:
/*
* Find task for given pid, grab task lock (caller must unlock).
*/
-static task_t *nproc_ps_get_task(int pid)
+static task_t *nproc_ps_get_task(struct nlmsghdr *nlh, int pid, uid_t uid, kernel_cap_t caps)
{
task_t *tsk;
@@ -305,13 +305,17 @@ static task_t *nproc_ps_get_task(int pid
if (tsk)
get_task_struct(tsk);
read_unlock(&tasklist_lock);
+ if (tsk && security_task_getstate(nlh->nlmsg_pid, tsk, uid, caps)) {
+ put_task_struct(tsk);
+ return NULL;
+ }
return tsk;
}
/*
* Iterate over a list of PIDs.
*/
-static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata)
+static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata, uid_t uid, kernel_cap_t caps)
{
int i;
int err = 0;
@@ -335,7 +339,7 @@ static int nproc_ps_select_pid(struct nl
for (i = 0; i < tcnt; i++) {
task_t *tsk;
- tsk = nproc_ps_get_task(pids[i]);
+ tsk = nproc_ps_get_task(nlh, pids[i], uid, caps);
if (!tsk)
continue;
err = nproc_pid_msg(nlh, fdata, len, tsk);
@@ -357,7 +361,7 @@ err_inval:
/*
* Iterate over all PIDs.
*/
-static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len)
+static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len, uid_t uid, kernel_cap_t caps)
{
void *map;
int offset, i;
@@ -378,7 +382,7 @@ static int nproc_ps_select_all(struct nl
if (offset >= BITS_PER_PAGE)
break;
pid = offset + i * BITS_PER_PAGE;
- tsk = nproc_ps_get_task(pid);
+ tsk = nproc_ps_get_task(nlh, pid, uid, caps);
if (!tsk)
continue;
err = nproc_pid_msg(nlh, fdata, len, tsk);
@@ -467,7 +471,7 @@ err_inval:
* Call the chosen process selector. Adding additional selectors
* (e.g. select by uid) is easy, but is there a need?
*/
-static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid)
+static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid, kernel_cap_t caps)
{
int err;
u32 len;
@@ -490,11 +494,11 @@ static int nproc_get_ps(struct nlmsghdr
case NPROC_SELECT_ALL:
if (left)
pwarn("%d bytes left.\n", left);
- err = nproc_ps_select_all(nlh, data, len);
+ err = nproc_ps_select_all(nlh, data, len, uid, caps);
break;
case NPROC_SELECT_PID:
err = nproc_ps_select_pid(nlh, data, len,
- left, sdata + 1);
+ left, sdata + 1, uid, caps);
break;
default:
pwarn("Unknown selection method %#x.\n", *sdata);
@@ -787,7 +791,7 @@ static __inline__ int nproc_process_msg(
err = nproc_get_global(nlh);
break;
case NPROC_GET_PS:
- err = nproc_get_ps(nlh, uid);
+ err = nproc_get_ps(nlh, uid, caps);
break;
default:
pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
--
Stephen Smalley <[email protected]>
National Security Agency
On Thu, 09 Sep 2004 12:02:14 -0700, William Lee Irwin III wrote:
> + stats->vmrss += mm->end_code - mm->start_code;
s/vmrss/vmsize/ ?
* Roger Luethi ([email protected]) wrote:
> On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote:
> > On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote:
> > > They aren't world readable when using a security module like SELinux;
> > > they are then typically only accessible by processes in the same
> > > security domain, aside from processes in privileged domains.
> > > security_task_to_inode() hook sets the security attributes on the
> > > /proc/pid inodes based on their security context, and then
> > > security_inode_permission() hook controls access to them. So you need
> > > at least comparable controls.
> >
> > Can you make a more specific suggestion regarding the controls to use?
> > It's a bit awkward for those highly unfamiliar with the subsystem to
>
> For the same reason, I'm not comfortable with implementing SELinux type
> access controls myself. How about:
>
> config NPROC
> depends on !SECURITY_SELINUX
>
It's not just SELinux, it's any security module (i.e. CONFIG_SECURITY for
starters).
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
* Stephen Smalley ([email protected]) wrote:
> Well, it isn't that easy, or at least I don't think it is. The problem
> is that there is no way presently to convey the sender's security
> credentials (beyond the existing uid, cap information), since the LSM
> patches for adding security fields and hooks for managing skb security
> fields were rejected. The best we can do at present is pass along the
> sender pid, uid, and cap, and the security module can look up the pid if
> it chooses to get the security field (but is naturally subject to races
> in that situation).
>
> Most obvious place to hook would be nproc_ps_get_task; we could then
> perform a check based on the sender's credentials and the target task's
> credentials, and simply return NULL if permission is not granted for
> that pair, thus skipping that task as if it didn't exist. That requires
> propagating the sender's credentials down to that function.
>
> Untested patch below.
>
> Index: linux-2.6/include/linux/security.h
> ===================================================================
> RCS file: /nfshome/pal/CVS/linux-2.6/include/linux/security.h,v
> retrieving revision 1.37
> diff -u -p -r1.37 security.h
> --- linux-2.6/include/linux/security.h 16 Jun 2004 14:49:42 -0000 1.37
> +++ linux-2.6/include/linux/security.h 9 Sep 2004 19:38:23 -0000
> @@ -632,6 +632,13 @@ struct swap_info_struct;
> * security attributes, e.g. for /proc/pid inodes.
> * @p contains the task_struct for the task.
> * @inode contains the inode structure for the inode.
> + * @task_getstate:
> + * Check permission before getting the state of a task.
> + * @pid contains the pid of the requesting process.
> + * @p contains the task_struct for the target task.
> + * @uid contains the uid of the requesting process.
> + * @caps contains the capability set of the requesting process.
> + * Return 0 if permission is granted.
Why caps?
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Thu, 09 Sep 2004 16:01:06 -0400, Stephen Smalley wrote:
> > For the same reason, I'm not comfortable with implementing SELinux type
> > access controls myself. How about:
> >
> > config NPROC
> > depends on !SECURITY_SELINUX
> >
> > Adding access control later won't be a problem for anyone who groks
> > SELinux.
>
[...]
> Most obvious place to hook would be nproc_ps_get_task; we could then
> perform a check based on the sender's credentials and the target task's
> credentials, and simply return NULL if permission is not granted for
> that pair, thus skipping that task as if it didn't exist. That requires
> propagating the sender's credentials down to that function.
>
> Untested patch below.
I used a somewhat different approach in my development tree (not
SELinuxy, though): Most fields were world readable, some required
credentials.
I don't have any strong feelings on access control, so I'd be happy
with any mechanism that doesn't completely botch performance. Anyway,
I do not consider lack of access controls to be a showstopper.
Roger
* Roger Luethi ([email protected]) wrote:
> On Thu, 09 Sep 2004 16:01:06 -0400, Stephen Smalley wrote:
> > > For the same reason, I'm not comfortable with implementing SELinux type
> > > access controls myself. How about:
> > >
> > > config NPROC
> > > depends on !SECURITY_SELINUX
> > >
> > > Adding access control later won't be a problem for anyone who groks
> > > SELinux.
> >
> [...]
> > Most obvious place to hook would be nproc_ps_get_task; we could then
> > perform a check based on the sender's credentials and the target task's
> > credentials, and simply return NULL if permission is not granted for
> > that pair, thus skipping that task as if it didn't exist. That requires
> > propagating the sender's credentials down to that function.
> >
> > Untested patch below.
>
> I used a somewhat different approach in my development tree (not
> SELinuxy, though): Most fields were world readable, some required
> credentials.
>
> I don't have any strong feelings on access control, so I'd be happy
> with any mechanism that doesn't completely botch performance. Anyway,
> I do not consider lack of access controls to be a showstopper.
Some of these things become quite sensitive, esp across setuid, etc.
For prototyping, I agree, not a showstopper. For merging, it should be
figured out properly.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Thu, 09 Sep 2004 12:23:13 -0700, William Lee Irwin III wrote:
> I took the structure fields to be just an argument passing convention
> giving the nommu case an identical prototype much like the helpers in
That seems rather confusing. We must special-case for !CONFIG_MMU
anyway because field IDs are tied to meaning, i.e. systems export
different sets of fields depending on this configuration setting. The
proc filesystem does the same, the difference is that a changing set
is easier to handle with nproc.
Roger
On Thu, 09 Sep 2004 22:55:31 +0200, Roger Luethi wrote:
> I used a somewhat different approach in my development tree (not
> SELinuxy, though): Most fields were world readable, some required
> credentials.
I forgot to mention that you can see the remnants of that approach in
<linux/nproc.h>: I used two bits of the field ID to define per-field
access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).
Roger
On Thu, 2004-09-09 at 16:48, Chris Wright wrote:
> > + * @task_getstate:
> > + * Check permission before getting the state of a task.
> > + * @pid contains the pid of the requesting process.
> > + * @p contains the task_struct for the target task.
> > + * @uid contains the uid of the requesting process.
> > + * @caps contains the capability set of the requesting process.
> > + * Return 0 if permission is granted.
>
> Why caps?
It is readily available in the netlink skb parms, and someone might want
to use it, e.g. a security module might limit a requesting process to
only getting state of other processes with the same uid unless the
requesting process has some capability.
--
Stephen Smalley <[email protected]>
National Security Agency
On Thu, 09 Sep 2004 12:23:13 -0700, William Lee Irwin III wrote:
> feasible, of course. I'll wait for your updates to follow up further.
Incremental update below. It contains a reorganization of the field
IDs (something I expected to do based on feedback) and minor tweaks in
error handling.
I'll post a full patch once the MMU stuff is sorted out.
Roger
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/include/linux/nproc.h linux-2.6.9-rc1-mm4.02/include/linux/nproc.h
--- linux-2.6.9-rc1-mm4.01/include/linux/nproc.h 2004-09-10 17:19:34.018727960 +0200
+++ linux-2.6.9-rc1-mm4.02/include/linux/nproc.h 2004-09-10 14:43:13.000000000 +0200
@@ -49,35 +49,57 @@
#define NPROC_LABEL_FIELD_UNIT 0x00000003
#define NPROC_LABEL_WCHAN 0x00000004
-/* Field IDs (unique key in bits 0 - 15) */
-#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL)
-#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
-/* Amount of free memory (pages) */
-#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
-/* Size of a page (bytes) */
-#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* --------------------------------------------------------------------- misc */
/* There's no guarantee about anything with jiffies. Still useful for some. */
-#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL)
-/* Process: VM size (KiB) */
-#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-/* Process: locked memory (KiB) */
-#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-/* Process: Memory resident size (KiB) */
-#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
-#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
-#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
-#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS)
-#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING)
+#define NPROC_JIFFIES (0x00000001 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL)
+/* Field IDs (unique key in bits 0 - 15) */
+#define NPROC_NOP_UL (0x00000002 | NPROC_TYPE_UL)
+/* Size of a page */
+#define NPROC_PAGESIZE (0x00000003 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* --------------------------------------------------------- /proc/PID/status */
+#define NPROC_NAME (0x00000100 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+#define NPROC_STATE (0x00000101 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_STATE_NAME (0x00000102 | NPROC_TYPE_STRING)
+#define NPROC_SLEEP_TIME (0x00000103 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_TOTAL_TIME (0x00000104 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_PID (0x00000105 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_TGID (0x00000106 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_PPID (0x00000107 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_TRACER_PID (0x00000108 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_UID (0x00000109 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_EUID (0x00000110 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_SUID (0x00000111 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_FSUID (0x00000112 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_GID (0x00000113 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_EGID (0x00000114 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_SGID (0x00000115 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_FSGID (0x00000116 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: VM size */
+#define NPROC_VMSIZE (0x00000117 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: locked memory */
+#define NPROC_VMLOCK (0x00000118 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size */
+#define NPROC_VMRSS (0x00000119 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA (0x00000120 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK (0x00000121 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE (0x00000122 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB (0x00000123 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS)
+/* ------------------------------------------------------------- /proc/vmstat */
+#define NPROC_NR_DIRTY (0x00000214 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK (0x00000215 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE (0x00000216 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS (0x00000217 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED (0x00000218 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB (0x00000219 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL)
+/* ------------------------------------------------------------ /proc/meminfo */
+/* Amount of free memory */
+#define NPROC_MEMFREE (0x00000320 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL)
+/* ---------------------------------------------------------- /proc/PID/wchan */
+#define NPROC_WCHAN (0x00000421 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME (0x00000422 | NPROC_TYPE_STRING)
+/* ----------------------------------------------------------- /proc/PID/stat */
+/* ---------------------------------------------------------- /proc/PID/statm */
+
#ifdef __KERNEL__
struct nproc_field {
@@ -88,11 +110,11 @@ struct nproc_field {
};
static struct nproc_field labels[] = {
- { NPROC_PID, "PID", "%5u", "" },
- { NPROC_NAME, "Name", "%-15s","" },
- { NPROC_MEMFREE, "MemFree", "%8u", "page" },
- { NPROC_PAGESIZE, "PageSize", "%4u", "byte" },
{ NPROC_JIFFIES, "Jiffies", "%10u", "" },
+ { NPROC_PAGESIZE, "PageSize", "%4u", "byte" },
+ { NPROC_NAME, "Name", "%-15s","" },
+ { NPROC_PID, "PID", "%5u", "" },
+ { NPROC_UID, "UID", "%5u", "" },
{ NPROC_VMSIZE, "VmSize", "%8u", "KiB" },
{ NPROC_VMLOCK, "VmLock", "%8u", "KiB" },
{ NPROC_VMRSS, "VmRSS", "%8u", "KiB" },
@@ -100,16 +122,16 @@ static struct nproc_field labels[] = {
{ NPROC_VMSTACK, "VmStack", "%8u", "KiB" },
{ NPROC_VMEXE, "VmExe", "%8u", "KiB" },
{ NPROC_VMLIB, "VmLib", "%8u", "KiB" },
- { NPROC_UID, "UID", "%5u", "" },
{ NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" },
{ NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" },
{ NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" },
{ NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" },
{ NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" },
{ NPROC_NR_SLAB, "nr_slab", "%8u", "page" },
+ { NPROC_MEMFREE, "MemFree", "%8u", "page" },
{ NPROC_WCHAN, "wchan", "%p", "" },
#ifdef CONFIG_KALLSYMS
- { NPROC_WCHAN_NAME, "wchan_symbol", "%s"},
+ { NPROC_WCHAN_NAME, "wchan_symbol", "%s", ""},
#endif
};
#endif /* __KERNEL__ */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/kernel/nproc.c linux-2.6.9-rc1-mm4.02/kernel/nproc.c
--- linux-2.6.9-rc1-mm4.01/kernel/nproc.c 2004-09-10 17:19:34.034725528 +0200
+++ linux-2.6.9-rc1-mm4.02/kernel/nproc.c 2004-09-10 12:04:28.000000000 +0200
@@ -17,12 +17,11 @@
/* There must be like 5 million dprintk definitions, so let's add some more */
#ifdef DEBUG
#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
-#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
#else
#define pdebug(x,args...)
-#define pwarn(x,args...)
#endif
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
static struct sock *nproc_sock = NULL;
@@ -129,18 +128,18 @@ static struct sk_buff *nproc_alloc_nlmsg
struct sk_buff *skb2 = 0;
skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
- if (!skb2) {
- skb2 = ERR_PTR(-ENOMEM);
- goto out;
- }
+ if (!skb2)
+ goto err_out;
NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
-out:
- return skb2;
+ goto out;
nlmsg_failure: /* Used by NLMSG_PUT */
kfree_skb(skb2);
- return NULL;
+err_out:
+ skb2 = ERR_PTR(-ENOMEM);
+out:
+ return skb2;
}
#define mstore(value, id, buf) \
@@ -634,18 +633,17 @@ static int find_id(__u32 *data, __u32 *l
pwarn("%d bytes left.\n", *left);
id = data[1];
- for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
- ; /* Do nothing */
-
- if (labels[i].id != id) {
- pwarn("No matching label found for %#x.\n", id);
- goto err_inval;
+ for (i = 0; i < ARRAY_SIZE(labels); i++) {
+ if (labels[i].id == id)
+ goto out;
}
- return i;
+ pwarn("No matching label found for %#x.\n", id);
err_inval:
return -EINVAL;
+out:
+ return i;
}
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/init/Kconfig linux-2.6.9-rc1-mm4.02/init/Kconfig
--- linux-2.6.9-rc1-mm4.01/init/Kconfig 2004-09-10 17:19:34.040724616 +0200
+++ linux-2.6.9-rc1-mm4.02/init/Kconfig 2004-09-10 00:32:36.000000000 +0200
@@ -141,10 +141,11 @@ config SYSCTL
config NPROC
bool "Netlink interface to /proc information"
- depends on PROC_FS && EXPERIMENTAL
+ depends on EXPERIMENTAL && !CONFIG_SECURITY
default y
help
- Nproc is a netlink interface to /proc information.
+ Nproc is a netlink interface to /proc information. Its benefits
+ are clean semantics and high performance.
config AUDIT
bool "Auditing support"
On Thu, 2004-09-09 at 15:11, Roger Luethi wrote:
> On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote:
> > I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation
> > patch atop the others I sent.
>
> I have a few minor changes coming up as well.
>
> One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
> ifdef them out of the list of offered fields for that configuration (and
> maybe in nproc_ps_field as well).
No. First of all, I think they can be offered. Until proven
otherwise, I'll assume that the !CONFIG_MMU case is buggy.
Second of all, removal will make the !CONFIG_MMU systems
less compatible with the rest of the world. This will
mean that fewer apps can run on !CONFIG_MMU boxes. It's
same problem as "All the world's a VAX". It's better that
the apps work; an author working on a Pentium 4 Xeon is
likely to write code that relies on the fields and might
not really understand what "no MMU" is all about.
On Thu, 2004-09-09 at 17:25, Roger Luethi wrote:
> On Thu, 09 Sep 2004 22:55:31 +0200, Roger Luethi wrote:
> > I used a somewhat different approach in my development tree (not
> > SELinuxy, though): Most fields were world readable, some required
> > credentials.
>
> I forgot to mention that you can see the remnants of that approach in
> <linux/nproc.h>: I used two bits of the field ID to define per-field
> access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).
Besides the low-security and high-security choices,
I'd like to see a medium-security choice.
low: everybody sees everything
medium: everybody sees something; privileged user sees all
high: must be privileged
This might mean that asking for stuff like EIP and WCHAN
causes you to see fewer processes.
If partial info is returned for a process, I'd like to
also get a bitmap of valid fields. Special "not valid"
values are a pain to deal with.
On Thu, 2004-09-09 at 15:11, Roger Luethi wrote:
>> I have a few minor changes coming up as well.
>> One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
>> ifdef them out of the list of offered fields for that configuration (and
>> maybe in nproc_ps_field as well).
On Sat, Sep 11, 2004 at 06:25:56PM -0400, Albert Cahalan wrote:
> No. First of all, I think they can be offered. Until proven
> otherwise, I'll assume that the !CONFIG_MMU case is buggy.
> Second of all, removal will make the !CONFIG_MMU systems
> less compatible with the rest of the world. This will
> mean that fewer apps can run on !CONFIG_MMU boxes. It's
> same problem as "All the world's a VAX". It's better that
> the apps work; an author working on a Pentium 4 Xeon is
> likely to write code that relies on the fields and might
> not really understand what "no MMU" is all about.
Would the nommu bits I wrote be satisfactory for you?
-- wli
On Thu, 2004-09-09 at 17:25, Roger Luethi wrote:
>> I forgot to mention that you can see the remnants of that approach in
>> <linux/nproc.h>: I used two bits of the field ID to define per-field
>> access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).
On Sat, Sep 11, 2004 at 06:36:53PM -0400, Albert Cahalan wrote:
> Besides the low-security and high-security choices,
> I'd like to see a medium-security choice.
> low: everybody sees everything
> medium: everybody sees something; privileged user sees all
> high: must be privileged
> This might mean that asking for stuff like EIP and WCHAN
> causes you to see fewer processes.
> If partial info is returned for a process, I'd like to
> also get a bitmap of valid fields. Special "not valid"
> values are a pain to deal with.
That's an interesting observation. Perhaps the union of the mmu and
nommu fields should be nominally reported alongside a bitmap of the
useful fields?
-- wli
On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote:
> > One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should
> > ifdef them out of the list of offered fields for that configuration (and
> > maybe in nproc_ps_field as well).
>
> No. First of all, I think they can be offered. Until proven
> otherwise, I'll assume that the !CONFIG_MMU case is buggy.
I agree with you that those specific fields should be offered for
!CONFIG_MMU. However, if for some reason they cannot carry a value
that fits the field description, they should not be offered at all. The
ambiguity of having 0 mean either "0" or "this field is not available"
is bad. Trying to read a specific field _can_ fail, and applications
had better handle that case (it's still trivial compared to having to
parse different /proc file layouts depending on the configuration).
> mean that fewer apps can run on !CONFIG_MMU boxes. It's
> same problem as "All the world's a VAX". It's better that
> the apps work; an author working on a Pentium 4 Xeon is
> likely to write code that relies on the fields and might
> not really understand what "no MMU" is all about.
The presumed wrong assumptions underlying broken tools of the future
are not a good base for designing a new interface. My interest is in
making it easy to write correct applications (or in fixing broken apps
that won't work, say, on !CONFIG_MMU systems).
Roger
On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote:
>> No. First of all, I think they can be offered. Until proven
>> otherwise, I'll assume that the !CONFIG_MMU case is buggy.
On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> I agree with you that those specific fields should be offered for
> !CONFIG_MMU. However, if for some reason they cannot carry a value
> that fits the field description, they should not be offered at all. The
> ambiguity of having 0 mean either "0" or "this field is not available"
> is bad. Trying to read a specific field _can_ fail, and applications
> had better handle that case (it's still trivial compared to having to
> parse different /proc file layouts depending on the configuration).
Apart from doing something it's supposed to for !CONFIG_MMU and using
the internal kernel accounting I set up for the CONFIG_MMU=y case I'm
not very concerned about this. I have a vague notion there should
probably be some consistency with the /proc/ precedent but am not
particularly tied to it. We should probably ask Greg Ungerer (the
maintainer of the external MMU-less patches) about what he prefers
since it's likely we can't anticipate all of the !CONFIG_MMU concerns.
On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote:
>> mean that fewer apps can run on !CONFIG_MMU boxes. It's
>> same problem as "All the world's a VAX". It's better that
>> the apps work; an author working on a Pentium 4 Xeon is
>> likely to write code that relies on the fields and might
>> not really understand what "no MMU" is all about.
On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> The presumed wrong assumptions underlying broken tools of the future
> are not a good base for designing a new interface. My interest is in
> making it easy to write correct applications (or in fixing broken apps
> that won't work, say, on !CONFIG_MMU systems).
I don't really know what the approach to app compatibility used by
userspace for !CONFIG_MMU is; I'll refer you to Greg Ungerer as my
knowledge of the CONFIG_MMU usage models and/or whatever userspace
is used in tandem with it outside the VM's internals is rather scant.
-- wli
Greg, could you comment on this since there are some people having
trouble figuring out what's going on with VM-related /proc/ fields for
!CONFIG_MMU. Please forgive the top-posting, it made more sense to
quote the text below in this instance.
On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
>> I agree with you that those specific fields should be offered for
>> !CONFIG_MMU. However, if for some reason they cannot carry a value
>> that fits the field description, they should not be offered at all. The
>> ambiguity of having 0 mean either "0" or "this field is not available"
>> is bad. Trying to read a specific field _can_ fail, and applications
>> had better handle that case (it's still trivial compared to having to
>> parse different /proc file layouts depending on the configuration).
On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote:
> Apart from doing something it's supposed to for !CONFIG_MMU and using
> the internal kernel accounting I set up for the CONFIG_MMU=y case I'm
> not very concerned about this. I have a vague notion there should
> probably be some consistency with the /proc/ precedent but am not
> particularly tied to it. We should probably ask Greg Ungerer (the
> maintainer of the external MMU-less patches) about what he prefers
> since it's likely we can't anticipate all of the !CONFIG_MMU concerns.
On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
>> The presumed wrong assumptions underlying broken tools of the future
>> are not a good base for designing a new interface. My interest is in
>> making it easy to write correct applications (or in fixing broken apps
>> that won't work, say, on !CONFIG_MMU systems).
On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote:
> I don't really know what the approach to app compatibility used by
> userspace for !CONFIG_MMU is; I'll refer you to Greg Ungerer as my
> knowledge of the CONFIG_MMU usage models and/or whatever userspace
> is used in tandem with it outside the VM's internals is rather scant.
On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote:
> > I forgot to mention that you can see the remnants of that approach in
> > <linux/nproc.h>: I used two bits of the field ID to define per-field
> > access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT).
>
> Besides the low-security and high-security choices,
> I'd like to see a medium-security choice.
>
> low: everybody sees everything
> medium: everybody sees something; privileged user sees all
> high: must be privileged
>
> This might mean that asking for stuff like EIP and WCHAN
> causes you to see fewer processes.
I'm not sure I understand you correctly, but the combination of
NPROC_PERM_USER and NPROC_PERM_ROOT already seems to fit your
description:
- If the access control bits for a field are cleared, any process/user
can get that field information for any process.
- If the access control bits are set to NPROC_PERM_USER, only root and
the owner of a process can read the field for that process.
- For NPROC_PERM_ROOT, only root can ever read such a field.
I picked that design because it captures the essence of what /proc
does today.
> If partial info is returned for a process, I'd like to
> also get a bitmap of valid fields. Special "not valid"
> values are a pain to deal with.
If an app asks for a field it has no or partial permission for, the set
of processes returned is trimmed accordingly. Since an application will
expect this behavior based on the access control bits, no guessing is
involved here.
If an app asks for a non-existant field (not supported on this
architecture or obsolete), it will get an error back. No guessing
involved here, either. We could report the bad field ID back, but it's
easy for user-space to figure out and it's not in the fast path (for
user space).
The tricky case is if an app asks for an offered field without permission
problems, but the field is not available in that particular context. The
only instance of this that comes to mind are mm_struct related fields
and kernel threads. Neither returning an error nor skipping affected
processes seems a good solution. In this special case, the current
nproc code returns 0, but that's probably not optimal. Currently,
my preferred solution would be to return ~(0).
I'm not convinced yet that making message formats more complex (adding
bitmaps or lists of applicaple fields or something) for one special
case is a better idea.
Roger
On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote:
>> This might mean that asking for stuff like EIP and WCHAN
>> causes you to see fewer processes.
On Tue, Sep 14, 2004 at 08:44:03AM +0200, Roger Luethi wrote:
> I'm not sure I understand you correctly, but the combination of
> NPROC_PERM_USER and NPROC_PERM_ROOT already seems to fit your
> description:
> - If the access control bits for a field are cleared, any process/user
> can get that field information for any process.
> - If the access control bits are set to NPROC_PERM_USER, only root and
> the owner of a process can read the field for that process.
> - For NPROC_PERM_ROOT, only root can ever read such a field.
> I picked that design because it captures the essence of what /proc
> does today.
The concern appears to be that the tools might interpret failed
permission checks as indications of process nonexistence. I don't
regard this as particularly pressing, as properly-written apps should
check the specific value of errno (in particular to retry when EAGAIN
is received in numerous contexts).
On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote:
>> If partial info is returned for a process, I'd like to
>> also get a bitmap of valid fields. Special "not valid"
>> values are a pain to deal with.
On Tue, Sep 14, 2004 at 08:44:03AM +0200, Roger Luethi wrote:
> If an app asks for a field it has no or partial permission for, the set
> of processes returned is trimmed accordingly. Since an application will
> expect this behavior based on the access control bits, no guessing is
> involved here.
> If an app asks for a non-existant field (not supported on this
> architecture or obsolete), it will get an error back. No guessing
> involved here, either. We could report the bad field ID back, but it's
> easy for user-space to figure out and it's not in the fast path (for
> user space).
> The tricky case is if an app asks for an offered field without permission
> problems, but the field is not available in that particular context. The
> only instance of this that comes to mind are mm_struct related fields
> and kernel threads. Neither returning an error nor skipping affected
> processes seems a good solution. In this special case, the current
> nproc code returns 0, but that's probably not optimal. Currently,
> my preferred solution would be to return ~(0).
> I'm not convinced yet that making message formats more complex (adding
> bitmaps or lists of applicaple fields or something) for one special
> case is a better idea.
Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be
done if the fields are measured in units such that the top bit is never
set for any feasible value, then a fully qualified error return could
simply be returned as (unsigned long)(-err). I suspect VSZ may be
problematic wrt. overflows even for 32-bit, not just for 31-bit.
-- wli
Hi William, Roger,
William Lee Irwin III wrote:
> Greg, could you comment on this since there are some people having
> trouble figuring out what's going on with VM-related /proc/ fields for
> !CONFIG_MMU. Please forgive the top-posting, it made more sense to
> quote the text below in this instance.
Yeah, the !CONFIG_MMU code behind this is probably a little stale.
The thinking has mostly been to keep things as much the same as
possible, even if the fields didn't have a sensible meaning in
non-mmu space.
> On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
>
>>>I agree with you that those specific fields should be offered for
>>>!CONFIG_MMU. However, if for some reason they cannot carry a value
>>>that fits the field description, they should not be offered at all. The
>>>ambiguity of having 0 mean either "0" or "this field is not available"
>>>is bad. Trying to read a specific field _can_ fail, and applications
>>>had better handle that case (it's still trivial compared to having to
>>>parse different /proc file layouts depending on the configuration).
In at least one case this is true now, as you mention for the
VmXxx fields. But looking at these now I think we could actually
implement most of them in a sensible way for the no-mmu case.
Size, Exe, Lib, Stk, etc all apply with their conventional
meanings.
> On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote:
>
>>Apart from doing something it's supposed to for !CONFIG_MMU and using
>>the internal kernel accounting I set up for the CONFIG_MMU=y case I'm
>>not very concerned about this. I have a vague notion there should
>>probably be some consistency with the /proc/ precedent but am not
>>particularly tied to it. We should probably ask Greg Ungerer (the
>>maintainer of the external MMU-less patches) about what he prefers
>>since it's likely we can't anticipate all of the !CONFIG_MMU concerns.
>
>
> On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
>
>>>The presumed wrong assumptions underlying broken tools of the future
>>>are not a good base for designing a new interface. My interest is in
>>>making it easy to write correct applications (or in fixing broken apps
>>>that won't work, say, on !CONFIG_MMU systems).
Reality for non-mmu targets is that most apps just won't be fixed
for them, so we try real hard to make the world look like it is
just like any other linux architecture.
I think !CONFIG_MMU case can be cleaned up to make it almost identical
to the CONFIG_MMU case, and reporting sensible values for just about
all fields.
Regards
Greg
------------------------------------------------------------------------
Greg Ungerer -- Chief Software Dude EMAIL: [email protected]
SnapGear -- a CyberGuard Company PHONE: +61 7 3435 2888
825 Stanley St, FAX: +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com
On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
> > - If the access control bits for a field are cleared, any process/user
> > can get that field information for any process.
> > - If the access control bits are set to NPROC_PERM_USER, only root and
> > the owner of a process can read the field for that process.
> > - For NPROC_PERM_ROOT, only root can ever read such a field.
> > I picked that design because it captures the essence of what /proc
> > does today.
>
> The concern appears to be that the tools might interpret failed
> permission checks as indications of process nonexistence. I don't
> regard this as particularly pressing, as properly-written apps should
> check the specific value of errno (in particular to retry when EAGAIN
> is received in numerous contexts).
I would expect a tool to refrain from asking for fields with restricted
access if it needs a complete overview over existing processes. It can
always ask for restricted fields in a second request (the vast majority
of fields are world-readable anyway).
> > processes seems a good solution. In this special case, the current
> > nproc code returns 0, but that's probably not optimal. Currently,
> > my preferred solution would be to return ~(0).
> > I'm not convinced yet that making message formats more complex (adding
> > bitmaps or lists of applicaple fields or something) for one special
> > case is a better idea.
>
> Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be
> done if the fields are measured in units such that the top bit is never
> set for any feasible value, then a fully qualified error return could
> simply be returned as (unsigned long)(-err). I suspect VSZ may be
> problematic wrt. overflows even for 32-bit, not just for 31-bit.
Yeah, that makes me nervous. There are just too many ways this can go
wrong or be misinterpreted in user space. Currently, nproc does not
indicate the type of error at all, because a properly written user-space
app will either not hit an error or be able to figure out what the
problem was based on the available information. I suppose if we wanted
to change that (which doesn't sound unreasonable), the proper way would
be to return error flags with an error message (delivered via netlink).
Roger
On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
>> The concern appears to be that the tools might interpret failed
>> permission checks as indications of process nonexistence. I don't
>> regard this as particularly pressing, as properly-written apps should
>> check the specific value of errno (in particular to retry when EAGAIN
>> is received in numerous contexts).
On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote:
> I would expect a tool to refrain from asking for fields with restricted
> access if it needs a complete overview over existing processes. It can
> always ask for restricted fields in a second request (the vast majority
> of fields are world-readable anyway).
That expectation can't be entirely relied upon, as the restrictions may
not be predictable.
On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
>> Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be
>> done if the fields are measured in units such that the top bit is never
>> set for any feasible value, then a fully qualified error return could
>> simply be returned as (unsigned long)(-err). I suspect VSZ may be
>> problematic wrt. overflows even for 32-bit, not just for 31-bit.
On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote:
> Yeah, that makes me nervous. There are just too many ways this can go
> wrong or be misinterpreted in user space. Currently, nproc does not
> indicate the type of error at all, because a properly written user-space
> app will either not hit an error or be able to figure out what the
> problem was based on the available information. I suppose if we wanted
> to change that (which doesn't sound unreasonable), the proper way would
> be to return error flags with an error message (delivered via netlink).
This kind of error reporting is better still, as the fields then won't
be polluted with invalid data under any circumstance (assuming the code
can report subsets of the fields or some such, which I presume to be
the case given that avoiding reporting potentially computationally
expensive fields was one of the original motivators of the patch).
-- wli
On Tue, 14 Sep 2004 17:47:52 +1000, Greg Ungerer wrote:
> Yeah, the !CONFIG_MMU code behind this is probably a little stale.
> The thinking has mostly been to keep things as much the same as
> possible, even if the fields didn't have a sensible meaning in
> non-mmu space.
With nproc, tool authors won't need to write any special-casing code
for non-MMU. All they need to handle is the possibility that a field
they ask for does not exist. (Of course it doesn't hurt if they know
how to deal with non-MMU specific fields if any exist)
> >On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote:
> >
> >>>I agree with you that those specific fields should be offered for
> >>>!CONFIG_MMU. However, if for some reason they cannot carry a value
> >>>that fits the field description, they should not be offered at all. The
> >>>ambiguity of having 0 mean either "0" or "this field is not available"
> >>>is bad. Trying to read a specific field _can_ fail, and applications
> >>>had better handle that case (it's still trivial compared to having to
> >>>parse different /proc file layouts depending on the configuration).
>
> In at least one case this is true now, as you mention for the
> VmXxx fields. But looking at these now I think we could actually
> implement most of them in a sensible way for the no-mmu case.
> Size, Exe, Lib, Stk, etc all apply with their conventional
> meanings.
It seems we all agree on that.
What I'd object to is offering fields like Size, Exe, etc. and filling
them with values that are wrong (e.g. returning always 0 for Exe). In
such a case, the field is simply not offered and asking for it an
error.
That's not a problem we can solve for tool authors: Allowing them to
distinguish between N/A and 0 is a property of the interface, and using
that interface means knowing how to deal with that distinction.
Roger
On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote:
> >> The concern appears to be that the tools might interpret failed
> >> permission checks as indications of process nonexistence. I don't
> >> regard this as particularly pressing, as properly-written apps should
> >> check the specific value of errno (in particular to retry when EAGAIN
> >> is received in numerous contexts).
>
> On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote:
> > I would expect a tool to refrain from asking for fields with restricted
> > access if it needs a complete overview over existing processes. It can
> > always ask for restricted fields in a second request (the vast majority
> > of fields are world-readable anyway).
>
> That expectation can't be entirely relied upon, as the restrictions may
> not be predictable.
They should be. For the simple design I described the access restrictions
are part of the field ID, so a tool can deduce the exact type of access
restrictions even if it doesn't know the field. There's plenty of space
left for additional access control flags in the field ID.
If it gets much more complex, the application (let alone the kernel)
has to have some knowledge of the security model anyway, so we could have
simple operations that allow a tool to discover how access restrictions
apply to the supported fields.
> > problem was based on the available information. I suppose if we wanted
> > to change that (which doesn't sound unreasonable), the proper way would
> > be to return error flags with an error message (delivered via netlink).
>
> This kind of error reporting is better still, as the fields then won't
> be polluted with invalid data under any circumstance (assuming the code
> can report subsets of the fields or some such, which I presume to be
> the case given that avoiding reporting potentially computationally
> expensive fields was one of the original motivators of the patch).
It cannot easily, and I don't think it wants to. The reason it's hard to
just reply with a subset is that the kernel does not send any description
of the reply content other than the serial number of the request --
it's up to the tool to know what it asked for. So if you remove a field,
you'd have to let user-space know which field you removed. Sending only
the allowed subset makes handling on both sides more complicated --
the kernel needs to build different kinds of messages in answer to one
request, and user-space tool need to be able to parse that.
The way the interface works now, though, is that a tool can rely on
the content of the reply to match the request. This makes the common
case both easy to write and fast.
Let me break it down once again:
- If a tool asks for a field the kernel doesn't know about, that's a
fatal error. An error message is returned, nothing else (this can be
discovered before any other reply is delivered).
- If a tool specifically asks for a process which doesn't exist,
nothing is returned. We could return an error indicating that. Might
be a good idea.
- If a tool asks for a field it doesn't have permission to read, it usally
does have permission to read that field for some tasks (e.g. same owner),
but not for others. So for some replies to one request, all requested
fields will contain meaningful values. What about the replies that
describe the tasks where the tool must not read at least some of the
requested values? I chose to simply skip those tasks.
We could also send an error message ("some tasks omitted") or send a
complete reply with the restricted fields zeroed and a special flag set
("some fields in this reply zeroed due to access control").
I'm really afraid of over-engineering something here, though. The fields
requested by tools like ps and top by default are all world readable
in /proc. I showed that solutions fit right in should we ever need
access control for real-world applications. For now, I'd rather not
extend the interface significantly unless the current semantics are
clearly insufficient.
Roger
On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
>> That expectation can't be entirely relied upon, as the restrictions may
>> not be predictable.
On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> They should be. For the simple design I described the access restrictions
> are part of the field ID, so a tool can deduce the exact type of access
> restrictions even if it doesn't know the field. There's plenty of space
> left for additional access control flags in the field ID.
No, in general races of the form "permissions were altered after I
checked them" can happen.
On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> If it gets much more complex, the application (let alone the kernel)
> has to have some knowledge of the security model anyway, so we could have
> simple operations that allow a tool to discover how access restrictions
> apply to the supported fields.
Checking that system calls succeeded is a minimum requirement at all
times. Misinterpreting error returns is the app's fault.
On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
>> This kind of error reporting is better still, as the fields then won't
>> be polluted with invalid data under any circumstance (assuming the code
>> can report subsets of the fields or some such, which I presume to be
>> the case given that avoiding reporting potentially computationally
>> expensive fields was one of the original motivators of the patch).
On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> It cannot easily, and I don't think it wants to. The reason it's hard to
> just reply with a subset is that the kernel does not send any description
> of the reply content other than the serial number of the request --
> it's up to the tool to know what it asked for. So if you remove a field,
> you'd have to let user-space know which field you removed. Sending only
> the allowed subset makes handling on both sides more complicated --
> the kernel needs to build different kinds of messages in answer to one
> request, and user-space tool need to be able to parse that.
Irritating. That must mean you can't ask for specific fields.
On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> The way the interface works now, though, is that a tool can rely on
> the content of the reply to match the request. This makes the common
> case both easy to write and fast.
> Let me break it down once again:
> - If a tool asks for a field the kernel doesn't know about, that's a
> fatal error. An error message is returned, nothing else (this can be
> discovered before any other reply is delivered).
If you can't ask for specific fields you're dead anyway.
On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> - If a tool specifically asks for a process which doesn't exist,
> nothing is returned. We could return an error indicating that. Might
> be a good idea.
ESRCH and ENOENT sound good.
On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> - If a tool asks for a field it doesn't have permission to read, it usally
> does have permission to read that field for some tasks (e.g. same owner),
> but not for others. So for some replies to one request, all requested
> fields will contain meaningful values. What about the replies that
> describe the tasks where the tool must not read at least some of the
> requested values? I chose to simply skip those tasks.
This is the bit about being dead already if you can't request subsets
of fields and/or one field at a time.
On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> We could also send an error message ("some tasks omitted") or send a
> complete reply with the restricted fields zeroed and a special flag set
> ("some fields in this reply zeroed due to access control").
> I'm really afraid of over-engineering something here, though. The fields
> requested by tools like ps and top by default are all world readable
> in /proc. I showed that solutions fit right in should we ever need
> access control for real-world applications. For now, I'd rather not
> extend the interface significantly unless the current semantics are
> clearly insufficient.
Well, "return this set of fields" means there's only one type of
request necessary, and userspace merely iterates through the subsets
obtained by striking out fields to which accesses caused errors until
either the set is empty or the call succeeds. One field at a time at
all times also means there's only one type of request necessary. So I
don't see overengineering happening here, merely that "either all
succeed or all fail" is a semantic that creates hardships for userspace;
both the alternatives are simple.
-- wli
On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote:
> >> That expectation can't be entirely relied upon, as the restrictions may
> >> not be predictable.
>
> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> > They should be. For the simple design I described the access restrictions
> > are part of the field ID, so a tool can deduce the exact type of access
> > restrictions even if it doesn't know the field. There's plenty of space
> > left for additional access control flags in the field ID.
>
> No, in general races of the form "permissions were altered after I
> checked them" can happen.
Can you make an example? Some scenario where this would be important?
> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> > If it gets much more complex, the application (let alone the kernel)
> > has to have some knowledge of the security model anyway, so we could have
> > simple operations that allow a tool to discover how access restrictions
> > apply to the supported fields.
>
> Checking that system calls succeeded is a minimum requirement at all
> times. Misinterpreting error returns is the app's fault.
It's async. You can't rely on return values. They'd have to be in
netlink messages.
> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> > It cannot easily, and I don't think it wants to. The reason it's hard to
> > just reply with a subset is that the kernel does not send any description
> > of the reply content other than the serial number of the request --
> > it's up to the tool to know what it asked for. So if you remove a field,
> > you'd have to let user-space know which field you removed. Sending only
> > the allowed subset makes handling on both sides more complicated --
> > the kernel needs to build different kinds of messages in answer to one
> > request, and user-space tool need to be able to parse that.
>
> Irritating. That must mean you can't ask for specific fields.
How so? For process fields, the request block is one u32 indicating the
number of field IDs to follow, then a bunch of u32 containing field IDs.
Any subset of field IDs, in any order of the tool's choosing.
The kernel replies with one message per process, each message containing
all the fields the tool requested, in the same order.
> On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote:
> > We could also send an error message ("some tasks omitted") or send a
> > complete reply with the restricted fields zeroed and a special flag set
> > ("some fields in this reply zeroed due to access control").
> > I'm really afraid of over-engineering something here, though. The fields
> > requested by tools like ps and top by default are all world readable
> > in /proc. I showed that solutions fit right in should we ever need
> > access control for real-world applications. For now, I'd rather not
> > extend the interface significantly unless the current semantics are
> > clearly insufficient.
>
> Well, "return this set of fields" means there's only one type of
> request necessary, and userspace merely iterates through the subsets
> obtained by striking out fields to which accesses caused errors until
> either the set is empty or the call succeeds. One field at a time at
> all times also means there's only one type of request necessary. So I
One field at a time at all times is unnecessarily slow.
Roger
On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> No, in general races of the form "permissions were altered after I
>> checked them" can happen.
On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> Can you make an example? Some scenario where this would be important?
Not particularly. It largely means poorly-coded apps may report gibberish.
On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> Checking that system calls succeeded is a minimum requirement at all
>> times. Misinterpreting error returns is the app's fault.
On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> It's async. You can't rely on return values. They'd have to be in
> netlink messages.
That's fine. Do these error messages specify which field access(es)
caused the error?
On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> Irritating. That must mean you can't ask for specific fields.
On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> How so? For process fields, the request block is one u32 indicating the
> number of field IDs to follow, then a bunch of u32 containing field IDs.
> Any subset of field IDs, in any order of the tool's choosing.
> The kernel replies with one message per process, each message containing
> all the fields the tool requested, in the same order.
Then assuming the error messages indicate which field access(es) caused
the error(s), you're already done; userspace must merely retry the
request with the offending fields cast out. Otherwise, you're still
done: userspace can merely retry the field accesses one at a time
(though it's nicer to say which ones caused the errors).
On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
>> Well, "return this set of fields" means there's only one type of
>> request necessary, and userspace merely iterates through the subsets
>> obtained by striking out fields to which accesses caused errors until
>> either the set is empty or the call succeeds. One field at a time at
>> all times also means there's only one type of request necessary. So I
On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> One field at a time at all times is unnecessarily slow.
Yes, that was the "slower and stupider than thou" option. You've
already vectorized field access requests, of which I heartily approve.
-- wli
On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> >> No, in general races of the form "permissions were altered after I
> >> checked them" can happen.
>
> On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > Can you make an example? Some scenario where this would be important?
>
> Not particularly. It largely means poorly-coded apps may report gibberish.
If we are still talking about the same thing here, gibberish is a rather
strong word. In the design I proposed access control affects the subset
of tasks returned as a result -- the tool would still display meaningful
information for the tasks it got replies for.
Anyway, if the access restrictions are hard-coded into the field ID,
then it's only the credentials that can change, and I can't see a race
there at the moment.
> On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> >> Checking that system calls succeeded is a minimum requirement at all
> >> times. Misinterpreting error returns is the app's fault.
>
> On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > It's async. You can't rely on return values. They'd have to be in
> > netlink messages.
>
> That's fine. Do these error messages specify which field access(es)
> caused the error?
They don't, because the access control I had in my dev tree silently
skipped tasks containing fields the process had no permission to read.
IOW, access control works as an implicit task selector. And security
wise that's clean because the kernel does not reveal any information
about other processes to the querying task (not even evidence of their
existence).
> Then assuming the error messages indicate which field access(es) caused
> the error(s), you're already done; userspace must merely retry the
> request with the offending fields cast out. Otherwise, you're still
> done: userspace can merely retry the field accesses one at a time
> (though it's nicer to say which ones caused the errors).
Agreed on every point.
The question I am pondering is: Does nproc need access control right now?
It's more work in kernel and user space and adds new opportunities to
introduce bugs. The merits seem rather dubious right now, considering
that all the fields used by current process info tools (files
/proc/pid{cmdline, stat, statm, status, wchan}) are world readable.
So my preference is to wait with access control until we know where
and how it is necessary.
Roger
On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
>> Not particularly. It largely means poorly-coded apps may report gibberish.
On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> If we are still talking about the same thing here, gibberish is a rather
> strong word. In the design I proposed access control affects the subset
> of tasks returned as a result -- the tool would still display meaningful
> information for the tasks it got replies for.
That sounds bizarre. I'd expect some kind of reply, even if merely an
error. I suppose "no reply" could be interpreted as "ESRCH", though
this means distinguishing between "some field caused an error" and
"the thing is dead" means the app has to fall back to requesting fields
one at a time.
On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> Anyway, if the access restrictions are hard-coded into the field ID,
> then it's only the credentials that can change, and I can't see a race
> there at the moment.
The race is in the app, not the kernel, so there's nothing to fix in
the kernel apart from distinctions between ESRCH and EPERM in error
reporting (otherwise the app is helpless to resolve the ambiguity).
On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
>> That's fine. Do these error messages specify which field access(es)
>> caused the error?
On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> They don't, because the access control I had in my dev tree silently
> skipped tasks containing fields the process had no permission to read.
> IOW, access control works as an implicit task selector. And security
> wise that's clean because the kernel does not reveal any information
> about other processes to the querying task (not even evidence of their
> existence).
If all errors are handled with "no reply", userspace loses some
efficiency, as it's forced to retry field accesses one at a time and
wait for timeouts on each of them for a dead/inaccessible task.
On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
>> Then assuming the error messages indicate which field access(es) caused
>> the error(s), you're already done; userspace must merely retry the
>> request with the offending fields cast out. Otherwise, you're still
>> done: userspace can merely retry the field accesses one at a time
>> (though it's nicer to say which ones caused the errors).
On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> Agreed on every point.
> The question I am pondering is: Does nproc need access control right now?
> It's more work in kernel and user space and adds new opportunities to
> introduce bugs. The merits seem rather dubious right now, considering
> that all the fields used by current process info tools (files
> /proc/pid{cmdline, stat, statm, status, wchan}) are world readable.
> So my preference is to wait with access control until we know where
> and how it is necessary.
This I can't answer.
-- wli
On Tue, 14 Sep 2004 11:37:36 -0700, Chris Wright wrote:
> * William Lee Irwin III ([email protected]) wrote:
> > On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> > >> No, in general races of the form "permissions were altered after I
> > >> checked them" can happen.
> >
> > On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > > Can you make an example? Some scenario where this would be important?
> >
> > Not particularly. It largely means poorly-coded apps may report gibberish.
>
> Canonical example is access(2) followed by open(2), not really relevant
> in this case. However, exec setuid root app...when do you check, and
> when to you fill in data to send back to user? For /proc, this type of
> check happens often (see things like may_ptrace_attach and
> task_dumpable in fs/proc/base.c).
For nproc, the procedure looks like this: A tool send(2)s a request,
credentials are attached to skb. Based on said credentials, the kernel
is free to provide (netlink_unicast to originating socket) or withhold
information. In this regard, nproc works like other netlink interfaces.
Roger
On Tue, 14 Sep 2004 10:43:25 -0700, William Lee Irwin III wrote:
> On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote:
> >> Not particularly. It largely means poorly-coded apps may report gibberish.
>
> On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote:
> > If we are still talking about the same thing here, gibberish is a rather
> > strong word. In the design I proposed access control affects the subset
> > of tasks returned as a result -- the tool would still display meaningful
> > information for the tasks it got replies for.
>
> That sounds bizarre. I'd expect some kind of reply, even if merely an
> error. I suppose "no reply" could be interpreted as "ESRCH", though
> this means distinguishing between "some field caused an error" and
> "the thing is dead" means the app has to fall back to requesting fields
> one at a time.
I suppose you are thinking of a request that lists a number of PIDs along
with a number of field IDs. In that case yes, I agree that it makes sense
to provide some explicit feedback to the tool once we add access control
(before that, there is no ambiguity: a missing answer means ESRCH).
The most common request, though, won't provide a list of pids, it will
only provide a list of field IDs and select all processes in the system
(NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't
ask for any specific process to begin with, ESRCH doesn't make sense
here. And for a system that looks anything like /proc does today,
fields that are capable of triggering EPERM are few and far between,
certainly not something you are hitting unexpectedly in the fast path
of a process monitoring tool.
Thanks, by the way, for all the feedback that helped me realize that
I have so far failed to explain the design well enough. I will try to
work on that.
Roger
On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote:
> I suppose you are thinking of a request that lists a number of PIDs along
> with a number of field IDs. In that case yes, I agree that it makes sense
> to provide some explicit feedback to the tool once we add access control
> (before that, there is no ambiguity: a missing answer means ESRCH).
> The most common request, though, won't provide a list of pids, it will
> only provide a list of field IDs and select all processes in the system
> (NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't
> ask for any specific process to begin with, ESRCH doesn't make sense
> here. And for a system that looks anything like /proc does today,
> fields that are capable of triggering EPERM are few and far between,
> certainly not something you are hitting unexpectedly in the fast path
> of a process monitoring tool.
Okay, so what kinds of errors are returned in this case, if any, or
(worst case) are the offending tasks completely silently dropped?
On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote:
> Thanks, by the way, for all the feedback that helped me realize that
> I have so far failed to explain the design well enough. I will try to
> work on that.
Thanks; while I could in principle expend more effort to understand the
netlink code, it's likely swifter to be given such commentary.
-- wli
* Roger Luethi ([email protected]) wrote:
> On Tue, 14 Sep 2004 11:37:36 -0700, Chris Wright wrote:
> > Canonical example is access(2) followed by open(2), not really relevant
> > in this case. However, exec setuid root app...when do you check, and
> > when to you fill in data to send back to user? For /proc, this type of
> > check happens often (see things like may_ptrace_attach and
> > task_dumpable in fs/proc/base.c).
>
> For nproc, the procedure looks like this: A tool send(2)s a request,
> credentials are attached to skb. Based on said credentials, the kernel
> is free to provide (netlink_unicast to originating socket) or withhold
> information. In this regard, nproc works like other netlink interfaces.
Understood. Question is, if the request is for data that's associated
with a task that is in the middle of an execve(setuid_root_app), does
the credential-check/skb-fill for response happen atomically w.r.t. said
execve? IOW, is it possible to pass credential check, then fill data
that's become sensitive since the check happened?
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote:
>> Okay, so what kinds of errors are returned in this case, if any, or
>> (worst case) are the offending tasks completely silently dropped?
On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote:
> In published code: No access control whatsoever. In dev tree: Silently
> dropped. Possible: Any kind of error and additional information that
> makes sense (we have netlink messages as a transport, after all).
I'm not sure what to make of this.
On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote:
> That said, I don't think dropping tasks silently is a "worst case"
> in this scenario. Whatever your error report is going to be, it will
> boil down to saying "some tasks that may or may not live by the time
> you read this have been skipped because some fields that you knew had
> access restrictions prevented providing the information in those cases,
> and I must be cautious about not revealing any sensitive information
> to you so sorry I can't be more helpful". What's a tool going to do
> with that? If it cares to get a complete snapshot, it can simply send
> two requests: One with and one without restricted fields.
> So the tool would, say, request PID/VmSize in the first message and
> environ in the second message. Since only the owner can read the
> environment, the second request would yield answers only for a subset
> of the total process table.
This sounds safe enough, though it's unclear how to predict what fields
may be restricted. I suppose one doesn't try and requests one field at
a time for all tasks in this model of interaction with userspace.
-- wli
On Tue, 14 Sep 2004 12:36:26 -0700, William Lee Irwin III wrote:
> On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote:
> > In published code: No access control whatsoever. In dev tree: Silently
> > dropped. Possible: Any kind of error and additional information that
> > makes sense (we have netlink messages as a transport, after all).
>
> I'm not sure what to make of this.
I was just trying to say that anything is possible (there are no
limitations inherent to the design), but I prefer it the way it is now.
I don't feel strongly about it should something different turn out to
be the preferred method of tool authors.
> This sounds safe enough, though it's unclear how to predict what fields
> may be restricted. I suppose one doesn't try and requests one field at
Simple: The fact that a field is subject to access restrictions is part
of the field ID. You can check that nproc.h contains this:
/* Access control (unused) */
#define NPROC_PERM_MASK 0x00300000
#define NPROC_PERM_USER 0x00100000
#define NPROC_PERM_ROOT 0x00200000
So even if a tool were to discover a new, previously unknown field offered
by the kernel, it could immediately tell that access restrictions apply and
what type they are (in case you wonder, there's extra space in reserve to
cover additional types of restrictions, including some catch-all thing (say
NPROC_PERM_COMPLEX_WHICH_MEANS_YOU_HAD_BETTER_KNOW_WHAT_YOU'RE_DOING)). So
nproc can cover everything /proc does today and is ready to go way beyond
that -- should that ever be deemed a good thing.
Roger
On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote:
> On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote:
> > I suppose you are thinking of a request that lists a number of PIDs along
> > with a number of field IDs. In that case yes, I agree that it makes sense
> > to provide some explicit feedback to the tool once we add access control
> > (before that, there is no ambiguity: a missing answer means ESRCH).
> > The most common request, though, won't provide a list of pids, it will
> > only provide a list of field IDs and select all processes in the system
> > (NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't
> > ask for any specific process to begin with, ESRCH doesn't make sense
> > here. And for a system that looks anything like /proc does today,
> > fields that are capable of triggering EPERM are few and far between,
> > certainly not something you are hitting unexpectedly in the fast path
> > of a process monitoring tool.
>
> Okay, so what kinds of errors are returned in this case, if any, or
> (worst case) are the offending tasks completely silently dropped?
In published code: No access control whatsoever. In dev tree: Silently
dropped. Possible: Any kind of error and additional information that
makes sense (we have netlink messages as a transport, after all).
That said, I don't think dropping tasks silently is a "worst case"
in this scenario. Whatever your error report is going to be, it will
boil down to saying "some tasks that may or may not live by the time
you read this have been skipped because some fields that you knew had
access restrictions prevented providing the information in those cases,
and I must be cautious about not revealing any sensitive information
to you so sorry I can't be more helpful". What's a tool going to do
with that? If it cares to get a complete snapshot, it can simply send
two requests: One with and one without restricted fields.
So the tool would, say, request PID/VmSize in the first message and
environ in the second message. Since only the owner can read the
environment, the second request would yield answers only for a subset
of the total process table.
Roger
On Tue, 14 Sep 2004 12:05:09 -0700, Chris Wright wrote:
> Understood. Question is, if the request is for data that's associated
> with a task that is in the middle of an execve(setuid_root_app), does
> the credential-check/skb-fill for response happen atomically w.r.t. said
> execve? IOW, is it possible to pass credential check, then fill data
> that's become sensitive since the check happened?
It shouldn't be once we implement access control. I don't pretend to know
what the best way is to prevent that. Checking several times just shrinks
the race window, so I suppose we'd have to lock the source data structures
down prior to checking credentials and copying data.
Roger
* William Lee Irwin III ([email protected]) wrote:
> On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote:
> >> No, in general races of the form "permissions were altered after I
> >> checked them" can happen.
>
> On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote:
> > Can you make an example? Some scenario where this would be important?
>
> Not particularly. It largely means poorly-coded apps may report gibberish.
Canonical example is access(2) followed by open(2), not really relevant
in this case. However, exec setuid root app...when do you check, and
when to you fill in data to send back to user? For /proc, this type of
check happens often (see things like may_ptrace_attach and
task_dumpable in fs/proc/base.c).
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote:
> Thanks; while I could in principle expend more effort to understand the
> netlink code, it's likely swifter to be given such commentary.
This message aims at showing how nproc works for user space. If you need
additional or a different kind of documentation, let me know.
Roger
Field ID
========
In order to extract a specific value from the proc filesystem, a tool
combines the file path and some method to determine the appropriate
offset into that file (depending on the file based on keyword,
white-space separated column, etc.). At this point, the tool applies
its knowledge of the specific field format to convert the string back
to what it stands for.
Nproc, on the other hand, uses field IDs to identify information.
Each field ID (32 bit) contains a number of sub fields:
bits
0-15 Content ID. For instance, 0x117 is the virtual memory size of
a process.
20-21 Access control ID. Type of access control restrictions that apply
to this field. Currently unused.
24-26 Data type ID. Defines the return type which is one of u32,
unsigned long, u64, or string.
28-30 Scope ID. Defines the scope for which a field is valid. Scope
can be process (e.g. VmSize) or global (e.g. MemFree).
The remaining bits are reserved for future use.
Some details on sub-fields:
Content ID (bits 0-15)
----------
Bits 8-15 are used to indicate the /proc file in which a field occurs and
0-7 to indicate the field within that file (where applicable). There's
no magic to that other than the fact that it makes easier for humans
to check nproc.h.
Content IDs are immutable and identical on all platforms. Thus,
the meaning of any content ID, once assigned, must never ever change!
Data type ID (24-26)
------------
It's no problem to define additional (even complex) data types should
the need arise. For numbers, the data type simply defines the size of
the container (32 bit, long, 64 bit).
For strings, the string itself is prepended with a u32 indicating the
length of the string.
Scope ID (28-30)
--------
The scope ID is just another piece of information for tools with
automatic field discovery (see example below).
Examples
========
A few examples of how the mechanisms are used:
Simple
------
A tool like vmstat(8) starts from a bunch of IDs for global fields it's
interested in. After opening the socket, it sends one NPROC_GET_GLOBAL
request containing said field IDs to the kernel. The kernel sends one
reply for vmstat to read: A va_list containing the result for each
requested field ID.
Unit conversion (if necessary) can typically be done in place. Format
string and buffer are directly passed to vprintf(3). Done.
Detecting obsolete fields
-------------------------
An NPROC_GET_FIELD_LIST request can be used at start-up to determine
the field IDs that are offered by the kernel. If an app requests an
obsolete field anyway (being optimistic is faster for the common case),
it will get an error message back and can determine the cause from there.
I don't expect this to happen more often than it has in the past
(disappearing fields suck), but it's a clean way to handle such an event.
Field autodiscovery
-------------------
A tool may be interested in printing all information available
about a set of processes it is monitoring. At start-up, it sends
NPROC_GET_FIELD_LIST and finds a new field it doesn't know about.
>From the field ID, the tool can deduce that the unknown field:
- is in process scope and thus interesting for its task. That's all it
takes to add the new field to the NPROC_GET_PS request sent to the kernel
(along with a list of monitored PIDs). If the reply for a PID is missing
from the result, the PID has died.
- needs 32bits to store the result
With three label calls on the new field ID, the app determines that
the kernel suggests "VmShared" as a label, "%8u" for formatting,
and that the unit is "KiB". (This may sound like bloat or overkill,
but all these strings are already available via /proc for many fields,
just in a processed form that makes it impractical to get the individual
elements back.) The tool appends the format string for the new field to
its own format string and can now proceed like the tool in the first,
trivial example.
Dealing with strings
--------------------
Most strings are really static labels (e.g. the label for a field ID
or the symbol name for wchan). In those cases, it's up to user-space
to ask for a label and cache the result as necessary. There are some
cases, though, where the label is transient. At least one of them,
the process name, is important enough to justify strings in regular
(as opposed to label) replies. Otherwise, the process and its name
may be gone by the time a tool gets around to ask for it based on
a PID it received. As there are no unique task identifiers, there
are races possible and correct caching is hard if not impossible.
But how can we still get a valid va_list back? A library function
in user space takes care of that. For a given list of field IDs, it
replaces every string type field with a NOP (reply size: unsigned long)
and appends the string type field ID to the end of the list:
u32 u32 u32
PID | NAME | VMSIZE
becomes
u32 u32 u32 u32
PID | NOP | VMSIZE | NAME
Now it's trivial to fix the replies:
u32 unsigned long u32 u32 string
1 | 0 | 1340 | 16 | init
^-- space used for this string
becomes
u32 unsigned long u32 u32 string
1 | <pointer to first string> | 1340 | 16 | init
Anticipating type changes
-------------------------
Some fields may grow in size (e.g. NPROC_PID may move from u32 to unsigned
long or u64). If a field is not available from the kernel, a smart tool can
check the list of field IDs for a field with with the same content ID but a
different data type and print that instead.
Here's another thing we haven't been able to do with /proc: Finding out
the relative cost of computing the elements we offer to user space.
I ran a test program against 2.6.9-rc2-bk1 + nproc to get:
Testing all process fields, best out of 10
FieldID CPU (s) Wall (s) Label
0x03000002 0.140000 0.202728 NOP
0x21000100 0.150000 0.210021 Name
0x22000105 0.120000 0.204886 PID
0x22000109 0.130000 0.205319 UID
0x22000117 0.140000 0.215275 VmSize
0x22000118 0.130000 0.214240 VmLock
0x22000119 0.120000 0.214870 VmRSS
0x22000120 0.160000 1.020574 VmData
0x22000121 0.140000 1.021185 VmStack
0x22000122 0.170000 1.021619 VmExe
0x22000123 0.170000 1.020045 VmLib
0x23000421 0.140000 0.220748 wchan
Ignore the absolute values (I requested each field individually for all
processes on my workstation, 1000 times). The cost of walking all vmas
for VmData & Co. is very visible.
Roger
On Wed, Sep 15, 2004 at 10:02:30PM +0200, Roger Luethi wrote:
> Here's another thing we haven't been able to do with /proc: Finding out
> the relative cost of computing the elements we offer to user space.
> I ran a test program against 2.6.9-rc2-bk1 + nproc to get:
> Testing all process fields, best out of 10
> FieldID CPU (s) Wall (s) Label
> 0x03000002 0.140000 0.202728 NOP
> 0x21000100 0.150000 0.210021 Name
> 0x22000105 0.120000 0.204886 PID
> 0x22000109 0.130000 0.205319 UID
> 0x22000117 0.140000 0.215275 VmSize
> 0x22000118 0.130000 0.214240 VmLock
> 0x22000119 0.120000 0.214870 VmRSS
> 0x22000120 0.160000 1.020574 VmData
> 0x22000121 0.140000 1.021185 VmStack
> 0x22000122 0.170000 1.021619 VmExe
> 0x22000123 0.170000 1.020045 VmLib
> 0x23000421 0.140000 0.220748 wchan
> Ignore the absolute values (I requested each field individually for all
> processes on my workstation, 1000 times). The cost of walking all vmas
> for VmData & Co. is very visible.
Try this again after applying my updates, which make it equivalent to the
algorithms used internally by fs/proc/task_mmu.c.
-- wli
On Wed, 15 Sep 2004 13:20:28 -0700, William Lee Irwin III wrote:
> Try this again after applying my updates, which make it equivalent to the
> algorithms used internally by fs/proc/task_mmu.c.
That doesn't sound very interesting. The results are predictable. The
point of my previous message was that we can easily identify expensive
fields.
Ah well, compiling patched kernel anyway.
Roger
On Wed, 15 Sep 2004 13:20:28 -0700, William Lee Irwin III wrote:
> > Ignore the absolute values (I requested each field individually for all
> > processes on my workstation, 1000 times). The cost of walking all vmas
> > for VmData & Co. is very visible.
>
> Try this again after applying my updates, which make it equivalent to the
> algorithms used internally by fs/proc/task_mmu.c.
Here you go:
Testing all process fields, best out of 10
FieldID CPU (s) Wall (s) Label
0x03000002 0.130000 0.208989 NOP
0x21000100 0.150000 0.222867 Name
0x22000105 0.140000 0.216126 PID
0x22000109 0.140000 0.218058 UID
0x22000117 0.140000 0.231467 VmSize
0x22000118 0.140000 0.227863 VmLock
0x22000119 0.140000 0.229867 VmRSS
0x22000120 0.140000 0.226822 VmData
0x22000121 0.140000 0.228589 VmStack
0x22000122 0.130000 0.229107 VmExe
0x22000123 0.140000 0.228584 VmLib
0x23000421 0.140000 0.230716 wchan
I have received some constructive criticism and suggestions, but I didn't
see any comments on the desirability of nproc in mainline. Initially meant
to be a proof-of-concept, nproc has become an interface that is much
cleaner and faster than procfs can ever hope to be (it takes some reading
of procps or libgtop code to appreciate the complexity that is /proc file
parsing today), and every change in /proc files widens the gap. I presented
source code, benchmarks, and design documentation to substantiate my
claims; I can post the user-space code somewhere if there's interest.
So I'm wondering if everybody's waiting for me to answer some important
question I overlooked, or if there is a general sentiment that this
project is not worth pursuing.
Roger
Roger Luethi writes:
> I have received some constructive criticism and suggestions,
> but I didn't see any comments on the desirability of nproc in
> mainline. Initially meant to be a proof-of-concept, nproc has
> become an interface that is much cleaner and faster than procfs
> can ever hope to be (it takes some reading of procps or libgtop
> code to appreciate the complexity that is /proc file parsing today),
You spotted the perfect hash lookup? :-)
> and every change in /proc files widens the gap. I presented
> source code, benchmarks, and design documentation to substantiate
> my claims; I can post the user-space code somewhere if there's
> interest.
>
> So I'm wondering if everybody's waiting for me to answer some
> important question I overlooked, or if there is a general
> sentiment that this project is not worth pursuing.
I'm very glad to see numerical proof that /proc is crap.
If nproc does nothing else, it's still been useful.
The funny varargs/vsprintf/whatever encoding is useless to me,
as are the labels.
The nicest think about netlink is, i think, that it might make
a practical interface for incremental update. As processes run
or get modified, monitoring apps might get notified. I did not
see mention of this being implemented, and I would take quite
some time to support it, so it's a long-term goal. (of course,
people can always submit procps patches to support this)
I doubt that it is good to break down the data into so many
different items. It seems sensible to break down the data by
locking requirements.
I could use an opaque per-process cookie for process identification.
This would protect from PID reuse, and might allow for faster
lookup. Perhaps it contains: PID, address of task_struct, and the
system-wide or per-cpu fork count from process creation.
Something like the stat() syscall would be pretty decent.
Well, whatever... In any case, I'd need to see some working code
for the libproc library. My net connection dies for hours at a
time, so don't expect speedy anything right now.
BTW, I have a 32-bit big-endian system with char being unsigned
by default. The varargs stuff is odd too.
On Fri, 17 Sep 2004 12:55:32 -0400, Albert Cahalan wrote:
> Roger Luethi writes:
> > I have received some constructive criticism and suggestions,
> > but I didn't see any comments on the desirability of nproc in
> > mainline. Initially meant to be a proof-of-concept, nproc has
> > become an interface that is much cleaner and faster than procfs
> > can ever hope to be (it takes some reading of procps or libgtop
> > code to appreciate the complexity that is /proc file parsing today),
>
> You spotted the perfect hash lookup? :-)
I never claimed nproc is perfect. Solutions with comparable performance
and simplicity are conceivable, but none of them will work anything
like procfs.
> The funny varargs/vsprintf/whatever encoding is useless to me,
Actually, that's just a by-product of the design. It is what you get when
you put all the fields back to back. The only addition I made kernel-side
to make this easy to exploit was the introduction of a NOP field.
> as are the labels.
Yup. The labels are not useful for the tools you maintain.
> The nicest think about netlink is, i think, that it might make
> a practical interface for incremental update. As processes run
> or get modified, monitoring apps might get notified. I did not
> see mention of this being implemented, and I would take quite
> some time to support it, so it's a long-term goal. (of course,
> people can always submit procps patches to support this)
Sounds like what wli and I have discussed as differential updates a few
weeks ago. I agree that would be nice, for now the goal was to suggest
something that's cleaner and faster than procfs. Extensions are easy
to add later.
> I doubt that it is good to break down the data into so many
> different items. It seems sensible to break down the data by
> locking requirements.
True if you consider a static set of fields that never changes. Problematic
otherwise, because as soon as you start grouping fields together, you need
an agreement between kernel and user-space on the contents of these groups.
With nproc, the kernel is free to group fields together for computation
(even the first release calculated all the fields that needed VMA walks
in one go).
> I could use an opaque per-process cookie for process identification.
> This would protect from PID reuse, and might allow for faster
> lookup. Perhaps it contains: PID, address of task_struct, and the
> system-wide or per-cpu fork count from process creation.
Agreed, that would be useful. And it would be easy to integrate with
nproc. Just add a field to return the cookie and a selector based on
cookies rather than PIDs.
> Something like the stat() syscall would be pretty decent.
You lost me there.
Roger
On Fri, 2004-09-17 at 13:51, Roger Luethi wrote:
> On Fri, 17 Sep 2004 12:55:32 -0400, Albert Cahalan wrote:
> > The nicest think about netlink is, i think, that it might make
> > a practical interface for incremental update. As processes run
> > or get modified, monitoring apps might get notified. I did not
> > see mention of this being implemented, and I would take quite
> > some time to support it, so it's a long-term goal. (of course,
> > people can always submit procps patches to support this)
>
> Sounds like what wli and I have discussed as differential updates
> a few weeks ago. I agree that would be nice, for now the goal was
> to suggest something that's cleaner and faster than procfs.
> Extensions are easy to add later.
To me, this looks like the killer feature. You could even
skip the regular process info. Simply return process identification
cookies that could be passed into a separate syscall to get
the information.
> > I doubt that it is good to break down the data into so many
> > different items. It seems sensible to break down the data by
> > locking requirements.
>
> True if you consider a static set of fields that never changes. Problematic
> otherwise, because as soon as you start grouping fields together, you need
> an agreement between kernel and user-space on the contents of these groups.
I suppose this is small potatoes compared to the overhead
of dealing with ASCII, but individual field handling would
be a bit slower.
For initial libproc support, I'd start by requesting info
in groups that match what /proc provides today.
> > I could use an opaque per-process cookie for process identification.
> > This would protect from PID reuse, and might allow for faster
> > lookup. Perhaps it contains: PID, address of task_struct, and the
> > system-wide or per-cpu fork count from process creation.
>
> Agreed, that would be useful. And it would be easy to integrate with
> nproc. Just add a field to return the cookie and a selector based on
> cookies rather than PIDs.
>
> > Something like the stat() syscall would be pretty decent.
>
> You lost me there.
The stat() call simply fills in a struct. Given a per-process
cookie (or a PID if you tolerate the race conditions), a syscall
similar to stat() could fill in a struct.
On Sat, 18 Sep 2004 08:40:12 -0400, Albert Cahalan wrote:
> To me, this looks like the killer feature. You could even
> skip the regular process info. Simply return process identification
> cookies that could be passed into a separate syscall to get
> the information.
Do you mean "return cookies for all existing processes"? Or "return
cookies for all processes created since X" (if so, what's X?) ?
> > True if you consider a static set of fields that never changes. Problematic
> > otherwise, because as soon as you start grouping fields together, you need
> > an agreement between kernel and user-space on the contents of these groups.
>
> I suppose this is small potatoes compared to the overhead
> of dealing with ASCII, but individual field handling would
> be a bit slower.
Correct.
> For initial libproc support, I'd start by requesting info
> in groups that match what /proc provides today.
Makes perfect sense. You can pre-assemble an array of field IDs, hand
them over to the kernel, and get the requested fields in the requested
order.
> The stat() call simply fills in a struct. Given a per-process
> cookie (or a PID if you tolerate the race conditions), a syscall
> similar to stat() could fill in a struct.
With nproc as-is you can send a request that matches your desired struct
and cast the result to a pointer to your struct.
An application can build its own cookie simply by always requesting a set
of fields that _together_ can be used to identify a process. I reckon that
PID + process creation timestamp would be a good combination (except that
the latter is not currently available). The creation of the complete reply
to a request is atomic per process, the race is gone. What is not possible
right now is selecting processes based on a cookie -- the only selectors
so far are "all of them" and "select by PID".
Roger
On Sun, 2004-09-19 at 06:39, Roger Luethi wrote:
> On Sat, 18 Sep 2004 08:40:12 -0400, Albert Cahalan wrote:
> > To me, this looks like the killer feature. You could even
> > skip the regular process info. Simply return process identification
> > cookies that could be passed into a separate syscall to get
> > the information.
>
> Do you mean "return cookies for all existing processes"? Or "return
> cookies for all processes created since X" (if so, what's X?) ?
First, queue cookies for all existing processes.
Then, as process data changes, queue cookies for
processes that need to be examined again. Suppress
queueing of cookies for processes that are already
in the queue so things don't get too backed up.
If memory usage exceeds some adjustable limit, then
switch to supplying all processes until the backlog
is gone.
I realize that the implementation may prove difficult.
> With nproc as-is you can send a request that matches your desired struct
> and cast the result to a pointer to your struct.
Either that's marketing, or I missed something. :-)
Can I force specific data sizes? Can I force a string to
be NUL-terminated or a NUL-padded fixed-length buffer?
Can I request padding bytes to be skipped over?
On Sun, 19 Sep 2004 08:29:57 -0400, Albert Cahalan wrote:
> > Do you mean "return cookies for all existing processes"? Or "return
> > cookies for all processes created since X" (if so, what's X?) ?
>
> First, queue cookies for all existing processes.
> Then, as process data changes, queue cookies for
> processes that need to be examined again. Suppress
> queueing of cookies for processes that are already
> in the queue so things don't get too backed up.
> If memory usage exceeds some adjustable limit, then
> switch to supplying all processes until the backlog
> is gone.
How is the kernel to know which changes of process data require
re-examination? In all likelihood, any tool is only going to be
interested in certain changes, not in others.
> I realize that the implementation may prove difficult.
It seems reasonable (and useful) to notify tools if new processes get
created. It is certainly possible to have additional events (like field
changes) trigger notifications, but this would probably become rather
intrusive and expensive.
> > With nproc as-is you can send a request that matches your desired struct
> > and cast the result to a pointer to your struct.
>
> Either that's marketing, or I missed something. :-)
>
> Can I force specific data sizes? Can I force a string to
> be NUL-terminated or a NUL-padded fixed-length buffer?
> Can I request padding bytes to be skipped over?
No, your data types have to match what the kernel offers. What I was
referring to was your request for "info in groups that match what /proc
provides today". What you _can_ do with nproc is, say, ask it to return
a pointer to something like this:
struct statm_extended {
__u32 pid; /*
__u32 namelen; * My simple cookie
char name[16]; */
__u32 resident; /*
__u32 shared; *
__u32 trs; * /proc/PID/statm content
__u32 lrs; *
__u32 drs; *
__u32 dt; */
};
Roger