2009-07-15 01:22:21

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH] add "VmUsers: N" to /proc/$PID/status

Was disscussed sometime ago: http://lkml.org/lkml/2007/8/27/53

This patch aims to improve memory usage info collection
from userspace. It addresses the problem when userspace
monitoring cannot determine when two (or more) processes
share the VM, but they are not threads.

In Linux, you can clone a process with CLONE_VM, but without
CLONE_THREAD, and as a result it will get new PID, and its own,
visible /proc/PID entry.

It creates a problem: userspace tools will think that this is
just another, separate process. There is no way it can
figure out that /proc/PID1 and /proc/PID2
correspond to two processes which share VM,
and if ir will sum memory usage over the whole of /proc/*,
it will count their memory usage twice.

It can be nice to know how many such CLONE_VM'ed processes
share VM with given /proc/PID. Then it would be possible to do
more accurate accounting of memory usage. Say, by dividing
all memory usage numbers of this process by this number.

After this patch, CLONE_VM'ed processes have a new line,
"VmUsers:", in /proc/$PID/status:
...
VmUsers: 2
Threads: 1
...

The value is obtained simply by atomic_read(&mm->mm_users).

One concern is that the counter may be larger
than real value, if other CPU did get_task_mm() on the VM
while we were generating /proc/$PID/status. Better ideas?


Test program is below:

#include <sched.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <errno.h>
#include <syscall.h>
/* Defeat glibc "pid caching" */
#define GETPID() ((int)syscall(SYS_getpid))
#define GETTID() ((int)syscall(SYS_gettid))
char stack[8*1024];
int f(void *arg) {
printf("child %d (%d)\n", GETPID(), GETTID());
sleep(1000);
_exit(0);
}
int main() {
int n;
memset(malloc(1234*1024), 1, 1234*1024);
printf("parent %d (%d)\n", GETPID(), GETTID());
// Create a process with shared VM, but not a thread
n = clone(f, stack + sizeof(stack)/2, CLONE_VM, 0);
printf("clone returned %d\n", n);
sleep(1000);
_exit(0);
}

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


--- linux-2.6.31-rc2/fs/proc/task_mmu.c Wed Jun 10 05:05:27 2009
+++ linux-2.6.31-rc2.VmUsers/fs/proc/task_mmu.c Wed Jul 15 02:54:45 2009
@@ -18,6 +18,7 @@
{
unsigned long data, text, lib;
unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+ unsigned num_vmusers;

/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -36,6 +37,7 @@
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
+ num_vmusers = atomic_read(&mm->mm_users) - 1;
seq_printf(m,
"VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
@@ -46,7 +48,8 @@
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
- "VmPTE:\t%8lu kB\n",
+ "VmPTE:\t%8lu kB\n"
+ "VmUsers:\t%u\n",
hiwater_vm << (PAGE_SHIFT-10),
(total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
@@ -54,7 +57,8 @@
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
- (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
+ (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10,
+ num_vmusers);
}

unsigned long task_vsize(struct mm_struct *mm)
--- linux-2.6.31-rc2/fs/proc/task_nommu.c Wed Jun 10 05:05:27 2009
+++ linux-2.6.31-rc2.VmUsers/fs/proc/task_nommu.c Wed Jul 15 02:54:39 2009
@@ -20,7 +20,8 @@
struct vm_region *region;
struct rb_node *p;
unsigned long bytes = 0, sbytes = 0, slack = 0, size;
-
+ unsigned num_vmusers;
+
down_read(&mm->mmap_sem);
for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
vma = rb_entry(p, struct vm_area_struct, vm_rb);
@@ -67,11 +68,14 @@

bytes += kobjsize(current); /* includes kernel stack */

+ num_vmusers = atomic_read(&mm->mm_users) - 1;
+
seq_printf(m,
"Mem:\t%8lu bytes\n"
"Slack:\t%8lu bytes\n"
- "Shared:\t%8lu bytes\n",
- bytes, slack, sbytes);
+ "Shared:\t%8lu bytes\n"
+ "VmUsers:\t%u\n",
+ bytes, slack, sbytes, num_vmusers);

up_read(&mm->mmap_sem);
}


2009-07-16 19:46:42

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] add "VmUsers: N" to /proc/$PID/status

On Wed, 15 Jul 2009 03:22:18 +0200, Denys Vlasenko said:

> It can be nice to know how many such CLONE_VM'ed processes
> share VM with given /proc/PID. Then it would be possible to do
> more accurate accounting of memory usage. Say, by dividing
> all memory usage numbers of this process by this number.

Process A clones a process A1. Process B clones a process B1. Now
all 4 of them have 'VmUsers: 2' on them, but there's no clean way to tell
whether A1 or B1 is the one sharing with A, or with B.

The patch is probably sufficient if all you want is some N to divide by, but
not if you care *which* processes are sharing how much.


Attachments:
(No filename) (226.00 B)

2009-07-16 21:27:26

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] add "VmUsers: N" to /proc/$PID/status

On Thursday 16 July 2009 21:46, [email protected] wrote:
> On Wed, 15 Jul 2009 03:22:18 +0200, Denys Vlasenko said:
>
> > It can be nice to know how many such CLONE_VM'ed processes
> > share VM with given /proc/PID. Then it would be possible to do
> > more accurate accounting of memory usage. Say, by dividing
> > all memory usage numbers of this process by this number.
>
> Process A clones a process A1. Process B clones a process B1. Now
> all 4 of them have 'VmUsers: 2' on them, but there's no clean way to tell
> whether A1 or B1 is the one sharing with A, or with B.
>
> The patch is probably sufficient if all you want is some N to divide by, but
> not if you care *which* processes are sharing how much.

You are right. There is more: the truly accurate accounting
needs to be per page. Like /proc/$PID/smaps
and /proc/$PID/pagemap. (However, I am not sure you can
relize that two processes share a VM by looking
at these files either)

I do not aim to solve _that_ problem with my patch.

I, indeed, want to have just an N I can divide RSS/VSZ/etc by,
to get, say, top display which do not mislead user
into thinking that he has 3 processes with 100 megabyte RSS
when in reality he has 3 processes sharing a single VM
with 100 meg RSS.

This will still not be completely accurate due to per-page
sharing and such, but it will be more accurate
than what we have now.
--
vda

2009-07-16 23:24:09

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] add "VmUsers: N" to /proc/$PID/status

On Thu, 16 Jul 2009 23:27:25 +0200, Denys Vlasenko said:

> You are right. There is more: the truly accurate accounting
> needs to be per page. Like /proc/$PID/smaps
> and /proc/$PID/pagemap. (However, I am not sure you can
> relize that two processes share a VM by looking
> at these files either)

Hmm.. what if /proc/$PID/<something> reported pairs of virtual/real page
addresses? Then you could identify shared pages by virtue of them having
the same real page frame address..

Of course, with 2G of memory in my laptop, that's 256K 4K pages, and it gets
even messier on machines with macho-RAM installed and lots of processes
running. Lotta pain to get more accurate numbers on what is a very race-prone
number.

> I, indeed, want to have just an N I can divide RSS/VSZ/etc by,
> to get, say, top display which do not mislead user
> into thinking that he has 3 processes with 100 megabyte RSS
> when in reality he has 3 processes sharing a single VM
> with 100 meg RSS.

Thinking about it a bit more - it's probably *usually* possible to sort out
which processes are sharing because if you have 2 sets of shared memory,
they'll *usually* have different RSS values - so the 2 processes with a
count of 2 and an RSS of 179M are one set, and the 2 processes with a count
of 2 and an RSS of 198M are probably another.

Certainly not good enough for chargeback accounting, but good enough if you're
trying to debug a WTF? situation or other "what's going on?" question...


Attachments:
(No filename) (226.00 B)

2009-07-16 23:44:28

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] add "VmUsers: N" to /proc/$PID/status

On Friday 17 July 2009 01:24, [email protected] wrote:
> > I, indeed, want to have just an N I can divide RSS/VSZ/etc by,
> > to get, say, top display which do not mislead user
> > into thinking that he has 3 processes with 100 megabyte RSS
> > when in reality he has 3 processes sharing a single VM
> > with 100 meg RSS.
>
> Thinking about it a bit more - it's probably *usually* possible to sort out
> which processes are sharing because if you have 2 sets of shared memory,
> they'll *usually* have different RSS values - so the 2 processes with a
> count of 2 and an RSS of 179M are one set, and the 2 processes with a count
> of 2 and an RSS of 198M are probably another.

Hmm. We can just expose the raw value of task->mm pointer.
For all tasks which share a VM, it will have the same value.

It would be an "information leak", yes, but it isn't obvious
whether it can be exploited at all. We can also obscure it
a bit by XORing or summing it with randomly selected constant
or some such.
--
vda