2005-02-11 22:32:57

by Richard F. Rebel

[permalink] [raw]
Subject: /proc/*/statm, exactly what does "shared" mean?


Hello,

I can't seem to find clear documentation about the 'share' column
from /proc/<pid>/statm.

Does this include pages that are shared with forked children marked as
copy-on-write?

Does this only reflect libraries that are dynamically loaded? What
about shared memory segments/mmaps (ala shmat or mmmap)?

If there is a place where I might find documentation that is more clear
beyond the proc.txt in the kernel docs and then man pages for procfs,
I'd welcome a pointer.

Thanks,

--
Richard F. Rebel

cat /dev/null > `tty`


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-02-12 13:08:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

On Fri, 11 Feb 2005, Richard F. Rebel wrote:
>
> I can't seem to find clear documentation about the 'share' column
> from /proc/<pid>/statm.
>
> Does this include pages that are shared with forked children marked as
> copy-on-write?
>
> Does this only reflect libraries that are dynamically loaded? What
> about shared memory segments/mmaps (ala shmat or mmmap)?
>
> If there is a place where I might find documentation that is more clear
> beyond the proc.txt in the kernel docs and then man pages for procfs,
> I'd welcome a pointer.

You may not be entirely happy with this answer.
It is a count of "pages of the process" which are "shared" in some sense.
But precisely what that means has changed from time to time: depending on
our perception of what we can safely afford the overhead of counting.

You may want to look at fs/proc proc_pid_statm() source for the release
of interest, and follow that back to see exactly what is being counted.

Throughout 2.4 (and 2.2 too I think) it was the count of those pages
instantiated in the process address space which currently have a page
count greater than 1. That would include private pages shared with
forked children, pages from the pagecache (including pages mapped
from executable or library or shared memory or file mmap), those
private pages which currently have swap allocated (so they're also
in the swapcache), and any pages which transitorily have page count
raised for whatever reason (they'd likely already be in one of the
above categories). A ragbag of meanings, but that's all you can
get from looking at page count.

Counting up that not very meaningful number, at frequent intervals
on large process address spaces, is a waste of valuable time.

>From 2.5.37 to 2.6.9, it's the total extent of file-backed areas
(file including executable or library or shared memory) in the
process address space: a total extent (in pagesize units),
not a count of instantiated pages. Much quicker to calculate.

But there were complaints about that, and a need to revert from
total extent to count of instantiated pages.

>From 2.6.10 onwards, for the foreseeable future, it is the count
of those pages instantiated in the process address space which are
shared with a file (including executable or library or shared memory)
i.e. those pages which are not anonymous, not private. That count
does not include private pages shared with forked children, nor
does it include private pages which happen to have swap allocated.

Hugh

2005-02-12 14:39:30

by Richard F. Rebel

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?


Hugh,

Thank you immensely for answering this so clearly. Although it leaves
me at a short end, I am grateful I no longer have to bang my head in
this direction any further. I have wasted a HUGE amount of time in
recent months trying to figure out why different kernel versions caused
my applications memory footprint to change so.

That said, many mod_perl users are *VERY* interested in being able to
detect and observe how "shared" our forked children are. Shared meaning
private pages shared with children (copy on write). Is it even possible
to do this in 2.6 kernels? If so, any pointers would be very helpful.

Thanks again,

Richard F. Rebel

On Sat, 2005-02-12 at 13:06 +0000, Hugh Dickins wrote:
> On Fri, 11 Feb 2005, Richard F. Rebel wrote:
> >
> > I can't seem to find clear documentation about the 'share' column
> > from /proc/<pid>/statm.
> >
> > Does this include pages that are shared with forked children marked as
> > copy-on-write?
> >
> > Does this only reflect libraries that are dynamically loaded? What
> > about shared memory segments/mmaps (ala shmat or mmmap)?
> >
> > If there is a place where I might find documentation that is more clear
> > beyond the proc.txt in the kernel docs and then man pages for procfs,
> > I'd welcome a pointer.
>
> You may not be entirely happy with this answer.
> It is a count of "pages of the process" which are "shared" in some sense.
> But precisely what that means has changed from time to time: depending on
> our perception of what we can safely afford the overhead of counting.
>
> You may want to look at fs/proc proc_pid_statm() source for the release
> of interest, and follow that back to see exactly what is being counted.
>
> Throughout 2.4 (and 2.2 too I think) it was the count of those pages
> instantiated in the process address space which currently have a page
> count greater than 1. That would include private pages shared with
> forked children, pages from the pagecache (including pages mapped
> from executable or library or shared memory or file mmap), those
> private pages which currently have swap allocated (so they're also
> in the swapcache), and any pages which transitorily have page count
> raised for whatever reason (they'd likely already be in one of the
> above categories). A ragbag of meanings, but that's all you can
> get from looking at page count.
>
> Counting up that not very meaningful number, at frequent intervals
> on large process address spaces, is a waste of valuable time.
>
> From 2.5.37 to 2.6.9, it's the total extent of file-backed areas
> (file including executable or library or shared memory) in the
> process address space: a total extent (in pagesize units),
> not a count of instantiated pages. Much quicker to calculate.
>
> But there were complaints about that, and a need to revert from
> total extent to count of instantiated pages.
>
> From 2.6.10 onwards, for the foreseeable future, it is the count
> of those pages instantiated in the process address space which are
> shared with a file (including executable or library or shared memory)
> i.e. those pages which are not anonymous, not private. That count
> does not include private pages shared with forked children, nor
> does it include private pages which happen to have swap allocated.
>
> Hugh
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Richard F. Rebel <[email protected]>
WhenU.com


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-02-12 15:43:53

by Hugh Dickins

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

On Sat, 12 Feb 2005, Richard F. Rebel wrote:
>
> That said, many mod_perl users are *VERY* interested in being able to
> detect and observe how "shared" our forked children are. Shared meaning
> private pages shared with children (copy on write). Is it even possible
> to do this in 2.6 kernels? If so, any pointers would be very helpful.

Not in any of the vanilla kernels.

Mauricio has a /proc/<pid>/smaps patch, in which he returns to looking
at every pte slot of every vma of the process as /proc/<pid>/statm did
in 2.4. I suggest you ask him offline for his latest version (the last
I saw did not include support for 2.6.11's pud level; and looped in an
inefficient way, repeatedly locating, mapping and unmapping the page
table for each pte slot - needs refactoring into pgd_range, pud_range,
pmd_range, pte_range levels like 2.4's statm).

It wouldn't be hard to take his framework (before or after improvements),
or the 2.4 proc_pid_statm framework, and modify that to give you the
information you're looking for. But I doubt that the resulting patch
would be accepted into a vanilla kernel - there's no end to the kinds
of numbers that different people might want to see.

I wonder how important swapping is in your case. If it's an
exceptional case that you can ignore, then a "private shared"
entry would be identified by code something like:

if (pte_present(pte)) {
pfn = pte_pfn(pte);
if (pfn_valid(pfn)) {
page = pfn_to_page(pfn);
if (PageAnon(page) && page_mapcount(page) > 1)
private_shared++;
}
}

But if swap use is significant, then it gets rather more complicated:
a present page, even when its mapcount is 1, might still be shared with
other processes, from which which the pte has been unmapped; and where
the pte is not present, there may be a swap entry shared with others.
It can be worked out, but awkward and more than a few lines of code.

Hugh

2005-02-16 10:41:37

by Mauricio Lin

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

Hi all,

Sorry for responding this email so late. I was busy with my trip.

On Sat, 12 Feb 2005 15:42:15 +0000 (GMT), Hugh Dickins <[email protected]> wrote:
> On Sat, 12 Feb 2005, Richard F. Rebel wrote:
> >
> > That said, many mod_perl users are *VERY* interested in being able to
> > detect and observe how "shared" our forked children are. Shared meaning
> > private pages shared with children (copy on write). Is it even possible
> > to do this in 2.6 kernels? If so, any pointers would be very helpful.
>
> Not in any of the vanilla kernels.
>
> Mauricio has a /proc/<pid>/smaps patch, in which he returns to looking
> at every pte slot of every vma of the process as /proc/<pid>/statm did
> in 2.4. I suggest you ask him offline for his latest version (the last
> I saw did not include support for 2.6.11's pud level;
I put the pud level on the last patch I sent to the linux-kernel list
as suggested by Marcelo Tosatti.

> and looped in an
> inefficient way, repeatedly locating, mapping and unmapping the page
> table for each pte slot - needs refactoring into pgd_range, pud_range,
> pmd_range, pte_range levels like 2.4's statm).
Well, for each vma it is checked how many pages are mapped to rss. So
I have to check per page if it is allocated in physical memory. I know
that this is a heavy function, but do you have any suggestion to
improve this? What do you mean "needs refactoring into pgd_range,
pud_range, pmd_range, pte_range levels like 2.4's statm"? Could you
give more details, please?

BR,

Mauricio Lin.

2005-02-16 12:01:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

On Wed, 16 Feb 2005, Mauricio Lin wrote:
> Well, for each vma it is checked how many pages are mapped to rss. So
> I have to check per page if it is allocated in physical memory. I know
> that this is a heavy function, but do you have any suggestion to
> improve this? What do you mean "needs refactoring into pgd_range,
> pud_range, pmd_range, pte_range levels like 2.4's statm"? Could you
> give more details, please?

Just look at, say, linux-2.4.29/fs/proc/array.c proc_pid_statm:
which calls statm_pgd_range which calls statm_pmd_range which
calls statm_pte_range which scans along the array of ptes doing
the pte examination you're doing. There are plenty of examples
in 2.6.11-rc mm/memory.c of how to do it with pud level too.

Whereas your way starts at the top and descends the tree each time
for every leaf, repeatedly mapping and unmapping the page table if
that pagetable is in highmem. You took follow_page as your starting
point, which is good for a single pte, but inefficient for many.

Your function(s) will still be heavyweight, but somewhat faster.

Hugh

2005-02-16 14:52:49

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

On Sat, Feb 12, 2005 at 09:39:20AM -0500, Richard F. Rebel wrote:
> That said, many mod_perl users are *VERY* interested in being able to
> detect and observe how "shared" our forked children are. Shared meaning
> private pages shared with children (copy on write). Is it even possible
> to do this in 2.6 kernels? If so, any pointers would be very helpful.

One thing Hugh didn't mention is the background as to why the shared
statistic was changed: it comes back to the fact that it was a very
expensive statistic to calculate. People running top on systems with
lots of virtual memory in use (ie lots of processes, applications with
shared memory segments) were seeing ridiculous cpu usage (100% for seconds
at a time) by top. As a result, the statistics available from the statm
file were changed to counters making the read of statm an O(1) operation.
This dropped top's cpu usage on a busy system to a much more reasonable
<1%, making it possible to get an idea what a busy system is actually
busy with.

-ben

2005-02-16 15:02:40

by Mauricio Lin

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

Hi Hugh,

Thanks by your suggestion. I did not know that kernel 2.4.29 has
changed the statm implementation. As I can see the statm
implementation is different between 2.4 and 2.6.

Let me see if I can use the 2.4.29 statm idea to improve the smaps for
kernel 2.6.11-rc.

BR,

Mauricio Lin.

On Wed, 16 Feb 2005 12:00:55 +0000 (GMT), Hugh Dickins <[email protected]> wrote:
> On Wed, 16 Feb 2005, Mauricio Lin wrote:
> > Well, for each vma it is checked how many pages are mapped to rss. So
> > I have to check per page if it is allocated in physical memory. I know
> > that this is a heavy function, but do you have any suggestion to
> > improve this? What do you mean "needs refactoring into pgd_range,
> > pud_range, pmd_range, pte_range levels like 2.4's statm"? Could you
> > give more details, please?
>
> Just look at, say, linux-2.4.29/fs/proc/array.c proc_pid_statm:
> which calls statm_pgd_range which calls statm_pmd_range which
> calls statm_pte_range which scans along the array of ptes doing
> the pte examination you're doing. There are plenty of examples
> in 2.6.11-rc mm/memory.c of how to do it with pud level too.
>
> Whereas your way starts at the top and descends the tree each time
> for every leaf, repeatedly mapping and unmapping the page table if
> that pagetable is in highmem. You took follow_page as your starting
> point, which is good for a single pte, but inefficient for many.
>
> Your function(s) will still be heavyweight, but somewhat faster.
>
> Hugh
>

2005-02-16 15:17:16

by Richard F. Rebel

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?


Hello,

I have heard that this particular information, while very important to
userland developers like me, is probably too expensive to keep track of
for most users.

Perhaps a way to enable it for developers, whom are willing to spend the
cpu cycles, and disable it for regular use would be a solution.

Would it be possible develop a solution allowing us to enable/disable
this tracking via a sysctl call?

Richard F. Rebel

On Wed, 2005-02-16 at 11:02 -0400, Mauricio Lin wrote:
> Hi Hugh,
>
> Thanks by your suggestion. I did not know that kernel 2.4.29 has
> changed the statm implementation. As I can see the statm
> implementation is different between 2.4 and 2.6.
>
> Let me see if I can use the 2.4.29 statm idea to improve the smaps for
> kernel 2.6.11-rc.
>
> BR,
>
> Mauricio Lin.
>
> On Wed, 16 Feb 2005 12:00:55 +0000 (GMT), Hugh Dickins <[email protected]> wrote:
> > On Wed, 16 Feb 2005, Mauricio Lin wrote:
> > > Well, for each vma it is checked how many pages are mapped to rss. So
> > > I have to check per page if it is allocated in physical memory. I know
> > > that this is a heavy function, but do you have any suggestion to
> > > improve this? What do you mean "needs refactoring into pgd_range,
> > > pud_range, pmd_range, pte_range levels like 2.4's statm"? Could you
> > > give more details, please?
> >
> > Just look at, say, linux-2.4.29/fs/proc/array.c proc_pid_statm:
> > which calls statm_pgd_range which calls statm_pmd_range which
> > calls statm_pte_range which scans along the array of ptes doing
> > the pte examination you're doing. There are plenty of examples
> > in 2.6.11-rc mm/memory.c of how to do it with pud level too.
> >
> > Whereas your way starts at the top and descends the tree each time
> > for every leaf, repeatedly mapping and unmapping the page table if
> > that pagetable is in highmem. You took follow_page as your starting
> > point, which is good for a single pte, but inefficient for many.
> >
> > Your function(s) will still be heavyweight, but somewhat faster.
> >
> > Hugh
> >
--
Richard F. Rebel

cat /dev/null > `tty`


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-02-16 15:59:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

On Wed, 16 Feb 2005, Mauricio Lin wrote:
>
> Thanks by your suggestion. I did not know that kernel 2.4.29 has
> changed the statm implementation. As I can see the statm
> implementation is different between 2.4 and 2.6.
>
> Let me see if I can use the 2.4.29 statm idea to improve the smaps for
> kernel 2.6.11-rc.

(2.4.29 made no changes there, I think it's unchanged between 2.4.0 and
2.4.29. It was 2.5.37 and later 2.5 and 2.6 developments that changed it.)

2005-02-16 16:15:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

On Wed, 16 Feb 2005, Richard F. Rebel wrote:
>
> I have heard that this particular information, while very important to
> userland developers like me, is probably too expensive to keep track of
> for most users.
>
> Perhaps a way to enable it for developers, whom are willing to spend the
> cpu cycles, and disable it for regular use would be a solution.
>
> Would it be possible develop a solution allowing us to enable/disable
> this tracking via a sysctl call?

Possible, but I don't think a sysctl would make much sense here.

The most important thing is that any heavyweight information gathering
should not be happening by default, as a side-effect of something
frequently run.

So a lot of people would oppose putting back any such heavyweight
work into any of the /proc/<pid>/statm fields. But if it goes into
a separate /proc/<pid>/whatever, not read by current tools, then
it's much less of a problem.

I'm still resistant, because I think if the information you're
interested in (how many private pages shared across forks) really
were of interest to many people, then someone would soon write a
top-like tool which kept sampling these values, and we'd be back to
the original situation. Or if it's not of interest to many people,
then isn't it better off as an out-of-tree patch?

But I don't have a dogmatic position on it,
and trust others' judgement better than my own.

Hugh

2005-02-17 16:33:49

by Richard F. Rebel

[permalink] [raw]
Subject: Re: /proc/*/statm, exactly what does "shared" mean?

Hello Hugh,

On Wed, 2005-02-16 at 16:10 +0000, Hugh Dickins wrote:
> On Wed, 16 Feb 2005, Richard F. Rebel wrote:
> >
> > I have heard that this particular information, while very important to
> > userland developers like me, is probably too expensive to keep track of
> > for most users.
> >
> > Perhaps a way to enable it for developers, whom are willing to spend the
> > cpu cycles, and disable it for regular use would be a solution.
> >
> > Would it be possible develop a solution allowing us to enable/disable
> > this tracking via a sysctl call?
>
> Possible, but I don't think a sysctl would make much sense here.
>
> The most important thing is that any heavyweight information gathering
> should not be happening by default, as a side-effect of something
> frequently run.

Okay and agreed.

> So a lot of people would oppose putting back any such heavyweight
> work into any of the /proc/<pid>/statm fields. But if it goes into
> a separate /proc/<pid>/whatever, not read by current tools, then
> it's much less of a problem.
>
> I'm still resistant, because I think if the information you're
> interested in (how many private pages shared across forks) really
> were of interest to many people, then someone would soon write a
> top-like tool which kept sampling these values, and we'd be back to
> the original situation. Or if it's not of interest to many people,
> then isn't it better off as an out-of-tree patch?

Well, let's put it this way. In general, would you agree that the bulk
of Linux servers on the internet run apache? It's also not hard to
assume that many of these also run apache+php or apache+mod_perl.

Apache has lots of modules and code that can easily be shared between
processes. Especially when you use mod_perl or php. This sharing
significantly effects the memory footprint and saves us all from having
gigabytes of memory that we don't really need. My apache2 processes
have a VSS of around 120MB!

The capacity of a web serving machine is some combination of CPU,
Memory, and IO bandwidth. Being able to measure the copy-on-write pages
for processes is a key variable to determining a machine capacity. This
figure changes over the life of the process as well, and can be used to
tell children to exit once they reach a certain point.

Now, about keeping it in the vanilla kernel: AFAIK, the only way to
acquire this information is from the kernel. It's useful to developers,
and available on most other platforms. Copy on write pages make sense,
why waste tons of RAM when there is no real need? It makes sense to be
able to observe this behavior, and the information can guide developers
to make their applications more efficient.

In many organizations developers do not have the ability/skill to make
custom kernels. They are given development platforms, told to write
their applications, and so on and so on. Patching the kernel for
keeping track of cow's and subsequently maintaining that patch really
doesn't help them much. I could go on but this is getting a bit off
topic.

Best,

Richard Rebel

> But I don't have a dogmatic position on it,
> and trust others' judgement better than my own.
>
> Hugh
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Richard F. Rebel

cat /dev/null > `tty`


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part