2002-08-30 01:53:57

by William Lee Irwin III

[permalink] [raw]
Subject: statm_pgd_range() sucks!

Okay, I have *had it* with statm_pgd_range()!

So I started up top(1) to monitor a benchmark. And the idiot thing
couldn't get out of the kernel fast enough to refresh the screen every
5s and ate an entire cpu (well, I did have 15 to spare, but still). I
pulled the plug and applied the following, which fixed the situation,
albeit with slight changes to the semantics of what's reported.

(1) shared, lib, text, & total are now reported as what's mapped
instead of what's resident. This actually fixes two bugs:
(1A) Misreporting the total size of what's mapped by a task
when third-level pagetables haven't been instantiated
for regions (i.e. incorrect VSZ).
(1B) Misreporting "shared" because elevated reference counts
do not represent sharing of pages between processes,
such things as presence on the LRU and buffers foil it.
shared is now the size of the regions mapped as MAP_SHARED,
which is different, but at least meaningful.

(2) dirty and reported as 0. It wasn't used by userspace anyway, and
frankly that information is too expensive to collect at all.

(3) The hardcoded i386 a.out format hack (0x60000000) is removed
and lib pages reported as 0. This value is not used by
userspace and has not been reliable or meaningful for some time,
as library pages have no special meaning to ELF. It can only be
faithfully reported by means of awareness of both the executable
format and architectural address space layout.

(4) The horrifically expensive pagetable walk is gone, and so
/proc/$PID/statm will disturb the system far less in general.
The reduction in cpu consumption observed was immense.
B = Before, A = After:

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
B: 2813 wli 25 0 1624 1624 744 R 99.9 0.0 0:38 top
A: 259 wli 15 0 2452 1616 0 R 6.1 0.0 3:45 top

Those cpus are 700MHz P-III Xeons. This efficiency gain is immense.

Profile of a stock kernel from a UP "desktop" system where I
continually run top(1) to catch runaway memory consumers:

Before (running for 20 days or so):
1492108515 total 1102.3391
1340420417 default_idle 27925425.3542
13822366 statm_pgd_range 29789.5819
7922620 number 11515.4360
7052090 handle_IRQ_event 88151.1250
6041850 fget 94403.9062
5077128 __generic_copy_to_user 63464.1000
4278107 csum_partial_copy_generic 16711.3555
3968287 vsnprintf 3757.8475
3336038 proc_pid_stat 4255.1505

After (running for 24 hours or so):
9921095 total 7.3620
9710998 default_idle 202312.4583
20253 number 18.6149
13130 proc_getdata 23.4464
12696 __generic_copy_to_user 158.7000
7454 vsnprintf 7.0587
6963 handle_IRQ_event 72.5312
6112 fast_clear_page 54.5714
4568 proc_pid_stat 6.3444
3931 kmem_cache_free 27.2986


statm_pgd_range() was eating buttloads of cpu for exactly zero
gain. Cutting it down to size is worthwhile even for "desktops".
This UP box is under heavy load with general network I/O,
serving up nfsroots to ancient NetBSD toasters, Mozilla/GNOME
gunk, and by some braindeath top(1) manages to dominate the
profile. It's not difficult to understand that a real-life
end-user interactive load is being dominated, and perhaps even
its progress hindered, by the cost of collecting information to
report to userspace in a grossly inefficient manner. And yes, I
do need to know what's eating cpu & mem on my box at all times.
(Before telling me to get more memory, I see enough highmem to
know I don't want highmem at home, and it's already got 768MB.)

(5) The code fits into 80x24 and is generally easier on the eyes.

(6) Less code:
$ diffstat no_statm_pgd_range-2.5.31-4
array.c | 128 ++++++++++------------------------------------------------------
1 files changed, 21 insertions(+), 107 deletions(-)


Against 2.5.31, applies cleanly to 2.5.32.


Cheers,
Bill


===== fs/proc/array.c 1.26 vs edited =====
--- 1.26/fs/proc/array.c Wed Jul 24 18:36:09 2002
+++ edited/fs/proc/array.c Tue Aug 20 09:12:37 2002
@@ -394,120 +394,34 @@
return res;
}

-static inline void statm_pte_range(pmd_t * pmd, unsigned long address, unsigned long size, int * pages, int * shared, int * dirty, int * total)
-{
- unsigned long end, pmd_end;
- pte_t *pte;
-
- if (pmd_none(*pmd))
- return;
- if (pmd_bad(*pmd)) {
- pmd_ERROR(*pmd);
- pmd_clear(pmd);
- return;
- }
- preempt_disable();
- pte = pte_offset_map(pmd, address);
- end = address + size;
- pmd_end = (address + PMD_SIZE) & PMD_MASK;
- if (end > pmd_end)
- end = pmd_end;
- do {
- pte_t page = *pte;
- struct page *ptpage;
- unsigned long pfn;
-
- address += PAGE_SIZE;
- pte++;
- if (pte_none(page))
- continue;
- ++*total;
- if (!pte_present(page))
- continue;
- pfn = pte_pfn(page);
- if (!pfn_valid(pfn))
- continue;
- ptpage = pfn_to_page(pfn);
- if (PageReserved(ptpage))
- continue;
- ++*pages;
- if (pte_dirty(page))
- ++*dirty;
- if (page_count(pte_page(page)) > 1)
- ++*shared;
- } while (address < end);
- pte_unmap(pte - 1);
- preempt_enable();
-}
-
-static inline void statm_pmd_range(pgd_t * pgd, unsigned long address, unsigned long size,
- int * pages, int * shared, int * dirty, int * total)
-{
- pmd_t * pmd;
- unsigned long end;
-
- if (pgd_none(*pgd))
- return;
- if (pgd_bad(*pgd)) {
- pgd_ERROR(*pgd);
- pgd_clear(pgd);
- return;
- }
- pmd = pmd_offset(pgd, address);
- address &= ~PGDIR_MASK;
- end = address + size;
- if (end > PGDIR_SIZE)
- end = PGDIR_SIZE;
- do {
- statm_pte_range(pmd, address, end - address, pages, shared, dirty, total);
- address = (address + PMD_SIZE) & PMD_MASK;
- pmd++;
- } while (address < end);
-}
-
-static void statm_pgd_range(pgd_t * pgd, unsigned long address, unsigned long end,
- int * pages, int * shared, int * dirty, int * total)
-{
- while (address < end) {
- statm_pmd_range(pgd, address, end - address, pages, shared, dirty, total);
- address = (address + PGDIR_SIZE) & PGDIR_MASK;
- pgd++;
- }
-}
-
int proc_pid_statm(struct task_struct *task, char * buffer)
{
- int size=0, resident=0, share=0, trs=0, lrs=0, drs=0, dt=0;
+ int size, resident, shared, text, lib, data, dirty = 0;
struct mm_struct *mm = get_task_mm(task);
+ struct vm_area_struct * vma;

- if (mm) {
- struct vm_area_struct * vma;
- down_read(&mm->mmap_sem);
- vma = mm->mmap;
- while (vma) {
- pgd_t *pgd = pgd_offset(mm, vma->vm_start);
- int pages = 0, shared = 0, dirty = 0, total = 0;
+ size = resident = shared = text = lib = data = 0;

- statm_pgd_range(pgd, vma->vm_start, vma->vm_end, &pages, &shared, &dirty, &total);
- resident += pages;
- share += shared;
- dt += dirty;
- size += total;
- if (vma->vm_flags & VM_EXECUTABLE)
- trs += pages; /* text */
- else if (vma->vm_flags & VM_GROWSDOWN)
- drs += pages; /* stack */
- else if (vma->vm_end > 0x60000000)
- lrs += pages; /* library */
- else
- drs += pages;
- vma = vma->vm_next;
- }
- up_read(&mm->mmap_sem);
- mmput(mm);
+ if (!mm)
+ goto out;
+
+ down_read(&mm->mmap_sem);
+ resident = mm->rss;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ size += pages;
+ if (vma->vm_flags & VM_SHARED)
+ shared += pages;
+ if (vma->vm_flags & VM_EXECUTABLE)
+ text += pages;
+ else
+ data += pages;
}
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+out:
return sprintf(buffer,"%d %d %d %d %d %d %d\n",
- size, resident, share, trs, lrs, drs, dt);
+ size, resident, shared, text, lib, data, dirty);
}

/*


2002-08-30 02:35:46

by Andrew Morton

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

William Lee Irwin III wrote:
>
> Okay, I have *had it* with statm_pgd_range()!

It's certainly very bad. A measurement tools shouldn't be perturbing
the system so much as to invalidate the results of other measurement
tools, and this one does.

I have several times had colleagues peering at kernel code wondering
why their application was spending so long in statm_pgd_range when
it really wasn't.

> ...
> (1) shared, lib, text, & total are now reported as what's mapped
> instead of what's resident. This actually fixes two bugs:

hmm. Personally, I've never believed, or even bothered to try to
understand what those columns are measuring. Does anyone actually
find them useful for anything? If so, what are they being used for?
What info do we really, actually want to know?

Reporting the size of the vma is really inaccurate for many situations,
and the info which you're showing here can be generated from
/proc/pid/maps. And it would be nice to get something useful out of this.

Would it be hard to add an `nr_pages' occupancy counter to vm_area_struct?
Go and add all those up?

BTW, Rohit's hugetlb patch touches proc_pid_statm(), so a diff on -mm3
would be appreciated.

2002-08-30 03:07:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

William Lee Irwin III wrote:
>> (1) shared, lib, text, & total are now reported as what's mapped
>> instead of what's resident. This actually fixes two bugs:

On Thu, Aug 29, 2002 at 07:51:44PM -0700, Andrew Morton wrote:
> hmm. Personally, I've never believed, or even bothered to try to
> understand what those columns are measuring. Does anyone actually
> find them useful for anything? If so, what are they being used for?
> What info do we really, actually want to know?

I'm basically looking for VSZ, RSS, %cpu, & pid -- after that I don't
care. top(1) examines a lot more than it feeds into the display, for
reasons unknown. In principle, there are ways of recovering the other
bits that seem too complex to be worthy of doing:

(1) update a mm->shared counter on every PG_direct break/collapse
(2) walk the pte_chain updating mm->dirty for each pte every time
set_page_dirty() or ClearPageDirty() is done
(3) binfmt helpers + arch specific helpers for the binfmt helpers for
keeping count of up mm->lib

(1) doesn't sound good because pte_chain stuff is already a big hotspot
(2) doesn't sound good for the same reason
(3) sounds like a portability nightmare

i.e. Not worth doing. esp. for stats of which only (1) is used/usable,
and the value of (1) in question (IMHO) due to the longstanding
misreporting.


On Thu, Aug 29, 2002 at 07:51:44PM -0700, Andrew Morton wrote:
> Reporting the size of the vma is really inaccurate for many situations,
> and the info which you're showing here can be generated from
> /proc/pid/maps. And it would be nice to get something useful out of this.
> Would it be hard to add an `nr_pages' occupancy counter to vm_area_struct?
> Go and add all those up?

If top(1) understood/used /proc/$PID/maps that'd be fine. It's more or
less a "vaguely compatible placeholder" aside from RSS, and there's
some kind of burden of vague compatibility for /proc/ stuff.

Per-vma RSS is trivial, just less self-contained. Everywhere the
mm->rss is touched, the vma to account that to is also known, except
for put_dirty_page(), and that can be repaired as its caller knows.



Cheers,
Bill

2002-08-30 03:38:46

by Andrew Morton

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

William Lee Irwin III wrote:
>
> William Lee Irwin III wrote:
> >> (1) shared, lib, text, & total are now reported as what's mapped
> >> instead of what's resident. This actually fixes two bugs:
>
> On Thu, Aug 29, 2002 at 07:51:44PM -0700, Andrew Morton wrote:
> > hmm. Personally, I've never believed, or even bothered to try to
> > understand what those columns are measuring. Does anyone actually
> > find them useful for anything? If so, what are they being used for?
> > What info do we really, actually want to know?
>
> I'm basically looking for VSZ, RSS, %cpu, & pid -- after that I don't
> care.

Well statistics coming out of the kernel can be quite vital in the tuning
of real world applications - they're not just for kernel developers. The
stats contribute to the bottom-line performance and stability of the things
for which people are actually using the kernel. That's a motherhood statement,
I know, but I think it's important.

> ...
>
> Per-vma RSS is trivial, just less self-contained. Everywhere the
> mm->rss is touched, the vma to account that to is also known, except
> for put_dirty_page(), and that can be repaired as its caller knows.

This would provide useful information at a justifiable cost, don't you
think?

(ho-hum, all right. I'll code it ;))

2002-08-30 08:21:28

by Anton Blanchard

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!


> Okay, I have *had it* with statm_pgd_range()!

On a related note, it would be nice if procps would not parse things it
doesnt need:

strace ps 2>&1 | grep open | grep '/proc'

open("/proc/12467/stat", O_RDONLY) = 7
open("/proc/12467/statm", O_RDONLY) = 7
open("/proc/12467/status", O_RDONLY) = 7
open("/proc/12467/cmdline", O_RDONLY) = 7
open("/proc/12467/environ", O_RDONLY) = 7

It always opens statm even when its not required.

Anton

2002-08-30 08:26:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

At some point in the past, I wrote:
>> Okay, I have *had it* with statm_pgd_range()!

On Fri, Aug 30, 2002 at 06:24:56PM +1000, Anton Blanchard wrote:
> On a related note, it would be nice if procps would not parse things it
> doesnt need:
> strace ps 2>&1 | grep open | grep '/proc'
> open("/proc/12467/stat", O_RDONLY) = 7
> open("/proc/12467/statm", O_RDONLY) = 7
> open("/proc/12467/status", O_RDONLY) = 7
> open("/proc/12467/cmdline", O_RDONLY) = 7
> open("/proc/12467/environ", O_RDONLY) = 7
> It always opens statm even when its not required.
> Anton

Userspace is FITH, it only needs to do it for BSD-style stuff reporting
RSS/vsz. vsz is wrong anyway if it doesn't open /proc/$PID/maps.


Cheers,
Bill

2002-08-30 17:42:47

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

In message <[email protected]>, > : William Lee Irwin III wri
tes:
> I'm basically looking for VSZ, RSS, %cpu, & pid -- after that I don't
> care. top(1) examines a lot more than it feeds into the display, for
> reasons unknown. In principle, there are ways of recovering the other
> bits that seem too complex to be worthy of doing:

Try using "f" inside of top(1). You'll see a set of additional
bits of info that it can report which may map to the data that it
examines but you haven't seen in its display.

gerrit

2002-09-01 21:56:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

On Friday 30 August 2002 04:51, Andrew Morton wrote:
> William Lee Irwin III wrote:
> > (1) shared, lib, text, & total are now reported as what's mapped
> > instead of what's resident. This actually fixes two bugs:
>
> hmm. Personally, I've never believed, or even bothered to try to
> understand what those columns are measuring. Does anyone actually
> find them useful for anything? If so, what are they being used for?
> What info do we really, actually want to know?

I don't know what use 'shared' is, but it's clearly not very accurate
since it's just adding up all pages with count >= 1. The only remotely
correct thing to do here is check for multiple pte reverse pointers.

--
Daniel

2002-09-05 03:24:45

by William Lee Irwin III

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

On Thu, Aug 29, 2002 at 07:51:44PM -0700, Andrew Morton wrote:
> BTW, Rohit's hugetlb patch touches proc_pid_statm(), so a diff on -mm3
> would be appreciated.

I lost track of what the TODO's were but this is of relatively minor
import, and I lagged long enough this is against 2.5.33-mm2:


Cheers,
Bill


--- linux-virgin/fs/proc/array.c.orig 2002-09-02 23:24:57.000000000 -0700
+++ linux-wli/fs/proc/array.c 2002-09-02 23:37:17.000000000 -0700
@@ -394,131 +394,38 @@
return res;
}

-static inline void statm_pte_range(pmd_t * pmd, unsigned long address, unsigned long size, int * pages, int * shared, int * dirty, int * total)
+int proc_pid_statm(task_t *task, char *buffer)
{
- unsigned long end, pmd_end;
- pte_t *pte;
-
- if (pmd_none(*pmd))
- return;
- if (pmd_bad(*pmd)) {
- pmd_ERROR(*pmd);
- pmd_clear(pmd);
- return;
- }
- preempt_disable();
- pte = pte_offset_map(pmd, address);
- end = address + size;
- pmd_end = (address + PMD_SIZE) & PMD_MASK;
- if (end > pmd_end)
- end = pmd_end;
- do {
- pte_t page = *pte;
- struct page *ptpage;
- unsigned long pfn;
-
- address += PAGE_SIZE;
- pte++;
- if (pte_none(page))
- continue;
- ++*total;
- if (!pte_present(page))
- continue;
- pfn = pte_pfn(page);
- if (!pfn_valid(pfn))
- continue;
- ptpage = pfn_to_page(pfn);
- if (PageReserved(ptpage))
- continue;
- ++*pages;
- if (pte_dirty(page))
- ++*dirty;
- if (page_count(pte_page(page)) > 1)
- ++*shared;
- } while (address < end);
- pte_unmap(pte - 1);
- preempt_enable();
-}
-
-static inline void statm_pmd_range(pgd_t * pgd, unsigned long address, unsigned long size,
- int * pages, int * shared, int * dirty, int * total)
-{
- pmd_t * pmd;
- unsigned long end;
-
- if (pgd_none(*pgd))
- return;
- if (pgd_bad(*pgd)) {
- pgd_ERROR(*pgd);
- pgd_clear(pgd);
- return;
- }
- pmd = pmd_offset(pgd, address);
- address &= ~PGDIR_MASK;
- end = address + size;
- if (end > PGDIR_SIZE)
- end = PGDIR_SIZE;
- do {
- statm_pte_range(pmd, address, end - address, pages, shared, dirty, total);
- address = (address + PMD_SIZE) & PMD_MASK;
- pmd++;
- } while (address < end);
-}
+ int size, resident, shared, text, lib, data, dirty;
+ struct mm_struct *mm = get_task_mm(task);
+ struct vm_area_struct * vma;

-static void statm_pgd_range(pgd_t * pgd, unsigned long address, unsigned long end,
- int * pages, int * shared, int * dirty, int * total)
-{
- while (address < end) {
- statm_pmd_range(pgd, address, end - address, pages, shared, dirty, total);
- address = (address + PGDIR_SIZE) & PGDIR_MASK;
- pgd++;
- }
-}
+ size = resident = shared = text = lib = data = dirty = 0;

-int proc_pid_statm(struct task_struct *task, char * buffer)
-{
- int size=0, resident=0, share=0, trs=0, lrs=0, drs=0, dt=0;
- struct mm_struct *mm = get_task_mm(task);
+ if (!mm)
+ goto out;

- if (mm) {
- struct vm_area_struct * vma;
- down_read(&mm->mmap_sem);
- vma = mm->mmap;
- while (vma) {
- pgd_t *pgd = pgd_offset(mm, vma->vm_start);
- int pages = 0, shared = 0, dirty = 0, total = 0;
- if (is_vm_hugetlb_page(vma)) {
- int num_pages = ((vma->vm_end - vma->vm_start)/PAGE_SIZE);
-
- resident += num_pages;
- if (!(vma->vm_flags & VM_DONTCOPY))
- share += num_pages;
- if (vma->vm_flags & VM_WRITE)
- dt += num_pages;
- drs += num_pages;
- vma = vma->vm_next;
- continue;
- }
- statm_pgd_range(pgd, vma->vm_start, vma->vm_end, &pages, &shared, &dirty, &total);
- resident += pages;
- share += shared;
- dt += dirty;
- size += total;
- if (vma->vm_flags & VM_EXECUTABLE)
- trs += pages; /* text */
- else if (vma->vm_flags & VM_GROWSDOWN)
- drs += pages; /* stack */
- else if (vma->vm_end > 0x60000000)
- lrs += pages; /* library */
- else
- drs += pages;
- vma = vma->vm_next;
+ down_read(&mm->mmap_sem);
+ resident = mm->rss;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (is_vm_hugetlb_page(vma)) {
+ if (!(vma->vm_flags & VM_DONTCOPY))
+ shared += pages;
+ continue;
}
- up_read(&mm->mmap_sem);
- mmput(mm);
+ if (vma->vm_flags & VM_SHARED)
+ shared += pages;
+ if (vma->vm_flags & VM_EXECUTABLE)
+ text += pages;
+ else
+ data += pages;
}
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+out:
return sprintf(buffer,"%d %d %d %d %d %d %d\n",
- size, resident, share, trs, lrs, drs, dt);
+ size, resident, shared, text, lib, data, dirty);
}

/*

2002-09-05 04:31:01

by Andrew Morton

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

William Lee Irwin III wrote:
>
> On Thu, Aug 29, 2002 at 07:51:44PM -0700, Andrew Morton wrote:
> > BTW, Rohit's hugetlb patch touches proc_pid_statm(), so a diff on -mm3
> > would be appreciated.
>
> I lost track of what the TODO's were but this is of relatively minor
> import, and I lagged long enough this is against 2.5.33-mm2:

Well the TODO was to worry about the (very) incorrect reporting of
mapping occupancy. mmap(1G file), touch one byte of it (or none)
and the thing will report 1G?

We figured that per-vma rss accounting would be easy and would fix
it, then we remembered that vma's can be split into two, which
screwed that plan most royally.

Maybe when a VMA is split, we set the new VMA to have an rss of zero,
and keep on doing the accounting. That way, the sum-of-vmas is
still correct even though the individual ones are wildly wrong??

2002-09-05 06:09:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

William Lee Irwin III wrote:
>> I lost track of what the TODO's were but this is of relatively minor
>> import, and I lagged long enough this is against 2.5.33-mm2:

On Wed, Sep 04, 2002 at 09:48:07PM -0700, Andrew Morton wrote:
> Well the TODO was to worry about the (very) incorrect reporting of
> mapping occupancy. mmap(1G file), touch one byte of it (or none)
> and the thing will report 1G?

I don't know of anything actually meant to report mapping occupancy
(except full RSS) before or after this patch. Or have I blundered?


On Wed, Sep 04, 2002 at 09:48:07PM -0700, Andrew Morton wrote:
> We figured that per-vma rss accounting would be easy and would fix
> it, then we remembered that vma's can be split into two, which
> screwed that plan most royally.
> Maybe when a VMA is split, we set the new VMA to have an rss of zero,
> and keep on doing the accounting. That way, the sum-of-vmas is
> still correct even though the individual ones are wildly wrong??

Hmm, that could get hairy depending on how we want them grouped. It
might be better just to maintain RSS counters for the kinds of mappings
we're interested in. Doing pagetable walks to make splitvma() do that
right could perform poorly. Otherwise we'd have to find another
instance of the same kind of thing to "donate" our RSS to on unmap.


Cheers,
Bill

2002-09-05 06:26:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

On Thu, Aug 29, 2002 at 07:51:44PM -0700, Andrew Morton wrote:
>> BTW, Rohit's hugetlb patch touches proc_pid_statm(), so a diff on -mm3
>> would be appreciated.

On Wed, Sep 04, 2002 at 08:20:35PM -0700, William Lee Irwin III wrote:
> I lost track of what the TODO's were but this is of relatively minor
> import, and I lagged long enough this is against 2.5.33-mm2:

doh! I dropped a line merging by hand and broke VSZ

on top of the prior one:


diff -u linux-wli/fs/proc/array.c linux-wli/fs/proc/array.c
--- linux-wli/fs/proc/array.c 2002-09-02 23:37:17.000000000 -0700
+++ linux-wli/fs/proc/array.c 2002-09-02 23:37:17.000000000 -0700
@@ -409,6 +409,7 @@
resident = mm->rss;
for (vma = mm->mmap; vma; vma = vma->vm_next) {
int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ size += pages;
if (is_vm_hugetlb_page(vma)) {
if (!(vma->vm_flags & VM_DONTCOPY))
shared += pages;

2002-09-05 06:32:13

by Andrew Morton

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

William Lee Irwin III wrote:
>
> William Lee Irwin III wrote:
> >> I lost track of what the TODO's were but this is of relatively minor
> >> import, and I lagged long enough this is against 2.5.33-mm2:
>
> On Wed, Sep 04, 2002 at 09:48:07PM -0700, Andrew Morton wrote:
> > Well the TODO was to worry about the (very) incorrect reporting of
> > mapping occupancy. mmap(1G file), touch one byte of it (or none)
> > and the thing will report 1G?
>
> I don't know of anything actually meant to report mapping occupancy
> (except full RSS) before or after this patch. Or have I blundered?

statm_pgd_range(pgd, vma->vm_start, vma->vm_end, &pages, &shared, &dirty, &total);
^^^^^

`pages' there is the number of actually resident pages, yes?
And it gets fed into trs, drs and lrs.

But converting it to this:

+ int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
...
+ if (vma->vm_flags & VM_SHARED)
+ shared += pages;

Will mean that `shared' can be vastly overestimated. I think??

> On Wed, Sep 04, 2002 at 09:48:07PM -0700, Andrew Morton wrote:
> > We figured that per-vma rss accounting would be easy and would fix
> > it, then we remembered that vma's can be split into two, which
> > screwed that plan most royally.
> > Maybe when a VMA is split, we set the new VMA to have an rss of zero,
> > and keep on doing the accounting. That way, the sum-of-vmas is
> > still correct even though the individual ones are wildly wrong??
>
> Hmm, that could get hairy depending on how we want them grouped. It
> might be better just to maintain RSS counters for the kinds of mappings
> we're interested in. Doing pagetable walks to make splitvma() do that
> right could perform poorly. Otherwise we'd have to find another
> instance of the same kind of thing to "donate" our RSS to on unmap.

A walk in split_vma would be unpopular.. Could we separate mm->rss
up into text, stack and library or something?

Or do we just not care? I guess it's conceivably useful to know
the residency of each mapping, but there doesn't seem to be an
existing proc interface for that anyway. And having them all
rolled up into an mm-wide number is a lot of information loss.

2002-09-05 07:11:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: statm_pgd_range() sucks!

William Lee Irwin III wrote:
>> I don't know of anything actually meant to report mapping occupancy
>> (except full RSS) before or after this patch. Or have I blundered?

On Wed, Sep 04, 2002 at 11:49:21PM -0700, Andrew Morton wrote:
> statm_pgd_range(pgd, vma->vm_start, vma->vm_end, &pages, &shared, &dirty, &total);
> ^^^^^
> `pages' there is the number of actually resident pages, yes?
> And it gets fed into trs, drs and lrs.
> But converting it to this:
+ int pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
...
+ if (vma->vm_flags & VM_SHARED)
+ shared += pages;
> Will mean that `shared' can be vastly overestimated. I think??


Shared as wildly guessing from the implementation can only be accurately
estimated in one of two ways"

(1) maintaining RSS counters in the mm's updated on PG_direct split/coalesce
(2) walking the pagetables and adding up things with PG_direct clear

... both are too computationally expensive, so I deliberately changed
the semantics to "amount of mem mapped as MAP_SHARED".

Prior to this it was pure garbage because it checked page->count > 1.


William Lee Irwin III wrote:
>> Hmm, that could get hairy depending on how we want them grouped. It
>> might be better just to maintain RSS counters for the kinds of mappings
>> we're interested in. Doing pagetable walks to make splitvma() do that
>> right could perform poorly. Otherwise we'd have to find another
>> instance of the same kind of thing to "donate" our RSS to on unmap.

On Wed, Sep 04, 2002 at 11:49:21PM -0700, Andrew Morton wrote:
> A walk in split_vma would be unpopular.. Could we separate mm->rss
> up into text, stack and library or something?
> Or do we just not care? I guess it's conceivably useful to know
> the residency of each mapping, but there doesn't seem to be an
> existing proc interface for that anyway. And having them all
> rolled up into an mm-wide number is a lot of information loss.

drs and lrs both have problems.

drs basically requires either the pagetable walk here or a pte_chain
walk in a hotpath (set_page_dirty()/ClearPageDirty(), maybe pte_mkdirty()).
So I dropped its reporting entirely.

lrs is not semantically meaningful except to a.out, the existing code
uses hardcoded i386 values for it, and requires chasing down executable
format stuff to "get right".

trs is counting up memory that's mapped as VM_EXECUTABLE, which isn't
a big deal. RSS-style counters work for that just fine.

At any rate, while this may very well benefit small systems, it looks
like this isn't likely to make system monitoring feasible for larger
boxen, since I ran with it tonight and vmstat(1) and top(1) required 2
minutes or (much) longer to refresh during the big bad tiobench run. In
all likelihood I'll have to write a chardev to monitor the system.


Cheers,
Bill