2004-10-13 23:10:46

by Andrea Arcangeli

[permalink] [raw]
Subject: per-process shared information

The 2.6 kernel dropped the shared information for each task, the third
field in /proc/<pid>/statem is a fake largely overstimating the number
of shared pages (it gets even bigger than the rss, that's why people
noticed, because the shared information was even bigger than the rss of
the task, so clearly those pages couldn't be shared if they didn't
exist).

Nothing important goes wrong in practice, but some statistics gets
screwed (gets underflows since they're rightfully used to compute rss -
shared to find the "private" not shared anonymous data).

I guess I can simply resurrect the O(N) pagetable scanning and check for
page_mapcount > 1 out of order in task_mmu.c? I want to add a new file
"statm_phys_shared" instead of doing it in statm, to avoid slowing down
ps xav. So special apps can still get the information out of the kernel.

At first I considered providing the "shared" information in O(log(N))
but I believe it would waste quite some cpu in the fast paths (it would
require walking the prio trees or the anon-vmas in each page fault, and
it would need to keep track of each vma->mm that we already updated
during the prio-tree walking), so I'd prefer the statistics to be slow
in O(N) instead of hurting the fast paths. If I allow rescheduling and I
trap signals there should be no DoS involved even in the very huge
boxes.

Comments? (can you suggest a better name?)

Ps. if somebody like Hugh volunteers implementing it, you're very
welcome, just let me know (I'll eventually want to work on the oom
handling too, which is pretty screwed right now, I've plenty of bugs
open on that area and the lowmem zone protection needs a rewrite too to
be set to a sane default value no matter the pages_lows etc..).


2004-10-14 22:25:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Thu, Oct 14, 2004 at 10:49:28PM +0100, Hugh Dickins wrote:
> Sounds horrid to me! I'm not inclined to volunteer for that: plus this
> is very much Bill's territory, though he might be glad to surrender it!
> But how about the patch below? Really a mixture of four patches...
> One, support anon_rss as a subset of rss, "shared" being (rss - anon_rss).
> Yes, that's a slight change in meaning of "shared" from in 2.4, but easy
> to support and I think very reasonable. On the one hand, yes, of course
> we know an anon page may actually be shared between several mms of the
> fork group, whereas it won't be counted in "shared" with this patch. But
> the old definition of "shared" was considerably more stupid, wasn't it?
> for example, a private page in pte and swap cache got counted as shared.

This is all very reasonable.


-- wli

2004-10-14 22:38:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Thu, Oct 14, 2004 at 10:49:28PM +0100, Hugh Dickins wrote:
> Is "shared" generally expected to pair with rss? Would it make

shared is expected to work like in linux 2.4 (and apparently solaris),
which means _physical_ pages mapped in more than one task.

> Sounds horrid to me! I'm not inclined to volunteer for that: plus this

what's horrid? would you add a O(log(N)) slowdown in the fast paths to
provide the stat in O(1)? I much prefer an O(N) loop in the stats as far
as it catches signals and reschedules as soon as need_resched is set.

if you can suggest a not-horrid approach to avoid breaking binary
compatibility to 2.4 you're welcome ;)

> One, support anon_rss as a subset of rss, "shared" being (rss - anon_rss).
> Yes, that's a slight change in meaning of "shared" from in 2.4, but easy
> to support and I think very reasonable. On the one hand, yes, of course

that's certainly much better than what we have right now, it's much
closer to the old semantics, but I'm not sure if it's enough to be
compliant with the other OS (including 2.4). I will ask.

I also guess the app will stop breaking since rss - shared will not wrap
anymore.

> we know an anon page may actually be shared between several mms of the
> fork group, whereas it won't be counted in "shared" with this patch. But
> the old definition of "shared" was considerably more stupid, wasn't it?
> for example, a private page in pte and swap cache got counted as shared.

just checking mapcount > 1 would do it right in 2.6.

> Would this new meaning of "shared" suit your purposes well enough?

It'd be fine for me, but I'm no the one how's having troubles.

> shouldn't change that now, but add your statm_phys_shared; whatever,

the only reason to add statm_phys_shared was to keep ps xav fast, if you
don't slowdown pa xav you can add another field at the end of statm.

Ideally shared should have been set to 0 and it should have been moved
to statm_phys_shared. It's slow but not so horrid thanks to the mapcount
which makes it really strightforward (it's just a vma + pgtable walk,
the only tricky bit is the signal catch and need_resched).

2004-10-14 22:14:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: per-process shared information

On Thu, 14 Oct 2004, Andrea Arcangeli wrote:

> The 2.6 kernel dropped the shared information for each task, the third
> field in /proc/<pid>/statem is a fake largely overstimating the number
> of shared pages (it gets even bigger than the rss, that's why people
> noticed, because the shared information was even bigger than the rss of
> the task, so clearly those pages couldn't be shared if they didn't
> exist).

Yes, in 2.4 they were pte counts (seventh always 0), but in 2.6 all but
the second field (rss) are just extents (fifth and seventh now always 0).
Very wasteful to support the old, often meaningless, counts.

> Nothing important goes wrong in practice, but some statistics gets
> screwed (gets underflows since they're rightfully used to compute rss -
> shared to find the "private" not shared anonymous data).

Is "shared" generally expected to pair with rss? Would it make
sense to revert shared to a pte count, but leave size, text, and
data as extents? I don't know.

> I guess I can simply resurrect the O(N) pagetable scanning and check for
> page_mapcount > 1 out of order in task_mmu.c? I want to add a new file
> "statm_phys_shared" instead of doing it in statm, to avoid slowing down
> ps xav. So special apps can still get the information out of the kernel.
>
> At first I considered providing the "shared" information in O(log(N))
> but I believe it would waste quite some cpu in the fast paths (it would
> require walking the prio trees or the anon-vmas in each page fault, and
> it would need to keep track of each vma->mm that we already updated
> during the prio-tree walking), so I'd prefer the statistics to be slow
> in O(N) instead of hurting the fast paths. If I allow rescheduling and I
> trap signals there should be no DoS involved even in the very huge
> boxes.

Sounds horrid to me! I'm not inclined to volunteer for that: plus this
is very much Bill's territory, though he might be glad to surrender it!

But how about the patch below? Really a mixture of four patches...

One, support anon_rss as a subset of rss, "shared" being (rss - anon_rss).
Yes, that's a slight change in meaning of "shared" from in 2.4, but easy
to support and I think very reasonable. On the one hand, yes, of course
we know an anon page may actually be shared between several mms of the
fork group, whereas it won't be counted in "shared" with this patch. But
the old definition of "shared" was considerably more stupid, wasn't it?
for example, a private page in pte and swap cache got counted as shared.
Would this new meaning of "shared" suit your purposes well enough?

Two, more dubious, report this new "shared" as the third field
in /proc/pid/statm, instead of the current "shared_vm". Maybe we
shouldn't change that now, but add your statm_phys_shared; whatever,
this patch at least allows you to see if it makes sense.

Three, I see negative "data" sometimes: since "text" is (always?) a
subset of "shared_vm", we're subtracting that twice from total_vm.
Fix that, just say data is total_vm-shared_vm: same as VmData+VmStk
from /proc/pid/status.

Four, the __vm_stat_accounting in do_mmap_pgoff didn't work quite
right in the test I happened to try, didn't balance with munmap -
shared anon starts off with NULL file, then gets given a vm_file
at the end, __vm_stat_account needs to use that. And a driver is
likely to change vm_flags too (adding VM_IO or VM_RESERVED for
example), so better pick that up. (I've also thrown in updating
pgoff for the vma_merge there: noticed that in an earlier audit,
though in fact all the drivers which adjust vm_pgoff happen to
set one of the VM_SPECIAL flags which prevent merging.)

Hugh

--- 2.6.9-rc4/fs/proc/task_mmu.c 2004-10-11 11:58:22.000000000 +0100
+++ linux/fs/proc/task_mmu.c 2004-10-14 21:33:48.744401000 +0100
@@ -35,9 +35,9 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = mm->shared_vm;
+ *shared = mm->rss - mm->anon_rss;
*text = (mm->end_code - mm->start_code) >> PAGE_SHIFT;
- *data = mm->total_vm - mm->shared_vm - *text;
+ *data = mm->total_vm - mm->shared_vm;
*resident = mm->rss;
return mm->total_vm;
}
--- 2.6.9-rc4/include/linux/sched.h 2004-10-11 11:58:32.000000000 +0100
+++ linux/include/linux/sched.h 2004-10-14 21:33:48.745400848 +0100
@@ -225,7 +225,7 @@ struct mm_struct {
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long rss, total_vm, locked_vm, shared_vm;
+ unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags;

unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
--- 2.6.9-rc4/kernel/fork.c 2004-10-11 11:58:33.000000000 +0100
+++ linux/kernel/fork.c 2004-10-14 21:33:48.747400544 +0100
@@ -297,6 +297,7 @@ static inline int dup_mmap(struct mm_str
mm->free_area_cache = oldmm->mmap_base;
mm->map_count = 0;
mm->rss = 0;
+ mm->anon_rss = 0;
cpus_clear(mm->cpu_vm_mask);
mm->mm_rb = RB_ROOT;
rb_link = &mm->mm_rb.rb_node;
--- 2.6.9-rc4/mm/memory.c 2004-10-11 11:58:34.000000000 +0100
+++ linux/mm/memory.c 2004-10-14 21:33:48.749400240 +0100
@@ -326,6 +326,8 @@ skip_copy_pte_range:
pte = pte_mkold(pte);
get_page(page);
dst->rss++;
+ if (PageAnon(page))
+ dst->anon_rss++;
set_pte(dst_pte, pte);
page_dup_rmap(page);
cont_copy_pte_range_noset:
@@ -416,7 +418,9 @@ static void zap_pte_range(struct mmu_gat
set_pte(ptep, pgoff_to_pte(page->index));
if (pte_dirty(pte))
set_page_dirty(page);
- if (pte_young(pte) && !PageAnon(page))
+ if (PageAnon(page))
+ tlb->mm->anon_rss--;
+ else if (pte_young(pte))
mark_page_accessed(page);
tlb->freed++;
page_remove_rmap(page);
@@ -1095,6 +1099,8 @@ static int do_wp_page(struct mm_struct *
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, pte))) {
+ if (PageAnon(old_page))
+ mm->anon_rss--;
if (PageReserved(old_page))
++mm->rss;
else
--- 2.6.9-rc4/mm/mmap.c 2004-10-11 11:58:34.000000000 +0100
+++ linux/mm/mmap.c 2004-10-14 21:33:48.751399936 +0100
@@ -988,9 +988,12 @@ munmap_back:
* f_op->mmap method. -DaveM
*/
addr = vma->vm_start;
+ pgoff = vma->vm_pgoff;
+ vm_flags = vma->vm_flags;

if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
+ file = vma->vm_file;
vma_link(mm, vma, prev, rb_link, rb_parent);
if (correct_wcount)
atomic_inc(&inode->i_writecount);
@@ -1005,6 +1008,7 @@ munmap_back:
}
out:
mm->total_vm += len >> PAGE_SHIFT;
+ __vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
@@ -1015,7 +1019,6 @@ out:
pgoff, flags & MAP_NONBLOCK);
down_write(&mm->mmap_sem);
}
- __vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
return addr;

unmap_and_free_vma:
--- 2.6.9-rc4/mm/rmap.c 2004-10-11 11:58:34.000000000 +0100
+++ linux/mm/rmap.c 2004-10-14 21:33:48.752399784 +0100
@@ -431,6 +431,8 @@ void page_add_anon_rmap(struct page *pag
BUG_ON(PageReserved(page));
BUG_ON(!anon_vma);

+ vma->vm_mm->anon_rss++;
+
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
index = (address - vma->vm_start) >> PAGE_SHIFT;
index += vma->vm_pgoff;
@@ -578,6 +580,7 @@ static int try_to_unmap_one(struct page
swap_duplicate(entry);
set_pte(pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
+ mm->anon_rss--;
}

mm->rss--;

2004-10-14 23:43:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: per-process shared information


Hi Andrea!

No useful comments on the statm reporting issue.

> Ps. if somebody like Hugh volunteers implementing it, you're very
> welcome, just let me know (I'll eventually want to work on the oom
> handling too, which is pretty screwed right now,

Yes, we've got reports of bad OOM killing behaviour (is that what you're
talking about?)

One thing is the removal of "if (nr_swap_pages > 0) goto out" from oom_kill()
causes problems (spurious oom kill).

We need to throttle more, on page reclaiming progress I think.

Take a look at

http://marc.theaimsgroup.com/?l=linux-mm&m=109587921204602&w=2

What else you're seeing?

> I've plenty of bugs
> open on that area and the lowmem zone protection needs a rewrite too to
> be set to a sane default value no matter the pages_lows etc..).

Nick has been working on that lately I think. What is the problem?



2004-10-15 00:06:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Thu, Oct 14, 2004 at 06:47:11PM -0300, Marcelo Tosatti wrote:
>
> Hi Andrea!
>
> No useful comments on the statm reporting issue.
>
> > Ps. if somebody like Hugh volunteers implementing it, you're very
> > welcome, just let me know (I'll eventually want to work on the oom
> > handling too, which is pretty screwed right now,
>
> Yes, we've got reports of bad OOM killing behaviour (is that what you're
> talking about?)
>
> One thing is the removal of "if (nr_swap_pages > 0) goto out" from oom_kill()
> causes problems (spurious oom kill).
>
> We need to throttle more, on page reclaiming progress I think.
>
> Take a look at
>
> http://marc.theaimsgroup.com/?l=linux-mm&m=109587921204602&w=2
>
> What else you're seeing?
>
> > I've plenty of bugs
> > open on that area and the lowmem zone protection needs a rewrite too to
> > be set to a sane default value no matter the pages_lows etc..).
>
> Nick has been working on that lately I think. What is the problem?

things went worse with the switch from 2.6.8 to 2.6.9-rc, so that's not
the nr_swap_pages > 0, likely the latest changes introduced regressions
instead of fixing them.

I'm seeing both hard deadlocks and suprious oom kills, and that all
makes sense, I can see the bugs, it's just I need to fix them, my plan
is to forward port some code from 2.4 which works fine, objrmap will make
it even better.

2004-10-15 01:06:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 01:58:45AM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 14, 2004 at 06:47:11PM -0300, Marcelo Tosatti wrote:
> >
> > Hi Andrea!
> >
> > No useful comments on the statm reporting issue.
> >
> > > Ps. if somebody like Hugh volunteers implementing it, you're very
> > > welcome, just let me know (I'll eventually want to work on the oom
> > > handling too, which is pretty screwed right now,
> >
> > Yes, we've got reports of bad OOM killing behaviour (is that what you're
> > talking about?)
> >
> > One thing is the removal of "if (nr_swap_pages > 0) goto out" from oom_kill()
> > causes problems (spurious oom kill).
> >
> > We need to throttle more, on page reclaiming progress I think.
> >
> > Take a look at
> >
> > http://marc.theaimsgroup.com/?l=linux-mm&m=109587921204602&w=2
> >
> > What else you're seeing?
> >
> > > I've plenty of bugs
> > > open on that area and the lowmem zone protection needs a rewrite too to
> > > be set to a sane default value no matter the pages_lows etc..).
> >
> > Nick has been working on that lately I think. What is the problem?
>
> things went worse with the switch from 2.6.8 to 2.6.9-rc, so that's not
> the nr_swap_pages > 0, likely the latest changes introduced regressions
> instead of fixing them.

Just FYI - removing the "nr_swap_pages > 0" fixes the problem at
the URL I posted above.

But having it creates hard locks on Oracle workloads (wli removed
that line) due to pinned memory.

> I'm seeing both hard deadlocks and suprious oom kills, and that all
> makes sense, I can see the bugs, it's just I need to fix them, my plan
> is to forward port some code from 2.4 which works fine, objrmap will make
> it even better.

Ok, very nice!

2004-10-15 10:45:53

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Thu, Oct 14, 2004 at 06:47:11PM -0300, Marcelo Tosatti wrote:
>> Nick has been working on that lately I think. What is the problem?

On Fri, Oct 15, 2004 at 01:58:45AM +0200, Andrea Arcangeli wrote:
> things went worse with the switch from 2.6.8 to 2.6.9-rc, so that's not
> the nr_swap_pages > 0, likely the latest changes introduced regressions
> instead of fixing them.
> I'm seeing both hard deadlocks and suprious oom kills, and that all
> makes sense, I can see the bugs, it's just I need to fix them, my plan
> is to forward port some code from 2.4 which works fine, objrmap will make
> it even better.

I'm not aware of these. If you could relay some of that information to
me (SuSE bugzilla numbers and similar are fine) I'd be much obliged.


-- wli

2004-10-15 10:51:06

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Thu, Oct 14, 2004 at 10:49:28PM +0100, Hugh Dickins wrote:
>> shouldn't change that now, but add your statm_phys_shared; whatever,

On Fri, Oct 15, 2004 at 12:37:30AM +0200, Andrea Arcangeli wrote:
> the only reason to add statm_phys_shared was to keep ps xav fast, if you
> don't slowdown pa xav you can add another field at the end of statm.
> Ideally shared should have been set to 0 and it should have been moved
> to statm_phys_shared. It's slow but not so horrid thanks to the mapcount
> which makes it really strightforward (it's just a vma + pgtable walk,
> the only tricky bit is the signal catch and need_resched).

I'd be rather loath to endorse reintroducing the problematic pagetable
and vma walks I spent so much effort eliminating. Will Hugh's patch
suffice for you? If not, what else is needed?


-- wli

2004-10-15 11:56:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 15 Oct 2004, Andrea Arcangeli wrote:
> On Thu, Oct 14, 2004 at 10:49:28PM +0100, Hugh Dickins wrote:
> > Is "shared" generally expected to pair with rss? Would it make
>
> shared is expected to work like in linux 2.4 (and apparently solaris),
> which means _physical_ pages mapped in more than one task.

Well, it didn't quite mean that in 2.4: since any pagecache (including
swapcache) page mapped into a single task would have page_count 2 and
so be counted as shared.

I think 2.4 was already trying to come up with a plausible simulacrum
of numbers that made sense to gather in 2.2, but the numbers had lost
their point, and it only had page_count to play with. Or maybe 2.2
was already trying to make up numbers to fit with 2.0...

> > Sounds horrid to me! I'm not inclined to volunteer for that: plus this
>
> what's horrid? would you add a O(log(N)) slowdown in the fast paths to
> provide the stat in O(1)? I much prefer an O(N) loop in the stats as far
> as it catches signals and reschedules as soon as need_resched is set.

Of course I prefer to keep significant slowdown out of the fast paths
("significant" inserted there because I don't mind the fastish path
incrementing just one count in an already dirty cacheline).

But I don't want to give myself unnecessary work, and I don't want to
give the cpu unnecessary work, particularly if the stats gathering is
in danger of dominating some profiled load. Bill had good reason to
remove even the vma walk; but I accept you're being careful to propose
that we keep the overhead out of existing /proc/pid uses - right if
we have to go that way, but I just prefer to avoid the work myself.

> if you can suggest a not-horrid approach to avoid breaking binary
> compatibility to 2.4 you're welcome ;)

I hope that's what my patch would be sufficient to achieve.

It would be unfair to say 2.4's numbers were actually a bug, but
certainly peculiar: I'm about as interested in exactly reproducing
their oddities as in building a replica of some antique furniture.

> > One, support anon_rss as a subset of rss, "shared" being (rss - anon_rss).
> > Yes, that's a slight change in meaning of "shared" from in 2.4, but easy
> > to support and I think very reasonable. On the one hand, yes, of course
>
> that's certainly much better than what we have right now, it's much
> closer to the old semantics, but I'm not sure if it's enough to be
> compliant with the other OS (including 2.4). I will ask.

Thanks, please do.

> I also guess the app will stop breaking since rss - shared will not wrap
> anymore.

Oh, if that's all we need, I can do a simpler patch ;)

> > we know an anon page may actually be shared between several mms of the
> > fork group, whereas it won't be counted in "shared" with this patch. But
> > the old definition of "shared" was considerably more stupid, wasn't it?
> > for example, a private page in pte and swap cache got counted as shared.
>
> just checking mapcount > 1 would do it right in 2.6.

Interesting idea, and now (well, 2.6.9-mm heading to 2.6.10) we have
atomic_inc_return and atomic_dec_return supported on all architectures,
it should be possible to adjust an mm->shared_rss each time mapcount
goes up from 1 or down to 1, as well as adjusting nr_mapped count
as we do when it goes up from 0 or down to 0.

Though I think I prefer the anon_rss count in yesterday's patch,
which is at least well-defined. And will usually give you numbers
much closer to 2.4's than shared_rss (since, as noted above, 2.4
counted a page shared between pagetable and pagecache as shared,
which mapcount 1 would not).

> > Would this new meaning of "shared" suit your purposes well enough?
>
> It'd be fine for me, but I'm no the one how's having troubles.

Let's wait and see how (rss - anon_rss) works out for your customer.

> > shouldn't change that now, but add your statm_phys_shared; whatever,
>
> the only reason to add statm_phys_shared was to keep ps xav fast, if you
> don't slowdown pa xav you can add another field at the end of statm.

We should ask Albert which he prefers: /proc/pid/statm "shared" field
revert to an rss-like count as in 2.4, subset of "resident", while size,
text and data fields remain extents; or leave that third field as in
earlier 2.6 and add a shared-rss field on the end?

Hugh

2004-10-15 13:25:53

by Albert Cahalan

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 2004-10-15 at 07:56, Hugh Dickins wrote:
> On Fri, 15 Oct 2004, Andrea Arcangeli wrote:
> > On Thu, Oct 14, 2004 at 10:49:28PM +0100, Hugh Dickins wrote:

> > if you can suggest a not-horrid approach to avoid breaking binary
> > compatibility to 2.4 you're welcome ;)
>
> I hope that's what my patch would be sufficient to achieve.
>
> It would be unfair to say 2.4's numbers were actually a bug, but
> certainly peculiar: I'm about as interested in exactly reproducing
> their oddities as in building a replica of some antique furniture.

Most other people believe that Linux should have a
stable ABI so that apps don't break left and right.
Even if people have source code, they lose it. Even
if they don't lose it, they don't want to retest.
The developers may have gone on to better things.

People actually rely on Linux to work. Didn't OSDL
just add app+ABI testing? This is why.

> > > we know an anon page may actually be shared between several mms of the
> > > fork group, whereas it won't be counted in "shared" with this patch. But
> > > the old definition of "shared" was considerably more stupid, wasn't it?
> > > for example, a private page in pte and swap cache got counted as shared.
> >
> > just checking mapcount > 1 would do it right in 2.6.
>
> Interesting idea, and now (well, 2.6.9-mm heading to 2.6.10) we have
> atomic_inc_return and atomic_dec_return supported on all architectures,
> it should be possible to adjust an mm->shared_rss each time mapcount
> goes up from 1 or down to 1, as well as adjusting nr_mapped count
> as we do when it goes up from 0 or down to 0.
>
> Though I think I prefer the anon_rss count in yesterday's patch,
> which is at least well-defined. And will usually give you numbers
> much closer to 2.4's than shared_rss (since, as noted above, 2.4
> counted a page shared between pagetable and pagecache as shared,
> which mapcount 1 would not).

I don't see why it is such trouble to provide the old data.
If new data is useful, provide that too.

> > > shouldn't change that now, but add your statm_phys_shared; whatever,
> >
> > the only reason to add statm_phys_shared was to keep ps xav fast, if you
> > don't slowdown pa xav you can add another field at the end of statm.
>
> We should ask Albert which he prefers: /proc/pid/statm "shared" field
> revert to an rss-like count as in 2.4, subset of "resident", while size,
> text and data fields remain extents; or leave that third field as in
> earlier 2.6 and add a shared-rss field on the end?

I display the data as a column in "top". Docomentation is
much easier to deal with if it doesn't have lots of special
cases for different kernel versions.

I guess I'd prefer that the fields of Linux 2.4 be restored,
and that any new fields be added on the end. Note that the
text and data fields are supposed to be rss-like as well.
Except for the size, they're all supposed to be that way.
This data was created to match what BSD provides.

If adding a new file to /proc, please pick a short name
that is friendly toward tab completion. "phymem" is OK.


2004-10-15 14:28:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 09:19:13AM -0400, Albert Cahalan wrote:
> I display the data as a column in "top". Docomentation is
> much easier to deal with if it doesn't have lots of special
> cases for different kernel versions.
> I guess I'd prefer that the fields of Linux 2.4 be restored,
> and that any new fields be added on the end. Note that the
> text and data fields are supposed to be rss-like as well.
> Except for the size, they're all supposed to be that way.
> This data was created to match what BSD provides.
> If adding a new file to /proc, please pick a short name
> that is friendly toward tab completion. "phymem" is OK.

The overhead is too catastrophic to tolerate. Please work with us
to find a sufficient approximation to whatever statistics you want
opposed to reverting to 2.4 algorithms or ones of similar expense.


-- wli

2004-10-15 14:47:45

by Albert Cahalan

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 2004-10-15 at 10:28, William Lee Irwin III wrote:
> On Fri, Oct 15, 2004 at 09:19:13AM -0400, Albert Cahalan wrote:
> > I display the data as a column in "top". Docomentation is
> > much easier to deal with if it doesn't have lots of special
> > cases for different kernel versions.
> > I guess I'd prefer that the fields of Linux 2.4 be restored,
> > and that any new fields be added on the end. Note that the
> > text and data fields are supposed to be rss-like as well.
> > Except for the size, they're all supposed to be that way.
> > This data was created to match what BSD provides.
> > If adding a new file to /proc, please pick a short name
> > that is friendly toward tab completion. "phymem" is OK.
>
> The overhead is too catastrophic to tolerate. Please work with us
> to find a sufficient approximation to whatever statistics you want
> opposed to reverting to 2.4 algorithms or ones of similar expense.

Why not get rid of rss too then? It's overhead.



2004-10-15 14:53:53

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 2004-10-15 at 10:28, William Lee Irwin III wrote:
>> The overhead is too catastrophic to tolerate. Please work with us
>> to find a sufficient approximation to whatever statistics you want
>> opposed to reverting to 2.4 algorithms or ones of similar expense.

On Fri, Oct 15, 2004 at 10:40:58AM -0400, Albert Cahalan wrote:
> Why not get rid of rss too then? It's overhead.

If you can't distinguish catastrophic from non-catastrophic you are
beyond my ability to help you.


-- wli

2004-10-15 16:07:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 12:56:22PM +0100, Hugh Dickins wrote:
> Well, it didn't quite mean that in 2.4: since any pagecache (including
> swapcache) page mapped into a single task would have page_count 2 and
> so be counted as shared.

Well, doing page_mapcount + !PageAnon should do the trick. My point is
that the confusion of an anon page going in swapcache (the one you
mentioned) is easily fixable.

> I think 2.4 was already trying to come up with a plausible simulacrum
> of numbers that made sense to gather in 2.2, but the numbers had lost
> their point, and it only had page_count to play with. Or maybe 2.2
> was already trying to make up numbers to fit with 2.0...

;)

> [..] and I don't want to
> give the cpu unnecessary work, [..]

this doesn't make much sense to me, then you should as well forbid
the user to run main () { for(;;) }.

Of course doing a sort of for(;;) in each pid of ps xav is overkill, but
on demand easily killable and reschedule friendly would be no different
than allowing an user to execute main () { for(;;) }. All you can do is
to renice it or use CKRM to lower its cpu availability from the
scheduler, or kill -9.

> remove even the vma walk; but I accept you're being careful to propose
> that we keep the overhead out of existing /proc/pid uses - right if
> we have to go that way, but I just prefer to avoid the work myself.

correct. I'd also prefer to avoid this work myself.

> I hope that's what my patch would be sufficient to achieve.

I hope too.

> It would be unfair to say 2.4's numbers were actually a bug, but
> certainly peculiar: I'm about as interested in exactly reproducing
> their oddities as in building a replica of some antique furniture.

sure the swapcache bit would need fixing ;)

> Oh, if that's all we need, I can do a simpler patch ;)

yep ;) though such simpler patch would return no-sensical results, it
would provide no-information at all in some case.

> Interesting idea, and now (well, 2.6.9-mm heading to 2.6.10) we have
> atomic_inc_return and atomic_dec_return supported on all architectures,
> it should be possible to adjust an mm->shared_rss each time mapcount
> goes up from 1 or down to 1, as well as adjusting nr_mapped count
> as we do when it goes up from 0 or down to 0.

I believe your approach will work fine, and it's much closer to the 2.4
physical-driven semantics. It looks like "shared" really means
not-anonymous memory, but accounted on a physical basis.

However I wouldn't mind if you want to add a new field and to provide
both the "virtual" shared like 2.6 right now, along the "physical"
shared miming the semantics of 2.4 (they could be both in statm since
they're O(1) to collect).

2004-10-15 16:24:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 09:19:13AM -0400, Albert Cahalan wrote:
> I don't see why it is such trouble to provide the old data.

I agree with you w.r.t. binary compatibility, here it's even a "source
compatibility" matter, a recompile wouldn't fix it.

However I wasn't exactly advocating to keep it 100% backwards
compatible in this case: somebody already broke it from 2.5.x to
2.6.9-rc, and since there was a very good reason for that, we should
probably declare it broken. Here there has been a very strong technical
reason to break statm, but they didn't break binary and source
compatibility gratuitously like some solaris kernel developer seems to
think in some blog.

the problem is that when ps xav wants to know the RSS it reads statm,
so we just cannot hurt ps xav to show the "old shared" information that
would be extremely slow to collect.

I was only not happy about dropping the old feature completely instead
of providing it with a different new API. Now I think the solution Hugh
just proposed with the anon_rss should mimic the old behaviour well
enough and it's probably the right way to go, it's still not literally
the same, but I doubt most people from userspace could notice the
difference, and most important it provides useful information, which is
the number of _physical_ pages mapped that aren't anonymous memory, this
is very valuable info and it's basically the same info that people was
getting from the old "shared". So I like it.

2004-10-15 16:41:40

by Albert Cahalan

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 2004-10-15 at 12:20, Andrea Arcangeli wrote:

> the problem is that when ps xav wants to know the RSS it reads statm,
> so we just cannot hurt ps xav to show the "old shared" information that
> would be extremely slow to collect.

Currently, ps uses /proc/*/stat for that. The /proc/*/statm
file is read to determine TRS and DRS, which are broken now.
That it, unless you count "ps -o OL_m" format.

The top program uses /proc/*/statm for many more fields:

%MEM Memory usage (RES)
VIRT Virtual Image (kb)
SWAP Swapped size (kb)
RES Resident size (kb)
CODE Code size (kb)
DATA Data+Stack size (kb)
SHR Shared Mem size (kb)
nDRT Dirty Pages count

> I was only not happy about dropping the old feature completely instead
> of providing it with a different new API. Now I think the solution Hugh
> just proposed with the anon_rss should mimic the old behaviour well
> enough and it's probably the right way to go, it's still not literally
> the same, but I doubt most people from userspace could notice the
> difference, and most important it provides useful information, which is
> the number of _physical_ pages mapped that aren't anonymous memory, this
> is very valuable info and it's basically the same info that people was
> getting from the old "shared". So I like it.

What exactly would be the difference, and when might users see it?


2004-10-15 17:04:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 10:40:58AM -0400, Albert Cahalan wrote:
> Why not get rid of rss too then? It's overhead.

"rss" is O(1), old "shared" is O(N). With N being a few terabytes of
data divided by PAGE_SIZE...

on any desktop machine you couldn't notice the difference, it only
matters on the high end.

But I believe Hugh's proposal will make everyone happy. Anything that we
stuff in statm must have O(1) complexity (or at least O(log(N)),
otherwise ps v will nearly stop running in the big boxes.

2004-10-15 17:18:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> Currently, ps uses /proc/*/stat for that. The /proc/*/statm
> file is read to determine TRS and DRS, which are broken now.
> That it, unless you count "ps -o OL_m" format.
> The top program uses /proc/*/statm for many more fields:

And here I refute every last field with a description of what 2.4.x
actually implemented and how its implementation renders the field
gibberish.


On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> %MEM Memory usage (RES)
> RES Resident size (kb)

These would be the valid pages installed into pagetable entries,
which differs from RSS by racing with pagetable updates while counting
up pages one-by-one, where it could otherwise just use the RSS count in
the mm.


On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> VIRT Virtual Image (kb)

This one isn't even close to VSZ in 2.4. This is the number of
allocated non-pte_none() ptes under vmas, which has no bearing on
any meaningful statistics, as not even pagetable space consumption
may be inferred from it due to alignment issues.


On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> SWAP Swapped size (kb)

This isn't actually calculated by 2.4, and can't be inferred from it
either, as swapped pages are indistinguishable from reserved pages
according to /proc/, most prominently the zero page.


On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> CODE Code size (kb)

This is the subset of the miscalculated RSS under VM_EXECUTABLE vmas,
hence it's a miscalculated subset of the RSS.


On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> DATA Data+Stack size (kb)

This is the subset of the miscalculated RSS under VM_GROWSDOWN
vmas, which is incorrect for stack-grows-upward architectures,
or are entirely below 0x60000000, which is pure gibberish.


On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> SHR Shared Mem size (kb)

page_count(pte_page(pte)) > 1; this is gibberish on numerous levels,
which Hugh pointed out, that is, owing to the references held by
pagecache, swapcache, and the like.


On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> nDRT Dirty Pages count

A count of dirty ptes, not dirty pages. This is meaningless for
most purposes except perhaps mmap() IO, and it's rather dubious
even then. These are not the dirty physical pages, or anything else
remotely meaningful to common scenarios. For instance, when userspace
uses a single buffer page for numerous write(2) calls, it actually
dirties large amounts of pagecache, but only dirties 1 page according
to this metric. When read/write sharing occurs, there is no trace of
other processes' having dirtied the page. Furthermore, ptes are dirtied
in numerous cases when the page has not been modified. For instance,
when the region has been mapped read/write but only a read fault was
taken. So our last field is meaningless.

Thus we are left with exactly zero fields which are not gibberish in 2.4,
and 2.4.x semantics have no leg left to stand upon.


-- wli

2004-10-15 17:24:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> Currently, ps uses /proc/*/stat for that. The /proc/*/statm

maybe you mean /proc/pid/status, status has the RSS and most virtual
info too, I doubt stat has it too (a trivial grep doesn't reveal it at
least).

Anyways my ps definitely reads /proc/*/statm with the 'v' option
(confirmed by strace).

open("/proc/1/stat", O_RDONLY) = 6
read(6, "1 (init) S 0 0 0 0 -1 256 207 16"..., 1023) = 199
close(6) = 0
open("/proc/1/statm", O_RDONLY) = 6
read(6, "147 17 111 111 0 36 0\n", 1023) = 22
close(6) = 0
open("/proc/1/status", O_RDONLY) = 6
read(6, "Name:\tinit\nState:\tS (sleeping)\nS"..., 1023) = 473
close(6) = 0
open("/proc/1/cmdline", O_RDONLY) = 6
read(6, "init [5]\0\0\0", 2047) = 11
close(6) = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(3, 5), ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x401cd000
write(1, " PID TTY STAT TIME MAJF"..., 64) = 64
write(1, " 1 ? S 0:02 2"..., 67) = 67
stat64("/proc/2", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/2/stat", O_RDONLY) = 6
read(6, "2 (migration/0) S 1 0 0 0 -1 328"..., 1023) = 135
close(6) = 0

peraphs we use different procps version. I heard there is more than one
version, and I know you're maintaining one but I never followed the
details, you sure know better than me why the above is happening on my
machine. But the point is that there are widely used apps opening statm
(top as you mentioned), and those apps normally don't care about
"shared" (or at least they can't underflow), and those must run on the
big boxes too, and statm was basically unfixable.

> What exactly would be the difference, and when might users see it?

one difference for example is that it cannot take into account for
shared anonymous pages generated by fork(). Or other corner cases would
be posisble, but I believe it's a minor issue... so it's looking good.
Frankly I was thinking shared as a page_mapcount > 1, while all
pagecache is considered "shared", so Hugh's algorithm is a lot closer to
the old shared than what I thought originally. Peraphs solaris also
implements it as rss - anon_rss (I mean, they must be facing similar
issues, and the report states the "shared" info is being reported with
the same semantics on linux 2.4 and solaris and some other unix... so
the theory Hugh's matching other unix even better than 2.4 sounds
reasonable).

2004-10-15 17:59:12

by Albert Cahalan

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 2004-10-15 at 13:13, Andrea Arcangeli wrote:
> On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> > Currently, ps uses /proc/*/stat for that. The /proc/*/statm
>
> maybe you mean /proc/pid/status, status has the RSS and most virtual
> info too, I doubt stat has it too (a trivial grep doesn't reveal it at
> least).

It's a popular value.

stat: in bytes, between vsize and rss_rlim
statm: in pages, between size and share
status: in KiB as VmRSS

> Anyways my ps definitely reads /proc/*/statm with the 'v' option
> (confirmed by strace).

Sure. That's not because of RSS. It's for TRS and DRS,
which are supposed to be RSS-like values specific to
text (code) and data.

The VM size of text is TSIZ, and of data is DSIZ.
These numbers, while useful, are not the same thing.

> peraphs we use different procps version. I heard there is more than one
> version, and I know you're maintaining one but I never followed the
> details, you sure know better than me why the above is happening on my
> machine. But the point is that there are widely used apps opening statm
> (top as you mentioned), and those apps normally don't care about
> "shared" (or at least they can't underflow), and those must run on the
> big boxes too, and statm was basically unfixable.

A user can configure top to display other columns if
he has a box that can't handle /proc/*/statm well.
The file will not be read if it is not needed.
Start top, then do:

f enters field modification screen
o disable VIRT
q disable RES
t disable SHR
n disable %MEM
enter exits field modification screen
W writes a ~/.toprc file

So, what is the problem again? :-)

> > What exactly would be the difference, and when might users see it?
>
> one difference for example is that it cannot take into account for
> shared anonymous pages generated by fork(). Or other corner cases would
> be posisble, but I believe it's a minor issue... so it's looking good.
> Frankly I was thinking shared as a page_mapcount > 1, while all
> pagecache is considered "shared", so Hugh's algorithm is a lot closer to
> the old shared than what I thought originally. Peraphs solaris also
> implements it as rss - anon_rss (I mean, they must be facing similar
> issues, and the report states the "shared" info is being reported with
> the same semantics on linux 2.4 and solaris and some other unix... so
> the theory Hugh's matching other unix even better than 2.4 sounds
> reasonable).

Well, as long as it makes the users happy... I don't personally
care, except to say that I don't care to document all sorts
of kernel-specific variations. It gets hopelessly messy.


2004-10-15 18:16:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 01:51:56PM -0400, Albert Cahalan wrote:
> Sure. That's not because of RSS. It's for TRS and DRS,
> which are supposed to be RSS-like values specific to
> text (code) and data.

And they're not RSS-like right now if you pick them from statm, the only
RSS-like variable is rss itself in 2.6 ;).

*data = mm->total_vm - mm->shared_vm

all those are virtual, not physical. dunno about 2.4, but I doubt 2.4
would be much different, rss + shared where the only physical driven
things in 2.4 IIRC. (now only rss is left, and Hugh's patch adds
anon_rss back)

to me TRS and DRS have always been _virtual_ when I read the ps output,
obviously since DRS tends to be orders of magnitude bigger than RSS
itself ;).

> The VM size of text is TSIZ, and of data is DSIZ.
> These numbers, while useful, are not the same thing.

those should come out of statm pretty nicely.

> A user can configure top to display other columns if
> he has a box that can't handle /proc/*/statm well.
> The file will not be read if it is not needed.
> Start top, then do:
>
> f enters field modification screen
> o disable VIRT
> q disable RES
> t disable SHR
> n disable %MEM
> enter exits field modification screen
> W writes a ~/.toprc file
>
> So, what is the problem again? :-)

that you can't get those values efficiently. Even assuming you're ok to
drop shared by disabling SHR, it wouldn't help, without a kernel API
change.

> Well, as long as it makes the users happy... I don't personally
> care, except to say that I don't care to document all sorts
> of kernel-specific variations. It gets hopelessly messy.

Yep, I believe users could be happy with Hugh's rss-anon_rss variant.

2004-10-15 18:31:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 08:14:46PM +0200, Andrea Arcangeli wrote:
> that you can't get those values efficiently. Even assuming you're ok to
> drop shared by disabling SHR, it wouldn't help, without a kernel API
> change.

On Fri, Oct 15, 2004 at 01:51:56PM -0400, Albert Cahalan wrote:
>> Well, as long as it makes the users happy... I don't personally
>> care, except to say that I don't care to document all sorts
>> of kernel-specific variations. It gets hopelessly messy.

On Fri, Oct 15, 2004 at 08:14:46PM +0200, Andrea Arcangeli wrote:
> Yep, I believe users could be happy with Hugh's rss-anon_rss variant.

I just checked in with some Oracle people and the primary concern
is splitting up RSS into shared and private. Given either shared
or private the other is calculable.


-- wli

2004-10-15 18:43:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 11:30:25AM -0700, William Lee Irwin III wrote:
> I just checked in with some Oracle people and the primary concern
> is splitting up RSS into shared and private. Given either shared
> or private the other is calculable.

can you define private a bit more? Is private the page_count == 1 like
2.4? Or is "private" == anonymous? that's the only question.

In Hugh's patch private == "anonymous". With 2.4 private == "page_count
== 1" (which is a subset of anonymous).

2004-10-15 18:47:46

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 11:30:25AM -0700, William Lee Irwin III wrote:
>> I just checked in with some Oracle people and the primary concern
>> is splitting up RSS into shared and private. Given either shared
>> or private the other is calculable.

On Fri, Oct 15, 2004 at 08:40:09PM +0200, Andrea Arcangeli wrote:
> can you define private a bit more? Is private the page_count == 1 like
> 2.4? Or is "private" == anonymous? that's the only question.
> In Hugh's patch private == "anonymous". With 2.4 private == "page_count
> == 1" (which is a subset of anonymous).

Private should be "anonymous" as far as I can tell. What's actually
going on is that they're trying to estimate per-process user memory
footprints so that the amount of client load that should be distributed
to a given box may be estimated from that. They at least used to
believe (I've since debunked this) that 2.4.x reported this information.
Their task (and hence our reporting) is not providing the complete
information to determine per-process memory footprints for general
workloads, rather it's known up-front that no fork()-based COW sharing
is going on in Oracle's case, so in this case, "anonymous" very happily
corresponds to "process-private". In fact, the /proc/ changes to report
threads only under the directory hierarchy of some distinguished thread
assists in this estimation effort.


-- wli

2004-10-15 19:27:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 11:47:13AM -0700, William Lee Irwin III wrote:
> workloads, rather it's known up-front that no fork()-based COW sharing
> is going on in Oracle's case, so in this case, "anonymous" very happily
> corresponds to "process-private". [..]

Ok fine.

> [..] In fact, the /proc/ changes to report
> threads only under the directory hierarchy of some distinguished thread
> assists in this estimation effort.

do you use threads now?

2004-10-15 19:38:03

by Albert Cahalan

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 2004-10-15 at 13:10, William Lee Irwin III wrote:
> On Fri, Oct 15, 2004 at 12:31:52PM -0400, Albert Cahalan wrote:
> > Currently, ps uses /proc/*/stat for that. The /proc/*/statm
> > file is read to determine TRS and DRS, which are broken now.
> > That it, unless you count "ps -o OL_m" format.
> > The top program uses /proc/*/statm for many more fields:
>
> And here I refute every last field with a description of what 2.4.x
> actually implemented and how its implementation renders the field
> gibberish.
...
> Thus we are left with exactly zero fields which are not gibberish in 2.4,
> and 2.4.x semantics have no leg left to stand upon.

Many are reasonable.

Jim developed "top" partly with a 2.2.xx kernel. He had
avoided the statm values at first, for performance, but
went back to using them when he found that the numbers
made more sense than the status and stat numbers did.

It is only recently that Debian stopped defaulting to
the 2.2.xx kernel. This isn't ancient history anywhere
beyond the linux-kernel mailing list.

Changing the columns offered may screw people up.
Believe it or not, people actually use top in scripts.
(everybody together now: "Eeeeeeew!")

Let /proc/*/statm be as slow as it needs to be.
It'll work right on normal systems then, and the
Altix users can simply configure top to display
columns that don't involve the statm files.


2004-10-15 20:41:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 11:47:13AM -0700, William Lee Irwin III wrote:
>> workloads, rather it's known up-front that no fork()-based COW sharing
>> is going on in Oracle's case, so in this case, "anonymous" very happily
>> corresponds to "process-private". [..]

On Fri, Oct 15, 2004 at 09:23:13PM +0200, Andrea Arcangeli wrote:
> Ok fine.

On Fri, Oct 15, 2004 at 11:47:13AM -0700, William Lee Irwin III wrote:
>> [..] In fact, the /proc/ changes to report
>> threads only under the directory hierarchy of some distinguished thread
>> assists in this estimation effort.

On Fri, Oct 15, 2004 at 09:23:13PM +0200, Andrea Arcangeli wrote:
> do you use threads now?

I believe using threads to some extent has been an option for some time,
though not a commonly used one on Linux.


-- wli

2004-10-15 20:52:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 01:41:23PM -0700, William Lee Irwin III wrote:
> though not a commonly used one on Linux.

I guess this is also because it's a no-way on x86.

2004-10-15 21:17:06

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 11:47:13AM -0700, William Lee Irwin III wrote:
> Private should be "anonymous" as far as I can tell. What's actually
> going on is that they're trying to estimate per-process user memory
> footprints so that the amount of client load that should be distributed
> to a given box may be estimated from that. They at least used to
> believe (I've since debunked this) that 2.4.x reported this information.
> Their task (and hence our reporting) is not providing the complete
> information to determine per-process memory footprints for general
> workloads, rather it's known up-front that no fork()-based COW sharing
> is going on in Oracle's case, so in this case, "anonymous" very happily
> corresponds to "process-private". In fact, the /proc/ changes to report
> threads only under the directory hierarchy of some distinguished thread
> assists in this estimation effort.

Okay, I reached the very original source(s) of these requirements inside
Oracle, and they are more than satisfied with Hugh's patch, particularly
as I explained to them how it was actually more accurate than 2.4.x;
they're only waiting for ports to the vendor kernel(s) now.


-- wli

2004-10-15 21:28:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 02:16:26PM -0700, William Lee Irwin III wrote:
> they're only waiting for ports to the vendor kernel(s) now.

Ok fine. But first it has to be included into mainline, then of course
we'll merge it. Fixing Oracle at the expense of being incompatible with
the user-ABI with future 2.6 is a no-way.

2004-10-15 21:43:03

by William Lee Irwin III

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, Oct 15, 2004 at 02:16:26PM -0700, William Lee Irwin III wrote:
>> they're only waiting for ports to the vendor kernel(s) now.

On Fri, Oct 15, 2004 at 11:28:31PM +0200, Andrea Arcangeli wrote:
> Ok fine. But first it has to be included into mainline, then of course
> we'll merge it. Fixing Oracle at the expense of being incompatible with
> the user-ABI with future 2.6 is a no-way.

I've come to expect this as a requirement.

The thing I wanted to convey most was that I got an acknowledgment from
the original sources of Oracle's requirement, including the project
lead for the team that maintains statistics collection kit that uses
the statistics to estimate the client capacity of a system and not just
whoever got the bug assigned to them inside Oracle, that Hugh's specific
implementation we want to go with also satisfies the user requirements.
They've even committed to runtime testing the patches to verify the
patch does everything they want it to.


-- wli

2004-10-15 22:05:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: per-process shared information

On Fri, 15 Oct 2004, William Lee Irwin III wrote:
> On Fri, Oct 15, 2004 at 11:28:31PM +0200, Andrea Arcangeli wrote:
> > Ok fine. But first it has to be included into mainline, then of course
> > we'll merge it. Fixing Oracle at the expense of being incompatible with
> > the user-ABI with future 2.6 is a no-way.
>
> The thing I wanted to convey most was that I got an acknowledgment from
> the original sources of Oracle's requirement, including the project
> lead for the team that maintains statistics collection kit that uses
> the statistics to estimate the client capacity of a system and not just
> whoever got the bug assigned to them inside Oracle, that Hugh's specific
> implementation we want to go with also satisfies the user requirements.
> They've even committed to runtime testing the patches to verify the
> patch does everything they want it to.

Andrea, Bill, great, thanks a lot for doing all the fieldwork on this.

After going through the discussions, I'm inclined to stick with the
patch as is i.e. "correct the 2.6 bug" in the "shared" third field of
/proc/pid/statm, rather than adding this as another field on the end.

Once 2.6.9 is out and we're open for patches again, I'll break it
into the four parts originally outlined, and send to Andrew.

Hugh

2004-10-19 15:08:06

by Bill Davidsen

[permalink] [raw]
Subject: Re: per-process shared information

Andrea Arcangeli wrote:
> On Fri, Oct 15, 2004 at 09:19:13AM -0400, Albert Cahalan wrote:
>
>>I don't see why it is such trouble to provide the old data.
>
>
> I agree with you w.r.t. binary compatibility, here it's even a "source
> compatibility" matter, a recompile wouldn't fix it.
>
> However I wasn't exactly advocating to keep it 100% backwards
> compatible in this case: somebody already broke it from 2.5.x to
> 2.6.9-rc, and since there was a very good reason for that, we should
> probably declare it broken. Here there has been a very strong technical
> reason to break statm, but they didn't break binary and source
> compatibility gratuitously like some solaris kernel developer seems to
> think in some blog.
>
> the problem is that when ps xav wants to know the RSS it reads statm,
> so we just cannot hurt ps xav to show the "old shared" information that
> would be extremely slow to collect.
>
> I was only not happy about dropping the old feature completely instead
> of providing it with a different new API. Now I think the solution Hugh
> just proposed with the anon_rss should mimic the old behaviour well
> enough and it's probably the right way to go, it's still not literally
> the same, but I doubt most people from userspace could notice the
> difference, and most important it provides useful information, which is
> the number of _physical_ pages mapped that aren't anonymous memory, this
> is very valuable info and it's basically the same info that people was
> getting from the old "shared". So I like it.

I think that's clearly the right solution. Going to significant effort
to produce compatible but incorrect values and/or formats is not
desirable. I've seen this with users and applications, too, complaining
that the new output doesn't match the old, even when the old was clearly
wrong.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-10-19 15:16:34

by Bill Davidsen

[permalink] [raw]
Subject: Re: per-process shared information

Andrea Arcangeli wrote:
> On Fri, Oct 15, 2004 at 12:56:22PM +0100, Hugh Dickins wrote:
>
>>Well, it didn't quite mean that in 2.4: since any pagecache (including
>>swapcache) page mapped into a single task would have page_count 2 and
>>so be counted as shared.
>
>
> Well, doing page_mapcount + !PageAnon should do the trick. My point is
> that the confusion of an anon page going in swapcache (the one you
> mentioned) is easily fixable.
>
>
>>I think 2.4 was already trying to come up with a plausible simulacrum
>>of numbers that made sense to gather in 2.2, but the numbers had lost
>>their point, and it only had page_count to play with. Or maybe 2.2
>>was already trying to make up numbers to fit with 2.0...
>
>
> ;)
>
>
>>[..] and I don't want to
>>give the cpu unnecessary work, [..]
>
>
> this doesn't make much sense to me, then you should as well forbid
> the user to run main () { for(;;) }.
>
> Of course doing a sort of for(;;) in each pid of ps xav is overkill, but
> on demand easily killable and reschedule friendly would be no different
> than allowing an user to execute main () { for(;;) }. All you can do is
> to renice it or use CKRM to lower its cpu availability from the
> scheduler, or kill -9.
>
>
>>remove even the vma walk; but I accept you're being careful to propose
>>that we keep the overhead out of existing /proc/pid uses - right if
>>we have to go that way, but I just prefer to avoid the work myself.
>
>
> correct. I'd also prefer to avoid this work myself.
>
>
>>I hope that's what my patch would be sufficient to achieve.
>
>
> I hope too.
>
>
>>It would be unfair to say 2.4's numbers were actually a bug, but
>>certainly peculiar: I'm about as interested in exactly reproducing
>>their oddities as in building a replica of some antique furniture.
>
>
> sure the swapcache bit would need fixing ;)
>
>
>>Oh, if that's all we need, I can do a simpler patch ;)
>
>
> yep ;) though such simpler patch would return no-sensical results, it
> would provide no-information at all in some case.
>
>
>>Interesting idea, and now (well, 2.6.9-mm heading to 2.6.10) we have
>>atomic_inc_return and atomic_dec_return supported on all architectures,
>>it should be possible to adjust an mm->shared_rss each time mapcount
>>goes up from 1 or down to 1, as well as adjusting nr_mapped count
>>as we do when it goes up from 0 or down to 0.
>
>
> I believe your approach will work fine, and it's much closer to the 2.4
> physical-driven semantics. It looks like "shared" really means
> not-anonymous memory, but accounted on a physical basis.
>
> However I wouldn't mind if you want to add a new field and to provide
> both the "virtual" shared like 2.6 right now, along the "physical"
> shared miming the semantics of 2.4 (they could be both in statm since
> they're O(1) to collect).

If it's cheap I'd like to see that. I don't know until I see the numbers
if it will be useful information for system tuning, but many servers
(web, news, mail) offer fork vs. thread operation and sometimes
non-intuitive performance results, so I'd like to have more data.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me