2005-09-21 12:19:17

by Frank van Maarseveen

[permalink] [raw]
Subject: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

This fixes a post 2.6.11 regression in maintaining the mm->hiwater_* counters.

Signed-off-by: Frank van Maarseveen <[email protected]>

diff -ru a/arch/ppc64/kernel/vdso.c b/arch/ppc64/kernel/vdso.c
--- a/arch/ppc64/kernel/vdso.c 2005-09-21 11:05:11.000000000 +0200
+++ b/arch/ppc64/kernel/vdso.c 2005-09-21 11:17:06.053426000 +0200
@@ -272,6 +272,7 @@
}
mm->total_vm += (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
up_write(&mm->mmap_sem);
+ update_mem_hiwater(mm);

return 0;
}
diff -ru a/arch/x86_64/ia32/syscall32.c b/arch/x86_64/ia32/syscall32.c
--- a/arch/x86_64/ia32/syscall32.c 2005-09-21 11:05:24.000000000 +0200
+++ b/arch/x86_64/ia32/syscall32.c 2005-09-21 11:17:06.130574000 +0200
@@ -72,6 +72,7 @@
}
mm->total_vm += npages;
up_write(&mm->mmap_sem);
+ update_mem_hiwater(mm);
return 0;
}

diff -ru a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c 2005-09-21 11:06:43.000000000 +0200
+++ b/fs/exec.c 2005-09-21 11:17:06.218465000 +0200
@@ -1207,7 +1207,7 @@
/* execve success */
security_bprm_free(bprm);
acct_update_integrals(current);
- update_mem_hiwater(current);
+ update_mem_hiwater(current->mm);
kfree(bprm);
return retval;
}
diff -ru a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h 2005-09-14 14:27:39.000000000 +0200
+++ b/include/linux/mm.h 2005-09-21 11:17:06.327839000 +0200
@@ -949,7 +949,7 @@
}

/* update per process rss and vm hiwater data */
-extern void update_mem_hiwater(struct task_struct *tsk);
+extern void update_mem_hiwater(struct mm_struct *mm);

#ifndef CONFIG_DEBUG_PAGEALLOC
static inline void
diff -ru a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c 2005-09-21 11:07:38.000000000 +0200
+++ b/kernel/exit.c 2005-09-21 11:17:06.395222000 +0200
@@ -839,7 +839,7 @@
preempt_count());

acct_update_integrals(tsk);
- update_mem_hiwater(tsk);
+ update_mem_hiwater(tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
del_timer_sync(&tsk->signal->real_timer);
diff -ru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2005-09-21 11:07:39.000000000 +0200
+++ b/kernel/sched.c 2005-09-21 11:17:06.462605000 +0200
@@ -2512,7 +2512,7 @@
/* Account for system time used */
acct_update_integrals(p);
/* Update rss highwater mark */
- update_mem_hiwater(p);
+ update_mem_hiwater(p->mm);
}

/*
diff -ru a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c 2005-09-21 11:07:40.000000000 +0200
+++ b/mm/memory.c 2005-09-21 11:17:06.602253000 +0200
@@ -2210,15 +2210,16 @@
* update_mem_hiwater
* - update per process rss and vm high water data
*/
-void update_mem_hiwater(struct task_struct *tsk)
+void update_mem_hiwater(struct mm_struct *mm)
{
- if (tsk->mm) {
- unsigned long rss = get_mm_counter(tsk->mm, rss);
+ unsigned long rss;

- if (tsk->mm->hiwater_rss < rss)
- tsk->mm->hiwater_rss = rss;
- if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
- tsk->mm->hiwater_vm = tsk->mm->total_vm;
+ if (likely(mm)) {
+ rss = get_mm_counter(mm, rss);
+ if (mm->hiwater_rss < rss)
+ mm->hiwater_rss = rss;
+ if (mm->hiwater_vm < mm->total_vm)
+ mm->hiwater_vm = mm->total_vm;
}
}

diff -ru a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c 2005-09-21 11:07:40.000000000 +0200
+++ b/mm/mmap.c 2005-09-21 11:17:06.755572000 +0200
@@ -854,6 +854,7 @@
mm->stack_vm += pages;
if (flags & (VM_RESERVED|VM_IO))
mm->reserved_vm += pages;
+ update_mem_hiwater(mm);
}
#endif /* CONFIG_PROC_FS */

@@ -1915,6 +1916,7 @@
vma_link(mm, vma, prev, rb_link, rb_parent);
out:
mm->total_vm += len >> PAGE_SHIFT;
+ update_mem_hiwater(mm);
if (flags & VM_LOCKED) {
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
diff -ru a/mm/nommu.c b/mm/nommu.c
--- a/mm/nommu.c 2005-09-21 11:07:40.000000000 +0200
+++ b/mm/nommu.c 2005-09-21 12:26:02.166764000 +0200
@@ -839,6 +839,7 @@
show_process_blocks();
#endif

+ update_mem_hiwater(mm);
return (unsigned long) result;

error:
@@ -1079,16 +1080,16 @@
{
}

-void update_mem_hiwater(struct task_struct *tsk)
+void update_mem_hiwater(struct mm_struct *mm)
{
unsigned long rss;

- if (likely(tsk->mm)) {
- rss = get_mm_counter(tsk->mm, rss);
- if (tsk->mm->hiwater_rss < rss)
- tsk->mm->hiwater_rss = rss;
- if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
- tsk->mm->hiwater_vm = tsk->mm->total_vm;
+ if (likely(mm)) {
+ rss = get_mm_counter(mm, rss);
+ if (mm->hiwater_rss < rss)
+ mm->hiwater_rss = rss;
+ if (mm->hiwater_vm < mm->total_vm)
+ mm->hiwater_vm = mm->total_vm;
}
}


--
Frank


2005-09-21 14:39:26

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
> This fixes a post 2.6.11 regression in maintaining the mm->hiwater_* counters.

It would be a good idea to CC Christoph Lameter, who I believe was the
one who very intentionally moved most of these updates out to timer tick.
Is that significantly missing updates?

If it turns out that your patch is appropriate:

1. The change from tsk to mm is good (but not urgent 2.6.14 material).

2. You've missed the instance Dave Miller recently added in fs/compat.c.

3. If these are to be peppered back all over, then the places where
total_vm changes and the places where rss changes are almost completely
disjoint, so it's lazy to be calling one function to do both all over.

4. If you've noticed a regression, you must be one of the elite that knows
what these counters are used for: nothing in the kernel.org tree does.
Please add a comment saying what it is that uses them and how, so
developers can make better judgements about how best to maintain them.

5. Please add appropriate CONFIG, dummy macros etc., so that no time
is wasted on these updates in all the vanilla systems which have no
interest in them - but maybe Christoph already has that well in hand.

Sorry for the rifle fire, you did put your head above the parapet!

Hugh

2005-09-21 14:58:35

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, Sep 21, 2005 at 03:38:57PM +0100, Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
> > This fixes a post 2.6.11 regression in maintaining the mm->hiwater_* counters.
>
> It would be a good idea to CC Christoph Lameter, who I believe was the
> one who very intentionally moved most of these updates out to timer tick.
> Is that significantly missing updates?

Apparently: I use a private patch (see below) to expose mm->hiwater_vm
to userland via /proc to detect misbehaving processes and for measuring
worst case memory use by programs. from 2.6.12 on I'm constantly seeing
a lot of processes with hiwater_vm < total_vm.

> 2. You've missed the instance Dave Miller recently added in fs/compat.c.

Thanks. I'll add that.

>
> 3. If these are to be peppered back all over, then the places where
> total_vm changes and the places where rss changes are almost completely
> disjoint, so it's lazy to be calling one function to do both all over.

I'll look into this.

>
> 5. Please add appropriate CONFIG, dummy macros etc., so that no time
> is wasted on these updates in all the vanilla systems which have no
> interest in them - but maybe Christoph already has that well in hand.

I'm not sure where it's being used for, other than how I use it in which
case it probably should depend on CONFIG_PROC_FS:

--- ./fs/proc/task_mmu.c.orig 2005-07-07 14:22:12.000000000 +0200
+++ ./fs/proc/task_mmu.c 2005-09-16 13:51:56.000000000 +0200
@@ -14,6 +14,7 @@
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
buffer += sprintf(buffer,
+ "VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
@@ -22,6 +23,7 @@
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
"VmPTE:\t%8lu kB\n",
+ mm->hiwater_vm << (PAGE_SHIFT-10),
(mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
get_mm_counter(mm, rss) << (PAGE_SHIFT-10),


--
Frank

2005-09-21 15:26:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Hugh Dickins wrote:

> On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
> > This fixes a post 2.6.11 regression in maintaining the mm->hiwater_* counters.
>
> It would be a good idea to CC Christoph Lameter, who I believe was the
> one who very intentionally moved most of these updates out to timer tick.

Right and we need to include Jay who introduced these in the first place.

What is the reason to move these counters back into the performance
critical vm paths? We agreed that some inaccuracy in these numbers was
acceptable.

2005-09-21 15:36:20

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
> On Wed, Sep 21, 2005 at 03:38:57PM +0100, Hugh Dickins wrote:
> > On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
> > > This fixes a post 2.6.11 regression in maintaining the mm->hiwater_* counters.
> >
> > It would be a good idea to CC Christoph Lameter, who I believe was the
> > one who very intentionally moved most of these updates out to timer tick.
> > Is that significantly missing updates?
>
> Apparently: I use a private patch (see below) to expose mm->hiwater_vm
> to userland via /proc to detect misbehaving processes and for measuring
> worst case memory use by programs. from 2.6.12 on I'm constantly seeing
> a lot of processes with hiwater_vm < total_vm.

Then just add an update_mem_hiwater() in your patch? On one level,
of course, that's just a hack which will hide the deficiency from you.
But on another level, I suspect it'd be good enough.

Let's hear what Christoph has to say.

> > 5. Please add appropriate CONFIG, dummy macros etc., so that no time
> > is wasted on these updates in all the vanilla systems which have no
> > interest in them - but maybe Christoph already has that well in hand.
>
> I'm not sure where it's being used for, other than how I use it in which
> case it probably should depend on CONFIG_PROC_FS:

CONFIG_FRANK_VM would be better.

> --- ./fs/proc/task_mmu.c.orig 2005-07-07 14:22:12.000000000 +0200
> +++ ./fs/proc/task_mmu.c 2005-09-16 13:51:56.000000000 +0200
> @@ -14,6 +14,7 @@
> text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
> lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> buffer += sprintf(buffer,
> + "VmPeak:\t%8lu kB\n"

Good naming.

> "VmSize:\t%8lu kB\n"
> "VmLck:\t%8lu kB\n"
> "VmRSS:\t%8lu kB\n"
> @@ -22,6 +23,7 @@
> "VmExe:\t%8lu kB\n"
> "VmLib:\t%8lu kB\n"
> "VmPTE:\t%8lu kB\n",
> + mm->hiwater_vm << (PAGE_SHIFT-10),
> (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
> mm->locked_vm << (PAGE_SHIFT-10),
> get_mm_counter(mm, rss) << (PAGE_SHIFT-10),

hiwater_vm would be much less contentious to update whenever total_vm
changes (perhaps in __vm_stat_account) than hiwater_rss. But tomorrow
I fear we'll be seeing a /proc patch from Frank Rss...

I do keep wondering what's so interesting about this hiwater_vm
(but would regret proposing other statistics to gather instead):
perhaps you're showing it just because it's there?

Hugh

2005-09-21 15:50:51

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, Sep 21, 2005 at 08:26:28AM -0700, Christoph Lameter wrote:
> On Wed, 21 Sep 2005, Hugh Dickins wrote:
>
> > On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
> > > This fixes a post 2.6.11 regression in maintaining the mm->hiwater_* counters.
> >
> > It would be a good idea to CC Christoph Lameter, who I believe was the
> > one who very intentionally moved most of these updates out to timer tick.
>
> Right and we need to include Jay who introduced these in the first place.
>
> What is the reason to move these counters back into the performance
> critical vm paths? We agreed that some inaccuracy in these numbers was
> acceptable.

Try running the script below after exposing hiwater_vm via
/proc/pid/status as "VmPeak:" (see my other mail in this thread).

========
#!/bin/sh

for f in /proc/*/status
do
awk '
BEGIN{ state = 0; }
/^Name:/ { name = $2; ++state; }
/^Pid:/ { pid = $2; ++state; }
/^VmPeak:/ { x = $2; ++state; }
/^VmSize:/ { y = $2; ++state; }
END{ if (state == 4 && x < y)
printf("pid %d (%s): %d < %d\n", pid, name, x, y);
}
' <$f
done
========

On 2.6.12.2 it prints:

pid 13170 (sh): 3312 < 3336
pid 16621 (sh): 2916 < 2996
pid 1696 (getty): 1576 < 1580
pid 2338 (getty): 1576 < 1580
pid 2344 (getty): 1444 < 1580
pid 24497 (getty): 1576 < 1580
pid 27929 (knews.sh): 2984 < 2988
pid 6207 (rsh-title): 1156 < 2980
pid 6208 (rsh): 1952 < 2036
pid 16749 (awk): 1836 < 2368

On 2.6.13 it is much worse.

--
Frank

2005-09-21 16:58:32

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>
>>This fixes a post 2.6.11 regression in maintaining the mm->hiwater_* counters.
>
>
> It would be a good idea to CC Christoph Lameter, who I believe was the
> one who very intentionally moved most of these updates out to timer tick.
> Is that significantly missing updates?
>
> If it turns out that your patch is appropriate:
>
> 1. The change from tsk to mm is good (but not urgent 2.6.14 material).
>
> 2. You've missed the instance Dave Miller recently added in fs/compat.c.
>
> 3. If these are to be peppered back all over, then the places where
> total_vm changes and the places where rss changes are almost completely
> disjoint, so it's lazy to be calling one function to do both all over.
>
> 4. If you've noticed a regression, you must be one of the elite that knows
> what these counters are used for: nothing in the kernel.org tree does.
> Please add a comment saying what it is that uses them and how, so
> developers can make better judgements about how best to maintain them.
>
> 5. Please add appropriate CONFIG, dummy macros etc., so that no time
> is wasted on these updates in all the vanilla systems which have no
> interest in them - but maybe Christoph already has that well in hand.

It is used in enhanced system accounting. An obvious CONFIG would be
CONFIG_BSD_PROCESS_ACCT.

However, since the CONFIG flag is almost always Yes, people would need
to turn it off if they do not want system accounting. Would that be
OK?

Thanks,
- jay


>
> Sorry for the rifle fire, you did put your head above the parapet!
>
> Hugh
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-09-21 17:05:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Jay Lan wrote:

> > 5. Please add appropriate CONFIG, dummy macros etc., so that no time
> > is wasted on these updates in all the vanilla systems which have no
> > interest in them - but maybe Christoph already has that well in hand.
>
> It is used in enhanced system accounting. An obvious CONFIG would be
> CONFIG_BSD_PROCESS_ACCT.

Right. Make all the data fields and code dependent on an appropriate
CONFIG_XXX macro. We talked about that a couple of weeks ago as AFAIK.

I had a look at Frank's patch and it does not seem to touch the critical
paths. Jay: Can you verify that the changes do not affect critical paths
and that accounting is still working in the right way?

2005-09-21 17:12:25

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Jay Lan wrote:
> Hugh Dickins wrote:
> >
> > 4. If you've noticed a regression, you must be one of the elite that
> > knows
> > what these counters are used for: nothing in the kernel.org tree does.
> > Please add a comment saying what it is that uses them and how, so
> > developers can make better judgements about how best to maintain them.
> >
> > 5. Please add appropriate CONFIG, dummy macros etc., so that no time
> > is wasted on these updates in all the vanilla systems which have no
> > interest in them - but maybe Christoph already has that well in hand.
>
> It is used in enhanced system accounting. An obvious CONFIG would be
> CONFIG_BSD_PROCESS_ACCT.
>
> However, since the CONFIG flag is almost always Yes, people would need
> to turn it off if they do not want system accounting. Would that be
> OK?

Christoph will know the issue better than I. But I'd say no, that's
not OK. You have some out of tree "enhanced system accounting" which
has been granted the privilege of hooks within the mainline kernel:
they should be disabled unless a CONFIG option is switched on, which
your accounting patch can do. And the (mainline) Kconfig help entry
for that CONFIG option should point us to the source of your package.

Hugh

2005-09-21 17:32:36

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Jay Lan wrote:
>
>>Hugh Dickins wrote:
>>
>>>4. If you've noticed a regression, you must be one of the elite that
>>>knows
>>>what these counters are used for: nothing in the kernel.org tree does.
>>>Please add a comment saying what it is that uses them and how, so
>>>developers can make better judgements about how best to maintain them.
>>>
>>>5. Please add appropriate CONFIG, dummy macros etc., so that no time
>>>is wasted on these updates in all the vanilla systems which have no
>>>interest in them - but maybe Christoph already has that well in hand.
>>
>>It is used in enhanced system accounting. An obvious CONFIG would be
>>CONFIG_BSD_PROCESS_ACCT.
>>
>>However, since the CONFIG flag is almost always Yes, people would need
>>to turn it off if they do not want system accounting. Would that be
>>OK?
>
>
> Christoph will know the issue better than I. But I'd say no, that's
> not OK. You have some out of tree "enhanced system accounting" which
> has been granted the privilege of hooks within the mainline kernel:
> they should be disabled unless a CONFIG option is switched on, which
> your accounting patch can do. And the (mainline) Kconfig help entry
> for that CONFIG option should point us to the source of your package.

It is not really an issue of out-of-tree accounting package. The
system accounting is based on very old technology and needs improvement.
The issue we face is not an issue of one particular accounting package.

I think the best approach would be to wrap the mm usage accounting
in a new CONFIG_ENHANCED_SYS_ACCT and leave it OFF by default so that
people can still get the minimal accounting with
CONFIG_BSD_PROCESS_ACCT.

- jay

>
> Hugh
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-09-21 17:39:41

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Christoph Lameter wrote:
> On Wed, 21 Sep 2005, Jay Lan wrote:
>
>
>>>5. Please add appropriate CONFIG, dummy macros etc., so that no time
>>> is wasted on these updates in all the vanilla systems which have no
>>> interest in them - but maybe Christoph already has that well in hand.
>>
>>It is used in enhanced system accounting. An obvious CONFIG would be
>>CONFIG_BSD_PROCESS_ACCT.
>
>
> Right. Make all the data fields and code dependent on an appropriate
> CONFIG_XXX macro. We talked about that a couple of weeks ago as AFAIK.
>
> I had a look at Frank's patch and it does not seem to touch the critical
> paths. Jay: Can you verify that the changes do not affect critical paths
> and that accounting is still working in the right way?

Frank's patch looks fine to me except one place:
diff -ru a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c 2005-09-21 11:07:40.000000000 +0200
+++ b/mm/mmap.c 2005-09-21 11:17:06.755572000 +0200
@@ -854,6 +854,7 @@
mm->stack_vm += pages;
if (flags & (VM_RESERVED|VM_IO))
mm->reserved_vm += pages;
+ update_mem_hiwater(mm);
}
#endif /* CONFIG_PROC_FS */

I have a question of adding this call here. 'update_mem_hiwater'
does nothing unless mm->total_vm or rss gets updated.
I do not see total_vm get updates in __vm_stat_account()?

- jay

2005-09-21 17:45:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Jay Lan wrote:

> I think the best approach would be to wrap the mm usage accounting
> in a new CONFIG_ENHANCED_SYS_ACCT and leave it OFF by default so that
> people can still get the minimal accounting with
> CONFIG_BSD_PROCESS_ACCT.

Sounds okay.

2005-09-21 17:46:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Jay Lan wrote:
>
> It is not really an issue of out-of-tree accounting package. The
> system accounting is based on very old technology and needs improvement.
> The issue we face is not an issue of one particular accounting package.
>
> I think the best approach would be to wrap the mm usage accounting
> in a new CONFIG_ENHANCED_SYS_ACCT and leave it OFF by default so that
> people can still get the minimal accounting with
> CONFIG_BSD_PROCESS_ACCT.

Yes, please: that sounds right. With macros which dissolve to nothing
when it's off, to avoid #ifdef CONFIG_....s throughout the source.c.
#ifdef around the mm fields themselves? Probably best that way.

Still need a pointer in the Kconfig to some project that uses these.

Still sceptical that hiwater_vm and hiwater_rss are the magic
missing numbers which bring system accounting into the 21st century:
more to come?

Thanks,
Hugh

2005-09-21 18:06:23

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, Sep 21, 2005 at 10:39:01AM -0700, Jay Lan wrote:
>
> Frank's patch looks fine to me except one place:
> diff -ru a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c 2005-09-21 11:07:40.000000000 +0200
> +++ b/mm/mmap.c 2005-09-21 11:17:06.755572000 +0200
> @@ -854,6 +854,7 @@
> mm->stack_vm += pages;
> if (flags & (VM_RESERVED|VM_IO))
> mm->reserved_vm += pages;
> + update_mem_hiwater(mm);
> }
> #endif /* CONFIG_PROC_FS */
>
> I have a question of adding this call here. 'update_mem_hiwater'
> does nothing unless mm->total_vm or rss gets updated.
> I do not see total_vm get updates in __vm_stat_account()?

Check the callers of __vm_stat_account(). It is called at various places
after increasing (sometimes decreasing) vm_total.

BTW, __vm_stat_account() itself is surrounded by #ifdef CONFIG_PROC_FS so
I'd thought that that might be appropriate for hiwater_* too. So far they are
set but are not read except by my patch for /proc/pid/status.

in 2.6.13.2:

$ find -type f|xargs grep hiwater_rss
./kernel/fork.c: mm->hiwater_rss = get_mm_counter(mm,rss);
./include/linux/sched.h: unsigned long hiwater_rss; /* High-water RSS usage */
./mm/memory.c: if (tsk->mm->hiwater_rss < rss)
./mm/memory.c: tsk->mm->hiwater_rss = rss;
./mm/nommu.c: if (tsk->mm->hiwater_rss < rss)
./mm/nommu.c: tsk->mm->hiwater_rss = rss;

$ find -type f|xargs grep hiwater_vm
./kernel/fork.c: mm->hiwater_vm = mm->total_vm;
./include/linux/sched.h: unsigned long hiwater_vm; /* High-water virtual memory usage */
./mm/memory.c: if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
./mm/memory.c: tsk->mm->hiwater_vm = tsk->mm->total_vm;
./mm/nommu.c: if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
./mm/nommu.c: tsk->mm->hiwater_vm = tsk->mm->total_vm;

The matches in mm/memory.c and mm/nommu.c are in function update_mem_hiwater().

--
Frank

2005-09-21 18:05:56

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Jay Lan wrote:
>
>>It is not really an issue of out-of-tree accounting package. The
>>system accounting is based on very old technology and needs improvement.
>>The issue we face is not an issue of one particular accounting package.
>>
>>I think the best approach would be to wrap the mm usage accounting
>>in a new CONFIG_ENHANCED_SYS_ACCT and leave it OFF by default so that
>>people can still get the minimal accounting with
>>CONFIG_BSD_PROCESS_ACCT.
>
>
> Yes, please: that sounds right. With macros which dissolve to nothing
> when it's off, to avoid #ifdef CONFIG_....s throughout the source.c.
> #ifdef around the mm fields themselves? Probably best that way.

Yes, that was the plan.

>
> Still need a pointer in the Kconfig to some project that uses these.
>
> Still sceptical that hiwater_vm and hiwater_rss are the magic
> missing numbers which bring system accounting into the 21st century:
> more to come?

Nah, the hiwater things were part of my enhanced accounting patches.
The io part was done as a macro as we talked about now. These are
the mm part.

System accounting improvement is needed because those accouting
information were needed and used at real customer sites.

So, yes, the basic system accounting code (ie, BSD) needs
improvement, and, no, i am done with sys acct data collection.

Thanks,
- jay


>
> Thanks,
> Hugh

2005-09-21 18:26:30

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, Sep 21, 2005 at 11:05:14AM -0700, Jay Lan wrote:
>
> System accounting improvement is needed because those accouting
> information were needed and used at real customer sites.
>
> So, yes, the basic system accounting code (ie, BSD) needs
> improvement, and, no, i am done with sys acct data collection.

Ok, but how do you verify that the mm->hiwater_* counters are correct to
any degree? because it appears that at least hiwater_vm is incorrect in
post-2.6.11 and not by a small amount.


What about calling

static inline void grow_total_vm(struct mm_struct *mm, unsigned long increase)
{
mm->total_vm += increase;
if (mm->total_vm > mm->hiwater_vm)
mm->hiwater_vm = mm->total_vm;
}

whenever total_vm is increased and possibly doing something similar for rss at
different places? If it is not on the fast path then it's not necessary to
#ifdef the thing anywhere.

--
Frank

2005-09-21 18:38:42

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Frank van Maarseveen wrote:
> On Wed, Sep 21, 2005 at 10:39:01AM -0700, Jay Lan wrote:
>
>>Frank's patch looks fine to me except one place:
>>diff -ru a/mm/mmap.c b/mm/mmap.c
>>--- a/mm/mmap.c 2005-09-21 11:07:40.000000000 +0200
>>+++ b/mm/mmap.c 2005-09-21 11:17:06.755572000 +0200
>>@@ -854,6 +854,7 @@
>> mm->stack_vm += pages;
>> if (flags & (VM_RESERVED|VM_IO))
>> mm->reserved_vm += pages;
>>+ update_mem_hiwater(mm);
>> }
>> #endif /* CONFIG_PROC_FS */
>>
>>I have a question of adding this call here. 'update_mem_hiwater'
>>does nothing unless mm->total_vm or rss gets updated.
>>I do not see total_vm get updates in __vm_stat_account()?
>
>
> Check the callers of __vm_stat_account(). It is called at various places
> after increasing (sometimes decreasing) vm_total.

Sounds good.

- jay


>
> BTW, __vm_stat_account() itself is surrounded by #ifdef CONFIG_PROC_FS so
> I'd thought that that might be appropriate for hiwater_* too. So far they are
> set but are not read except by my patch for /proc/pid/status. >
> in 2.6.13.2:
>
> $ find -type f|xargs grep hiwater_rss
> ./kernel/fork.c: mm->hiwater_rss = get_mm_counter(mm,rss);
> ./include/linux/sched.h: unsigned long hiwater_rss; /* High-water RSS usage */
> ./mm/memory.c: if (tsk->mm->hiwater_rss < rss)
> ./mm/memory.c: tsk->mm->hiwater_rss = rss;
> ./mm/nommu.c: if (tsk->mm->hiwater_rss < rss)
> ./mm/nommu.c: tsk->mm->hiwater_rss = rss;
>
> $ find -type f|xargs grep hiwater_vm
> ./kernel/fork.c: mm->hiwater_vm = mm->total_vm;
> ./include/linux/sched.h: unsigned long hiwater_vm; /* High-water virtual memory usage */
> ./mm/memory.c: if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
> ./mm/memory.c: tsk->mm->hiwater_vm = tsk->mm->total_vm;
> ./mm/nommu.c: if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
> ./mm/nommu.c: tsk->mm->hiwater_vm = tsk->mm->total_vm;
>
> The matches in mm/memory.c and mm/nommu.c are in function update_mem_hiwater().
>

2005-09-21 18:42:13

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, Sep 21, 2005 at 04:35:56PM +0100, Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>
> > --- ./fs/proc/task_mmu.c.orig 2005-07-07 14:22:12.000000000 +0200
> > +++ ./fs/proc/task_mmu.c 2005-09-16 13:51:56.000000000 +0200
> > @@ -14,6 +14,7 @@
> > text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
> > lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> > buffer += sprintf(buffer,
> > + "VmPeak:\t%8lu kB\n"
>
> Good naming.
>
> > "VmSize:\t%8lu kB\n"
> > "VmLck:\t%8lu kB\n"
> > "VmRSS:\t%8lu kB\n"
> > @@ -22,6 +23,7 @@
> > "VmExe:\t%8lu kB\n"
> > "VmLib:\t%8lu kB\n"
> > "VmPTE:\t%8lu kB\n",
> > + mm->hiwater_vm << (PAGE_SHIFT-10),
> > (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
> > mm->locked_vm << (PAGE_SHIFT-10),
> > get_mm_counter(mm, rss) << (PAGE_SHIFT-10),
>
> I do keep wondering what's so interesting about this hiwater_vm
> (but would regret proposing other statistics to gather instead):
> perhaps you're showing it just because it's there?

No, hiwater_vm is exactly what we need. Before hiwater_vm I had my own patch
to get exactly the same result but the member was called "peak_vm" instead.

Most software we develop at my work uses this to test for regression
in memory usage. Just before program exit it reads /proc/pid/status:VmPeak
and reports it. Scripts do the rest.

--
Frank

2005-09-21 18:58:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>
> Most software we develop at my work uses this to test for regression
> in memory usage. Just before program exit it reads /proc/pid/status:VmPeak
> and reports it. Scripts do the rest.

That makes some sense. Thank you for explaining it.

Hugh

2005-09-21 19:19:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>
> What about calling
>
> static inline void grow_total_vm(struct mm_struct *mm, unsigned long increase)
> {
> mm->total_vm += increase;
> if (mm->total_vm > mm->hiwater_vm)
> mm->hiwater_vm = mm->total_vm;
> }
>
> whenever total_vm is increased and possibly doing something similar for rss at
> different places? If it is not on the fast path then it's not necessary to
> #ifdef the thing anywhere.

I think there's a good argument for separating hiwater_vm and hiwater_rss
completely (and you don't seem to be interested in hiwater_rss yourself).

hiwater_rss is on some fast paths: well, I don't see them as fast paths
myself (the page faults), but they are of exceptional concern to Christoph,
and the less we have to mess with struct mm at those points the happier he
is. I guess hiwater_rss should remain updated from the timer tick for now.

But I think you're right that hiwater_vm is best updated where total_vm
is: I'm not sure if it covers all cases completely (I think there's one
or two places which don't bother to call __vm_stat_account because they
believe it won't change anything), but in principle it would make lots of
sense to do it in the __vm_stat_account which typically follows adjusting
total_vm, as you did, and if possible nowhere else; rather than adding
your inline above.

Would you be satisfied with that, Christoph?

I should warn you that I'll shortly (shortly meaning in days rather
than hours) be sending Andrew a patch which will remove the "__" from
__vm_stat_account, since the old vm_stat_account is now hardly used.
I'm also rearranging the rss,anon_rss accounting. Maybe come back
to the hiwaters later on?

Hugh

2005-09-21 19:28:37

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, Sep 21, 2005 at 08:19:31PM +0100, Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
> >
> > What about calling
> >
> > static inline void grow_total_vm(struct mm_struct *mm, unsigned long increase)
> > {
> > mm->total_vm += increase;
> > if (mm->total_vm > mm->hiwater_vm)
> > mm->hiwater_vm = mm->total_vm;
> > }
> >
> > whenever total_vm is increased and possibly doing something similar for rss at
> > different places? If it is not on the fast path then it's not necessary to
> > #ifdef the thing anywhere.
>
...
> But I think you're right that hiwater_vm is best updated where total_vm
> is: I'm not sure if it covers all cases completely (I think there's one
> or two places which don't bother to call __vm_stat_account because they
> believe it won't change anything), but in principle it would make lots of
> sense to do it in the __vm_stat_account which typically follows adjusting
> total_vm, as you did, and if possible nowhere else; rather than adding
> your inline above.

But update_mem_hiwater() is called at various places too, and I guess that
covers merely the total_vm increase, not rss.

Maybe above inline should replace update_mem_hiwater()?

--
Frank

2005-09-21 19:38:50

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Frank van Maarseveen wrote:
> On Wed, Sep 21, 2005 at 08:19:31PM +0100, Hugh Dickins wrote:
>
>>On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>>
>>>What about calling
>>>
>>>static inline void grow_total_vm(struct mm_struct *mm, unsigned long increase)
>>>{
>>> mm->total_vm += increase;
>>> if (mm->total_vm > mm->hiwater_vm)
>>> mm->hiwater_vm = mm->total_vm;
>>>}
>>>
>>>whenever total_vm is increased and possibly doing something similar for rss at
>>>different places? If it is not on the fast path then it's not necessary to
>>>#ifdef the thing anywhere.
>>
> ...
>
>>But I think you're right that hiwater_vm is best updated where total_vm
>>is: I'm not sure if it covers all cases completely (I think there's one
>>or two places which don't bother to call __vm_stat_account because they
>>believe it won't change anything), but in principle it would make lots of
>>sense to do it in the __vm_stat_account which typically follows adjusting
>>total_vm, as you did, and if possible nowhere else; rather than adding
>>your inline above.
>
>
> But update_mem_hiwater() is called at various places too, and I guess that
> covers merely the total_vm increase, not rss.

That is not true. update_mem_hiwater() also updates hiwater_rss.

>
> Maybe above inline should replace update_mem_hiwater()?
>

2005-09-21 19:47:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Jay Lan wrote:
> Frank van Maarseveen wrote:
> >
> > But update_mem_hiwater() is called at various places too, and I guess
> > that covers merely the total_vm increase, not rss.
>
> That is not true. update_mem_hiwater() also updates hiwater_rss.
>
> > Maybe above inline should replace update_mem_hiwater()?

Yes, I don't understand what Frank meant there either.

Hugh

2005-09-21 19:48:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Hugh Dickins wrote:

>
> But I think you're right that hiwater_vm is best updated where total_vm
> is: I'm not sure if it covers all cases completely (I think there's one
> or two places which don't bother to call __vm_stat_account because they
> believe it won't change anything), but in principle it would make lots of
> sense to do it in the __vm_stat_account which typically follows adjusting
> total_vm, as you did, and if possible nowhere else; rather than adding
> your inline above.
>
> Would you be satisfied with that, Christoph?

Looks like this is staying off the critical code paths. So ok.

> I'm also rearranging the rss,anon_rss accounting. Maybe come back
> to the hiwaters later on?

I'd be very interested to see the rearrangement.

2005-09-21 19:48:35

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, Sep 21, 2005 at 12:38:08PM -0700, Jay Lan wrote:
> Frank van Maarseveen wrote:
> >On Wed, Sep 21, 2005 at 08:19:31PM +0100, Hugh Dickins wrote:
> >
> >>But I think you're right that hiwater_vm is best updated where total_vm
> >>is: I'm not sure if it covers all cases completely (I think there's one
> >>or two places which don't bother to call __vm_stat_account because they
> >>believe it won't change anything), but in principle it would make lots of
> >>sense to do it in the __vm_stat_account which typically follows adjusting
> >>total_vm, as you did, and if possible nowhere else; rather than adding
> >>your inline above.
> >
> >
> >But update_mem_hiwater() is called at various places too, and I guess that
> >covers merely the total_vm increase, not rss.
>
> That is not true. update_mem_hiwater() also updates hiwater_rss.

You're right.

But shouldn't hiwater_rss be updated via a totally different path? When rss
changes, total_vm doesn't and vice versa. So maybe there should be _two_
update functions.

--
Frank

2005-09-21 19:53:22

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>
> But shouldn't hiwater_rss be updated via a totally different path? When rss
> changes, total_vm doesn't and vice versa. So maybe there should be _two_
> update functions.

Absolutely, one of the points I made earlier.

Hugh

2005-09-21 20:06:59

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Christoph Lameter wrote:
> On Wed, 21 Sep 2005, Hugh Dickins wrote:
>
> Looks like this is staying off the critical code paths. So ok.

Great, thanks.

> > I'm also rearranging the rss,anon_rss accounting. Maybe come back
> > to the hiwaters later on?
>
> I'd be very interested to see the rearrangement.

Sorry, you'll be imagining something much grander than I meant.
And sorry, I have to work my way through splitting it out from
the total and then posting, can't cut it out for you right now.

But all I was meaning was, as part of the preparatory patches for
the narrowed and split page_table_lock, not included in last month's
prototype, I'm (a) switching to file_rss plus anon_rss rather than rss
including anon_rss, as we discussed earlier (you believe the SGI boxes
won't need that, but in general it seems silly to be doing two atomic
updates instead of one); and (b) moving the "freed" rss updating up
from inside the "tlb" operations (the prototype patch had the wrong
locking for that); and (c) batching the updates (i.e. one add or sub
instead of many incs or decs) per-page_table-page in copy_pte_range
as in zap_pte_range.

I don't expect you to disagree with it; but nor do I expect you
to be anything but disappointed, realizing that's all I meant.

Hugh

2005-09-21 20:13:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 21 Sep 2005, Hugh Dickins wrote:

> I don't expect you to disagree with it; but nor do I expect you
> to be anything but disappointed, realizing that's all I meant.

Oh dont worry, this sound good. Otherwise I only expect you to produce the
vm with the highest performance ever.... ;-)

2005-09-27 20:43:11

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Hugh Dickins wrote:
> On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>
>>What about calling
>>
>>static inline void grow_total_vm(struct mm_struct *mm, unsigned long increase)
>>{
>> mm->total_vm += increase;
>> if (mm->total_vm > mm->hiwater_vm)
>> mm->hiwater_vm = mm->total_vm;
>>}
>>
>>whenever total_vm is increased and possibly doing something similar for rss at
>>different places? If it is not on the fast path then it's not necessary to
>>#ifdef the thing anywhere.
>
>
> I think there's a good argument for separating hiwater_vm and hiwater_rss
> completely (and you don't seem to be interested in hiwater_rss yourself).
>
> hiwater_rss is on some fast paths: well, I don't see them as fast paths
> myself (the page faults), but they are of exceptional concern to Christoph,
> and the less we have to mess with struct mm at those points the happier he
> is. I guess hiwater_rss should remain updated from the timer tick for now.
>
> But I think you're right that hiwater_vm is best updated where total_vm
> is: I'm not sure if it covers all cases completely (I think there's one
> or two places which don't bother to call __vm_stat_account because they
> believe it won't change anything), but in principle it would make lots of
> sense to do it in the __vm_stat_account which typically follows adjusting
> total_vm, as you did, and if possible nowhere else; rather than adding
> your inline above.

While in the work on separating hiwater_vm from hiwater_rss, i noticed
that __vm_stat_account() was not called in these functions where
total_vm was updated:
mm/mmap.c do_brk
mm/nommu.c do_mmap_pgoff
mm/nommu.c do_munmap
arch/ppc64/kernel/vdso.c arch_setup_additional_pages
arch/x86_64/ia32/syscall32.c syscall32_setup_pages

Frank tried to touch the latter two in his proposed patch.
Does it make sense we add __vm_stat_account() calls to the above
routines?

- jay


>
> Would you be satisfied with that, Christoph?
>
> I should warn you that I'll shortly (shortly meaning in days rather
> than hours) be sending Andrew a patch which will remove the "__" from
> __vm_stat_account, since the old vm_stat_account is now hardly used.
> I'm also rearranging the rss,anon_rss accounting. Maybe come back
> to the hiwaters later on?
>
> Hugh

2005-09-27 21:47:46

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Jay Lan wrote:
> Hugh Dickins wrote:
>
>> On Wed, 21 Sep 2005, Frank van Maarseveen wrote:
>>
>>> What about calling
>>>
>>> static inline void grow_total_vm(struct mm_struct *mm, unsigned long
>>> increase)
>>> {
>>> mm->total_vm += increase;
>>> if (mm->total_vm > mm->hiwater_vm)
>>> mm->hiwater_vm = mm->total_vm;
>>> }
>>>
>>> whenever total_vm is increased and possibly doing something similar
>>> for rss at
>>> different places? If it is not on the fast path then it's not
>>> necessary to
>>> #ifdef the thing anywhere.
>>
>>
>>
>> I think there's a good argument for separating hiwater_vm and hiwater_rss
>> completely (and you don't seem to be interested in hiwater_rss yourself).
>>
>> hiwater_rss is on some fast paths: well, I don't see them as fast paths
>> myself (the page faults), but they are of exceptional concern to
>> Christoph,
>> and the less we have to mess with struct mm at those points the
>> happier he
>> is. I guess hiwater_rss should remain updated from the timer tick for
>> now.
>>
>> But I think you're right that hiwater_vm is best updated where total_vm
>> is: I'm not sure if it covers all cases completely (I think there's one
>> or two places which don't bother to call __vm_stat_account because they
>> believe it won't change anything), but in principle it would make lots of
>> sense to do it in the __vm_stat_account which typically follows adjusting
>> total_vm, as you did, and if possible nowhere else; rather than adding
>> your inline above.

Just looked at the __vm_stat_account() code. It is enclosed inside
#ifdef CONFIG_PROC_FS.

If that is necessary, i can not put hiwater_vm update code in there. The
system accounting code should not be dependent on a config flag that has
nothing to do with system accounting.

Thanks,
- jay

>
>
> While in the work on separating hiwater_vm from hiwater_rss, i noticed
> that __vm_stat_account() was not called in these functions where
> total_vm was updated:
> mm/mmap.c do_brk
> mm/nommu.c do_mmap_pgoff
> mm/nommu.c do_munmap
> arch/ppc64/kernel/vdso.c arch_setup_additional_pages
> arch/x86_64/ia32/syscall32.c syscall32_setup_pages
>
> Frank tried to touch the latter two in his proposed patch.
> Does it make sense we add __vm_stat_account() calls to the above
> routines?
>
> - jay
>
>
>>
>> Would you be satisfied with that, Christoph?
>>
>> I should warn you that I'll shortly (shortly meaning in days rather
>> than hours) be sending Andrew a patch which will remove the "__" from
>> __vm_stat_account, since the old vm_stat_account is now hardly used.
>> I'm also rearranging the rss,anon_rss accounting. Maybe come back
>> to the hiwaters later on?
>>
>> Hugh
>
>

2005-09-27 21:50:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Tue, 27 Sep 2005, Jay Lan wrote:

> Just looked at the __vm_stat_account() code. It is enclosed inside
> #ifdef CONFIG_PROC_FS.
>
> If that is necessary, i can not put hiwater_vm update code in there. The
> system accounting code should not be dependent on a config flag that has
> nothing to do with system accounting.

I doubt you can do accounting without having /proc. Dont you need to
read/write files in /proc? Can we make accounting dependent on /proc?

2005-09-27 22:43:46

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Christoph Lameter wrote:
> On Tue, 27 Sep 2005, Jay Lan wrote:
>
>
>>Just looked at the __vm_stat_account() code. It is enclosed inside
>>#ifdef CONFIG_PROC_FS.
>>
>>If that is necessary, i can not put hiwater_vm update code in there. The
>>system accounting code should not be dependent on a config flag that has
>>nothing to do with system accounting.
>
>
> I doubt you can do accounting without having /proc. Dont you need to
> read/write files in /proc? Can we make accounting dependent on /proc?

Right, system accounting code (BSD_PROCESS_ACCT) does not depend on
/proc.

In my opinion, those "everyone should say Y here" config flags should
not be config flags at all. Since it is a flag, i should not put
sys accounting code in there.

2005-09-28 13:19:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Tue, 27 Sep 2005, Jay Lan wrote:
>
> While in the work on separating hiwater_vm from hiwater_rss, i noticed
> that __vm_stat_account() was not called in these functions where
> total_vm was updated:
> mm/mmap.c do_brk
> mm/nommu.c do_mmap_pgoff
> mm/nommu.c do_munmap
> arch/ppc64/kernel/vdso.c arch_setup_additional_pages
> arch/x86_64/ia32/syscall32.c syscall32_setup_pages
>
> Frank tried to touch the latter two in his proposed patch.
> Does it make sense we add __vm_stat_account() calls to the above
> routines?

Probably not. Partly because of the PROCFS issue you noticed after.
And partly because that's a long list, whereas the evidence is that
__vm_stat_account is nowadays being called in the places that need it.

What, you couldn't find the call to __vm_stat_account next to
updating total_vm in do_mmap_pgoff :-? And I don't see that total_vm
is updated in do_munmap, that and the vm_stat_account have to be done
at a lower, per-vma level. Haven't looked at the others yet.

I think, better leave hiwater_vm to me for now. Since I've got that
patchset in the next -mm, it's maybe easier for me to put something
together on top of that. And it seems quite possible (haven't checked
yet) that actually we should be moving the update of total_vm in to the
core from out all over, rather than splattering hiwater_vm updates about.

I'll take a look in the next couple of days.

Hugh

2005-09-28 17:31:34

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Hugh Dickins wrote:
>On Tue, 27 Sep 2005, Jay Lan wrote:
>
>>While in the work on separating hiwater_vm from hiwater_rss, i noticed
>>that __vm_stat_account() was not called in these functions where
>>total_vm was updated:
>>mm/mmap.c do_brk
>>mm/nommu.c do_mmap_pgoff
>>mm/nommu.c do_munmap
>>arch/ppc64/kernel/vdso.c arch_setup_additional_pages
>>arch/x86_64/ia32/syscall32.c syscall32_setup_pages
>>
>>Frank tried to touch the latter two in his proposed patch.
>>Does it make sense we add __vm_stat_account() calls to the above
>>routines?
>>
>
>Probably not. Partly because of the PROCFS issue you noticed after.
>And partly because that's a long list, whereas the evidence is that
>__vm_stat_account is nowadays being called in the places that need it.
>
>What, you couldn't find the call to __vm_stat_account next to
>updating total_vm in do_mmap_pgoff :-? And I don't see that total_vm
>is updated in do_munmap, that and the vm_stat_account have to be done
>at a lower, per-vma level. Haven't looked at the others yet.
>

There are two do_mmap_pgoff, one in mm/mmap.c and the other in
mm/nommu.c. It was the mm/nommu.c version that does not have
a call to __vm_stat_account.

I am not concerned about do_munmap any more from updating
hiwater_vm's concern since the total_vm gets decreased, not increased.

>I think, better leave hiwater_vm to me for now. Since I've got that
>
Sounds good to me.

I agreed with you (and Frank) that it makes sense to split hiwater_vm and
hiwater_rss update.
>patchset in the next -mm, it's maybe easier for me to put something
>together on top of that. And it seems quite possible (haven't checked
>yet) that actually we should be moving the update of total_vm in to the
>core from out all over, rather than splattering hiwater_vm updates about
>
>I'll take a look in the next couple of days.
>
>Hugh
>

In case you are going to turn hiwater_vm update into a macro and hide it
in a config flag, here is what i have that might be useful.

Thanks,
- jay


Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig 2005-09-22 14:14:54.105262867 -0700
+++ linux/init/Kconfig 2005-09-22 14:57:19.326784560 -0700
@@ -162,6 +162,19 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is
available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.

+config ENHANCED_SYS_ACCT
+ bool "Enhanced System Accounting support"
+ depends on BSD_PROCESS_ACCT
+ default n
+ help
+ If you say Y here, some extra system accounting data will be
+ collected and updated to task struct as well as task->mm struct.
+ These information are used and presented in enhanced system
+ accounting packages such as Comprehensive System Accounting (CSA).
+ This activity may force cachelines into shared from exclusive
+ on a processor. The CSA project can be viewed at
+ <http://oss.sgi.com/projects/csa/>.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2005-09-22 14:15:07.465757425 -0700
+++ linux/mm/memory.c 2005-09-27 16:55:28.438702812 -0700
@@ -2206,21 +2206,21 @@ unsigned long vmalloc_to_pfn(void * vmal

EXPORT_SYMBOL(vmalloc_to_pfn);

+#ifdef CONFIG_ENHANCED_SYS_ACCT
/*
- * update_mem_hiwater
- * - update per process rss and vm high water data
+ * update_rss_hiwater
+ * - update per process rss high water data
*/
-void update_mem_hiwater(struct task_struct *tsk)
+void update_rss_hiwater(struct mm_struct *mm)
{
- if (tsk->mm) {
- unsigned long rss = get_mm_counter(tsk->mm, rss);
+ if (likely(mm)) {
+ unsigned long rss = get_mm_counter(mm, rss);

- if (tsk->mm->hiwater_rss < rss)
- tsk->mm->hiwater_rss = rss;
- if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
- tsk->mm->hiwater_vm = tsk->mm->total_vm;
+ if (mm->hiwater_rss < rss)
+ mm->hiwater_rss = rss;
}
}
+#endif

#if !defined(__HAVE_ARCH_GATE_AREA)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2005-08-28 16:41:01.000000000 -0700
+++ linux/include/linux/mm.h 2005-09-27 17:33:49.372846334 -0700
@@ -949,7 +949,18 @@ static inline void vm_stat_unaccount(str
}

/* update per process rss and vm hiwater data */
-extern void update_mem_hiwater(struct task_struct *tsk);
+#ifdef CONFIG_ENHANCED_SYS_ACCT
+static inline void update_hiwater_vm(struct mm_struct *mm)
+{
+ if (mm->hiwater_vm < mm->total_vm)
+ mm->hiwater_vm = mm->total_vm;
+}
+
+extern void update_hiwater_rss(struct mm_struct *mm);
+#else
+#define update_hiwater_vm(x) do { } while (0)
+#define update_hiwater_rss(x) do { } while (0)
+#endif

#ifndef CONFIG_DEBUG_PAGEALLOC
static inline void

2005-10-03 19:53:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Wed, 28 Sep 2005, Jay Lan wrote:
>
> I am not concerned about do_munmap any more from updating
> hiwater_vm's concern since the total_vm gets decreased, not increased.

Many thanks for that remark, it inspired me to do the exact opposite!

> In case you are going to turn hiwater_vm update into a macro and hide it
> in a config flag, here is what i have that might be useful.

And thanks for your patch, which I've factored in with Frank's, and come
come up with the one below, against 2.6.14-rc2-mm2. Which uses no CONFIG
flag: I think we're enough out of the fast paths that it's not needed.

See comment in fs/proc/task_mmu.c for the principle. Could maintain
hiwater_vm straightforwardly, but I think it's easier to remember if
we handle them both in the same way.

I did look into doing the total_vm increment and calling vm_stat_account
in insert_vm_struct, but concluded it solved no particular problem, and
raised some questions (where architectures, notably ia64, have special
vmas which they may have good reason to leave out of total_vm).

I haven't cross-checked the mm_struct cacheline rearrangement yet,
it looks plausible, but could easily turn out to straddle boundaries.

Christoph, Frank, Jay: does this patch look like it fits your needs?

Hugh

--- 2.6.14-rc2-mm2/fs/compat.c 2005-09-21 12:16:40.000000000 +0100
+++ linux/fs/compat.c 2005-10-03 19:42:49.000000000 +0100
@@ -1490,7 +1490,6 @@ int compat_do_execve(char * filename,
/* execve success */
security_bprm_free(bprm);
acct_update_integrals(current);
- update_mem_hiwater(current);
kfree(bprm);
return retval;
}
--- 2.6.14-rc2-mm2/fs/exec.c 2005-09-30 11:59:08.000000000 +0100
+++ linux/fs/exec.c 2005-10-03 19:42:49.000000000 +0100
@@ -1207,7 +1207,6 @@ int do_execve(char * filename,
/* execve success */
security_bprm_free(bprm);
acct_update_integrals(current);
- update_mem_hiwater(current);
kfree(bprm);
return retval;
}
--- 2.6.14-rc2-mm2/fs/proc/task_mmu.c 2005-09-30 11:59:09.000000000 +0100
+++ linux/fs/proc/task_mmu.c 2005-10-03 19:42:49.000000000 +0100
@@ -14,22 +14,41 @@
char *task_mem(struct mm_struct *mm, char *buffer)
{
unsigned long data, text, lib;
+ unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;

+ /*
+ * Note: to minimize their overhead, mm maintains hiwater_vm and
+ * hiwater_rss only when about to *lower* total_vm or rss. Any
+ * collector of these hiwater stats must therefore get total_vm
+ * and rss too, which will usually be the higher. Barriers? not
+ * worth the effort, such snapshots can always be inconsistent.
+ */
+ hiwater_vm = total_vm = mm->total_vm;
+ if (hiwater_vm < mm->hiwater_vm)
+ hiwater_vm = mm->hiwater_vm;
+ hiwater_rss = total_rss = get_mm_rss(mm);
+ if (hiwater_rss < mm->hiwater_rss)
+ hiwater_rss = mm->hiwater_rss;
+
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
buffer += sprintf(buffer,
+ "VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
+ "VmHWM:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
"VmData:\t%8lu kB\n"
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
"VmPTE:\t%8lu kB\n",
- (mm->total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
+ hiwater_vm << (PAGE_SHIFT-10),
+ (total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- get_mm_rss(mm) << (PAGE_SHIFT-10),
+ hiwater_rss << (PAGE_SHIFT-10),
+ total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
(PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
--- 2.6.14-rc2-mm2/include/linux/mm.h 2005-09-30 11:59:11.000000000 +0100
+++ linux/include/linux/mm.h 2005-10-03 19:42:49.000000000 +0100
@@ -945,9 +945,6 @@ static inline void vm_stat_account(struc
}
#endif /* CONFIG_PROC_FS */

-/* update per process rss and vm hiwater data */
-extern void update_mem_hiwater(struct task_struct *tsk);
-
#ifndef CONFIG_DEBUG_PAGEALLOC
static inline void
kernel_map_pages(struct page *page, int numpages, int enable)
--- 2.6.14-rc2-mm2/include/linux/sched.h 2005-09-30 11:59:11.000000000 +0100
+++ linux/include/linux/sched.h 2005-10-03 19:42:49.000000000 +0100
@@ -245,6 +245,16 @@ extern void arch_unmap_area_topdown(stru
#define dec_mm_counter(mm, member) (mm)->_##member--
#define get_mm_rss(mm) ((mm)->_file_rss + (mm)->_anon_rss)

+#define update_hiwater_rss(mm) do { \
+ unsigned long _rss = get_mm_rss(mm); \
+ if ((mm)->hiwater_rss < _rss) \
+ (mm)->hiwater_rss = _rss; \
+} while (0)
+#define update_hiwater_vm(mm) do { \
+ if ((mm)->hiwater_vm < (mm)->total_vm) \
+ (mm)->hiwater_vm = (mm)->total_vm; \
+} while (0)
+
typedef unsigned long mm_counter_t;

struct mm_struct {
@@ -270,16 +280,19 @@ struct mm_struct {
* by mmlist_lock
*/

- unsigned long start_code, end_code, start_data, end_data;
- unsigned long start_brk, brk, start_stack;
- unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long total_vm, locked_vm, shared_vm;
- unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;
-
/* Special counters protected by the page_table_lock */
mm_counter_t _file_rss;
mm_counter_t _anon_rss;

+ unsigned long hiwater_rss; /* High-watermark of RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */
+
+ unsigned long total_vm, locked_vm, shared_vm, exec_vm;
+ unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
+ unsigned long start_code, end_code, start_data, end_data;
+ unsigned long start_brk, brk, start_stack;
+ unsigned long arg_start, arg_end, env_start, env_end;
+
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

unsigned dumpable:2;
@@ -299,11 +312,7 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
-
struct kioctx default_kioctx;
-
- unsigned long hiwater_rss; /* High-water RSS usage */
- unsigned long hiwater_vm; /* High-water virtual memory usage */
};

struct sighand_struct {
--- 2.6.14-rc2-mm2/kernel/exit.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/kernel/exit.c 2005-10-03 19:42:49.000000000 +0100
@@ -839,7 +839,10 @@ fastcall NORET_TYPE void do_exit(long co
preempt_count());

acct_update_integrals(tsk);
- update_mem_hiwater(tsk);
+ if (tsk->mm) {
+ update_hiwater_rss(tsk->mm);
+ update_hiwater_vm(tsk->mm);
+ }
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
del_timer_sync(&tsk->signal->real_timer);
--- 2.6.14-rc2-mm2/kernel/sched.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/kernel/sched.c 2005-10-03 19:42:49.000000000 +0100
@@ -2603,8 +2603,6 @@ void account_system_time(struct task_str
cpustat->idle = cputime64_add(cpustat->idle, tmp);
/* Account for system time used */
acct_update_integrals(p);
- /* Update rss highwater mark */
- update_mem_hiwater(p);
}

/*
--- 2.6.14-rc2-mm2/mm/fremap.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/mm/fremap.c 2005-10-03 19:42:49.000000000 +0100
@@ -20,13 +20,12 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>

-static inline void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+static int zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
pte_t pte = *ptep;
+ int did_zap_file_pte = 0;

- if (pte_none(pte))
- return;
if (pte_present(pte)) {
unsigned long pfn = pte_pfn(pte);

@@ -39,7 +38,7 @@ static inline void zap_pte(struct mm_str
set_page_dirty(page);
page_remove_rmap(page);
page_cache_release(page);
- dec_mm_counter(mm, file_rss);
+ did_zap_file_pte = 1;
}
}
} else {
@@ -47,6 +46,8 @@ static inline void zap_pte(struct mm_str
free_swap_and_cache(pte_to_swp_entry(pte));
pte_clear(mm, addr, ptep);
}
+
+ return did_zap_file_pte;
}

/*
@@ -90,9 +91,9 @@ int install_page(struct mm_struct *mm, s
if (!page->mapping || page->index >= size)
goto err_unlock;

- zap_pte(mm, vma, addr, pte);
+ if (pte_none(*pte) || !zap_pte(mm, vma, addr, pte))
+ inc_mm_counter(mm, file_rss);

- inc_mm_counter(mm, file_rss);
flush_icache_page(vma, page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
page_add_file_rmap(page);
@@ -137,7 +138,10 @@ int install_file_pte(struct mm_struct *m
if (!pte)
goto err_unlock;

- zap_pte(mm, vma, addr, pte);
+ if (!pte_none(*pte) && zap_pte(mm, vma, addr, pte)) {
+ update_hiwater_rss(mm);
+ dec_mm_counter(mm, file_rss);
+ }

set_pte_at(mm, addr, pte, pgoff_to_pte(pgoff));
pte_val = *pte;
--- 2.6.14-rc2-mm2/mm/hugetlb.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/mm/hugetlb.c 2005-10-03 19:42:49.000000000 +0100
@@ -309,6 +309,9 @@ void unmap_hugepage_range(struct vm_area
BUG_ON(start & ~HPAGE_MASK);
BUG_ON(end & ~HPAGE_MASK);

+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
if (! ptep)
--- 2.6.14-rc2-mm2/mm/memory.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/mm/memory.c 2005-10-03 19:42:49.000000000 +0100
@@ -800,6 +800,7 @@ unsigned long zap_page_range(struct vm_a
lru_add_drain();
spin_lock(&mm->page_table_lock);
tlb = tlb_gather_mmu(mm, 0);
+ update_hiwater_rss(mm);
end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
tlb_finish_mmu(tlb, address, end);
spin_unlock(&mm->page_table_lock);
@@ -2205,22 +2206,6 @@ unsigned long vmalloc_to_pfn(void * vmal

EXPORT_SYMBOL(vmalloc_to_pfn);

-/*
- * update_mem_hiwater
- * - update per process rss and vm high water data
- */
-void update_mem_hiwater(struct task_struct *tsk)
-{
- if (tsk->mm) {
- unsigned long rss = get_mm_rss(tsk->mm);
-
- if (tsk->mm->hiwater_rss < rss)
- tsk->mm->hiwater_rss = rss;
- if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
- tsk->mm->hiwater_vm = tsk->mm->total_vm;
- }
-}
-
#if !defined(__HAVE_ARCH_GATE_AREA)

#if defined(AT_SYSINFO_EHDR)
--- 2.6.14-rc2-mm2/mm/mmap.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/mm/mmap.c 2005-10-03 19:42:49.000000000 +0100
@@ -1625,6 +1625,8 @@ find_extend_vma(struct mm_struct * mm, u
*/
static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
{
+ /* Update high watermark before we lower total_vm */
+ update_hiwater_vm(mm);
do {
long nrpages = vma_pages(vma);

@@ -1653,6 +1655,7 @@ static void unmap_region(struct mm_struc
lru_add_drain();
spin_lock(&mm->page_table_lock);
tlb = tlb_gather_mmu(mm, 0);
+ update_hiwater_rss(mm);
unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
@@ -1938,6 +1941,7 @@ void exit_mmap(struct mm_struct *mm)

flush_cache_mm(mm);
tlb = tlb_gather_mmu(mm, 1);
+ /* Don't update_hiwater_rss(mm) here, do_exit already did */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(&tlb, mm, vma, 0, -1, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
--- 2.6.14-rc2-mm2/mm/mremap.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/mm/mremap.c 2005-10-03 19:42:49.000000000 +0100
@@ -167,6 +167,7 @@ static unsigned long move_vma(struct vm_
unsigned long new_pgoff;
unsigned long moved_len;
unsigned long excess = 0;
+ unsigned long hiwater_vm;
int split = 0;

/*
@@ -205,9 +206,15 @@ static unsigned long move_vma(struct vm_
}

/*
- * if we failed to move page tables we still do total_vm increment
- * since do_munmap() will decrement it by old_len == new_len
+ * If we failed to move page tables we still do total_vm increment
+ * since do_munmap() will decrement it by old_len == new_len.
+ *
+ * Since total_vm is about to be raised artificially high for a
+ * moment, we need to restore high watermark afterwards: if stats
+ * are taken meanwhile, total_vm and hiwater_vm appear too high.
+ * If this were a serious issue, we'd add a flag to do_munmap().
*/
+ hiwater_vm = mm->hiwater_vm;
mm->total_vm += new_len >> PAGE_SHIFT;
vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT);

@@ -216,6 +223,7 @@ static unsigned long move_vma(struct vm_
vm_unacct_memory(excess >> PAGE_SHIFT);
excess = 0;
}
+ mm->hiwater_vm = hiwater_vm;

/* Restore VM_ACCOUNT if one or two pieces of vma left */
if (excess) {
--- 2.6.14-rc2-mm2/mm/nommu.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/mm/nommu.c 2005-10-03 19:42:49.000000000 +0100
@@ -239,6 +239,7 @@ void vunmap(void *addr)
asmlinkage unsigned long sys_brk(unsigned long brk)
{
struct mm_struct *mm = current->mm;
+ long grow;

if (brk < mm->start_brk || brk > mm->context.end_brk)
return mm->brk;
@@ -246,6 +247,12 @@ asmlinkage unsigned long sys_brk(unsigne
if (mm->brk == brk)
return mm->brk;

+ /* Update high watermark before we may lower total_vm */
+ update_hiwater_vm(mm);
+
+ grow = PAGE_ALIGN(brk) - PAGE_ALIGN(mm->brk);
+ mm->total_vm += grow / PAGE_SIZE;
+
/*
* Always allow shrinking brk
*/
@@ -679,6 +686,7 @@ unsigned long do_mmap_pgoff(struct file
unsigned long flags,
unsigned long pgoff)
{
+ struct mm_struct *mm = current->mm;
struct vm_list_struct *vml = NULL;
struct vm_area_struct *vma = NULL;
struct rb_node *rb;
@@ -813,7 +821,7 @@ unsigned long do_mmap_pgoff(struct file
realalloc += kobjsize(vma);
askedalloc += sizeof(*vma);

- current->mm->total_vm += len >> PAGE_SHIFT;
+ mm->total_vm += len >> PAGE_SHIFT;

add_nommu_vma(vma);

@@ -821,8 +829,8 @@ unsigned long do_mmap_pgoff(struct file
realalloc += kobjsize(vml);
askedalloc += sizeof(*vml);

- vml->next = current->mm->context.vmlist;
- current->mm->context.vmlist = vml;
+ vml->next = mm->context.vmlist;
+ mm->context.vmlist = vml;

up_write(&nommu_vma_sem);

@@ -1075,19 +1083,6 @@ void arch_unmap_area(struct mm_struct *m
{
}

-void update_mem_hiwater(struct task_struct *tsk)
-{
- unsigned long rss;
-
- if (likely(tsk->mm)) {
- rss = get_mm_rss(tsk->mm);
- if (tsk->mm->hiwater_rss < rss)
- tsk->mm->hiwater_rss = rss;
- if (tsk->mm->hiwater_vm < tsk->mm->total_vm)
- tsk->mm->hiwater_vm = tsk->mm->total_vm;
- }
-}
-
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen,
int even_cows)
--- 2.6.14-rc2-mm2/mm/rmap.c 2005-09-30 11:59:12.000000000 +0100
+++ linux/mm/rmap.c 2005-10-03 19:42:49.000000000 +0100
@@ -543,6 +543,9 @@ static int try_to_unmap_one(struct page
if (pte_dirty(pteval))
set_page_dirty(page);

+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
/*
@@ -633,6 +636,9 @@ static void try_to_unmap_cluster(unsigne
if (!pmd_present(*pmd))
goto out_unlock;

+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
for (original_pte = pte = pte_offset_map(pmd, address);
address < end; pte++, address += PAGE_SIZE) {

2005-10-04 08:59:41

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Mon, Oct 03, 2005 at 08:53:09PM +0100, Hugh Dickins wrote:
>
> Christoph, Frank, Jay: does this patch look like it fits your needs?

Yes. Thanks.


--
Frank

2005-10-04 21:48:51

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Hugh Dickins wrote:
>
> And thanks for your patch, which I've factored in with Frank's, and come
> come up with the one below, against 2.6.14-rc2-mm2. Which uses no CONFIG
> flag: I think we're enough out of the fast paths that it's not needed.
>
> See comment in fs/proc/task_mmu.c for the principle. Could maintain
> hiwater_vm straightforwardly, but I think it's easier to remember if
> we handle them both in the same way.
>
> I did look into doing the total_vm increment and calling vm_stat_account
> in insert_vm_struct, but concluded it solved no particular problem, and
> raised some questions (where architectures, notably ia64, have special
> vmas which they may have good reason to leave out of total_vm).
>
> I haven't cross-checked the mm_struct cacheline rearrangement yet,
> it looks plausible, but could easily turn out to straddle boundaries.
>
> Christoph, Frank, Jay: does this patch look like it fits your needs?

I am building a kernel with your patch and am going to run some test
to compare the statistics.

Thanks,
- jay

>

2005-10-07 02:09:40

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

Jay Lan wrote:
> Hugh Dickins wrote:
>
>>
>>
>> See comment in fs/proc/task_mmu.c for the principle. Could maintain
>> hiwater_vm straightforwardly, but I think it's easier to remember if
>> we handle them both in the same way.
>>
>> I did look into doing the total_vm increment and calling vm_stat_account
>> in insert_vm_struct, but concluded it solved no particular problem, and
>> raised some questions (where architectures, notably ia64, have special
>> vmas which they may have good reason to leave out of total_vm).
>>
>> I haven't cross-checked the mm_struct cacheline rearrangement yet,
>> it looks plausible, but could easily turn out to straddle boundaries.
>>
>> Christoph, Frank, Jay: does this patch look like it fits your needs?
>
>
> I am building a kernel with your patch and am going to run some test
> to compare the statistics.

My testing showed the same number on hiwater_vm, but hiwater_rss from
Hugh's version was consistently ~1.5% lower. Where was the loss?

The fact that i have consistent hiwater_vm and hiwater_rss
in a few hundreds of processes suggests that that test may not
be a good test for comparing hiwater_vm and hiwater_rss.

I guess it allocates same amount of memory up front in every sub-tests
processes and never get over it. However, it also showed the way Hugh
did hiwater_rss in new code missed something.

Tomorrow i will be very busy at my work. I will be back on this
next Monday.

Thanks,
- jay

>
> Thanks,
> - jay
>
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-10-07 03:04:11

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2.6.14-rc2] fix incorrect mm->hiwater_vm and mm->hiwater_rss

On Thu, 6 Oct 2005, Jay Lan wrote:
> Jay Lan wrote:
> > Hugh Dickins wrote:
> > >
> > > See comment in fs/proc/task_mmu.c for the principle. Could maintain
> > > hiwater_vm straightforwardly, but I think it's easier to remember if
> > > we handle them both in the same way.
> >
> > I am building a kernel with your patch and am going to run some test
> > to compare the statistics.
>
> My testing showed the same number on hiwater_vm, but hiwater_rss from
> Hugh's version was consistently ~1.5% lower. Where was the loss?

Thanks for trying it out. Please check if you adjusted the way in which
you collect hiwater_rss: as in the example in task_mmu.c, it's no longer
any good just reading hiwater_rss, you need the max of get_mm_rss and
hiwater_rss each time. Similarly for hiwater_vm.

Hugh