2008-12-03 18:14:37

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

Unless we are going to decrease rss/vm there is no point to call the
(racy) update_hiwater_xxx() helpers. Still do_exit() does this, and
the accounting code uses mm->hiwater_xxx directly.

This is not right. fill_pid()->xacct_add_tsk() can be called by
taskstats_user_cmd() at any time, not only when the task exits.
in that case taskstats->hiwater_xxx can be very wrong.

Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() to use instead,
and kill the "if (tsk->mm) {}" code in do_exit(). The first helper will
be also used to actually fill/report rusage->ru_maxrss.

Signed-off-by: Oleg Nesterov <[email protected]>

--- K-28/include/linux/sched.h~HIWATER 2008-12-02 17:12:40.000000000 +0100
+++ K-28/include/linux/sched.h 2008-12-03 18:17:18.000000000 +0100
@@ -388,6 +388,9 @@ extern void arch_unmap_area_topdown(stru
(mm)->hiwater_vm = (mm)->total_vm; \
} while (0)

+#define get_mm_hiwater_rss(mm) max((mm)->hiwater_rss, get_mm_rss(mm))
+#define get_mm_hiwater_vm(mm) max((mm)->hiwater_vm, (mm)->total_vm)
+
extern void set_dumpable(struct mm_struct *mm, int value);
extern int get_dumpable(struct mm_struct *mm);

--- K-28/kernel/tsacct.c~HIWATER 2008-10-10 00:13:53.000000000 +0200
+++ K-28/kernel/tsacct.c 2008-12-03 18:24:28.000000000 +0100
@@ -90,8 +90,8 @@ void xacct_add_tsk(struct taskstats *sta
mm = get_task_mm(p);
if (mm) {
/* adjust to KB unit */
- stats->hiwater_rss = mm->hiwater_rss * PAGE_SIZE / KB;
- stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
+ stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
+ stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
mmput(mm);
}
stats->read_char = p->ioac.rchar;
--- K-28/kernel/exit.c~HIWATER 2008-12-02 17:12:40.000000000 +0100
+++ K-28/kernel/exit.c 2008-12-03 18:21:06.000000000 +0100
@@ -1048,10 +1048,7 @@ NORET_TYPE void do_exit(long code)
preempt_count());

acct_update_integrals(tsk);
- if (tsk->mm) {
- update_hiwater_rss(tsk->mm);
- update_hiwater_vm(tsk->mm);
- }
+
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);


2008-12-03 20:38:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

On Wed, 3 Dec 2008, Oleg Nesterov wrote:
> Unless we are going to decrease rss/vm there is no point to call the
> (racy) update_hiwater_xxx() helpers. Still do_exit() does this, and

I'm puzzled by this comment. exit() _is_ about to decrease rss/vm,
so isn't it right to be calling update_hiwater_xxx()?

There is a question of who's going to be able to see the result from
this point on: I forget whether I was doing it for my own satisfaction,
or for a real observer. Even if there isn't a real observer today,
I think I'd prefer do_exit() to continue to update_hiwater_xxx(),
in case an observer is added tomorrow - unless you feel it's
unjustifiably adding code to and slowing down process exit.

You say "(racy)": in my view, it was only as racy as whatever might
cause it to be racy. By that, I mean that if the numbers ended up
slightly wrong, you could reasonably imagine that the races happened
in a different sequence which would have ended up with the numbers
seen. Have you noticed something more serious we need to fix?

> the accounting code uses mm->hiwater_xxx directly.
>
> This is not right. fill_pid()->xacct_add_tsk() can be called by
> taskstats_user_cmd() at any time, not only when the task exits.
> in that case taskstats->hiwater_xxx can be very wrong.

Here you're very right. There was no tsacct.c when I added those
hiwaters in 2.6.15, it's quite wrong to have been using those
numbers without comparing against current values, well spotted.

>
> Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() to use instead,
> and kill the "if (tsk->mm) {}" code in do_exit().

If you're going to add special helper macros (I don't care myself),
wouldn't it be better to convert fs/proc/task_mmu.c (the original
consumer) to use them too?

And, as I say, I'd _prefer_ that block to remain in do_exit(),
but don't have strong evidence why it should.

> The first helper will
> be also used to actually fill/report rusage->ru_maxrss.

Oh, yes, I noticed a mail yesterday in which you claimed to Cc me,
but didn't (like we all claim to be attaching missing patches ;)
I then forgot it, but yes, I am glad to see Jiri putting
hiwater_rss to more use, fewer ever-0s from /usr/bin/time.

Hugh

>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- K-28/include/linux/sched.h~HIWATER 2008-12-02 17:12:40.000000000 +0100
> +++ K-28/include/linux/sched.h 2008-12-03 18:17:18.000000000 +0100
> @@ -388,6 +388,9 @@ extern void arch_unmap_area_topdown(stru
> (mm)->hiwater_vm = (mm)->total_vm; \
> } while (0)
>
> +#define get_mm_hiwater_rss(mm) max((mm)->hiwater_rss, get_mm_rss(mm))
> +#define get_mm_hiwater_vm(mm) max((mm)->hiwater_vm, (mm)->total_vm)
> +
> extern void set_dumpable(struct mm_struct *mm, int value);
> extern int get_dumpable(struct mm_struct *mm);
>
> --- K-28/kernel/tsacct.c~HIWATER 2008-10-10 00:13:53.000000000 +0200
> +++ K-28/kernel/tsacct.c 2008-12-03 18:24:28.000000000 +0100
> @@ -90,8 +90,8 @@ void xacct_add_tsk(struct taskstats *sta
> mm = get_task_mm(p);
> if (mm) {
> /* adjust to KB unit */
> - stats->hiwater_rss = mm->hiwater_rss * PAGE_SIZE / KB;
> - stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
> + stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
> + stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
> mmput(mm);
> }
> stats->read_char = p->ioac.rchar;
> --- K-28/kernel/exit.c~HIWATER 2008-12-02 17:12:40.000000000 +0100
> +++ K-28/kernel/exit.c 2008-12-03 18:21:06.000000000 +0100
> @@ -1048,10 +1048,7 @@ NORET_TYPE void do_exit(long code)
> preempt_count());
>
> acct_update_integrals(tsk);
> - if (tsk->mm) {
> - update_hiwater_rss(tsk->mm);
> - update_hiwater_vm(tsk->mm);
> - }
> +
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> hrtimer_cancel(&tsk->signal->real_timer);

2008-12-04 14:00:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

On 12/03, Hugh Dickins wrote:
>
> On Wed, 3 Dec 2008, Oleg Nesterov wrote:
>
> > Unless we are going to decrease rss/vm there is no point to call the
> > (racy) update_hiwater_xxx() helpers. Still do_exit() does this, and
>
> I'm puzzled by this comment. exit() _is_ about to decrease rss/vm,
> so isn't it right to be calling update_hiwater_xxx()?

Do you mean exit_mm()->...->exit_mmap() ? But this doesn't matter, this
->mm is going away. Nobody can read these counters when ->mm_users == 0,
no?

> There is a question of who's going to be able to see the result from
> this point on: I forget whether I was doing it for my own satisfaction,
> or for a real observer. Even if there isn't a real observer today,
> I think I'd prefer do_exit() to continue to update_hiwater_xxx(),
> in case an observer is added tomorrow - unless you feel it's
> unjustifiably adding code to and slowing down process exit.

Please see below,

> You say "(racy)": in my view, it was only as racy as whatever might
> cause it to be racy. By that, I mean that if the numbers ended up
> slightly wrong, you could reasonably imagine that the races happened
> in a different sequence which would have ended up with the numbers
> seen. Have you noticed something more serious we need to fix?

But the difference can be huge. Let's suppose the process has 2 threads
T1 and T2.

T1 exits, calls update_hiwater_vm(), notices that mm->hiwater_vm must be
updated, and preempted right before mm->hiwater_vm = new_value.

T2 does free(malloc(A_LOT)) and exits. It sets the correct value for
->hiwater_vm which takes A_LOT into account and disappears.

T1 resumes, and "reverts" ->hiwater_vm to the "new_value" which was
calculated before.

Now, since T1 is the last exiting thread, the whole thread group
exits and we report the wrong ->hiwater_vm to the userspace.

Yes, this race is unlikely. But there is another reason (perhaps
not very good) why I tried to remove update_hiwater_xxx() from
do_exit().

Imho this code looks as if: from now it is "safe" to use ->hiwater_xxx
directly because we already updated it. But it is not. Unless we are
the last thread, we can't trust mm->hiwater_xxx anyway, we should
re-check get_mm_rss/total_vm.

> > Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() to use instead,
> > and kill the "if (tsk->mm) {}" code in do_exit().
>
> If you're going to add special helper macros (I don't care myself),
> wouldn't it be better to convert fs/proc/task_mmu.c (the original
> consumer) to use them too?

Yes, I was going to convert task_mem(), but noticed that it has
to read get_mm_rss() and ->total_vm anyway. Still, perhaps it
would be more clean to use the new macros anyway, even if this
wiil (unlikely) need a couple of extra cpu ticks.

> And, as I say, I'd _prefer_ that block to remain in do_exit(),
> but don't have strong evidence why it should.

I think that update_hiwater_xxx() should be "private" for vm code which
does zap/unmap, and any observer should use get_mm_hiwater_xxx(). Nobody
should use mm->hiwater_xxx directly.

But I don't (and of course can't) have a strong opinion on that,
will wait for your verdict and re-send.

The patch need the update anyway, I just noticed that if we remove
update_hiwater_xxx() from do_exit(), we should change the comment
in exit_mmap().

> > The first helper will
> > be also used to actually fill/report rusage->ru_maxrss.
>
> Oh, yes, I noticed a mail yesterday in which you claimed to Cc me,
> but didn't (like we all claim to be attaching missing patches ;)

Yes sorry ;) The only problem with mutt is that it is not possible
to change CC while editing the text (unless edit_headers is set).

Oleg.

2008-12-05 12:54:45

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

On Thu, 4 Dec 2008, Oleg Nesterov wrote:
> On 12/03, Hugh Dickins wrote:
> >
> > On Wed, 3 Dec 2008, Oleg Nesterov wrote:
> >
> > > Unless we are going to decrease rss/vm there is no point to call the
> > > (racy) update_hiwater_xxx() helpers. Still do_exit() does this, and
> >
> > I'm puzzled by this comment. exit() _is_ about to decrease rss/vm,
> > so isn't it right to be calling update_hiwater_xxx()?
>
> Do you mean exit_mm()->...->exit_mmap() ? But this doesn't matter, this
> ->mm is going away. Nobody can read these counters when ->mm_users == 0,
> no?

I was conceding that point in my next sentence...

> > There is a question of who's going to be able to see the result from
> > this point on: I forget whether I was doing it for my own satisfaction,
> > or for a real observer.

I've looked back at it now: the hiwater fields already existed, their
updates were slowing things down, and nothing in tree was even looking
at the results; I changed the way they were updated to minimize the
slowdown.

So my principal reason for putting the updates in do_exit() was simply
that there had been an update_mem_hiwater(tsk) there in do_exit(), and
no examples of how the results would get used, so I thought I'd better
update them there too.

The /proc VmPeak & VmHWM stuff I added just to give it an intree
consumer and example of how to handle the counters: an example which
sadly was ignored when the tsacct.c use came in a few releases later.

> > Even if there isn't a real observer today,
> > I think I'd prefer do_exit() to continue to update_hiwater_xxx(),
> > in case an observer is added tomorrow - unless you feel it's
> > unjustifiably adding code to and slowing down process exit.
>
> Please see below,
>
> > You say "(racy)": in my view, it was only as racy as whatever might
> > cause it to be racy. By that, I mean that if the numbers ended up
> > slightly wrong, you could reasonably imagine that the races happened
> > in a different sequence which would have ended up with the numbers
> > seen. Have you noticed something more serious we need to fix?
>
> But the difference can be huge. Let's suppose the process has 2 threads
> T1 and T2.
>
> T1 exits, calls update_hiwater_vm(), notices that mm->hiwater_vm must be
> updated, and preempted right before mm->hiwater_vm = new_value.
>
> T2 does free(malloc(A_LOT)) and exits. It sets the correct value for
> ->hiwater_vm which takes A_LOT into account and disappears.
>
> T1 resumes, and "reverts" ->hiwater_vm to the "new_value" which was
> calculated before.
>
> Now, since T1 is the last exiting thread, the whole thread group
> exits and we report the wrong ->hiwater_vm to the userspace.

Thanks for spelling it out for me, Oleg, yes, I can see that now.
And indeed, I seem to have been aware of it originally, when I said:
<quote>
What locking? None: if the app is racy then these statistics will be
racy, it's not worth any overhead to make them exact. But whenever it
suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss
under page_table_lock (for now) or with preemption disabled (later on):
without going to any trouble, minimize the time between reading current
values and updating, to minimize those occasions when a racing thread
bumps a count up and back down in between.
<unquote>

Other uses of update_hiwater_vm() are protected by down_write of
mmap_sem, so by removing the update_hiwater_vm() from do_exit(),
you're completely fixing its raciness. update_hiwater_rss()
remains racy in other places, but it's less of an issue because
rss can't be raised by large amounts, just by one page increments.
(We may want to revisit this if we ever get around to doing the
tlb_gather_mmu stuff without disabling preemption.)

>
> Yes, this race is unlikely. But there is another reason (perhaps
> not very good) why I tried to remove update_hiwater_xxx() from
> do_exit().
>
> Imho this code looks as if: from now it is "safe" to use ->hiwater_xxx
> directly because we already updated it. But it is not. Unless we are
> the last thread, we can't trust mm->hiwater_xxx anyway, we should
> re-check get_mm_rss/total_vm.

Okay, you're completely eliminating one race, and we've found out
why I sited the code there originally: go ahead, I've no objection
to you removing it now.

But with one change: please also change the comment in exit_mmap():
- /* Don't update_hiwater_rss(mm) here, do_exit already did */
+ /* update_hiwater_rss(mm) here? but nobody should be looking */
or reword that if you prefer.

Ah, reading further down, I see you spotted that.

> > > Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() to use instead,
> > > and kill the "if (tsk->mm) {}" code in do_exit().
> >
> > If you're going to add special helper macros (I don't care myself),
> > wouldn't it be better to convert fs/proc/task_mmu.c (the original
> > consumer) to use them too?
>
> Yes, I was going to convert task_mem(), but noticed that it has
> to read get_mm_rss() and ->total_vm anyway. Still, perhaps it
> would be more clean to use the new macros anyway, even if this
> wiil (unlikely) need a couple of extra cpu ticks.

No, that is a good reason to leave fs/proc/task_mmu.c as it is.
I just wouldn't have added your accessor macros for one use myself;
but they might help someone to avoid making the same mistake again.

Go ahead, thanks!
Hugh

2008-12-05 18:51:18

by Jay Lan

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

Hugh Dickins wrote:
> On Thu, 4 Dec 2008, Oleg Nesterov wrote:
>> On 12/03, Hugh Dickins wrote:
>>> On Wed, 3 Dec 2008, Oleg Nesterov wrote:
>>>
>>>> Unless we are going to decrease rss/vm there is no point to call the
>>>> (racy) update_hiwater_xxx() helpers. Still do_exit() does this, and
>>> I'm puzzled by this comment. exit() _is_ about to decrease rss/vm,
>>> so isn't it right to be calling update_hiwater_xxx()?
>> Do you mean exit_mm()->...->exit_mmap() ? But this doesn't matter, this
>> ->mm is going away. Nobody can read these counters when ->mm_users == 0,
>> no?
>
> I was conceding that point in my next sentence...
>
>>> There is a question of who's going to be able to see the result from
>>> this point on: I forget whether I was doing it for my own satisfaction,
>>> or for a real observer.
>
> I've looked back at it now: the hiwater fields already existed, their
> updates were slowing things down, and nothing in tree was even looking
> at the results; I changed the way they were updated to minimize the
> slowdown.
>
> So my principal reason for putting the updates in do_exit() was simply
> that there had been an update_mem_hiwater(tsk) there in do_exit(), and
> no examples of how the results would get used, so I thought I'd better
> update them there too.

Add Jonathan Lim to cc. He is responsible for CSA. The applications
using those data are running at user space through the taskstats
interface. That was the whole point of creating taskstats interface:
move accounting modules to user space.

Regards,
jay

>
> The /proc VmPeak & VmHWM stuff I added just to give it an intree
> consumer and example of how to handle the counters: an example which
> sadly was ignored when the tsacct.c use came in a few releases later.
>
>>> Even if there isn't a real observer today,
>>> I think I'd prefer do_exit() to continue to update_hiwater_xxx(),
>>> in case an observer is added tomorrow - unless you feel it's
>>> unjustifiably adding code to and slowing down process exit.
>> Please see below,
>>
>>> You say "(racy)": in my view, it was only as racy as whatever might
>>> cause it to be racy. By that, I mean that if the numbers ended up
>>> slightly wrong, you could reasonably imagine that the races happened
>>> in a different sequence which would have ended up with the numbers
>>> seen. Have you noticed something more serious we need to fix?
>> But the difference can be huge. Let's suppose the process has 2 threads
>> T1 and T2.
>>
>> T1 exits, calls update_hiwater_vm(), notices that mm->hiwater_vm must be
>> updated, and preempted right before mm->hiwater_vm = new_value.
>>
>> T2 does free(malloc(A_LOT)) and exits. It sets the correct value for
>> ->hiwater_vm which takes A_LOT into account and disappears.
>>
>> T1 resumes, and "reverts" ->hiwater_vm to the "new_value" which was
>> calculated before.
>>
>> Now, since T1 is the last exiting thread, the whole thread group
>> exits and we report the wrong ->hiwater_vm to the userspace.
>
> Thanks for spelling it out for me, Oleg, yes, I can see that now.
> And indeed, I seem to have been aware of it originally, when I said:
> <quote>
> What locking? None: if the app is racy then these statistics will be
> racy, it's not worth any overhead to make them exact. But whenever it
> suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss
> under page_table_lock (for now) or with preemption disabled (later on):
> without going to any trouble, minimize the time between reading current
> values and updating, to minimize those occasions when a racing thread
> bumps a count up and back down in between.
> <unquote>
>
> Other uses of update_hiwater_vm() are protected by down_write of
> mmap_sem, so by removing the update_hiwater_vm() from do_exit(),
> you're completely fixing its raciness. update_hiwater_rss()
> remains racy in other places, but it's less of an issue because
> rss can't be raised by large amounts, just by one page increments.
> (We may want to revisit this if we ever get around to doing the
> tlb_gather_mmu stuff without disabling preemption.)
>
>> Yes, this race is unlikely. But there is another reason (perhaps
>> not very good) why I tried to remove update_hiwater_xxx() from
>> do_exit().
>>
>> Imho this code looks as if: from now it is "safe" to use ->hiwater_xxx
>> directly because we already updated it. But it is not. Unless we are
>> the last thread, we can't trust mm->hiwater_xxx anyway, we should
>> re-check get_mm_rss/total_vm.
>
> Okay, you're completely eliminating one race, and we've found out
> why I sited the code there originally: go ahead, I've no objection
> to you removing it now.
>
> But with one change: please also change the comment in exit_mmap():
> - /* Don't update_hiwater_rss(mm) here, do_exit already did */
> + /* update_hiwater_rss(mm) here? but nobody should be looking */
> or reword that if you prefer.
>
> Ah, reading further down, I see you spotted that.
>
>>>> Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() to use instead,
>>>> and kill the "if (tsk->mm) {}" code in do_exit().
>>> If you're going to add special helper macros (I don't care myself),
>>> wouldn't it be better to convert fs/proc/task_mmu.c (the original
>>> consumer) to use them too?
>> Yes, I was going to convert task_mem(), but noticed that it has
>> to read get_mm_rss() and ->total_vm anyway. Still, perhaps it
>> would be more clean to use the new macros anyway, even if this
>> wiil (unlikely) need a couple of extra cpu ticks.
>
> No, that is a good reason to leave fs/proc/task_mmu.c as it is.
> I just wouldn't have added your accessor macros for one use myself;
> but they might help someone to avoid making the same mistake again.
>
> Go ahead, thanks!
> Hugh

2008-12-06 08:48:59

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

* Jay Lan <[email protected]> [2008-12-05 10:41:02]:

> Hugh Dickins wrote:
> > On Thu, 4 Dec 2008, Oleg Nesterov wrote:
> >> On 12/03, Hugh Dickins wrote:
> >>> On Wed, 3 Dec 2008, Oleg Nesterov wrote:
> >>>
> >>>> Unless we are going to decrease rss/vm there is no point to call the
> >>>> (racy) update_hiwater_xxx() helpers. Still do_exit() does this, and
> >>> I'm puzzled by this comment. exit() _is_ about to decrease rss/vm,
> >>> so isn't it right to be calling update_hiwater_xxx()?
> >> Do you mean exit_mm()->...->exit_mmap() ? But this doesn't matter, this
> >> ->mm is going away. Nobody can read these counters when ->mm_users == 0,
> >> no?
> >
> > I was conceding that point in my next sentence...
> >
> >>> There is a question of who's going to be able to see the result from
> >>> this point on: I forget whether I was doing it for my own satisfaction,
> >>> or for a real observer.
> >
> > I've looked back at it now: the hiwater fields already existed, their
> > updates were slowing things down, and nothing in tree was even looking
> > at the results; I changed the way they were updated to minimize the
> > slowdown.
> >
> > So my principal reason for putting the updates in do_exit() was simply
> > that there had been an update_mem_hiwater(tsk) there in do_exit(), and
> > no examples of how the results would get used, so I thought I'd better
> > update them there too.
>
> Add Jonathan Lim to cc. He is responsible for CSA. The applications
> using those data are running at user space through the taskstats
> interface. That was the whole point of creating taskstats interface:
> move accounting modules to user space.

Yes, true and getdelays can display all the exported information.

The race does seem concerning, I would vote for keeping the update in
there and disabling preemption around the update, so that hiwater
cannot swing back and forth.

--
Balbir

2008-12-06 09:56:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

On Sat, 6 Dec 2008, Balbir Singh wrote:
>
> Yes, true and getdelays can display all the exported information.
>
> The race does seem concerning, I would vote for keeping the update in
> there and disabling preemption around the update, so that hiwater
> cannot swing back and forth.

?? Oleg is _fixing_ a race by removing the update from do_exit();
and he is fixing the way the hiwater info was collected in tsacct.c.

Hugh

2008-12-06 10:46:55

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

* Hugh Dickins <[email protected]> [2008-12-06 09:56:19]:

> On Sat, 6 Dec 2008, Balbir Singh wrote:
> >
> > Yes, true and getdelays can display all the exported information.
> >
> > The race does seem concerning, I would vote for keeping the update in
> > there and disabling preemption around the update, so that hiwater
> > cannot swing back and forth.
>
> ?? Oleg is _fixing_ a race by removing the update from do_exit();
> and he is fixing the way the hiwater info was collected in tsacct.c.

I see that change and the reasoning seems accurate that we can query
the task at anytime, but I worry that if taskstats is not enabled, we'll
never call update_hiwater.* on the exiting task.

I wonder if a thread came in and like Oleg said, did (without taskstats
enabled)

free(malloc(some size)), followed by exit()

whether task_mem() would show the correct results for hiwater.*. I
should try it out and look more closely at the code as well.

--
Balbir

2008-12-06 15:26:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

Hi Oleg

> But the difference can be huge. Let's suppose the process has 2 threads
> T1 and T2.
>
> T1 exits, calls update_hiwater_vm(), notices that mm->hiwater_vm must be
> updated, and preempted right before mm->hiwater_vm = new_value.
>
> T2 does free(malloc(A_LOT)) and exits. It sets the correct value for
> ->hiwater_vm which takes A_LOT into account and disappears.
>
> T1 resumes, and "reverts" ->hiwater_vm to the "new_value" which was
> calculated before.
>
> Now, since T1 is the last exiting thread, the whole thread group
> exits and we report the wrong ->hiwater_vm to the userspace.
>
> Yes, this race is unlikely. But there is another reason (perhaps
> not very good) why I tried to remove update_hiwater_xxx() from
> do_exit().

This explain is very good clarification than current patch description, IMHO.
So, if possible, I hope to add it into the patch description.

thanks.

2008-12-07 16:21:34

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

(sorry for delay, I am travelling till 11 Dec)

On 12/06, Balbir Singh wrote:
>
> * Hugh Dickins <[email protected]> [2008-12-06 09:56:19]:
>
> > On Sat, 6 Dec 2008, Balbir Singh wrote:
> > >
> > > Yes, true and getdelays can display all the exported information.
> > >
> > > The race does seem concerning, I would vote for keeping the update in
> > > there and disabling preemption around the update, so that hiwater
> > > cannot swing back and forth.
> >
> > ?? Oleg is _fixing_ a race by removing the update from do_exit();
> > and he is fixing the way the hiwater info was collected in tsacct.c.
>
> I see that change and the reasoning seems accurate that we can query
> the task at anytime, but I worry that if taskstats is not enabled, we'll
> never call update_hiwater.* on the exiting task.

With this patch, even if taskstats _is_ enabled, we never call update_
on do_exit() path. Because there is no point to do this.

> I wonder if a thread came in and like Oleg said, did (without taskstats
> enabled)
>
> free(malloc(some size)), followed by exit()
>
> whether task_mem() would show the correct results for hiwater.*.

unlike taskstats, task_mem() doesn't rely on update_hiwater_xxx(),
it reads the current values and calculates the maximum. And this is
the "right thing".

update_hiwater_xxx() is only needed when we are going to decrease
the current value, so we can lose the info if we don't calculate
the maximum right now.

We can disable preemption around update_ in do_exit(), but this
doesn't close the race. We can even disable irqs but this (in
theory) is not enough either. But the main point we do not need
to update.

And please note that taskstats was wrong even if update_ was not
racy. Exactly because it relies on update_ in do_exit(), but it
should not.

As for ru_maxrss accounting, we can keep these update_hiwater_xxx()
calls in do_exit() and then use mm->hiwater_xxx directly, but we
should check group_dead in that case. I don't really think this
would be cleaner/better, and then we have the similar problems with
CLONE_VM tasks.

Oleg.

2008-12-07 16:58:45

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

* Oleg Nesterov <[email protected]> [2008-12-07 17:17:50]:

> (sorry for delay, I am travelling till 11 Dec)
>
> On 12/06, Balbir Singh wrote:
> >
> > * Hugh Dickins <[email protected]> [2008-12-06 09:56:19]:
> >
> > > On Sat, 6 Dec 2008, Balbir Singh wrote:
> > > >
> > > > Yes, true and getdelays can display all the exported information.
> > > >
> > > > The race does seem concerning, I would vote for keeping the update in
> > > > there and disabling preemption around the update, so that hiwater
> > > > cannot swing back and forth.
> > >
> > > ?? Oleg is _fixing_ a race by removing the update from do_exit();
> > > and he is fixing the way the hiwater info was collected in tsacct.c.
> >
> > I see that change and the reasoning seems accurate that we can query
> > the task at anytime, but I worry that if taskstats is not enabled, we'll
> > never call update_hiwater.* on the exiting task.
>
> With this patch, even if taskstats _is_ enabled, we never call update_
> on do_exit() path. Because there is no point to do this.
>

Hmmm.. I thought the rules were to update it when RSS/total_vm is
decreasing. taskstats_exit() calls fill_pid(), which in turn calls
xacct_add_tsk().

> > I wonder if a thread came in and like Oleg said, did (without taskstats
> > enabled)
> >
> > free(malloc(some size)), followed by exit()
> >
> > whether task_mem() would show the correct results for hiwater.*.
>
> unlike taskstats, task_mem() doesn't rely on update_hiwater_xxx(),
> it reads the current values and calculates the maximum. And this is
> the "right thing".
>
> update_hiwater_xxx() is only needed when we are going to decrease
> the current value, so we can lose the info if we don't calculate
> the maximum right now.
>

This is a bit confusing, look at strerror_l.c in libc. It frees the
last strerror value on exit of the thread. If a thread did strerror()
followed by exit(). If free() and malloc() map to mmap() and munmap(),
do_exit() will affect RSS and total_vm... no?

> We can disable preemption around update_ in do_exit(), but this
> doesn't close the race. We can even disable irqs but this (in
> theory) is not enough either. But the main point we do not need
> to update.
>

See above.

> And please note that taskstats was wrong even if update_ was not
> racy. Exactly because it relies on update_ in do_exit(), but it
> should not.
>

This is because you believe we should do the comparison like
task_mem()? task_mem() does no updates of hi_water.*.

> As for ru_maxrss accounting, we can keep these update_hiwater_xxx()
> calls in do_exit() and then use mm->hiwater_xxx directly, but we
> should check group_dead in that case. I don't really think this
> would be cleaner/better, and then we have the similar problems with
> CLONE_VM tasks.

CLONE_VM without thread groups is sort of annoying and hopefully dead
:) mm_owner had a lot of complexity due to that

--
Balbir

2008-12-07 17:30:41

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

On 12/07, Balbir Singh wrote:
>
> * Oleg Nesterov <[email protected]> [2008-12-07 17:17:50]:
>
> > On 12/06, Balbir Singh wrote:
> > >
> > > * Hugh Dickins <[email protected]> [2008-12-06 09:56:19]:
> > >
> > > > On Sat, 6 Dec 2008, Balbir Singh wrote:
> > > > >
> > > > > Yes, true and getdelays can display all the exported information.
> > > > >
> > > > > The race does seem concerning, I would vote for keeping the update in
> > > > > there and disabling preemption around the update, so that hiwater
> > > > > cannot swing back and forth.
> > > >
> > > > ?? Oleg is _fixing_ a race by removing the update from do_exit();
> > > > and he is fixing the way the hiwater info was collected in tsacct.c.
> > >
> > > I see that change and the reasoning seems accurate that we can query
> > > the task at anytime, but I worry that if taskstats is not enabled, we'll
> > > never call update_hiwater.* on the exiting task.
> >
> > With this patch, even if taskstats _is_ enabled, we never call update_
> > on do_exit() path. Because there is no point to do this.
>
> Hmmm.. I thought the rules were to update it when RSS/total_vm is
> decreasing.

Yes, but we are not going to decrease rss/vm,

> taskstats_exit() calls fill_pid(), which in turn calls
> xacct_add_tsk().

Yes, but we can't rely on update_hiwater_xxx() in do_exit(), because
this path can be called before this thread/process exits.

> > > I wonder if a thread came in and like Oleg said, did (without taskstats
> > > enabled)
> > >
> > > free(malloc(some size)), followed by exit()
> > >
> > > whether task_mem() would show the correct results for hiwater.*.
> >
> > unlike taskstats, task_mem() doesn't rely on update_hiwater_xxx(),
> > it reads the current values and calculates the maximum. And this is
> > the "right thing".
> >
> > update_hiwater_xxx() is only needed when we are going to decrease
> > the current value, so we can lose the info if we don't calculate
> > the maximum right now.
> >
>
> This is a bit confusing, look at strerror_l.c in libc. It frees the
> last strerror value on exit of the thread. If a thread did strerror()
> followed by exit(). If free() and malloc() map to mmap() and munmap(),
> do_exit() will affect RSS and total_vm... no?

No. When the task does unmap, vm does update_hiwater_vm() "internally",
it does not need the help from do_exit(). And do_exit() can't help,
it is to late to calculate the maximum, ->total_vm was already decreased.

do_exit() itself does not affect rss/vm. Until we call exit_mmap(),
but at this point ->mm is dead, nobody can look at it, its ->mm_users
is zero.

> > We can disable preemption around update_ in do_exit(), but this
> > doesn't close the race. We can even disable irqs but this (in
> > theory) is not enough either. But the main point we do not need
> > to update.
> >
>
> See above.

See above ;)

> > And please note that taskstats was wrong even if update_ was not
> > racy. Exactly because it relies on update_ in do_exit(), but it
> > should not.
> >
>
> This is because you believe we should do the comparison like
> task_mem()? task_mem() does no updates of hi_water.*.

Yes. please read the patch, taskstats uses the new get_mm_hiwater_xxx()
helpers.

> > As for ru_maxrss accounting, we can keep these update_hiwater_xxx()
> > calls in do_exit() and then use mm->hiwater_xxx directly, but we
> > should check group_dead in that case. I don't really think this
> > would be cleaner/better, and then we have the similar problems with
> > CLONE_VM tasks.
>
> CLONE_VM without thread groups is sort of annoying and hopefully dead
> :) mm_owner had a lot of complexity due to that

Yes, I know. But I doubt we can stop support CLONE_VM ;)

Oleg.

2008-12-07 17:40:25

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting

* Oleg Nesterov <[email protected]> [2008-12-07 18:28:45]:

[snip]

> No. When the task does unmap, vm does update_hiwater_vm() "internally",
> it does not need the help from do_exit(). And do_exit() can't help,
> it is to late to calculate the maximum, ->total_vm was already decreased.
>
> do_exit() itself does not affect rss/vm. Until we call exit_mmap(),
> but at this point ->mm is dead, nobody can look at it, its ->mm_users
> is zero.
>

OK, I think that makes sense, since free would have cleaned up the
memory before entering the exit system call.

[snip]

--
Balbir