Ugh, looking more at the patch, I'm getting more and more convinces
that it is pure and utter garbage.
It does "sync_mm_rss(mm);" in mmput(), _after_ it has done the
possibly final mmdrop(). WTF?
This is crap, guys. Seriously. Stop playing russian rulette with this
code. I think we need to revert *all* of the crazy rss games, unless
Konstantin can show us some truly obviously correct fix.
Sadly, I merged and pushed out the crap before I had rebooted and
noticed this problem, so now it's in the wild. Can somebody please
take a look at this asap?
Linus
On Thu, Jun 7, 2012 at 5:17 PM, Linus Torvalds
<[email protected]> wrote:
> This patch actually seems to have made the
>
> ?BUG: Bad rss-counter state ..
>
> problem *much* worse. It triggers all the time for me now - I've got
> 408 of those messages on my macbook air within a minute of booting it.
>
> Not good. Especially not good when it's marked for stable too.
On 2012.06.07 at 17:25 -0700, Linus Torvalds wrote:
> Ugh, looking more at the patch, I'm getting more and more convinces
> that it is pure and utter garbage.
>
> It does "sync_mm_rss(mm);" in mmput(), _after_ it has done the
> possibly final mmdrop(). WTF?
>
> This is crap, guys. Seriously. Stop playing russian rulette with this
> code. I think we need to revert *all* of the crazy rss games, unless
> Konstantin can show us some truly obviously correct fix.
>
> Sadly, I merged and pushed out the crap before I had rebooted and
> noticed this problem, so now it's in the wild. Can somebody please
> take a look at this asap?
You've somehow merged the wrong patch.
The correct version can be found here:
http://marc.info/?l=linux-kernel&m=133848759505805
--
Markus
On Thu, 7 Jun 2012, Linus Torvalds wrote:
> Ugh, looking more at the patch, I'm getting more and more convinces
> that it is pure and utter garbage.
>
> It does "sync_mm_rss(mm);" in mmput(), _after_ it has done the
> possibly final mmdrop(). WTF?
>
> This is crap, guys. Seriously. Stop playing russian rulette with this
> code. I think we need to revert *all* of the crazy rss games, unless
> Konstantin can show us some truly obviously correct fix.
>
> Sadly, I merged and pushed out the crap before I had rebooted and
> noticed this problem, so now it's in the wild. Can somebody please
> take a look at this asap?
>
> Linus
>
> On Thu, Jun 7, 2012 at 5:17 PM, Linus Torvalds
> <[email protected]> wrote:
> > This patch actually seems to have made the
> >
> > ?BUG: Bad rss-counter state ..
> >
> > problem *much* worse. It triggers all the time for me now - I've got
> > 408 of those messages on my macbook air within a minute of booting it.
> >
> > Not good. Especially not good when it's marked for stable too.
I'm on the Cc, but I've not been following closely, I've not been able
to reproduce the issue with Konstantin's commit in just now, and I've
not even tried Oleg's version: so, don't trust me an inch.
But I was surprised that that patch went to you, I thought Konstantin
and Oleg had just reached agreement that Oleg's version (reposted this
morning in the "3.4-rc7: BUG: Bad rss-counter state" thread) was nicer.
And it looks like it does not do anything offensive in mmput().
Doing things after something called "mm_release" sounds equally
bad, but IIRC that's not so.
You probably want to revert Konstantin's, try out Oleg's on your Air,
and maybe wait for Konstantin and Oleg to confirm the below.
Offline for a few hours,
Hugh
[PATCH] correctly synchronize rss-counters at exit/exec
From: Oleg Nesterov <[email protected]>
A simplified version of Konstantin Khlebnikov's patch.
do_exit() and exec_mmap() call sync_mm_rss() before mm_release()
does put_user(clear_child_tid) which can update task->rss_stat
and thus make mm->rss_stat inconsistent. This triggers the "BUG:"
printk in check_mm().
- Move the final sync_mm_rss() from do_exit() to exit_mm(), and
change exec_mmap() to call sync_mm_rss() after mm_release() to
make check_mm() happy.
Perhaps we should simply move it into mm_release() and call it
unconditionally to catch the "task->rss_stat != 0 && !task->mm"
bugs.
- Since taskstats_exit() is called before exit_mm(), add another
sync_mm_rss() into xacct_add_tsk() who actually uses rss_stat.
Probably we should also shift acct_update_integrals().
Reported-by: Markus Trippelsdorf <[email protected]>
Tested-by: Martin Mokrejs <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Konstantin Khlebnikov <[email protected]>
---
fs/exec.c | 2 +-
kernel/exit.c | 5 ++---
kernel/tsacct.c | 1 +
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 52c9e2f..e49e3c2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -823,10 +823,10 @@ static int exec_mmap(struct mm_struct *mm)
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
mm_release(tsk, old_mm);
if (old_mm) {
+ sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
diff --git a/kernel/exit.c b/kernel/exit.c
index ab972a7..b3a84b5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -655,6 +655,8 @@ static void exit_mm(struct task_struct * tsk)
mm_release(tsk, mm);
if (!mm)
return;
+
+ sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state
@@ -965,9 +967,6 @@ void do_exit(long code)
preempt_count());
acct_update_integrals(tsk);
- /* sync mm's RSS info before statistics gathering */
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 23b4d78..a64ee90 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
mm = get_task_mm(p);
if (mm) {
+ sync_mm_rss(mm);
/* adjust to KB unit */
stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
--
1.5.5.1
On Thu, Jun 7, 2012 at 6:05 PM, Markus Trippelsdorf
<[email protected]> wrote:
>
> You've somehow merged the wrong patch.
> The correct version can be found here:
> http://marc.info/?l=linux-kernel&m=133848759505805
It looks like Andrew sent me a bad version.
However, that patch you point at isn't good *either*.
It does totally insane things in xacct_add_tsk(). You can't call
"sync_mm_rss(mm)" on somebody elses mm, yet that is exactly what it
does (and you can't pass in another thread pointer either, since the
whole point of the per-thread counters is that they don't have locking
and aren't atomic, so you can't read them from any other context than
"current").
The thing is, the *only* point where it makes sense to sync the rss
pointers is when you detach the mm from the current thread. And
possibly at "fork()" time, *before* you duplicate the "struct
task_struct" and pollute the new one with stale rss counter values
from the old one.
So doing sync_mm_rss() in xacct_add_tsk() is crazy. Doing it
*anywhere* where mm is not clearly "current->mm" is wrong. If there is
a "get_task_mm()" or similar nearby, it's wrong, it's crap, and it
shouldn't be done.
Oleg, please rescue me? Your patch looks much closer to sane, but it's
not quite there..
Linus
No, this is apparently that same "almost there" patch from Oleg. I
guarantee that it's wrong.
Linus
---
[ This part, to be exact: ]
On Thu, Jun 7, 2012 at 6:16 PM, Hugh Dickins <[email protected]> wrote:
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
> ? ? ? ?stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> ? ? ? ?mm = get_task_mm(p);
> ? ? ? ?if (mm) {
> + ? ? ? ? ? ? ? sync_mm_rss(mm);
> ? ? ? ? ? ? ? ?/* adjust to KB unit */
> ? ? ? ? ? ? ? ?stats->hiwater_rss ? = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
> ? ? ? ? ? ? ? ?stats->hiwater_vm ? ?= get_mm_hiwater_vm(mm) ?* PAGE_SIZE / KB;
> --
> 1.5.5.1
On Thu, 7 Jun 2012, Linus Torvalds wrote:
> No, this is apparently that same "almost there" patch from Oleg. I
> guarantee that it's wrong.
>
> Linus
>
> ---
>
> [ This part, to be exact: ]
>
> On Thu, Jun 7, 2012 at 6:16 PM, Hugh Dickins <[email protected]> wrote:
> > --- a/kernel/tsacct.c
> > +++ b/kernel/tsacct.c
> > @@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
> > ? ? ? ?stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> > ? ? ? ?mm = get_task_mm(p);
> > ? ? ? ?if (mm) {
> > + ? ? ? ? ? ? ? sync_mm_rss(mm);
> > ? ? ? ? ? ? ? ?/* adjust to KB unit */
> > ? ? ? ? ? ? ? ?stats->hiwater_rss ? = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
> > ? ? ? ? ? ? ? ?stats->hiwater_vm ? ?= get_mm_hiwater_vm(mm) ?* PAGE_SIZE / KB;
> > --
Yup.
It does look as if Oleg's intent (last chance to update xacct stats
from dying current task) would be well served by changing that to
if (p == current)
sync_mm_rss(mm);
but I've made too many hurried decisions recently to sign off on that.
Hugh
Hugh Dickins wrote:
> On Thu, 7 Jun 2012, Linus Torvalds wrote:
>
>> Ugh, looking more at the patch, I'm getting more and more convinces
>> that it is pure and utter garbage.
>>
>> It does "sync_mm_rss(mm);" in mmput(), _after_ it has done the
>> possibly final mmdrop(). WTF?
>>
>> This is crap, guys. Seriously. Stop playing russian rulette with this
>> code. I think we need to revert *all* of the crazy rss games, unless
>> Konstantin can show us some truly obviously correct fix.
>>
>> Sadly, I merged and pushed out the crap before I had rebooted and
>> noticed this problem, so now it's in the wild. Can somebody please
>> take a look at this asap?
>>
>> Linus
>>
>> On Thu, Jun 7, 2012 at 5:17 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> This patch actually seems to have made the
>>>
>>> BUG: Bad rss-counter state ..
>>>
>>> problem *much* worse. It triggers all the time for me now - I've got
>>> 408 of those messages on my macbook air within a minute of booting it.
>>>
>>> Not good. Especially not good when it's marked for stable too.
>
> I'm on the Cc, but I've not been following closely, I've not been able
> to reproduce the issue with Konstantin's commit in just now, and I've
> not even tried Oleg's version: so, don't trust me an inch.
Easiest way to reproduce this:
#define _GNU_SOURCE
#include <stdlib.h>
#include <sched.h>
#include <sys/mman.h>
int child(void *arg)
{
return 0;
}
char stack[4096];
int main(int argc, char **argv)
{
void *page;
page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
clone(child, stack + sizeof(stack), CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, NULL, page);
return 0;
}
As result you can see "BUG: Bad rss-counter state mm:ffff88040783a680 idx:1 val:-1" in dmesg
There left only one problem: nobody calls sync_mm_rss() after put_user() in mm_release().
My patch is really messy, so please ignore it.
Oleg's http://marc.info/?l=linux-kernel&m=133848759505805 version much simpler and changes only sync_mm_rss() calls.
>
> But I was surprised that that patch went to you, I thought Konstantin
> and Oleg had just reached agreement that Oleg's version (reposted this
> morning in the "3.4-rc7: BUG: Bad rss-counter state" thread) was nicer.
>
> And it looks like it does not do anything offensive in mmput().
> Doing things after something called "mm_release" sounds equally
> bad, but IIRC that's not so.
>
> You probably want to revert Konstantin's, try out Oleg's on your Air,
> and maybe wait for Konstantin and Oleg to confirm the below.
>
> Offline for a few hours,
> Hugh
>
> [PATCH] correctly synchronize rss-counters at exit/exec
>
> From: Oleg Nesterov<[email protected]>
>
> A simplified version of Konstantin Khlebnikov's patch.
>
> do_exit() and exec_mmap() call sync_mm_rss() before mm_release()
> does put_user(clear_child_tid) which can update task->rss_stat
> and thus make mm->rss_stat inconsistent. This triggers the "BUG:"
> printk in check_mm().
>
> - Move the final sync_mm_rss() from do_exit() to exit_mm(), and
> change exec_mmap() to call sync_mm_rss() after mm_release() to
> make check_mm() happy.
>
> Perhaps we should simply move it into mm_release() and call it
> unconditionally to catch the "task->rss_stat != 0&& !task->mm"
> bugs.
>
> - Since taskstats_exit() is called before exit_mm(), add another
> sync_mm_rss() into xacct_add_tsk() who actually uses rss_stat.
>
> Probably we should also shift acct_update_integrals().
>
> Reported-by: Markus Trippelsdorf<[email protected]>
> Tested-by: Martin Mokrejs<[email protected]>
> Signed-off-by: Oleg Nesterov<[email protected]>
> Acked-by: Konstantin Khlebnikov<[email protected]>
> ---
> fs/exec.c | 2 +-
> kernel/exit.c | 5 ++---
> kernel/tsacct.c | 1 +
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 52c9e2f..e49e3c2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -823,10 +823,10 @@ static int exec_mmap(struct mm_struct *mm)
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> - sync_mm_rss(old_mm);
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> + sync_mm_rss(old_mm);
> /*
> * Make sure that if there is a core dump in progress
> * for the old mm, we get out and die instead of going
> diff --git a/kernel/exit.c b/kernel/exit.c
> index ab972a7..b3a84b5 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -655,6 +655,8 @@ static void exit_mm(struct task_struct * tsk)
> mm_release(tsk, mm);
> if (!mm)
> return;
> +
> + sync_mm_rss(mm);
> /*
> * Serialize with any possible pending coredump.
> * We must hold mmap_sem around checking core_state
> @@ -965,9 +967,6 @@ void do_exit(long code)
> preempt_count());
>
> acct_update_integrals(tsk);
> - /* sync mm's RSS info before statistics gathering */
> - if (tsk->mm)
> - sync_mm_rss(tsk->mm);
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> hrtimer_cancel(&tsk->signal->real_timer);
> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> index 23b4d78..a64ee90 100644
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
> stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> mm = get_task_mm(p);
> if (mm) {
> + sync_mm_rss(mm);
> /* adjust to KB unit */
> stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
> stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
On 06/07, Linus Torvalds wrote:
>
> It does totally insane things in xacct_add_tsk(). You can't call
> "sync_mm_rss(mm)" on somebody elses mm,
Damn, I am stupid. Yes, I forgot about fill_stats_for_pid().
And I didn't bother to look at get_task_mm() which clearly
shows that this tsk can be !current.
We can add the "p == current" check as Hugh suggested.
But,
> Doing it
> *anywhere* where mm is not clearly "current->mm" is wrong.
Agreed.
How about v2? It adds sync_mm_rss() into taskstats_exit(). Note
that it preserves the "tsk->mm != NULL" check we currently have.
I think it should be removed (see the changelog), but even if I
am right I'd prefer to do this in a separate patch.
------------------------------------------------------------------------------
Subject: [PATCH] correctly synchronize rss-counters at exit/exec
A simplified version of Konstantin Khlebnikov's patch.
do_exit() and exec_mmap() call sync_mm_rss() before mm_release()
does put_user(clear_child_tid) which can update task->rss_stat
and thus make mm->rss_stat inconsistent. This triggers the "BUG:"
printk in check_mm().
- Move the final sync_mm_rss() from do_exit() to exit_mm(), and
change exec_mmap() to call sync_mm_rss() after mm_release() to
make check_mm() happy.
Perhaps we should simply move it into mm_release() and call it
unconditionally to catch the "task->rss_stat != 0 && !task->mm"
bugs.
- Since taskstats_exit() is called before exit_mm(), add another
sync_mm_rss() into taskstats_exit() for xacct_add_tsk() who
actually uses rss_stat. As Linus pointed out, it is not sane
to move it into xacct_add_tsk().
Probably we should also shift acct_update_integrals(), and
"tsk->mm != NULL" check looks equally unneeded.
Reported-by: Markus Trippelsdorf <[email protected]>
Tested-by: Martin Mokrejs <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Konstantin Khlebnikov <[email protected]>
---
fs/exec.c | 2 +-
kernel/exit.c | 5 ++---
kernel/taskstats.c | 2 ++
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index a79786a..da27b91 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -819,10 +819,10 @@ static int exec_mmap(struct mm_struct *mm)
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
mm_release(tsk, old_mm);
if (old_mm) {
+ sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e40041..38c4a91 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -655,6 +655,8 @@ static void exit_mm(struct task_struct * tsk)
mm_release(tsk, mm);
if (!mm)
return;
+
+ sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state
@@ -966,9 +968,6 @@ void do_exit(long code)
preempt_count());
acct_update_integrals(tsk);
- /* sync mm's RSS info before statistics gathering */
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index e660464..55d1103 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -630,6 +630,8 @@ void taskstats_exit(struct task_struct *tsk, int group_dead)
if (!stats)
goto err;
+ if (tsk->mm)
+ sync_mm_rss(tsk->mm);
fill_stats(tsk, stats);
/*
--
1.5.5.1
On 06/08, Konstantin Khlebnikov wrote:
>
> As result you can see "BUG: Bad rss-counter state mm:ffff88040783a680 idx:1 val:-1" in dmesg
>
> There left only one problem: nobody calls sync_mm_rss() after put_user() in mm_release().
Both callers call sync_mm_rss() to make check_mm() happy. But please
see the changelog, I think we should move it into mm_release(). See
the patch below (on top of v2 I sent). I need to recheck.
As for xacct_add_tsk(), yes it can "miss" that put_user(). But this
is what we have now, I think we do not care.
Oleg.
--- x/fs/exec.c
+++ x/fs/exec.c
@@ -822,7 +822,6 @@ static int exec_mmap(struct mm_struct *m
mm_release(tsk, old_mm);
if (old_mm) {
- sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -656,7 +656,6 @@ static void exit_mm(struct task_struct *
if (!mm)
return;
- sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state
--- x/kernel/taskstats.c
+++ x/kernel/taskstats.c
@@ -630,8 +630,7 @@ void taskstats_exit(struct task_struct *
if (!stats)
goto err;
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
+ sync_mm_rss(tsk->mm);
fill_stats(tsk, stats);
/*
--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -656,7 +656,6 @@ static void exit_mm(struct task_struct *
if (!mm)
return;
- sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state
Oleg Nesterov wrote:
> On 06/08, Konstantin Khlebnikov wrote:
>>
>> As result you can see "BUG: Bad rss-counter state mm:ffff88040783a680 idx:1 val:-1" in dmesg
>>
>> There left only one problem: nobody calls sync_mm_rss() after put_user() in mm_release().
>
> Both callers call sync_mm_rss() to make check_mm() happy. But please
> see the changelog, I think we should move it into mm_release(). See
> the patch below (on top of v2 I sent). I need to recheck.
Patch below broken: it removes one hunk from kernel/exit.c twice.
And it does not add anything into mm_release().
>
> As for xacct_add_tsk(), yes it can "miss" that put_user(). But this
> is what we have now, I think we do not care.
>
> Oleg.
>
> --- x/fs/exec.c
> +++ x/fs/exec.c
> @@ -822,7 +822,6 @@ static int exec_mmap(struct mm_struct *m
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> - sync_mm_rss(old_mm);
> /*
> * Make sure that if there is a core dump in progress
> * for the old mm, we get out and die instead of going
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -656,7 +656,6 @@ static void exit_mm(struct task_struct *
> if (!mm)
> return;
>
> - sync_mm_rss(mm);
> /*
> * Serialize with any possible pending coredump.
> * We must hold mmap_sem around checking core_state
> --- x/kernel/taskstats.c
> +++ x/kernel/taskstats.c
> @@ -630,8 +630,7 @@ void taskstats_exit(struct task_struct *
> if (!stats)
> goto err;
>
> - if (tsk->mm)
> - sync_mm_rss(tsk->mm);
> + sync_mm_rss(tsk->mm);
> fill_stats(tsk, stats);
>
> /*
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -656,7 +656,6 @@ static void exit_mm(struct task_struct *
> if (!mm)
> return;
>
> - sync_mm_rss(mm);
> /*
> * Serialize with any possible pending coredump.
> * We must hold mmap_sem around checking core_state
>
On 06/08, Konstantin Khlebnikov wrote:
>
> Oleg Nesterov wrote:
>> On 06/08, Konstantin Khlebnikov wrote:
>>>
>>> As result you can see "BUG: Bad rss-counter state mm:ffff88040783a680 idx:1 val:-1" in dmesg
>>>
>>> There left only one problem: nobody calls sync_mm_rss() after put_user() in mm_release().
>>
>> Both callers call sync_mm_rss() to make check_mm() happy. But please
>> see the changelog, I think we should move it into mm_release(). See
>> the patch below (on top of v2 I sent). I need to recheck.
>
> Patch below broken: it removes one hunk from kernel/exit.c twice.
> And it does not add anything into mm_release().
Yes, sorry. But I guess you understand the intent, mm_release() should
simply do sync_mm_rss() after put_user(clear_child_tid) unconditionally.
If task->mm == NULL but task->rss_stat, then there is something wrong
and probably OOPS makes sense.
Oleg.
do_exit() and exec_mmap() call sync_mm_rss() before mm_release()
does put_user(clear_child_tid) which can update task->rss_stat
and thus make mm->rss_stat inconsistent. This triggers the "BUG:"
printk in check_mm().
Let's fix this bug in the safest way, and optimize/cleanup this later.
Reported-by: Markus Trippelsdorf <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/exec.c | 2 +-
kernel/exit.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c
index a79786a..da27b91 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -819,10 +819,10 @@ static int exec_mmap(struct mm_struct *mm)
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
mm_release(tsk, old_mm);
if (old_mm) {
+ sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
diff --git a/kernel/exit.c b/kernel/exit.c
index 34867cc..c0277d3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -643,6 +643,7 @@ static void exit_mm(struct task_struct * tsk)
mm_release(tsk, mm);
if (!mm)
return;
+ sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state
(2012/06/08 21:18), Oleg Nesterov wrote:
> On 06/07, Linus Torvalds wrote:
>>
>> It does totally insane things in xacct_add_tsk(). You can't call
>> "sync_mm_rss(mm)" on somebody elses mm,
>
> Damn, I am stupid. Yes, I forgot about fill_stats_for_pid().
> And I didn't bother to look at get_task_mm() which clearly
> shows that this tsk can be !current.
>
> We can add the "p == current" check as Hugh suggested.
>
> But,
>
>> Doing it
>> *anywhere* where mm is not clearly "current->mm" is wrong.
>
> Agreed.
>
> How about v2? It adds sync_mm_rss() into taskstats_exit(). Note
> that it preserves the "tsk->mm != NULL" check we currently have.
> I think it should be removed (see the changelog), but even if I
> am right I'd prefer to do this in a separate patch.
>
I'm sorry I've been silent...one another fix I can think of is
this kind of change to sync_mm_rss(). How do you think ?
==
From be49ed6843b09ae33d758f2a51cf8357f7502512 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Mon, 11 Jun 2012 19:45:09 +0900
Subject: [PATCH] fix sync_mm_rss() leakage.
Any page fault after sync_mm_rss() in do_exit() causes problem
in check_mm(). It happens because task's rss counter is not
synchronized after the last sync_mm_rss().
This patch replaces the last sync_mm_rss() with finalize_mm_rss()
and disallow per-task rss count caching after finalization.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/exec.c | 3 ++-
include/linux/mm.h | 10 ++++++++++
kernel/exit.c | 3 +--
mm/memory.c | 21 ++++++++++++++++++---
4 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index a79786a..3e47772 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -819,7 +819,7 @@ static int exec_mmap(struct mm_struct *mm)
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
+ finalize_mm_rss();
mm_release(tsk, old_mm);
if (old_mm) {
@@ -851,6 +851,7 @@ static int exec_mmap(struct mm_struct *mm)
return 0;
}
mmdrop(active_mm);
+ initialize_mm_rss();
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..995d7ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1129,10 +1129,20 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
#if defined(SPLIT_RSS_COUNTING)
void sync_mm_rss(struct mm_struct *mm);
+void finalize_mm_rss(void);
+void initialize_mm_rss(void);
#else
+static inline void finalize_mm_rss(void)
+{
+}
+
static inline void sync_mm_rss(struct mm_struct *mm)
{
}
+
+static inline void initialize_mm_rss(void)
+{
+}
#endif
int vma_wants_writenotify(struct vm_area_struct *vma);
diff --git a/kernel/exit.c b/kernel/exit.c
index 34867cc..2111879 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -961,8 +961,7 @@ void do_exit(long code)
acct_update_integrals(tsk);
/* sync mm's RSS info before statistics gathering */
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
+ finalize_mm_rss();
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
diff --git a/mm/memory.c b/mm/memory.c
index 1b7dc66..07aa887d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -125,6 +125,20 @@ core_initcall(init_zero_pfn);
#if defined(SPLIT_RSS_COUNTING)
+void initialize_mm_rss(void)
+{
+ current->rss_stat.events = 0;
+}
+
+void finalize_mm_rss(void)
+{
+ current->rss_stat.events = -1;
+ if (current->mm)
+ sync_mm_rss(current->mm);
+}
+
+#define rss_count_finalized(task) ((task)->rss_stat.events < 0)
+
void sync_mm_rss(struct mm_struct *mm)
{
int i;
@@ -135,14 +149,15 @@ void sync_mm_rss(struct mm_struct *mm)
current->rss_stat.count[i] = 0;
}
}
- current->rss_stat.events = 0;
+ if (!rss_count_finalized(current))
+ current->rss_stat.events = 0;
}
static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
{
struct task_struct *task = current;
- if (likely(task->mm == mm))
+ if (likely(task->mm == mm && !rss_count_finalized(task)))
task->rss_stat.count[member] += val;
else
add_mm_counter(mm, member, val);
@@ -154,7 +169,7 @@ static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
#define TASK_RSS_EVENTS_THRESH (64)
static void check_sync_rss_stat(struct task_struct *task)
{
- if (unlikely(task != current))
+ if (unlikely(task != current || rss_count_finalized(task)))
return;
if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
sync_mm_rss(task->mm);
--
1.7.4.1