Hi Andrew,
while shutting down my laptop (Dell Vostro 3550 with 16GB RAM, core i7) with 3.4-rc7 I got:
May 23 00:07:54 vostro kernel: [352687.968267] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:1 val:-59
May 23 00:07:54 vostro kernel: [352687.968312] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:2 val:59
May 23 00:07:55 vostro acpid: exiting
May 23 00:07:55 vostro syslog-ng[2838]: syslog-ng shutting down; version='3.3.4'
I found by Google the below thread and thought that maybe it is related?
http://comments.gmane.org/gmane.linux.kernel.mm/76459
Please forward this to the right person and Cc: me if I should provide more details (some lines form dmesg?).
I am a plain user and it is time to sleep here.
Actually, searching /var/log/messages backwards gives me few more hits on previous shutdowns:
May 14 18:36:11 vostro kernel: [548780.511951] xhci_hcd 0000:0b:00.0: Cached old ring, 1 ring cached
May 14 18:36:11 vostro kernel: [548780.511953] xhci_hcd 0000:0b:00.0: Cached old ring, 2 rings cached
May 14 18:36:11 vostro kernel: [548780.512034] xhci_hcd 0000:0b:00.0: // Ding dong!
May 14 18:36:11 vostro kernel: [548780.512042] xhci_hcd 0000:0b:00.0: get port status, actual port 0 status = 0x2a0
May 14 18:36:11 vostro kernel: [548780.512043] xhci_hcd 0000:0b:00.0: Get port status returned 0x2a0
May 14 18:36:11 vostro kernel: [548780.536242] BUG: Bad rss-counter state mm:ffff88040bf760c0 idx:1 val:-1
May 14 18:36:11 vostro kernel: [548780.536245] BUG: Bad rss-counter state mm:ffff88040bf760c0 idx:2 val:1
May 14 18:36:11 vostro kernel: [548780.551350] xhci_hcd 0000:0b:00.0: get port status, actual port 0 status = 0x2a0
May 14 18:36:11 vostro kernel: [548780.551363] xhci_hcd 0000:0b:00.0: Get port status returned 0x2a0
May 14 18:36:11 vostro kernel: [548780.591424] xhci_hcd 0000:0b:00.0: get port status, actual port 0 status = 0x2a0
May 14 18:36:11 vostro kernel: [548780.591427] xhci_hcd 0000:0b:00.0: Get port status returned 0x2a0
May 14 18:36:11 vostro kernel: [548780.631208] xhci_hcd 0000:0b:00.0: get port status, actual port 0 status = 0x2a0
May 14 18:36:11 vostro kernel: [548780.631217] xhci_hcd 0000:0b:00.0: Get port status returned 0x2a0
May 14 18:36:11 vostro kernel: [548780.671254] xhci_hcd 0000:0b:00.0: get port status, actual port 0 status = 0x2a0
May 14 18:36:11 vostro kernel: [548780.671256] xhci_hcd 0000:0b:00.0: Get port status returned 0x2a0
May 14 18:36:11 vostro kernel: [548780.671259] hub 4-0:1.0: debounce: port 1: total 100ms stable 100ms status 0x2a0
May 14 18:36:11 vostro kernel: [548781.093467] BUG: Bad rss-counter state mm:ffff88040954ec40 idx:1 val:-1
May 14 18:36:11 vostro kernel: [548781.093470] BUG: Bad rss-counter state mm:ffff88040954ec40 idx:2 val:1
My older logs show it appeared first in 3.4.0-rc6. Or is it that because I changed my .config
at that time? Can't say at the moment. What type of config variable should I look for?
Best regards,
Martin
On Wed, 23 May 2012 00:41:28 +0200
Martin Mokrejs <[email protected]> wrote:
> Hi Andrew,
> while shutting down my laptop (Dell Vostro 3550 with 16GB RAM, core i7) with 3.4-rc7 I got:
>
> May 23 00:07:54 vostro kernel: [352687.968267] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:1 val:-59
> May 23 00:07:54 vostro kernel: [352687.968312] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:2 val:59
> May 23 00:07:55 vostro acpid: exiting
> May 23 00:07:55 vostro syslog-ng[2838]: syslog-ng shutting down; version='3.3.4'
>
> I found by Google the below thread and thought that maybe it is related?
> http://comments.gmane.org/gmane.linux.kernel.mm/76459
>
> ...
>
Well hopefully the below will fix this?
I notice that I don't have this tagged for -stable backporting. That
seems wrong. Konstantin, do we know for how long this bug has been in
there?
From: Konstantin Khlebnikov <[email protected]>
Subject: mm: correctly synchronize rss-counters at exit/exec
mm->rss_stat counters have per-task delta: task->rss_stat. Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().
do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats, audit
and other stuff. Unfortunately the kernel does this before calling
mm_release(), which can call put_user() for processing
task->clear_child_tid. So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier. After mm_release() there should be
no pagefaults.
[[email protected]: tweak comment]
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reported-by: Markus Trippelsdorf <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/exec.c | 1 -
kernel/exit.c | 13 ++++++++-----
kernel/fork.c | 8 ++++++++
3 files changed, 16 insertions(+), 6 deletions(-)
diff -puN fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec fs/exec.c
--- a/fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec
+++ a/fs/exec.c
@@ -823,7 +823,6 @@ static int exec_mmap(struct mm_struct *m
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
mm_release(tsk, old_mm);
if (old_mm) {
diff -puN kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/exit.c
--- a/kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec
+++ a/kernel/exit.c
@@ -423,6 +423,7 @@ void daemonize(const char *name, ...)
* user space pages. We don't need them, and if we didn't close them
* they would be locked into memory.
*/
+ mm_release(current, current->mm);
exit_mm(current);
/*
* We don't want to get frozen, in case system-wide hibernation
@@ -640,7 +641,6 @@ static void exit_mm(struct task_struct *
struct mm_struct *mm = tsk->mm;
struct core_state *core_state;
- mm_release(tsk, mm);
if (!mm)
return;
/*
@@ -959,9 +959,13 @@ void do_exit(long code)
preempt_count());
acct_update_integrals(tsk);
- /* sync mm's RSS info before statistics gathering */
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
+
+ /* Set exit_code before complete_vfork_done() in mm_release() */
+ tsk->exit_code = code;
+
+ /* Release mm and sync mm's RSS info before statistics gathering */
+ mm_release(tsk, tsk->mm);
+
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
@@ -974,7 +978,6 @@ void do_exit(long code)
tty_audit_exit();
audit_free(tsk);
- tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
exit_mm(tsk);
diff -puN kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/fork.c
--- a/kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec
+++ a/kernel/fork.c
@@ -809,6 +809,14 @@ void mm_release(struct task_struct *tsk,
}
tsk->clear_child_tid = NULL;
}
+
+ /*
+ * Final rss-counter synchronization. After this point there must be
+ * no pagefaults into this mm from the current context. Otherwise
+ * mm->rss_stat will be inconsistent.
+ */
+ if (mm)
+ sync_mm_rss(mm);
}
/*
_
On Tue, 22 May 2012 16:28:35 -0700
Andrew Morton <[email protected]> wrote:
> I notice that I don't have this tagged for -stable backporting. That
> seems wrong. Konstantin, do we know for how long this bug has been in
> there?
Also, I have a note here that Oleg was unhappy with the patch. Oleg
happiness is important. Has he cheered up yet?
Andrew Morton wrote:
> On Wed, 23 May 2012 00:41:28 +0200
> Martin Mokrejs<[email protected]> wrote:
>
>> Hi Andrew,
>> while shutting down my laptop (Dell Vostro 3550 with 16GB RAM, core i7) with 3.4-rc7 I got:
>>
>> May 23 00:07:54 vostro kernel: [352687.968267] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:1 val:-59
>> May 23 00:07:54 vostro kernel: [352687.968312] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:2 val:59
>> May 23 00:07:55 vostro acpid: exiting
>> May 23 00:07:55 vostro syslog-ng[2838]: syslog-ng shutting down; version='3.3.4'
>>
>> I found by Google the below thread and thought that maybe it is related?
>> http://comments.gmane.org/gmane.linux.kernel.mm/76459
>>
>> ...
>>
>
>
> Well hopefully the below will fix this?
>
> I notice that I don't have this tagged for -stable backporting. That
> seems wrong. Konstantin, do we know for how long this bug has been in
> there?
It there for years, by itself it is mostly harmless.
This warning was added in c3f0327f8e9d7a503f0d64573c311eddd61f197d
so only v3.4 has this, I thought this fix will be there before release.
>
>
>
> From: Konstantin Khlebnikov<[email protected]>
> Subject: mm: correctly synchronize rss-counters at exit/exec
>
> mm->rss_stat counters have per-task delta: task->rss_stat. Before
> changing task->mm pointer the kernel must flush this delta with
> sync_mm_rss().
>
> do_exit() already calls sync_mm_rss() to flush the rss-counters before
> committing the rss statistics into task->signal->maxrss, taskstats, audit
> and other stuff. Unfortunately the kernel does this before calling
> mm_release(), which can call put_user() for processing
> task->clear_child_tid. So at this point we can trigger page-faults and
> task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
> inconsistent and check_mm() will print something like this:
>
> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
>
> This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
> out of do_exit() and calls it earlier. After mm_release() there should be
> no pagefaults.
>
> [[email protected]: tweak comment]
> Signed-off-by: Konstantin Khlebnikov<[email protected]>
> Reported-by: Markus Trippelsdorf<[email protected]>
> Cc: Hugh Dickins<[email protected]>
> Cc: KAMEZAWA Hiroyuki<[email protected]>
> Cc: Oleg Nesterov<[email protected]>
> Signed-off-by: Andrew Morton<[email protected]>
> ---
>
> fs/exec.c | 1 -
> kernel/exit.c | 13 ++++++++-----
> kernel/fork.c | 8 ++++++++
> 3 files changed, 16 insertions(+), 6 deletions(-)
>
> diff -puN fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec fs/exec.c
> --- a/fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec
> +++ a/fs/exec.c
> @@ -823,7 +823,6 @@ static int exec_mmap(struct mm_struct *m
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> - sync_mm_rss(old_mm);
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> diff -puN kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/exit.c
> --- a/kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec
> +++ a/kernel/exit.c
> @@ -423,6 +423,7 @@ void daemonize(const char *name, ...)
> * user space pages. We don't need them, and if we didn't close them
> * they would be locked into memory.
> */
> + mm_release(current, current->mm);
> exit_mm(current);
> /*
> * We don't want to get frozen, in case system-wide hibernation
> @@ -640,7 +641,6 @@ static void exit_mm(struct task_struct *
> struct mm_struct *mm = tsk->mm;
> struct core_state *core_state;
>
> - mm_release(tsk, mm);
> if (!mm)
> return;
> /*
> @@ -959,9 +959,13 @@ void do_exit(long code)
> preempt_count());
>
> acct_update_integrals(tsk);
> - /* sync mm's RSS info before statistics gathering */
> - if (tsk->mm)
> - sync_mm_rss(tsk->mm);
> +
> + /* Set exit_code before complete_vfork_done() in mm_release() */
> + tsk->exit_code = code;
> +
> + /* Release mm and sync mm's RSS info before statistics gathering */
> + mm_release(tsk, tsk->mm);
> +
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> hrtimer_cancel(&tsk->signal->real_timer);
> @@ -974,7 +978,6 @@ void do_exit(long code)
> tty_audit_exit();
> audit_free(tsk);
>
> - tsk->exit_code = code;
> taskstats_exit(tsk, group_dead);
>
> exit_mm(tsk);
> diff -puN kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/fork.c
> --- a/kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec
> +++ a/kernel/fork.c
> @@ -809,6 +809,14 @@ void mm_release(struct task_struct *tsk,
> }
> tsk->clear_child_tid = NULL;
> }
> +
> + /*
> + * Final rss-counter synchronization. After this point there must be
> + * no pagefaults into this mm from the current context. Otherwise
> + * mm->rss_stat will be inconsistent.
> + */
> + if (mm)
> + sync_mm_rss(mm);
> }
>
> /*
> _
>
Hi,
I rebooted the laptop twice today after just brief uses and the messages did not
appear in the logs.
Now I just applied the below patch and during two reboots it did not appear either.
Do I have to use the computer for some longer while to reproduce the issue? ;-)
I will stay with the patch applied over 3.4-rc7 and would the BUG: re-appear I will
let you know. But I doubt at the moment I could confirm it really helped.
Clues how to reproduce? ;)
Martin
Andrew Morton wrote:
> On Wed, 23 May 2012 00:41:28 +0200
> Martin Mokrejs <[email protected]> wrote:
>
>> Hi Andrew,
>> while shutting down my laptop (Dell Vostro 3550 with 16GB RAM, core i7) with 3.4-rc7 I got:
>>
>> May 23 00:07:54 vostro kernel: [352687.968267] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:1 val:-59
>> May 23 00:07:54 vostro kernel: [352687.968312] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:2 val:59
>> May 23 00:07:55 vostro acpid: exiting
>> May 23 00:07:55 vostro syslog-ng[2838]: syslog-ng shutting down; version='3.3.4'
>>
>> I found by Google the below thread and thought that maybe it is related?
>> http://comments.gmane.org/gmane.linux.kernel.mm/76459
>>
>> ...
>>
>
>
> Well hopefully the below will fix this?
>
> I notice that I don't have this tagged for -stable backporting. That
> seems wrong. Konstantin, do we know for how long this bug has been in
> there?
>
>
>
> From: Konstantin Khlebnikov <[email protected]>
> Subject: mm: correctly synchronize rss-counters at exit/exec
>
> mm->rss_stat counters have per-task delta: task->rss_stat. Before
> changing task->mm pointer the kernel must flush this delta with
> sync_mm_rss().
>
> do_exit() already calls sync_mm_rss() to flush the rss-counters before
> committing the rss statistics into task->signal->maxrss, taskstats, audit
> and other stuff. Unfortunately the kernel does this before calling
> mm_release(), which can call put_user() for processing
> task->clear_child_tid. So at this point we can trigger page-faults and
> task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
> inconsistent and check_mm() will print something like this:
>
> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
>
> This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
> out of do_exit() and calls it earlier. After mm_release() there should be
> no pagefaults.
>
> [[email protected]: tweak comment]
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Reported-by: Markus Trippelsdorf <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> fs/exec.c | 1 -
> kernel/exit.c | 13 ++++++++-----
> kernel/fork.c | 8 ++++++++
> 3 files changed, 16 insertions(+), 6 deletions(-)
>
> diff -puN fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec fs/exec.c
> --- a/fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec
> +++ a/fs/exec.c
> @@ -823,7 +823,6 @@ static int exec_mmap(struct mm_struct *m
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> - sync_mm_rss(old_mm);
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> diff -puN kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/exit.c
> --- a/kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec
> +++ a/kernel/exit.c
> @@ -423,6 +423,7 @@ void daemonize(const char *name, ...)
> * user space pages. We don't need them, and if we didn't close them
> * they would be locked into memory.
> */
> + mm_release(current, current->mm);
> exit_mm(current);
> /*
> * We don't want to get frozen, in case system-wide hibernation
> @@ -640,7 +641,6 @@ static void exit_mm(struct task_struct *
> struct mm_struct *mm = tsk->mm;
> struct core_state *core_state;
>
> - mm_release(tsk, mm);
> if (!mm)
> return;
> /*
> @@ -959,9 +959,13 @@ void do_exit(long code)
> preempt_count());
>
> acct_update_integrals(tsk);
> - /* sync mm's RSS info before statistics gathering */
> - if (tsk->mm)
> - sync_mm_rss(tsk->mm);
> +
> + /* Set exit_code before complete_vfork_done() in mm_release() */
> + tsk->exit_code = code;
> +
> + /* Release mm and sync mm's RSS info before statistics gathering */
> + mm_release(tsk, tsk->mm);
> +
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> hrtimer_cancel(&tsk->signal->real_timer);
> @@ -974,7 +978,6 @@ void do_exit(long code)
> tty_audit_exit();
> audit_free(tsk);
>
> - tsk->exit_code = code;
> taskstats_exit(tsk, group_dead);
>
> exit_mm(tsk);
> diff -puN kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/fork.c
> --- a/kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec
> +++ a/kernel/fork.c
> @@ -809,6 +809,14 @@ void mm_release(struct task_struct *tsk,
> }
> tsk->clear_child_tid = NULL;
> }
> +
> + /*
> + * Final rss-counter synchronization. After this point there must be
> + * no pagefaults into this mm from the current context. Otherwise
> + * mm->rss_stat will be inconsistent.
> + */
> + if (mm)
> + sync_mm_rss(mm);
> }
>
> /*
> _
>
> .
>
On 05/22, Andrew Morton wrote:
>
> Also, I have a note here that Oleg was unhappy with the patch. Oleg
> happiness is important. Has he cheered up yet?
Well, yes, I do not really like this patch ;) Because I think there is
a more simple/straightforward fix, see below. In my opinion it also
makes the original code simpler.
But. Obviously this is subjective, I can't prove my patch is "better",
and I didn't try to test it.
So I won't argue with Konstantin who dislikes my patch, although I
would like to know the reason.
Oleg.
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *sta
stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
mm = get_task_mm(p);
if (mm) {
+ sync_mm_rss(mm);
/* adjust to KB unit */
stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -643,6 +643,8 @@ static void exit_mm(struct task_struct *
mm_release(tsk, mm);
if (!mm)
return;
+
+ sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state
@@ -960,9 +962,6 @@ void do_exit(long code)
preempt_count());
acct_update_integrals(tsk);
- /* sync mm's RSS info before statistics gathering */
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -823,10 +823,10 @@ static int exec_mmap(struct mm_struct *m
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
mm_release(tsk, old_mm);
if (old_mm) {
+ sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
Martin Mokrejs wrote:
> Hi,
> I rebooted the laptop twice today after just brief uses and the messages did not
> appear in the logs.
>
> Now I just applied the below patch and during two reboots it did not appear either.
> Do I have to use the computer for some longer while to reproduce the issue? ;-)
Yes, some data must be in swap to reproduce this, so memory pressure required here.
>
> I will stay with the patch applied over 3.4-rc7 and would the BUG: re-appear I will
> let you know. But I doubt at the moment I could confirm it really helped.
> Clues how to reproduce? ;)
> Martin
>
> Andrew Morton wrote:
>> On Wed, 23 May 2012 00:41:28 +0200
>> Martin Mokrejs<[email protected]> wrote:
>>
>>> Hi Andrew,
>>> while shutting down my laptop (Dell Vostro 3550 with 16GB RAM, core i7) with 3.4-rc7 I got:
>>>
>>> May 23 00:07:54 vostro kernel: [352687.968267] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:1 val:-59
>>> May 23 00:07:54 vostro kernel: [352687.968312] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:2 val:59
>>> May 23 00:07:55 vostro acpid: exiting
>>> May 23 00:07:55 vostro syslog-ng[2838]: syslog-ng shutting down; version='3.3.4'
>>>
>>> I found by Google the below thread and thought that maybe it is related?
>>> http://comments.gmane.org/gmane.linux.kernel.mm/76459
>>>
>>> ...
>>>
>>
>>
>> Well hopefully the below will fix this?
>>
>> I notice that I don't have this tagged for -stable backporting. That
>> seems wrong. Konstantin, do we know for how long this bug has been in
>> there?
>>
>>
>>
>> From: Konstantin Khlebnikov<[email protected]>
>> Subject: mm: correctly synchronize rss-counters at exit/exec
>>
>> mm->rss_stat counters have per-task delta: task->rss_stat. Before
>> changing task->mm pointer the kernel must flush this delta with
>> sync_mm_rss().
>>
>> do_exit() already calls sync_mm_rss() to flush the rss-counters before
>> committing the rss statistics into task->signal->maxrss, taskstats, audit
>> and other stuff. Unfortunately the kernel does this before calling
>> mm_release(), which can call put_user() for processing
>> task->clear_child_tid. So at this point we can trigger page-faults and
>> task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
>> inconsistent and check_mm() will print something like this:
>>
>> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
>> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
>>
>> This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
>> out of do_exit() and calls it earlier. After mm_release() there should be
>> no pagefaults.
>>
>> [[email protected]: tweak comment]
>> Signed-off-by: Konstantin Khlebnikov<[email protected]>
>> Reported-by: Markus Trippelsdorf<[email protected]>
>> Cc: Hugh Dickins<[email protected]>
>> Cc: KAMEZAWA Hiroyuki<[email protected]>
>> Cc: Oleg Nesterov<[email protected]>
>> Signed-off-by: Andrew Morton<[email protected]>
>> ---
>>
>> fs/exec.c | 1 -
>> kernel/exit.c | 13 ++++++++-----
>> kernel/fork.c | 8 ++++++++
>> 3 files changed, 16 insertions(+), 6 deletions(-)
>>
>> diff -puN fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec fs/exec.c
>> --- a/fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec
>> +++ a/fs/exec.c
>> @@ -823,7 +823,6 @@ static int exec_mmap(struct mm_struct *m
>> /* Notify parent that we're no longer interested in the old VM */
>> tsk = current;
>> old_mm = current->mm;
>> - sync_mm_rss(old_mm);
>> mm_release(tsk, old_mm);
>>
>> if (old_mm) {
>> diff -puN kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/exit.c
>> --- a/kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec
>> +++ a/kernel/exit.c
>> @@ -423,6 +423,7 @@ void daemonize(const char *name, ...)
>> * user space pages. We don't need them, and if we didn't close them
>> * they would be locked into memory.
>> */
>> + mm_release(current, current->mm);
>> exit_mm(current);
>> /*
>> * We don't want to get frozen, in case system-wide hibernation
>> @@ -640,7 +641,6 @@ static void exit_mm(struct task_struct *
>> struct mm_struct *mm = tsk->mm;
>> struct core_state *core_state;
>>
>> - mm_release(tsk, mm);
>> if (!mm)
>> return;
>> /*
>> @@ -959,9 +959,13 @@ void do_exit(long code)
>> preempt_count());
>>
>> acct_update_integrals(tsk);
>> - /* sync mm's RSS info before statistics gathering */
>> - if (tsk->mm)
>> - sync_mm_rss(tsk->mm);
>> +
>> + /* Set exit_code before complete_vfork_done() in mm_release() */
>> + tsk->exit_code = code;
>> +
>> + /* Release mm and sync mm's RSS info before statistics gathering */
>> + mm_release(tsk, tsk->mm);
>> +
>> group_dead = atomic_dec_and_test(&tsk->signal->live);
>> if (group_dead) {
>> hrtimer_cancel(&tsk->signal->real_timer);
>> @@ -974,7 +978,6 @@ void do_exit(long code)
>> tty_audit_exit();
>> audit_free(tsk);
>>
>> - tsk->exit_code = code;
>> taskstats_exit(tsk, group_dead);
>>
>> exit_mm(tsk);
>> diff -puN kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/fork.c
>> --- a/kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec
>> +++ a/kernel/fork.c
>> @@ -809,6 +809,14 @@ void mm_release(struct task_struct *tsk,
>> }
>> tsk->clear_child_tid = NULL;
>> }
>> +
>> + /*
>> + * Final rss-counter synchronization. After this point there must be
>> + * no pagefaults into this mm from the current context. Otherwise
>> + * mm->rss_stat will be inconsistent.
>> + */
>> + if (mm)
>> + sync_mm_rss(mm);
>> }
>>
>> /*
>> _
>>
>> .
>>
Oleg Nesterov wrote:
> On 05/22, Andrew Morton wrote:
>>
>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>> happiness is important. Has he cheered up yet?
>
> Well, yes, I do not really like this patch ;) Because I think there is
> a more simple/straightforward fix, see below. In my opinion it also
> makes the original code simpler.
>
> But. Obviously this is subjective, I can't prove my patch is "better",
> and I didn't try to test it.
>
> So I won't argue with Konstantin who dislikes my patch, although I
> would like to know the reason.
I don't remember why I dislike your patch.
For now I can only say ACK )
>
> Oleg.
>
>
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *sta
> stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> mm = get_task_mm(p);
> if (mm) {
> + sync_mm_rss(mm);
> /* adjust to KB unit */
> stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
> stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -643,6 +643,8 @@ static void exit_mm(struct task_struct *
> mm_release(tsk, mm);
> if (!mm)
> return;
> +
> + sync_mm_rss(mm);
> /*
> * Serialize with any possible pending coredump.
> * We must hold mmap_sem around checking core_state
> @@ -960,9 +962,6 @@ void do_exit(long code)
> preempt_count());
>
> acct_update_integrals(tsk);
> - /* sync mm's RSS info before statistics gathering */
> - if (tsk->mm)
> - sync_mm_rss(tsk->mm);
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> hrtimer_cancel(&tsk->signal->real_timer);
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -823,10 +823,10 @@ static int exec_mmap(struct mm_struct *m
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> - sync_mm_rss(old_mm);
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> + sync_mm_rss(old_mm);
> /*
> * Make sure that if there is a core dump in progress
> * for the old mm, we get out and die instead of going
>
>
On Wed, 30 May 2012 00:18:31 +0400
Konstantin Khlebnikov <[email protected]> wrote:
> Oleg Nesterov wrote:
> > On 05/22, Andrew Morton wrote:
> >>
> >> Also, I have a note here that Oleg was unhappy with the patch. Oleg
> >> happiness is important. Has he cheered up yet?
> >
> > Well, yes, I do not really like this patch ;) Because I think there is
> > a more simple/straightforward fix, see below. In my opinion it also
> > makes the original code simpler.
> >
> > But. Obviously this is subjective, I can't prove my patch is "better",
> > and I didn't try to test it.
> >
> > So I won't argue with Konstantin who dislikes my patch, although I
> > would like to know the reason.
>
> I don't remember why I dislike your patch.
> For now I can only say ACK )
We'll need a changelogged signed-off patch, please Oleg. And some evidence
that it was tested would be nice ;)
Andrew Morton wrote:
> On Wed, 30 May 2012 00:18:31 +0400
> Konstantin Khlebnikov <[email protected]> wrote:
>
>> Oleg Nesterov wrote:
>>> On 05/22, Andrew Morton wrote:
>>>>
>>>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>>>> happiness is important. Has he cheered up yet?
>>>
>>> Well, yes, I do not really like this patch ;) Because I think there is
>>> a more simple/straightforward fix, see below. In my opinion it also
>>> makes the original code simpler.
>>>
>>> But. Obviously this is subjective, I can't prove my patch is "better",
>>> and I didn't try to test it.
>>>
>>> So I won't argue with Konstantin who dislikes my patch, although I
>>> would like to know the reason.
>>
>> I don't remember why I dislike your patch.
>> For now I can only say ACK )
>
> We'll need a changelogged signed-off patch, please Oleg. And some evidence
> that it was tested would be nice ;)
I will reboot in few hours, finally after few days ... I am running this first
patch. I will try to test the second/alternative patch more quickly. Sorry for
the delay.
Konstantin Khlebnikov wrote:
> Andrew Morton wrote:
>> On Wed, 23 May 2012 00:41:28 +0200
>> Martin Mokrejs<[email protected]> wrote:
>>
>>> Hi Andrew,
>>> while shutting down my laptop (Dell Vostro 3550 with 16GB RAM, core i7) with 3.4-rc7 I got:
>>>
>>> May 23 00:07:54 vostro kernel: [352687.968267] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:1 val:-59
>>> May 23 00:07:54 vostro kernel: [352687.968312] BUG: Bad rss-counter state mm:ffff88040b56f800 idx:2 val:59
>>> May 23 00:07:55 vostro acpid: exiting
>>> May 23 00:07:55 vostro syslog-ng[2838]: syslog-ng shutting down; version='3.3.4'
>>>
>>> I found by Google the below thread and thought that maybe it is related?
>>> http://comments.gmane.org/gmane.linux.kernel.mm/76459
>>>
>>> ...
>>>
>>
>>
>> Well hopefully the below will fix this?
>>
>> I notice that I don't have this tagged for -stable backporting. That
>> seems wrong. Konstantin, do we know for how long this bug has been in
>> there?
>
> It there for years, by itself it is mostly harmless.
> This warning was added in c3f0327f8e9d7a503f0d64573c311eddd61f197d
> so only v3.4 has this, I thought this fix will be there before release.
>
>>
>>
>>
>> From: Konstantin Khlebnikov<[email protected]>
>> Subject: mm: correctly synchronize rss-counters at exit/exec
>>
>> mm->rss_stat counters have per-task delta: task->rss_stat. Before
>> changing task->mm pointer the kernel must flush this delta with
>> sync_mm_rss().
>>
>> do_exit() already calls sync_mm_rss() to flush the rss-counters before
>> committing the rss statistics into task->signal->maxrss, taskstats, audit
>> and other stuff. Unfortunately the kernel does this before calling
>> mm_release(), which can call put_user() for processing
>> task->clear_child_tid. So at this point we can trigger page-faults and
>> task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
>> inconsistent and check_mm() will print something like this:
>>
>> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
>> | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
>>
>> This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
>> out of do_exit() and calls it earlier. After mm_release() there should be
>> no pagefaults.
>>
>> [[email protected]: tweak comment]
>> Signed-off-by: Konstantin Khlebnikov<[email protected]>
>> Reported-by: Markus Trippelsdorf<[email protected]>
>> Cc: Hugh Dickins<[email protected]>
>> Cc: KAMEZAWA Hiroyuki<[email protected]>
>> Cc: Oleg Nesterov<[email protected]>
>> Signed-off-by: Andrew Morton<[email protected]>
>> ---
>>
>> fs/exec.c | 1 -
>> kernel/exit.c | 13 ++++++++-----
>> kernel/fork.c | 8 ++++++++
>> 3 files changed, 16 insertions(+), 6 deletions(-)
>>
>> diff -puN fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec fs/exec.c
>> --- a/fs/exec.c~mm-correctly-synchronize-rss-counters-at-exit-exec
>> +++ a/fs/exec.c
>> @@ -823,7 +823,6 @@ static int exec_mmap(struct mm_struct *m
>> /* Notify parent that we're no longer interested in the old VM */
>> tsk = current;
>> old_mm = current->mm;
>> - sync_mm_rss(old_mm);
>> mm_release(tsk, old_mm);
>>
>> if (old_mm) {
>> diff -puN kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/exit.c
>> --- a/kernel/exit.c~mm-correctly-synchronize-rss-counters-at-exit-exec
>> +++ a/kernel/exit.c
>> @@ -423,6 +423,7 @@ void daemonize(const char *name, ...)
>> * user space pages. We don't need them, and if we didn't close them
>> * they would be locked into memory.
>> */
>> + mm_release(current, current->mm);
>> exit_mm(current);
>> /*
>> * We don't want to get frozen, in case system-wide hibernation
>> @@ -640,7 +641,6 @@ static void exit_mm(struct task_struct *
>> struct mm_struct *mm = tsk->mm;
>> struct core_state *core_state;
>>
>> - mm_release(tsk, mm);
>> if (!mm)
>> return;
>> /*
>> @@ -959,9 +959,13 @@ void do_exit(long code)
>> preempt_count());
>>
>> acct_update_integrals(tsk);
>> - /* sync mm's RSS info before statistics gathering */
>> - if (tsk->mm)
>> - sync_mm_rss(tsk->mm);
>> +
>> + /* Set exit_code before complete_vfork_done() in mm_release() */
>> + tsk->exit_code = code;
>> +
>> + /* Release mm and sync mm's RSS info before statistics gathering */
>> + mm_release(tsk, tsk->mm);
>> +
>> group_dead = atomic_dec_and_test(&tsk->signal->live);
>> if (group_dead) {
>> hrtimer_cancel(&tsk->signal->real_timer);
>> @@ -974,7 +978,6 @@ void do_exit(long code)
>> tty_audit_exit();
>> audit_free(tsk);
>>
>> - tsk->exit_code = code;
>> taskstats_exit(tsk, group_dead);
>>
>> exit_mm(tsk);
>> diff -puN kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec kernel/fork.c
>> --- a/kernel/fork.c~mm-correctly-synchronize-rss-counters-at-exit-exec
>> +++ a/kernel/fork.c
>> @@ -809,6 +809,14 @@ void mm_release(struct task_struct *tsk,
>> }
>> tsk->clear_child_tid = NULL;
>> }
>> +
>> + /*
>> + * Final rss-counter synchronization. After this point there must be
>> + * no pagefaults into this mm from the current context. Otherwise
>> + * mm->rss_stat will be inconsistent.
>> + */
>> + if (mm)
>> + sync_mm_rss(mm);
>> }
>>
>> /*
>> _
>>
I made my system to allocate some 3 millions of blocks in swap according to vmstat(1)
and rebooted. It took about 6 minutes to the system to kill 7 gimp images 2.2GB
(16000x8000px, at 1200dpi each and a python session having some huge lists in memory.
I have 16GB of RAM. There were no errors/warnings or Oopses logged in /var/log/messages
so I conclude this patch from Konstantin Khlebnikov works for me on 3.4-rc7.
May 30 09:57:57 vostro syslog-ng[2534]: syslog-ng shutting down; version='3.3.4'
May 30 10:06:31 vostro syslog-ng[2519]: syslog-ng starting up; version='3.3.4'
Tested-by: Martin Mokrejs <[email protected]>
--
Will try the other patch from Oleg Nesterov now.
Oleg Nesterov wrote:
> On 05/22, Andrew Morton wrote:
>>
>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>> happiness is important. Has he cheered up yet?
>
> Well, yes, I do not really like this patch ;) Because I think there is
> a more simple/straightforward fix, see below. In my opinion it also
> makes the original code simpler.
>
> But. Obviously this is subjective, I can't prove my patch is "better",
> and I didn't try to test it.
>
> So I won't argue with Konstantin who dislikes my patch, although I
> would like to know the reason.
>
> Oleg.
>
>
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *sta
> stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> mm = get_task_mm(p);
> if (mm) {
> + sync_mm_rss(mm);
> /* adjust to KB unit */
> stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
> stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -643,6 +643,8 @@ static void exit_mm(struct task_struct *
> mm_release(tsk, mm);
> if (!mm)
> return;
> +
> + sync_mm_rss(mm);
> /*
> * Serialize with any possible pending coredump.
> * We must hold mmap_sem around checking core_state
> @@ -960,9 +962,6 @@ void do_exit(long code)
> preempt_count());
>
> acct_update_integrals(tsk);
> - /* sync mm's RSS info before statistics gathering */
> - if (tsk->mm)
> - sync_mm_rss(tsk->mm);
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> hrtimer_cancel(&tsk->signal->real_timer);
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -823,10 +823,10 @@ static int exec_mmap(struct mm_struct *m
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> - sync_mm_rss(old_mm);
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> + sync_mm_rss(old_mm);
> /*
> * Make sure that if there is a core dump in progress
> * for the old mm, we get out and die instead of going
>
>
Tested-by: Martin Mokrejs <[email protected]>
This patch works equally well for me as the other patch proposed earlier by Konstantin
Khlebnikov.
Would both patches have some debug printk() showing the code really did kick
in I would have been more assured it had a chance to really do their job. But
in both cases I made the system use up all RAM and start to swap so if that was
enough to trigger the situation as you said earlier then they are both fine.
Finally, I went to re-test again the patch from Konstantin because the several
minutes long delay in shutdown puzzled me and I did not get it with this patch
from Oleg. I conclude it was probably related to my initial attempts to also copy
/home/blah to /tmp (I thought it is in-memory filesystem so I can easily drain
memory resources but seems I was wrong). Maybe this was the reason while the
shutdown took so long. I am still not sure because init.d/ scritps cleanup /tmp
on startup on Gentoo ... but I was not able to reproduce the long delay on second
attempt with using purely python to eat my memory to record some huge lists.
For those wondering as well why the long delay on shutdown happened here are my
mounts:
# mount
rootfs on / type rootfs (rw)
/dev/root on / type ext3 (rw,noatime,commit=0)
devtmpfs on /dev type devtmpfs (rw,relatime,size=8184896k,nr_inodes=2046224,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755)
rc-svcdir on /lib64/rc/init.d type tmpfs (rw,nosuid,nodev,noexec,relatime,size=1024k,mode=755)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
cgroup_root on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755)
openrc on /sys/fs/cgroup/openrc type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/lib64/rc/sh/cgroup-release-agent.sh,name=openrc)
cpu on /sys/fs/cgroup/cpu type cgroup (rw,nosuid,nodev,noexec,relatime,cpu)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,noexec,nosuid,nodev)
#
Martin Mokrejs wrote:
> Andrew Morton wrote:
>> On Wed, 30 May 2012 00:18:31 +0400
>> Konstantin Khlebnikov<[email protected]> wrote:
>>
>>> Oleg Nesterov wrote:
>>>> On 05/22, Andrew Morton wrote:
>>>>>
>>>>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>>>>> happiness is important. Has he cheered up yet?
>>>>
>>>> Well, yes, I do not really like this patch ;) Because I think there is
>>>> a more simple/straightforward fix, see below. In my opinion it also
>>>> makes the original code simpler.
>>>>
>>>> But. Obviously this is subjective, I can't prove my patch is "better",
>>>> and I didn't try to test it.
>>>>
>>>> So I won't argue with Konstantin who dislikes my patch, although I
>>>> would like to know the reason.
>>>
>>> I don't remember why I dislike your patch.
>>> For now I can only say ACK )
>>
>> We'll need a changelogged signed-off patch, please Oleg. And some evidence
>> that it was tested would be nice ;)
>
> I will reboot in few hours, finally after few days ... I am running this first
> patch. I will try to test the second/alternative patch more quickly. Sorry for
> the delay.
>
easiest way trigger this bug:
#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <sys/syscall.h>
#include <sys/mman.h>
static inline int sys_clone(unsigned long flags, void *stack, int *ptid, int *ctid)
{
return syscall(SYS_clone, flags, stack, ptid, ctid);
}
int main(int argc, char **argv)
{
void *page;
page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
sys_clone(CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, page);
}
Konstantin Khlebnikov wrote:
> Martin Mokrejs wrote:
>> Andrew Morton wrote:
>>> On Wed, 30 May 2012 00:18:31 +0400
>>> Konstantin Khlebnikov<[email protected]> wrote:
>>>
>>>> Oleg Nesterov wrote:
>>>>> On 05/22, Andrew Morton wrote:
>>>>>>
>>>>>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>>>>>> happiness is important. Has he cheered up yet?
>>>>>
>>>>> Well, yes, I do not really like this patch ;) Because I think there is
>>>>> a more simple/straightforward fix, see below. In my opinion it also
>>>>> makes the original code simpler.
>>>>>
>>>>> But. Obviously this is subjective, I can't prove my patch is "better",
>>>>> and I didn't try to test it.
>>>>>
>>>>> So I won't argue with Konstantin who dislikes my patch, although I
>>>>> would like to know the reason.
>>>>
>>>> I don't remember why I dislike your patch.
>>>> For now I can only say ACK )
>>>
>>> We'll need a changelogged signed-off patch, please Oleg. And some evidence
>>> that it was tested would be nice ;)
>>
>> I will reboot in few hours, finally after few days ... I am running this first
>> patch. I will try to test the second/alternative patch more quickly. Sorry for
>> the delay.
>>
>
> easiest way trigger this bug:
>
> #define _GNU_SOURCE
> #include <unistd.h>
> #include <sched.h>
> #include <sys/syscall.h>
> #include <sys/mman.h>
>
> static inline int sys_clone(unsigned long flags, void *stack, int *ptid, int *ctid)
> {
> return syscall(SYS_clone, flags, stack, ptid, ctid);
> }
>
> int main(int argc, char **argv)
> {
> void *page;
>
> page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> sys_clone(CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, page);
> }
>
I am getting segfaults with this.
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x00007f430f70a7e0 in __elf_set___libc_subfreeres_element_free_mem__ () from /lib64/libc.so.6
#2 0x00007f430f70a7e8 in __elf_set___libc_atexit_element__IO_cleanup__ () from /lib64/libc.so.6
#3 0x0000000000000001 in ?? ()
#4 0x0000000000000000 in ?? ()
(gdb)
What number should I give it as an argument? ;-)
Martin
Martin Mokrejs wrote:
>
>
> Konstantin Khlebnikov wrote:
>> Martin Mokrejs wrote:
>>> Andrew Morton wrote:
>>>> On Wed, 30 May 2012 00:18:31 +0400
>>>> Konstantin Khlebnikov<[email protected]> wrote:
>>>>
>>>>> Oleg Nesterov wrote:
>>>>>> On 05/22, Andrew Morton wrote:
>>>>>>>
>>>>>>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>>>>>>> happiness is important. Has he cheered up yet?
>>>>>>
>>>>>> Well, yes, I do not really like this patch ;) Because I think there is
>>>>>> a more simple/straightforward fix, see below. In my opinion it also
>>>>>> makes the original code simpler.
>>>>>>
>>>>>> But. Obviously this is subjective, I can't prove my patch is "better",
>>>>>> and I didn't try to test it.
>>>>>>
>>>>>> So I won't argue with Konstantin who dislikes my patch, although I
>>>>>> would like to know the reason.
>>>>>
>>>>> I don't remember why I dislike your patch.
>>>>> For now I can only say ACK )
>>>>
>>>> We'll need a changelogged signed-off patch, please Oleg. And some evidence
>>>> that it was tested would be nice ;)
>>>
>>> I will reboot in few hours, finally after few days ... I am running this first
>>> patch. I will try to test the second/alternative patch more quickly. Sorry for
>>> the delay.
>>>
>>
>> easiest way trigger this bug:
>>
>> #define _GNU_SOURCE
>> #include<unistd.h>
>> #include<sched.h>
>> #include<sys/syscall.h>
>> #include<sys/mman.h>
>>
>> static inline int sys_clone(unsigned long flags, void *stack, int *ptid, int *ctid)
>> {
>> return syscall(SYS_clone, flags, stack, ptid, ctid);
>> }
>>
>> int main(int argc, char **argv)
>> {
>> void *page;
>>
>> page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>> sys_clone(CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, page);
>> }
>>
>
> I am getting segfaults with this.
>
> (gdb) where
> #0 0x0000000000000000 in ?? ()
> #1 0x00007f430f70a7e0 in __elf_set___libc_subfreeres_element_free_mem__ () from /lib64/libc.so.6
> #2 0x00007f430f70a7e8 in __elf_set___libc_atexit_element__IO_cleanup__ () from /lib64/libc.so.6
> #3 0x0000000000000001 in ?? ()
> #4 0x0000000000000000 in ?? ()
> (gdb)
>
> What number should I give it as an argument? ;-)
there is no arguments.
yeah it corrupts stack. I'm too lazy to write it properly =)
but on non-patched kernel it also triggers this bug:
[206732.025131] BUG: Bad rss-counter state mm:ffff88000d8a6c80 idx:1 val:-1
Konstantin Khlebnikov wrote:
> Martin Mokrejs wrote:
>>
>>
>> Konstantin Khlebnikov wrote:
>>> Martin Mokrejs wrote:
>>>> Andrew Morton wrote:
>>>>> On Wed, 30 May 2012 00:18:31 +0400
>>>>> Konstantin Khlebnikov<[email protected]> wrote:
>>>>>
>>>>>> Oleg Nesterov wrote:
>>>>>>> On 05/22, Andrew Morton wrote:
>>>>>>>>
>>>>>>>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>>>>>>>> happiness is important. Has he cheered up yet?
>>>>>>>
>>>>>>> Well, yes, I do not really like this patch ;) Because I think there is
>>>>>>> a more simple/straightforward fix, see below. In my opinion it also
>>>>>>> makes the original code simpler.
>>>>>>>
>>>>>>> But. Obviously this is subjective, I can't prove my patch is "better",
>>>>>>> and I didn't try to test it.
>>>>>>>
>>>>>>> So I won't argue with Konstantin who dislikes my patch, although I
>>>>>>> would like to know the reason.
>>>>>>
>>>>>> I don't remember why I dislike your patch.
>>>>>> For now I can only say ACK )
>>>>>
>>>>> We'll need a changelogged signed-off patch, please Oleg. And some evidence
>>>>> that it was tested would be nice ;)
>>>>
>>>> I will reboot in few hours, finally after few days ... I am running this first
>>>> patch. I will try to test the second/alternative patch more quickly. Sorry for
>>>> the delay.
>>>>
>>>
>>> easiest way trigger this bug:
>>>
>>> #define _GNU_SOURCE
>>> #include<unistd.h>
>>> #include<sched.h>
>>> #include<sys/syscall.h>
>>> #include<sys/mman.h>
>>>
>>> static inline int sys_clone(unsigned long flags, void *stack, int *ptid, int *ctid)
>>> {
>>> return syscall(SYS_clone, flags, stack, ptid, ctid);
>>> }
>>>
>>> int main(int argc, char **argv)
>>> {
>>> void *page;
>>>
>>> page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>>> sys_clone(CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, page);
>>> }
>>>
>>
>> I am getting segfaults with this.
>>
>> (gdb) where
>> #0 0x0000000000000000 in ?? ()
>> #1 0x00007f430f70a7e0 in __elf_set___libc_subfreeres_element_free_mem__ () from /lib64/libc.so.6
>> #2 0x00007f430f70a7e8 in __elf_set___libc_atexit_element__IO_cleanup__ () from /lib64/libc.so.6
>> #3 0x0000000000000001 in ?? ()
>> #4 0x0000000000000000 in ?? ()
>> (gdb)
>>
>> What number should I give it as an argument? ;-)
>
> there is no arguments.
>
> yeah it corrupts stack. I'm too lazy to write it properly =)
> but on non-patched kernel it also triggers this bug:
> [206732.025131] BUG: Bad rss-counter state mm:ffff88000d8a6c80 idx:1 val:-1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
this version works without segfaults =)
#define _GNU_SOURCE
#include <stdlib.h>
#include <sched.h>
#include <sys/mman.h>
int child(void *arg)
{
return 0;
}
char stack[4096];
int main(int argc, char **argv)
{
void *page;
page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
clone(child, stack + sizeof(stack), CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, NULL, page);
return 0;
}
Konstantin Khlebnikov wrote:
> Konstantin Khlebnikov wrote:
>> Martin Mokrejs wrote:
>>>
>>>
>>> Konstantin Khlebnikov wrote:
>>>> Martin Mokrejs wrote:
>>>>> Andrew Morton wrote:
>>>>>> On Wed, 30 May 2012 00:18:31 +0400
>>>>>> Konstantin Khlebnikov<[email protected]> wrote:
>>>>>>
>>>>>>> Oleg Nesterov wrote:
>>>>>>>> On 05/22, Andrew Morton wrote:
>>>>>>>>>
>>>>>>>>> Also, I have a note here that Oleg was unhappy with the patch. Oleg
>>>>>>>>> happiness is important. Has he cheered up yet?
>>>>>>>>
>>>>>>>> Well, yes, I do not really like this patch ;) Because I think there is
>>>>>>>> a more simple/straightforward fix, see below. In my opinion it also
>>>>>>>> makes the original code simpler.
>>>>>>>>
>>>>>>>> But. Obviously this is subjective, I can't prove my patch is "better",
>>>>>>>> and I didn't try to test it.
>>>>>>>>
>>>>>>>> So I won't argue with Konstantin who dislikes my patch, although I
>>>>>>>> would like to know the reason.
>>>>>>>
>>>>>>> I don't remember why I dislike your patch.
>>>>>>> For now I can only say ACK )
>>>>>>
>>>>>> We'll need a changelogged signed-off patch, please Oleg. And some evidence
>>>>>> that it was tested would be nice ;)
>>>>>
>>>>> I will reboot in few hours, finally after few days ... I am running this first
>>>>> patch. I will try to test the second/alternative patch more quickly. Sorry for
>>>>> the delay.
>>>>>
>>>>
>>>> easiest way trigger this bug:
>>>>
>>>> #define _GNU_SOURCE
>>>> #include<unistd.h>
>>>> #include<sched.h>
>>>> #include<sys/syscall.h>
>>>> #include<sys/mman.h>
>>>>
>>>> static inline int sys_clone(unsigned long flags, void *stack, int *ptid, int *ctid)
>>>> {
>>>> return syscall(SYS_clone, flags, stack, ptid, ctid);
>>>> }
>>>>
>>>> int main(int argc, char **argv)
>>>> {
>>>> void *page;
>>>>
>>>> page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>>>> sys_clone(CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, page);
>>>> }
>>>>
>>>
>>> I am getting segfaults with this.
>>>
>>> (gdb) where
>>> #0 0x0000000000000000 in ?? ()
>>> #1 0x00007f430f70a7e0 in __elf_set___libc_subfreeres_element_free_mem__ () from /lib64/libc.so.6
>>> #2 0x00007f430f70a7e8 in __elf_set___libc_atexit_element__IO_cleanup__ () from /lib64/libc.so.6
>>> #3 0x0000000000000001 in ?? ()
>>> #4 0x0000000000000000 in ?? ()
>>> (gdb)
>>>
>>> What number should I give it as an argument? ;-)
>>
>> there is no arguments.
>>
>> yeah it corrupts stack. I'm too lazy to write it properly =)
>> but on non-patched kernel it also triggers this bug:
>> [206732.025131] BUG: Bad rss-counter state mm:ffff88000d8a6c80 idx:1 val:-1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>
> this version works without segfaults =)
>
> #define _GNU_SOURCE
> #include <stdlib.h>
> #include <sched.h>
> #include <sys/mman.h>
>
> int child(void *arg)
> {
> return 0;
> }
>
> char stack[4096];
>
> int main(int argc, char **argv)
> {
> void *page;
>
> page = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> clone(child, stack + sizeof(stack), CLONE_VFORK | CLONE_VM | CLONE_CHILD_CLEARTID, NULL, NULL, NULL, page);
> return 0;
> }
>
Thanks, this app does not crash anymore. Re-confirming that both patches fix the issue on my system.
Martin
On 05/30, Konstantin Khlebnikov wrote:
>
> I don't remember why I dislike your patch.
> For now I can only say ACK )
Great.
Thanks Konstantin, thanks Martin!
I'll write the changelog and send the patch tomorrow.
Oleg.
Oleg Nesterov wrote:
> On 05/30, Konstantin Khlebnikov wrote:
>>
>> I don't remember why I dislike your patch.
>> For now I can only say ACK )
>
> Great.
>
> Thanks Konstantin, thanks Martin!
>
> I'll write the changelog and send the patch tomorrow.
Ding! Week is over, or I missed something? )
>
> Oleg.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
On Thu, Jun 7, 2012 at 9:59 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> Oleg Nesterov wrote:
>>
>> On 05/30, Konstantin Khlebnikov wrote:
>>>
>>>
>>> I don't remember why I dislike your patch.
>>> For now I can only say ACK )
>>
>>
>> Great.
>>
>> Thanks Konstantin, thanks Martin!
>>
>> I'll write the changelog and send the patch tomorrow.
>
>
> Ding! Week is over, or I missed something? )
FWIW, I see the same issue also on UML (3.5-rc1).
--
Thanks,
//richard
On 06/07, Konstantin Khlebnikov wrote:
>
> Oleg Nesterov wrote:
>>
>> I'll write the changelog and send the patch tomorrow.
>
> Ding! Week is over, or I missed something? )
Pong ;)
I have sent the patch on May 31, see
http://marc.info/?l=linux-kernel&m=133848759505805
Also attached below, just in case.
Initiallly I sent 2 patches, see
http://marc.info/?l=linux-kernel&m=133848784705941
but 2/2 (your patch) was already merged.
-------------------------------------------------------------------------------
[PATCH] correctly synchronize rss-counters at exit/exec
A simplified version of Konstantin Khlebnikov's patch.
do_exit() and exec_mmap() call sync_mm_rss() before mm_release()
does put_user(clear_child_tid) which can update task->rss_stat
and thus make mm->rss_stat inconsistent. This triggers the "BUG:"
printk in check_mm().
- Move the final sync_mm_rss() from do_exit() to exit_mm(), and
change exec_mmap() to call sync_mm_rss() after mm_release() to
make check_mm() happy.
Perhaps we should simply move it into mm_release() and call it
unconditionally to catch the "task->rss_stat != 0 && !task->mm"
bugs.
- Since taskstats_exit() is called before exit_mm(), add another
sync_mm_rss() into xacct_add_tsk() who actually uses rss_stat.
Probably we should also shift acct_update_integrals().
Reported-by: Markus Trippelsdorf <[email protected]>
Tested-by: Martin Mokrejs <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Konstantin Khlebnikov <[email protected]>
---
fs/exec.c | 2 +-
kernel/exit.c | 5 ++---
kernel/tsacct.c | 1 +
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 52c9e2f..e49e3c2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -823,10 +823,10 @@ static int exec_mmap(struct mm_struct *mm)
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
mm_release(tsk, old_mm);
if (old_mm) {
+ sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
diff --git a/kernel/exit.c b/kernel/exit.c
index ab972a7..b3a84b5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -655,6 +655,8 @@ static void exit_mm(struct task_struct * tsk)
mm_release(tsk, mm);
if (!mm)
return;
+
+ sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state
@@ -965,9 +967,6 @@ void do_exit(long code)
preempt_count());
acct_update_integrals(tsk);
- /* sync mm's RSS info before statistics gathering */
- if (tsk->mm)
- sync_mm_rss(tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 23b4d78..a64ee90 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
mm = get_task_mm(p);
if (mm) {
+ sync_mm_rss(mm);
/* adjust to KB unit */
stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
--
1.5.5.1
Oleg Nesterov wrote:
> On 06/07, Konstantin Khlebnikov wrote:
>>
>> Oleg Nesterov wrote:
>>>
>>> I'll write the changelog and send the patch tomorrow.
>>
>> Ding! Week is over, or I missed something? )
>
> Pong ;)
>
> I have sent the patch on May 31, see
> http://marc.info/?l=linux-kernel&m=133848759505805
> Also attached below, just in case.
>
> Initiallly I sent 2 patches, see
> http://marc.info/?l=linux-kernel&m=133848784705941
> but 2/2 (your patch) was already merged.
Hmm, ok. Thanks.
I think rss-fix must be in stable-3.4.x -- that "BUG..." message can disturb users.
Plus via this bug any application can decrease rss down to zero =)
>
> -------------------------------------------------------------------------------
> [PATCH] correctly synchronize rss-counters at exit/exec
>
> A simplified version of Konstantin Khlebnikov's patch.
>
> do_exit() and exec_mmap() call sync_mm_rss() before mm_release()
> does put_user(clear_child_tid) which can update task->rss_stat
> and thus make mm->rss_stat inconsistent. This triggers the "BUG:"
> printk in check_mm().
>
> - Move the final sync_mm_rss() from do_exit() to exit_mm(), and
> change exec_mmap() to call sync_mm_rss() after mm_release() to
> make check_mm() happy.
>
> Perhaps we should simply move it into mm_release() and call it
> unconditionally to catch the "task->rss_stat != 0&& !task->mm"
> bugs.
>
> - Since taskstats_exit() is called before exit_mm(), add another
> sync_mm_rss() into xacct_add_tsk() who actually uses rss_stat.
>
> Probably we should also shift acct_update_integrals().
>
> Reported-by: Markus Trippelsdorf<[email protected]>
> Tested-by: Martin Mokrejs<[email protected]>
> Signed-off-by: Oleg Nesterov<[email protected]>
> Acked-by: Konstantin Khlebnikov<[email protected]>
> ---
> fs/exec.c | 2 +-
> kernel/exit.c | 5 ++---
> kernel/tsacct.c | 1 +
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 52c9e2f..e49e3c2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -823,10 +823,10 @@ static int exec_mmap(struct mm_struct *mm)
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> - sync_mm_rss(old_mm);
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> + sync_mm_rss(old_mm);
> /*
> * Make sure that if there is a core dump in progress
> * for the old mm, we get out and die instead of going
> diff --git a/kernel/exit.c b/kernel/exit.c
> index ab972a7..b3a84b5 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -655,6 +655,8 @@ static void exit_mm(struct task_struct * tsk)
> mm_release(tsk, mm);
> if (!mm)
> return;
> +
> + sync_mm_rss(mm);
> /*
> * Serialize with any possible pending coredump.
> * We must hold mmap_sem around checking core_state
> @@ -965,9 +967,6 @@ void do_exit(long code)
> preempt_count());
>
> acct_update_integrals(tsk);
> - /* sync mm's RSS info before statistics gathering */
> - if (tsk->mm)
> - sync_mm_rss(tsk->mm);
> group_dead = atomic_dec_and_test(&tsk->signal->live);
> if (group_dead) {
> hrtimer_cancel(&tsk->signal->real_timer);
> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> index 23b4d78..a64ee90 100644
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -91,6 +91,7 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
> stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> mm = get_task_mm(p);
> if (mm) {
> + sync_mm_rss(mm);
> /* adjust to KB unit */
> stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
> stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;