2014-06-04 18:24:01

by Greg KH

[permalink] [raw]
Subject: Bad rss-counter is back on 3.14-stable

Hi all,

Dave, I saw you mention that you were seeing the "Bad rss-counter" line
on 3.15-rc1, but I couldn't find any follow-up on this to see if anyone
figured it out, or did it just "magically" go away?

I ask as Brandon is seeing this same message a lot on a 3.14.4 kernel,
causing system crashes and problems:

[16591492.449718] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:0 val:-1836508
[16591492.449737] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:1 val:1836508

[20783350.461716] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:0 val:-52518
[20783350.461734] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:1 val:52518

[21393387.112302] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:0 val:-1767569
[21393387.112321] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:1 val:1767569

[21430098.512837] BUG: Bad rss-counter state mm:ffff880100036680 idx:0 val:-2946
[21430098.512854] BUG: Bad rss-counter state mm:ffff880100036680 idx:1 val:2946

Anyone have any ideas of a 3.15-rc patch I should be including in
3.14-stable to resolve this?

thanks,

greg k-h


2014-06-04 18:47:55

by Dennis Mungai

[permalink] [raw]
Subject: Re: Bad rss-counter is back on 3.14-stable

Hello Greg,

do_exit() and exec_mmap() call sync_mm_rss() before mm_release()
does put_user(clear_child_tid) which can update task->rss_stat
and thus make mm->rss_stat inconsistent. This triggers the "BUG:"
printk in check_mm().

Let's fix this bug in the safest way, and optimize/cleanup this later.

Reported-by: Greg KH <[email protected]>
Signed-off-by: Dennis E. Mungai <[email protected]>
---
fs/exec.c | 2 +-
kernel/exit.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c
index a79786a..da27b91 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -819,10 +819,10 @@ static int exec_mmap(struct mm_struct *mm)
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
- sync_mm_rss(old_mm);
mm_release(tsk, old_mm);

if (old_mm) {
+ sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
diff --git a/kernel/exit.c b/kernel/exit.c
index 34867cc..c0277d3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -643,6 +643,7 @@ static void exit_mm(struct task_struct * tsk)
mm_release(tsk, mm);
if (!mm)
return;
+ sync_mm_rss(mm);
/*
* Serialize with any possible pending coredump.
* We must hold mmap_sem around checking core_state

Apply that patch and see how it goes.

On 4 June 2014 21:27, Greg KH <[email protected]> wrote:
> Hi all,
>
> Dave, I saw you mention that you were seeing the "Bad rss-counter" line
> on 3.15-rc1, but I couldn't find any follow-up on this to see if anyone
> figured it out, or did it just "magically" go away?
>
> I ask as Brandon is seeing this same message a lot on a 3.14.4 kernel,
> causing system crashes and problems:
>
> [16591492.449718] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:0 val:-1836508
> [16591492.449737] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:1 val:1836508
>
> [20783350.461716] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:0 val:-52518
> [20783350.461734] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:1 val:52518
>
> [21393387.112302] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:0 val:-1767569
> [21393387.112321] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:1 val:1767569
>
> [21430098.512837] BUG: Bad rss-counter state mm:ffff880100036680 idx:0 val:-2946
> [21430098.512854] BUG: Bad rss-counter state mm:ffff880100036680 idx:1 val:2946
>
> Anyone have any ideas of a 3.15-rc patch I should be including in
> 3.14-stable to resolve this?
>
> thanks,
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



--
Please avoid sending me Word or PowerPoint attachments.

See http://www.gnu.org/philosophy/no-word-attachments.html

2014-06-04 19:12:44

by Dave Jones

[permalink] [raw]
Subject: Re: Bad rss-counter is back on 3.14-stable

On Wed, Jun 04, 2014 at 11:27:39AM -0700, Greg KH wrote:
> Hi all,
>
> Dave, I saw you mention that you were seeing the "Bad rss-counter" line
> on 3.15-rc1, but I couldn't find any follow-up on this to see if anyone
> figured it out, or did it just "magically" go away?
>
> I ask as Brandon is seeing this same message a lot on a 3.14.4 kernel,
> causing system crashes and problems:
>
> [16591492.449718] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:0 val:-1836508
> [16591492.449737] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:1 val:1836508
>
> [20783350.461716] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:0 val:-52518
> [20783350.461734] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:1 val:52518
>
> [21393387.112302] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:0 val:-1767569
> [21393387.112321] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:1 val:1767569
>
> [21430098.512837] BUG: Bad rss-counter state mm:ffff880100036680 idx:0 val:-2946
> [21430098.512854] BUG: Bad rss-counter state mm:ffff880100036680 idx:1 val:2946
>
> Anyone have any ideas of a 3.15-rc patch I should be including in
> 3.14-stable to resolve this?

hard to tell if they were the same issues I was seeing without the full
backtrace. The only bad rss bugs that I recall being fixed for sure were
the ones that Hugh nailed down right before 3.14 (887843961c4b)

I've not seen anything in a while, but that may just be because I end up
hitting other bugs before they get a chance to show.

Brandon, what kind of workload is that machine doing ? I wonder if I can
add something to trinity to make it provoke it.

Dave

2014-06-04 19:35:47

by Brandon Philips

[permalink] [raw]
Subject: Re: Bad rss-counter is back on 3.14-stable

On Wed, Jun 4, 2014 at 12:12 PM, Dave Jones <[email protected]> wrote:
> Brandon, what kind of workload is that machine doing ? I wonder if I can
> add something to trinity to make it provoke it.

A really boring database workload (fsync() ~50ms) with a sloowww block
device with btrfs. There are occasional CPU spikes due to expensive
queries.

How can I be more helpful in my workload description?

Thanks,

Brandon

2014-06-04 22:23:05

by Dave Jones

[permalink] [raw]
Subject: Re: Bad rss-counter is back on 3.14-stable

On Wed, Jun 04, 2014 at 12:35:45PM -0700, Brandon Philips wrote:
> On Wed, Jun 4, 2014 at 12:12 PM, Dave Jones <[email protected]> wrote:
> > Brandon, what kind of workload is that machine doing ? I wonder if I can
> > add something to trinity to make it provoke it.
>
> A really boring database workload (fsync() ~50ms) with a sloowww block
> device with btrfs. There are occasional CPU spikes due to expensive
> queries.
>
> How can I be more helpful in my workload description?

I feared it would be something like a database. Trying to replicate
things seen under those workloads always seems to be challenging,
in part due to the system specific setups they seem to have.

I wonder if any of the benchmarking apps we have do a realistic
representation of what modern databases do. It might be a fun project
to take something like that and extend it to do random queries.

Dave

2014-06-04 23:46:49

by Andre Tomt

[permalink] [raw]
Subject: Re: Bad rss-counter is back on 3.14-stable

On 04. juni 2014 20:27, Greg KH wrote:
> Hi all,
>
> Dave, I saw you mention that you were seeing the "Bad rss-counter" line
> on 3.15-rc1, but I couldn't find any follow-up on this to see if anyone
> figured it out, or did it just "magically" go away?
>
> I ask as Brandon is seeing this same message a lot on a 3.14.4 kernel,
> causing system crashes and problems:
>
> [16591492.449718] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:0 val:-1836508
> [16591492.449737] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:1 val:1836508
>
> [20783350.461716] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:0 val:-52518
> [20783350.461734] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:1 val:52518
>
> [21393387.112302] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:0 val:-1767569
> [21393387.112321] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:1 val:1767569
>
> [21430098.512837] BUG: Bad rss-counter state mm:ffff880100036680 idx:0 val:-2946
> [21430098.512854] BUG: Bad rss-counter state mm:ffff880100036680 idx:1 val:2946
>
> Anyone have any ideas of a 3.15-rc patch I should be including in
> 3.14-stable to resolve this?

I saw a bunch of similar errors on 3.14.x up to and including 3.14.4,
running Java (Tomcat) and Postgres on Xen PV. Have not seen it since
"mm: use paravirt friendly ops for NUMA hinting ptes" landed in 3.14.5.

402e194dfc5b38d99f9c65b86e2666b29adebf8c in stable,
29c7787075c92ca8af353acd5301481e6f37082f upstream

As I did not follow the original discussion I have no idea if this is
the same thing, and I'm way too lazy to look for it now. ;-)

2014-06-05 00:18:05

by Greg KH

[permalink] [raw]
Subject: Re: Bad rss-counter is back on 3.14-stable

On Thu, Jun 05, 2014 at 01:37:42AM +0200, Andre Tomt wrote:
> On 04. juni 2014 20:27, Greg KH wrote:
> > Hi all,
> >
> > Dave, I saw you mention that you were seeing the "Bad rss-counter" line
> > on 3.15-rc1, but I couldn't find any follow-up on this to see if anyone
> > figured it out, or did it just "magically" go away?
> >
> > I ask as Brandon is seeing this same message a lot on a 3.14.4 kernel,
> > causing system crashes and problems:
> >
> > [16591492.449718] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:0 val:-1836508
> > [16591492.449737] BUG: Bad rss-counter state mm:ffff8801ced99880 idx:1 val:1836508
> >
> > [20783350.461716] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:0 val:-52518
> > [20783350.461734] BUG: Bad rss-counter state mm:ffff8801d2b1dc00 idx:1 val:52518
> >
> > [21393387.112302] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:0 val:-1767569
> > [21393387.112321] BUG: Bad rss-counter state mm:ffff8801d0104e00 idx:1 val:1767569
> >
> > [21430098.512837] BUG: Bad rss-counter state mm:ffff880100036680 idx:0 val:-2946
> > [21430098.512854] BUG: Bad rss-counter state mm:ffff880100036680 idx:1 val:2946
> >
> > Anyone have any ideas of a 3.15-rc patch I should be including in
> > 3.14-stable to resolve this?
>
> I saw a bunch of similar errors on 3.14.x up to and including 3.14.4,
> running Java (Tomcat) and Postgres on Xen PV. Have not seen it since
> "mm: use paravirt friendly ops for NUMA hinting ptes" landed in 3.14.5.
>
> 402e194dfc5b38d99f9c65b86e2666b29adebf8c in stable,
> 29c7787075c92ca8af353acd5301481e6f37082f upstream
>
> As I did not follow the original discussion I have no idea if this is
> the same thing, and I'm way too lazy to look for it now. ;-)

Ah, nice find.

Brandon, I think 3.14.5 is in the CoreOs tree, can you update to that on
these boxes to see if it solves the issue?

thanks,

greg k-h