2006-08-01 18:44:54

by Dave Jones

[permalink] [raw]
Subject: single bit flip detector.

In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

Signed-off-by: Dave Jones <[email protected]>

--- linux-2.6.16.noarch/mm/slab.c~ 2006-03-22 18:29:27.000000000 -0500
+++ linux-2.6.16.noarch/mm/slab.c 2006-03-22 18:30:58.000000000 -0500
@@ -1516,10 +1516,33 @@ static void poison_obj(struct kmem_cache
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total=0, bad_count=0;
printk(KERN_ERR "%03x:", offset);
- for (i = 0; i < limit; i++)
+ for (i = 0; i < limit; i++) {
+ if (data[offset+i] != POISON_FREE) {
+ total += data[offset+i];
+ ++bad_count;
+ }
printk(" %02x", (unsigned char)data[offset + i]);
+ }
printk("\n");
+ if (bad_count == 1) {
+ switch (total) {
+ case POISON_FREE ^ 0x01:
+ case POISON_FREE ^ 0x02:
+ case POISON_FREE ^ 0x04:
+ case POISON_FREE ^ 0x08:
+ case POISON_FREE ^ 0x10:
+ case POISON_FREE ^ 0x20:
+ case POISON_FREE ^ 0x40:
+ case POISON_FREE ^ 0x80:
+ printk (KERN_ERR "Single bit error detected. Possibly bad RAM.\n");
+#ifdef CONFIG_X86
+ printk (KERN_ERR "Run memtest86 or other memory test tool.\n");
+#endif
+ return;
+ }
+ }
}
#endif

--
http://www.codemonkey.org.uk


2006-08-01 20:12:17

by Bill Davidsen

[permalink] [raw]
Subject: Re: single bit flip detector.

eventuallyDave Jones wrote:
> In case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug. In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages
> in those cases, in the hope that users will try memtest before
> they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Run memtest86.

Given the probability of hardware vs. kernel, you could replace
"possible" with "probable" and not get any argument from me.

--
Bill Davidsen <[email protected]>
Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.

2006-08-01 21:55:17

by Alan

[permalink] [raw]
Subject: Re: single bit flip detector.

Ar Maw, 2006-08-01 am 14:44 -0400, ysgrifennodd Dave Jones:
> + case POISON_FREE ^ 0x01:
> + case POISON_FREE ^ 0x02:
> + case POISON_FREE ^ 0x04:
> + case POISON_FREE ^ 0x08:
> + case POISON_FREE ^ 0x10:
> + case POISON_FREE ^ 0x20:
> + case POISON_FREE ^ 0x40:
> + case POISON_FREE ^ 0x80:
> + printk (KERN_ERR "Single bit error detected. Possibly bad RAM.\n");
> +#ifdef CONFIG_X86
> + printk (KERN_ERR "Run memtest86 or other memory test tool.\n");
> +#endif
> + return;

Gack .. NAK

#1: Do we want memtest86 or memtest86+ ?
#2: The check is horrible and there is an elegant implementation for
single bit.

errors = value ^ expected;
if (errors && !(errors & (errors - 1)))
printk(KERN_ERR "Single bit error detected....");


Alan

2006-08-01 22:30:23

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Tue, Aug 01, 2006 at 11:14:27PM +0100, Alan Cox wrote:
> Ar Maw, 2006-08-01 am 14:44 -0400, ysgrifennodd Dave Jones:
> > + case POISON_FREE ^ 0x01:
> > + case POISON_FREE ^ 0x02:
> > + case POISON_FREE ^ 0x04:
> > + case POISON_FREE ^ 0x08:
> > + case POISON_FREE ^ 0x10:
> > + case POISON_FREE ^ 0x20:
> > + case POISON_FREE ^ 0x40:
> > + case POISON_FREE ^ 0x80:
> > + printk (KERN_ERR "Single bit error detected. Possibly bad RAM.\n");
> > +#ifdef CONFIG_X86
> > + printk (KERN_ERR "Run memtest86 or other memory test tool.\n");
> > +#endif
> > + return;
>
> Gack .. NAK
>
> #1: Do we want memtest86 or memtest86+ ?

I doubt it really matters.

> #2: The check is horrible and there is an elegant implementation for
> single bit.
>
> errors = value ^ expected;
> if (errors && !(errors & (errors - 1)))
> printk(KERN_ERR "Single bit error detected....");

Good call, I'll hack that up.

Dave

--
http://www.codemonkey.org.uk

2006-08-01 22:36:31

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Tue, Aug 01, 2006 at 06:30:11PM -0400, Dave Jones wrote:
> On Tue, Aug 01, 2006 at 11:14:27PM +0100, Alan Cox wrote:
> > Ar Maw, 2006-08-01 am 14:44 -0400, ysgrifennodd Dave Jones:
> > > + case POISON_FREE ^ 0x01:
> > > + case POISON_FREE ^ 0x02:
> > > + case POISON_FREE ^ 0x04:
> > > + case POISON_FREE ^ 0x08:
> > > + case POISON_FREE ^ 0x10:
> > > + case POISON_FREE ^ 0x20:
> > > + case POISON_FREE ^ 0x40:
> > > + case POISON_FREE ^ 0x80:
> > > + printk (KERN_ERR "Single bit error detected. Possibly bad RAM.\n");
> > > +#ifdef CONFIG_X86
> > > + printk (KERN_ERR "Run memtest86 or other memory test tool.\n");
> > > +#endif
> > > + return;
> >
> > Gack .. NAK
> >
> > #1: Do we want memtest86 or memtest86+ ?
>
> I doubt it really matters.
>
> > #2: The check is horrible and there is an elegant implementation for
> > single bit.
> >
> > errors = value ^ expected;
> > if (errors && !(errors & (errors - 1)))
> > printk(KERN_ERR "Single bit error detected....");
>
> Good call, I'll hack that up.

Take #2.

In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

Signed-off-by: Dave Jones <[email protected]>


diff --git a/mm/slab.c b/mm/slab.c
index 21ba060..39f1183 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1638,10 +1638,29 @@ static void poison_obj(struct kmem_cache
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total=0, bad_count=0;
printk(KERN_ERR "%03x:", offset);
- for (i = 0; i < limit; i++)
+ for (i = 0; i < limit; i++) {
+ if (data[offset+i] != POISON_FREE) {
+ total += data[offset+i];
+ ++bad_count;
+ }
printk(" %02x", (unsigned char)data[offset + i]);
+ }
printk("\n");
+
+ if (bad_count == 1) {
+ errors = total ^ POISON_FREE;
+ if ((errors && !(errors & (errors-1))) {
+ printk (KERN_ERR "Single bit error detected. Probably bad RAM.\n");
+#ifdef CONFIG_X86
+ printk (KERN_ERR "Run memtest86+ or similar memory test tool.\n");
+#else
+ printk (KERN_ERR "Run a memory test tool.\n");
+#endif
+ return;
+ }
+ }
}
#endif

--
http://www.codemonkey.org.uk

2006-08-01 23:00:11

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: single bit flip detector.

On Tue, Aug 01, 2006 at 06:36:22PM -0400, Dave Jones wrote:
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1638,10 +1638,29 @@ static void poison_obj(struct kmem_cache
> static void dump_line(char *data, int offset, int limit)
> {
> int i;
> + unsigned char total=0, bad_count=0;

" = 0"

> printk(KERN_ERR "%03x:", offset);
> - for (i = 0; i < limit; i++)
> + for (i = 0; i < limit; i++) {
> + if (data[offset+i] != POISON_FREE) {

" + i"

> + total += data[offset+i];

ditto

> + ++bad_count;

How about post increment?

> + }
> printk(" %02x", (unsigned char)data[offset + i]);
> + }
> printk("\n");
> +
> + if (bad_count == 1) {
> + errors = total ^ POISON_FREE;

undeclared "errors"

> + if ((errors && !(errors & (errors-1))) {

^^^

Turn on CONFIG_DEBUG_SLAB before compiling. ;-)

2006-08-01 23:16:14

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Wed, Aug 02, 2006 at 03:00:03AM +0400, Alexey Dobriyan wrote:

> Turn on CONFIG_DEBUG_SLAB before compiling. ;-)

Well, that was silly. Here's a properly compile tested patch :-)

Dave


In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

Signed-off-by: Dave Jones <[email protected]>

diff --git a/mm/slab.c b/mm/slab.c
index 21ba060..39f1183 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1638,10 +1638,29 @@ static void poison_obj(struct kmem_cache
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total = 0, bad_count = 0, errors = 0;
printk(KERN_ERR "%03x:", offset);
- for (i = 0; i < limit; i++)
+ for (i = 0; i < limit; i++) {
+ if (data[offset + i] != POISON_FREE) {
+ total += data[offset + i];
+ bad_count++;
+ }
printk(" %02x", (unsigned char)data[offset + i]);
+ }
printk("\n");
+
+ if (bad_count == 1) {
+ errors = total ^ POISON_FREE;
+ if (errors && !(errors & (errors-1))) {
+ printk (KERN_ERR "Single bit error detected. Probably bad RAM.\n");
+#ifdef CONFIG_X86
+ printk (KERN_ERR "Run memtest86+ or similar memory test tool.\n");
+#else
+ printk (KERN_ERR "Run a memory test tool.\n");
+#endif
+ return;
+ }
+ }
}
#endif


--
http://www.codemonkey.org.uk

2006-08-01 23:27:29

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Tue, Aug 01, 2006 at 07:16:03PM -0400, Dave Jones wrote:
> On Wed, Aug 02, 2006 at 03:00:03AM +0400, Alexey Dobriyan wrote:
>
> > Turn on CONFIG_DEBUG_SLAB before compiling. ;-)
>
> Well, that was silly. Here's a properly compile tested patch :-)

The grammar police found me, here's hopefully the final rendition..

Dave


In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Probably bad RAM.

Signed-off-by: Dave Jones <[email protected]>

diff --git a/mm/slab.c b/mm/slab.c
index 21ba060..39f1183 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1638,10 +1638,29 @@ static void poison_obj(struct kmem_cache
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total = 0, bad_count = 0, errors = 0;
printk(KERN_ERR "%03x:", offset);
- for (i = 0; i < limit; i++)
+ for (i = 0; i < limit; i++) {
+ if (data[offset + i] != POISON_FREE) {
+ total += data[offset + i];
+ bad_count++;
+ }
printk(" %02x", (unsigned char)data[offset + i]);
+ }
printk("\n");
+
+ if (bad_count == 1) {
+ errors = total ^ POISON_FREE;
+ if (errors && !(errors & (errors-1))) {
+ printk (KERN_ERR "Single bit error detected. Probably bad RAM.\n");
+#ifdef CONFIG_X86
+ printk (KERN_ERR "Run memtest86+ or a similar memory test tool.\n");
+#else
+ printk (KERN_ERR "Run a memory test tool.\n");
+#endif
+ return;
+ }
+ }
}
#endif


--
http://www.codemonkey.org.uk

2006-08-01 23:28:52

by Andreas Schwab

[permalink] [raw]
Subject: Re: single bit flip detector.

Dave Jones <[email protected]> writes:

> diff --git a/mm/slab.c b/mm/slab.c
> index 21ba060..39f1183 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1638,10 +1638,29 @@ static void poison_obj(struct kmem_cache
> static void dump_line(char *data, int offset, int limit)
> {
> int i;
> + unsigned char total = 0, bad_count = 0, errors = 0;

No need to initialize errors here.

> printk(KERN_ERR "%03x:", offset);
> - for (i = 0; i < limit; i++)
> + for (i = 0; i < limit; i++) {
> + if (data[offset + i] != POISON_FREE) {
> + total += data[offset + i];
> + bad_count++;
> + }
> printk(" %02x", (unsigned char)data[offset + i]);
> + }
> printk("\n");
> +
> + if (bad_count == 1) {
> + errors = total ^ POISON_FREE;
> + if (errors && !(errors & (errors-1))) {
> + printk (KERN_ERR "Single bit error detected. Probably bad RAM.\n");
> +#ifdef CONFIG_X86
> + printk (KERN_ERR "Run memtest86+ or similar memory test tool.\n");
> +#else
> + printk (KERN_ERR "Run a memory test tool.\n");
> +#endif
> + return;

Useless return.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2006-08-01 23:51:22

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Wed, Aug 02, 2006 at 01:28:49AM +0200, Andreas Schwab wrote:
> Dave Jones <[email protected]> writes:
>
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 21ba060..39f1183 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -1638,10 +1638,29 @@ static void poison_obj(struct kmem_cache
> > static void dump_line(char *data, int offset, int limit)
> > {
> > int i;
> > + unsigned char total = 0, bad_count = 0, errors = 0;
>
> No need to initialize errors here.

I'm going for the record of 'most times a patch gets submitted in one day'.
And to think we were complaining that patches don't get enough review ? :)
If every change had this much polish, we'd be awesome.

Dave


In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

Signed-off-by: Dave Jones <[email protected]>

diff --git a/mm/slab.c b/mm/slab.c
index 21ba060..39f1183 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1638,10 +1638,28 @@ static void poison_obj(struct kmem_cache
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total = 0, bad_count = 0, errors;
printk(KERN_ERR "%03x:", offset);
- for (i = 0; i < limit; i++)
+ for (i = 0; i < limit; i++) {
+ if (data[offset + i] != POISON_FREE) {
+ total += data[offset + i];
+ bad_count++;
+ }
printk(" %02x", (unsigned char)data[offset + i]);
+ }
printk("\n");
+
+ if (bad_count == 1) {
+ errors = total ^ POISON_FREE;
+ if (errors && !(errors & (errors-1))) {
+ printk (KERN_ERR "Single bit error detected. Probably bad RAM.\n");
+#ifdef CONFIG_X86
+ printk (KERN_ERR "Run memtest86+ or a similar memory test tool.\n");
+#else
+ printk (KERN_ERR "Run a memory test tool.\n");
+#endif
+ }
+ }
}
#endif


--
http://www.codemonkey.org.uk

2006-08-02 00:16:41

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Tue, Aug 01, 2006 at 07:51:09PM -0400, Dave Jones wrote:
> I'm going for the record of 'most times a patch gets submitted in one day'.
> And to think we were complaining that patches don't get enough review ? :)
> If every change had this much polish, we'd be awesome.

Sigh. Spaces before printk. Whatever next.
I am now officially bored of seeing this patch.

Dave


In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

Signed-off-by: Dave Jones <[email protected]>

diff --git a/mm/slab.c b/mm/slab.c
index 21ba060..39f1183 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1638,10 +1638,28 @@ static void poison_obj(struct kmem_cache
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total = 0, bad_count = 0, errors;
printk(KERN_ERR "%03x:", offset);
- for (i = 0; i < limit; i++)
+ for (i = 0; i < limit; i++) {
+ if (data[offset + i] != POISON_FREE) {
+ total += data[offset + i];
+ bad_count++;
+ }
printk(" %02x", (unsigned char)data[offset + i]);
+ }
printk("\n");
+
+ if (bad_count == 1) {
+ errors = total ^ POISON_FREE;
+ if (errors && !(errors & (errors-1))) {
+ printk(KERN_ERR "Single bit error detected. Probably bad RAM.\n");
+#ifdef CONFIG_X86
+ printk(KERN_ERR "Run memtest86+ or a similar memory test tool.\n");
+#else
+ printk(KERN_ERR "Run a memory test tool.\n");
+#endif
+ }
+ }
}
#endif


--
http://www.codemonkey.org.uk

2006-08-02 06:17:55

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: single bit flip detector.

Dave Jones wrote:
> On Tue, Aug 01, 2006 at 07:51:09PM -0400, Dave Jones wrote:
> > I'm going for the record of 'most times a patch gets submitted in one
> > day'. And to think we were complaining that patches don't get enough
> > review ? :) If every change had this much polish, we'd be awesome.
>
> Sigh. Spaces before printk. Whatever next.
> I am now officially bored of seeing this patch.

> diff --git a/mm/slab.c b/mm/slab.c
> index 21ba060..39f1183 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1638,10 +1638,28 @@ static void poison_obj(struct kmem_cache
> static void dump_line(char *data, int offset, int limit)
> {
> int i;
> + unsigned char total = 0, bad_count = 0, errors;
> printk(KERN_ERR "%03x:", offset);
> - for (i = 0; i < limit; i++)
> + for (i = 0; i < limit; i++) {

You might want to add a newline before the actual code to make it more
readable :)

Eike


Attachments:
(No filename) (894.00 B)
(No filename) (189.00 B)
Download all attachments

2006-08-02 07:10:04

by Jan Engelhardt

[permalink] [raw]
Subject: Re: single bit flip detector.

>
>> + ++bad_count;
>
>How about post increment?
>

There would no difference, in effect.
Same goes for the return; mentioned elsewhere in the thread.


Jan Engelhardt
--

2006-08-02 07:11:31

by Jan Engelhardt

[permalink] [raw]
Subject: Re: single bit flip detector.


> printk(" %02x", (unsigned char)data[offset + i]);

Remove cast. (Or does it spew a warning message for you?)


Jan Engelhardt
--

2006-08-02 15:24:15

by Patrick McLean

[permalink] [raw]
Subject: Re: single bit flip detector.

Dave Jones wrote:
> On Tue, Aug 01, 2006 at 07:51:09PM -0400, Dave Jones wrote:
> > I'm going for the record of 'most times a patch gets submitted in one day'.
> > And to think we were complaining that patches don't get enough review ? :)
> > If every change had this much polish, we'd be awesome.
>
> Sigh. Spaces before printk. Whatever next.
> I am now officially bored of seeing this patch.
>
> Dave
>
>
> In case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug. In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages
> in those cases, in the hope that users will try memtest before
> they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Run memtest86.
>
> Signed-off-by: Dave Jones <[email protected]>
>
> + if (errors && !(errors & (errors-1))) {
> + printk(KERN_ERR "Single bit error detected. Probably bad RAM.\n");
> +#ifdef CONFIG_X86

#if defined(CONFIG_X86) || defined(CONFIG_X86_64)

memtest86+ runs fine on x86_64 machines as well.

> + printk(KERN_ERR "Run memtest86+ or a similar memory test tool.\n");
> +#else
> + printk(KERN_ERR "Run a memory test tool.\n");

2006-08-02 16:09:32

by Randy Dunlap

[permalink] [raw]
Subject: Re: single bit flip detector.

On Wed, 02 Aug 2006 11:24:13 -0400 Patrick McLean wrote:

> Dave Jones wrote:
> > On Tue, Aug 01, 2006 at 07:51:09PM -0400, Dave Jones wrote:
> > > I'm going for the record of 'most times a patch gets submitted in one day'.
> > > And to think we were complaining that patches don't get enough review ? :)
> > > If every change had this much polish, we'd be awesome.
> >
> > Sigh. Spaces before printk. Whatever next.
> > I am now officially bored of seeing this patch.
> >
> > Dave
> >
> >
> > In case where we detect a single bit has been flipped, we spew
> > the usual slab corruption message, which users instantly think
> > is a kernel bug. In a lot of cases, single bit errors are
> > down to bad memory, or other hardware failure.
> >
> > This patch adds an extra line to the slab debug messages
> > in those cases, in the hope that users will try memtest before
> > they report a bug.
> >
> > 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> > Single bit error detected. Possibly bad RAM. Run memtest86.
> >
> > Signed-off-by: Dave Jones <[email protected]>
> >
> > + if (errors && !(errors & (errors-1))) {
> > + printk(KERN_ERR "Single bit error detected. Probably bad RAM.\n");
> > +#ifdef CONFIG_X86
>
> #if defined(CONFIG_X86) || defined(CONFIG_X86_64)

CONFIG_X86 is set on both x86-32 and x86-64.


> memtest86+ runs fine on x86_64 machines as well.
>
> > + printk(KERN_ERR "Run memtest86+ or a similar memory test tool.\n");
> > +#else
> > + printk(KERN_ERR "Run a memory test tool.\n");
> -


---
~Randy

2006-08-04 21:20:15

by Andrew Morton

[permalink] [raw]
Subject: Re: single bit flip detector.

On Tue, 1 Aug 2006 20:16:26 -0400
Dave Jones <[email protected]> wrote:

> In case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug. In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages
> in those cases, in the hope that users will try memtest before
> they report a bug.

Well boy, this has to be the most-reviewed patch ever. You'd think that
I'd apply it with great confidence and warm fuzzies.

However...


From: Andrew Morton <[email protected]>

- one decl per line is more patching-friendly and a bit more idiomatic.

- make `bad_count' an int: a uchar might overflow

- Put a blank line between decls and code

- rename `total' to `error', remove `errors'.

- there's no need to sum up the errors.

- don't need to check for non-zero `errors': we know it is != POISON_FREE.

- make it look non-crapful in an 80-col window.

- add missing spaces in arithmetic

Cc: Dave Jones <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---


diff -puN mm/slab.c~single-bit-flip-detector-tidy mm/slab.c
--- a/mm/slab.c~single-bit-flip-detector-tidy
+++ a/mm/slab.c
@@ -1637,11 +1637,13 @@ static void poison_obj(struct kmem_cache
static void dump_line(char *data, int offset, int limit)
{
int i;
- unsigned char total = 0, bad_count = 0, errors;
+ unsigned char error = 0;
+ int bad_count = 0;
+
printk(KERN_ERR "%03x:", offset);
for (i = 0; i < limit; i++) {
if (data[offset + i] != POISON_FREE) {
- total += data[offset + i];
+ error = data[offset + i];
bad_count++;
}
printk(" %02x", (unsigned char)data[offset + i]);
@@ -1649,11 +1651,13 @@ static void dump_line(char *data, int of
printk("\n");

if (bad_count == 1) {
- errors = total ^ POISON_FREE;
- if (errors && !(errors & (errors-1))) {
- printk(KERN_ERR "Single bit error detected. Probably bad RAM.\n");
+ error ^= POISON_FREE;
+ if (!(error & (error - 1))) {
+ printk(KERN_ERR "Single bit error detected. Probably "
+ "bad RAM.\n");
#ifdef CONFIG_X86
- printk(KERN_ERR "Run memtest86+ or a similar memory test tool.\n");
+ printk(KERN_ERR "Run memtest86+ or a similar memory "
+ "test tool.\n");
#else
printk(KERN_ERR "Run a memory test tool.\n");
#endif
_

2006-08-04 22:06:54

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Fri, Aug 04, 2006 at 02:19:55PM -0700, Andrew Morton wrote:
> On Tue, 1 Aug 2006 20:16:26 -0400
> Dave Jones <[email protected]> wrote:
>
> > In case where we detect a single bit has been flipped, we spew
> > the usual slab corruption message, which users instantly think
> > is a kernel bug. In a lot of cases, single bit errors are
> > down to bad memory, or other hardware failure.
> >
> > This patch adds an extra line to the slab debug messages
> > in those cases, in the hope that users will try memtest before
> > they report a bug.
>
> Well boy, this has to be the most-reviewed patch ever. You'd think that
> I'd apply it with great confidence and warm fuzzies.

I should stick to one-liner fixes.

> - one decl per line is more patching-friendly and a bit more idiomatic.
> - make `bad_count' an int: a uchar might overflow
> - Put a blank line between decls and code
> - rename `total' to `error', remove `errors'.
> - there's no need to sum up the errors.
> - don't need to check for non-zero `errors': we know it is != POISON_FREE.
> - make it look non-crapful in an 80-col window.
> - add missing spaces in arithmetic

With this much iteration, I do have one question..
Does it still work ? :-)

Dave

--
http://www.codemonkey.org.uk

2006-08-04 22:25:22

by Andrew Morton

[permalink] [raw]
Subject: Re: single bit flip detector.

On Fri, 4 Aug 2006 18:06:44 -0400
Dave Jones <[email protected]> wrote:

> Does it still work ? :-)

Awww man. And I thought everyone _else_ was being fussy.

I expect it'll work. We'll find out when someone hits a single-bit error.
The worst it can do is crash the machine, except it's already crashed...

2006-08-04 22:28:43

by Dave Jones

[permalink] [raw]
Subject: Re: single bit flip detector.

On Fri, Aug 04, 2006 at 03:25:07PM -0700, Andrew Morton wrote:
> On Fri, 4 Aug 2006 18:06:44 -0400
> Dave Jones <[email protected]> wrote:
>
> > Does it still work ? :-)
>
> Awww man. And I thought everyone _else_ was being fussy.
>
> I expect it'll work. We'll find out when someone hits a single-bit error.
> The worst it can do is crash the machine, except it's already crashed...

Ok, if it turns out to be broken, and my version worked, you owe
me a beer at the next conference :-)

Apathetically-Signed-off-by: Dave Jones <[email protected]>

Dave

--
http://www.codemonkey.org.uk

2006-08-06 11:06:06

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: single bit flip detector.

On Wed, 2 Aug 2006, Jan Engelhardt wrote:
> > printk(" %02x", (unsigned char)data[offset + i]);
>
> Remove cast. (Or does it spew a warning message for you?)

No warning, but you still want the cast...

On PPC and ARM that will work fine, since char is unsigned.

But on most other platforms char is signed, and contrary to popular belief,
`%02x' doesn't mean `limit this field to 2 characters', so it would print e.g.
ffffffff instead of ff for -1.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds