2006-02-02 19:24:16

by Dave Jones

[permalink] [raw]
Subject: discriminate single bit error hardware failure from slab corruption.

In the case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages in those
cases, in the hope that users will try memtest before they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Please run memtest86.

Signed-off-by: Dave Jones <[email protected]>

--- linux-2.6.15/mm/slab.c~ 2006-01-09 13:25:17.000000000 -0500
+++ linux-2.6.15/mm/slab.c 2006-01-09 13:26:01.000000000 -0500
@@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total=0;
printk(KERN_ERR "%03x:", offset);
for (i = 0; i < limit; i++) {
+ if (data[offset+i] != POISON_FREE)
+ total += data[offset+i];
printk(" %02x", (unsigned char)data[offset + i]);
}
printk("\n");
@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
}
}
printk("\n");
+ switch (total) {
+ case 0x36:
+ case 0x6a:
+ case 0x6f:
+ case 0x81:
+ case 0xac:
+ case 0xd3:
+ case 0xd5:
+ case 0xea:
+ printk (KERN_ERR "Single bit error detected. Possibly bad RAM. Please run memtest86.\n");
+ return;
+ }
}
#endif


2006-02-02 19:28:07

by Randy Dunlap

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

On Thu, 2 Feb 2006, Dave Jones wrote:

> In the case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug. In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages in those
> cases, in the hope that users will try memtest before they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Please run memtest86.

does memtest86 run on all $ARCHes ?
or this is good for <large percentage>, so it's Good. :)
Just checking; it is a good idea.

> Signed-off-by: Dave Jones <[email protected]>
>
> --- linux-2.6.15/mm/slab.c~ 2006-01-09 13:25:17.000000000 -0500
> +++ linux-2.6.15/mm/slab.c 2006-01-09 13:26:01.000000000 -0500
> @@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
> static void dump_line(char *data, int offset, int limit)
> {
> int i;
> + unsigned char total=0;
> printk(KERN_ERR "%03x:", offset);
> for (i = 0; i < limit; i++) {
> + if (data[offset+i] != POISON_FREE)
> + total += data[offset+i];
> printk(" %02x", (unsigned char)data[offset + i]);
> }
> printk("\n");
> @@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
> }
> }
> printk("\n");
> + switch (total) {
> + case 0x36:
> + case 0x6a:
> + case 0x6f:
> + case 0x81:
> + case 0xac:
> + case 0xd3:
> + case 0xd5:
> + case 0xea:
> + printk (KERN_ERR "Single bit error detected. Possibly bad RAM. Please run memtest86.\n");
> + return;
> + }
> }
> #endif
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
~Randy

2006-02-02 19:38:12

by Jesper Juhl

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

On 2/2/06, Dave Jones <[email protected]> wrote:
> In the case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug. In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages in those
> cases, in the hope that users will try memtest before they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Please run memtest86.
>
May I suggest that the text be
Single bit error detected. Possibly bad RAM. Please run memtest86
and/or memtest86+.

both programs are good memory testers, but they are different and
sometimes one finds problems not detected by the other.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-02-02 19:53:55

by Pekka Enberg

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

On 2/2/06, Dave Jones <[email protected]> wrote:
> In the case where we detect a single bit has been flipped, we spew
> the usual slab corruption message, which users instantly think
> is a kernel bug. In a lot of cases, single bit errors are
> down to bad memory, or other hardware failure.
>
> This patch adds an extra line to the slab debug messages in those
> cases, in the hope that users will try memtest before they report a bug.
>
> 000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Single bit error detected. Possibly bad RAM. Please run memtest86.
>
> Signed-off-by: Dave Jones <[email protected]>

Looks good to me.

Pekka

2006-02-03 00:44:57

by Avi Kivity

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

Dave Jones wrote:

>In the case where we detect a single bit has been flipped, we spew
>the usual slab corruption message, which users instantly think
>is a kernel bug. In a lot of cases, single bit errors are
>down to bad memory, or other hardware failure.
>
>This patch adds an extra line to the slab debug messages in those
>cases, in the hope that users will try memtest before they report a bug.
>
>000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>Single bit error detected. Possibly bad RAM. Please run memtest86.
>
>Signed-off-by: Dave Jones <[email protected]>
>
>--- linux-2.6.15/mm/slab.c~ 2006-01-09 13:25:17.000000000 -0500
>+++ linux-2.6.15/mm/slab.c 2006-01-09 13:26:01.000000000 -0500
>@@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
> static void dump_line(char *data, int offset, int limit)
> {
> int i;
>+ unsigned char total=0;
> printk(KERN_ERR "%03x:", offset);
> for (i = 0; i < limit; i++) {
>+ if (data[offset+i] != POISON_FREE)
>+ total += data[offset+i];
>
>
how about

total += hweight8(data[offset+i] ^ POISON_FREE);

> printk(" %02x", (unsigned char)data[offset + i]);
> }
> printk("\n");
>@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
> }
> }
> printk("\n");
>+ switch (total) {
>+ case 0x36:
>+ case 0x6a:
>+ case 0x6f:
>+ case 0x81:
>+ case 0xac:
>+ case 0xd3:
>+ case 0xd5:
>+ case 0xea:
>+ printk (KERN_ERR "Single bit error detected. Possibly bad RAM. Please run memtest86.\n");
>+ return;
>+ }
>
>
and a

if (total == 1)
printk(...);

here? it seems more readable and more correct as well.

> }
> #endif
>
>
>


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-02-03 01:47:07

by Dave Jones

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

On Fri, Feb 03, 2006 at 02:44:52AM +0200, Avi Kivity wrote:

> total += hweight8(data[offset+i] ^ POISON_FREE);
>
> > printk(" %02x", (unsigned char)data[offset + i]);
> > }
> > printk("\n");
> >@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
> > }
> > }
> > printk("\n");
> >+ switch (total) {
> >+ case 0x36:
> >+ case 0x6a:
> >+ case 0x6f:
> >+ case 0x81:
> >+ case 0xac:
> >+ case 0xd3:
> >+ case 0xd5:
> >+ case 0xea:
> >+ printk (KERN_ERR "Single bit error detected.
> >Possibly bad RAM. Please run memtest86.\n");
> >+ return;
> >+ }
> >
> >
> and a
>
> if (total == 1)
> printk(...);
>
> here? it seems more readable and more correct as well.

More readable ? Are you kidding ?
What I wrote is smack-you-in-the-face-obvious what it's doing.
With your variant, I have to sit down and think it through.

wrt correctness, what do you see wrong with my approach?

Dave

2006-02-03 02:05:30

by Avi Kivity

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

Dave Jones wrote:

>On Fri, Feb 03, 2006 at 02:44:52AM +0200, Avi Kivity wrote:
>
> > total += hweight8(data[offset+i] ^ POISON_FREE);
> >
> > > printk(" %02x", (unsigned char)data[offset + i]);
> > > }
> > > printk("\n");
> > >@@ -1019,6 +1023,18 @@ static void dump_line(char *data, int of
> > > }
> > > }
> > > printk("\n");
> > >+ switch (total) {
> > >+ case 0x36:
> > >+ case 0x6a:
> > >+ case 0x6f:
> > >+ case 0x81:
> > >+ case 0xac:
> > >+ case 0xd3:
> > >+ case 0xd5:
> > >+ case 0xea:
> > >+ printk (KERN_ERR "Single bit error detected.
> > >Possibly bad RAM. Please run memtest86.\n");
> > >+ return;
> > >+ }
> > >
> > >
> > and a
> >
> > if (total == 1)
> > printk(...);
> >
> > here? it seems more readable and more correct as well.
>
>More readable ? Are you kidding ?
>What I wrote is smack-you-in-the-face-obvious what it's doing.
>With your variant, I have to sit down and think it through.
>
>
Looks like we have mirror image brains :) - I had to scratch my scalp to
figure out where all the magic numbers in the switch came from.

Perhaps well named variables will help:

unsigned char modified_bits = data[offset+i] ^ POSION_FREE;
int modified_bits_count = hweight8(modified_bits);
total += modified_bits_count;

>wrt correctness, what do you see wrong with my approach?
>
>
Your code will generate a false positive 8 times in 256 runs, or 1 in
32. A 3% false positive rate seems excessive, It's also sensitive to
changes to POISON_FREE.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-02-03 04:20:46

by Dave Jones

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

On Fri, Feb 03, 2006 at 04:05:23AM +0200, Avi Kivity wrote:

> unsigned char modified_bits = data[offset+i] ^ POSION_FREE;
> int modified_bits_count = hweight8(modified_bits);
> total += modified_bits_count;
>
> >wrt correctness, what do you see wrong with my approach?
> Your code will generate a false positive 8 times in 256 runs, or 1 in
> 32. A 3% false positive rate seems excessive, It's also sensitive to
> changes to POISON_FREE.

Hmm, I made a mistake in my maths somewhere, and some of those values
are incorrect, so having the compiler do the work would have stopped
me screwing up, but once the correct values are used, I doubt there's
ever a really compelling reason to change the slab poison pattern.

Dave

In case where we detect a single bit has been flipped, we spew
the usual slab corruption message, which users instantly think
is a kernel bug. In a lot of cases, single bit errors are
down to bad memory, or other hardware failure.

This patch adds an extra line to the slab debug messages
in those cases, in the hope that users will try memtest before
they report a bug.

000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Single bit error detected. Possibly bad RAM. Run memtest86.

Signed-off-by: Dave Jones <[email protected]>

--- linux-2.6.15/mm/slab.c~ 2006-01-09 13:25:17.000000000 -0500
+++ linux-2.6.15/mm/slab.c 2006-01-09 13:26:01.000000000 -0500
@@ -1313,8 +1313,11 @@ static void poison_obj(kmem_cache_t *cac
static void dump_line(char *data, int offset, int limit)
{
int i;
+ unsigned char total=0;
printk(KERN_ERR "%03x:", offset);
for (i = 0; i < limit; i++) {
+ if (data[offset+i] != POISON_FREE)
+ total += data[offset+i];
printk(" %02x", (unsigned char)data[offset + i]);
}
printk("\n");
@@ -1019,6 +1023,22 @@ static void dump_line(char *data, int of
}
}
printk("\n");
+ switch (total) {
+ /* 01101011 (0x6b - SLAB_POISON) */
+ case 0x6a: /* 01101010 bit 0 flipped */
+ case 0x69: /* 01101001 bit 1 flipped */
+ case 0x6f: /* 01101111 bit 2 flipped */
+ case 0x63: /* 01100011 bit 3 flipped */
+ case 0x7b: /* 01111011 bit 4 flipped */
+ case 0x4b: /* 01001011 bit 5 flipped */
+ case 0x2b: /* 00101011 bit 6 flipped */
+ case 0xeb: /* 11101011 bit 7 flipped */
+ printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"
+#ifdef CONFIG_X86
+ printk (KERN_ERR "Run memtest86 or other memory test tool.\n");
+#endif
+ return;
+ }
}
#endif

2006-02-03 04:41:35

by Roland Dreier

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

Dave> Hmm, I made a mistake in my maths somewhere, and some of
Dave> those values are incorrect, so having the compiler do the
Dave> work would have stopped me screwing up, but once the correct
Dave> values are used, I doubt there's ever a really compelling
Dave> reason to change the slab poison pattern.

But Avi is still correct about false positives. For example, if
something stomps on the slab poison and leaves it as

e0 08 03 00

then that will add up to eb and still trigger your message, even
though it's far from a single bit error.

Maybe making the loop be something like

unsigned char total = 0, bad_count = 0;
printk(KERN_ERR "%03x:", offset);
for (i = 0; i < limit; i++) {
if (data[offset+i] != POISON_FREE) {
total += data[offset+i];
++bad_count;
}
printk(" %02x", (unsigned char)data[offset + i]);
}

and then you can put

if (bad_count == 1)

before the switch statement.

I have to admit that Avi's code seems clearer to me too, though.

- R.

2006-02-03 05:03:46

by Dave Jones

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

On Thu, Feb 02, 2006 at 08:41:26PM -0800, Roland Dreier wrote:
> Dave> Hmm, I made a mistake in my maths somewhere, and some of
> Dave> those values are incorrect, so having the compiler do the
> Dave> work would have stopped me screwing up, but once the correct
> Dave> values are used, I doubt there's ever a really compelling
> Dave> reason to change the slab poison pattern.
>
> But Avi is still correct about false positives. For example, if
> something stomps on the slab poison and leaves it as
>
> e0 08 03 00
>
> then that will add up to eb and still trigger your message, even
> though it's far from a single bit error.

Ah, now I see the point Avi was making.

> Maybe making the loop be something like
>
> unsigned char total = 0, bad_count = 0;
> printk(KERN_ERR "%03x:", offset);
> for (i = 0; i < limit; i++) {
> if (data[offset+i] != POISON_FREE) {
> total += data[offset+i];
> ++bad_count;
> }
> printk(" %02x", (unsigned char)data[offset + i]);
> }
>
> and then you can put
>
> if (bad_count == 1)
>
> before the switch statement.
>
> I have to admit that Avi's code seems clearer to me too, though.

I'm easily persuaded either way really, as long as
we arrive at a desirable end-result ;)

Dave

2006-02-03 09:25:35

by George Spelvin

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

Um... case values are allowed to be expressions.

Isn't
+ switch (total) {
+ case SLAB_POISON ^ 0x01:
+ case SLAB_POISON ^ 0x02:
+ case SLAB_POISON ^ 0x04:
+ case SLAB_POISON ^ 0x08:
+ case SLAB_POISON ^ 0x10:
+ case SLAB_POISON ^ 0x20:
+ case SLAB_POISON ^ 0x40:
+ case SLAB_POISON ^ 0x80:
+ printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"

Infinitely clearer, even without the comments? Or, if you want to
be cleverer:

total ^= SLAB_POISON;
if ((total & (total-1)) == 0) {
printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"
}


If you wanted to get the bit-counting exactly accurate, you'd do:

unsigned char total = 0, total2 = 0;

for (i = 0; i < limit; i++) {
unsigned char delta = data[offset+i];
printk(" %02x", delta;
delta ^= POISON_FREE;
total2 |= total & delta;
total |= delta;
}
printk("\n");

/* If total2 has 0 bits set and total1 has at most 1 bit set... */
if (!total2 && !(total1 & (total1 - 1)) {
printk (KERN_ERR "Single bit error detected. Possibly bad RAM\n"

}

2006-02-03 11:05:17

by Olivier Galibert

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

On Thu, Feb 02, 2006 at 11:20:35PM -0500, Dave Jones wrote:
> + case 0x6a: /* 01101010 bit 0 flipped */
> + case 0x69: /* 01101001 bit 1 flipped */
> + case 0x6f: /* 01101111 bit 2 flipped */
> + case 0x63: /* 01100011 bit 3 flipped */
> + case 0x7b: /* 01111011 bit 4 flipped */
> + case 0x4b: /* 01001011 bit 5 flipped */
> + case 0x2b: /* 00101011 bit 6 flipped */
> + case 0xeb: /* 11101011 bit 7 flipped */

What about simply:
case 0x6b^0x01:
case 0x6b^0x02:
case 0x6b^0x04:
case 0x6b^0x08:
case 0x6b^0x10:
case 0x6b^0x20:
case 0x6b^0x40:
case 0x6b^0x80:

OG.

2006-02-03 14:09:47

by Jan Engelhardt

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

>
>000: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
>Single bit error detected. Possibly bad RAM. Please run memtest86.
>
>--- linux-2.6.15/mm/slab.c~ 2006-01-09 13:25:17.000000000 -0500
>+++ linux-2.6.15/mm/slab.c 2006-01-09 13:26:01.000000000 -0500

So, and what do non-x86 users use?


Jan Engelhardt
--

2006-02-03 14:12:24

by Jan Engelhardt

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

>
>I have to admit that Avi's code seems clearer to me too, though.
>
"me2". The second (longer) form of it, that is.

Jan Engelhardt
--

2006-02-03 14:14:50

by Jan Engelhardt

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.


>Um... case values are allowed to be expressions.

Whatever they are, they must be able to be reduced to an integer constant.


Jan Engelhardt
--

2006-02-06 20:19:39

by Pavel Machek

[permalink] [raw]
Subject: Re: discriminate single bit error hardware failure from slab corruption.

Hi!

> > here? it seems more readable and more correct as well.
>
> More readable ? Are you kidding ?

Well, his method really counts flipped bits. It looks very obvious
to me -- unlike magic numbers.

--
Thanks, Sharp!