2018-06-16 21:23:37

by Sergei Trofimovich

[permalink] [raw]
Subject: x86_64: movdqu rarely stores bad data (movdqu works fine). Kernel bug, fried CPU or glibc bug?

TL;DR: on master string/test-memmove glibc test fails on my machine
and I don't know why. Other tests work fine.

$ elf/ld.so --inhibit-cache --library-path . string/test-memmove
simple_memmove __memmove_ssse3_rep __memmove_ssse3 __memmove_sse2_unaligned __memmove_ia32
string/test-memmove: Wrong result in function __memmove_sse2_unaligned dst "0x70000084" src "0x70000000" offset "43297733"

https://sourceware.org/git/?p=glibc.git;a=blob;f=string/test-memmove.c;h=64e3651ba40604e47ddf6d633f4d0aea4644f60a;hb=HEAD

Long story:

I've trimmed __memmove_sse2_unaligned implementation down to
test-memmove-xmm-unaligned.c (attached). It's supposed to show
failed memmove attempts when those happen:

$ gcc -ggdb3 -O2 -m32 test-memmove-xmm-unaligned.c -o test-memmove-xmm-unaligned -Wall && ./test-memmove-xmm-unaligned
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset= 3786689; expected=0039C7C1( 3786689) actual=0039C7C3( 3786691) bit_mismatch=00000002; iteration=1
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset= 3786689; expected=0039C7C1( 3786689) actual=0039C7C3( 3786691) bit_mismatch=00000002; iteration=3
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset= 5448641; expected=005323C1( 5448641) actual=005323C3( 5448643) bit_mismatch=00000002; iteration=5
Bad result in memmove(dst=0xe7d44110, src=0xe7d44010, len=134217728): offset=29022145; expected=01BAD7C1(29022145) actual=01BAD7C3(29022147) bit_mismatch=00000002; iteration=9

$ gcc -ggdb3 -O2 -m64 test-memmove-xmm-unaligned.c -o test-memmove-xmm-unaligned -Wall && ./test-memmove-xmm-unaligned
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=25257857; expected=01816781(25257857) actual=01816783(25257859) bit_mismatch=00000002; iteration=43
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=28109697; expected=01ACEB81(28109697) actual=01ACEB83(28109699) bit_mismatch=00000002; iteration=112
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=18257633; expected=011696E1(18257633) actual=011696E3(18257635) bit_mismatch=00000002; iteration=363
Bad result in memmove(dst=0x7fa4658bf110, src=0x7fa4658bf010, len=134217728): offset=26981249; expected=019BB381(26981249) actual=019BB383(26981251) bit_mismatch=00000002; iteration=437

Note it is a single-bit corruption happening occasionally (not on every iteration).
-m32 is way more error prone that -m64.

Test example roughly implements these 2 loops:
This fails:
sfence
loop {
movdqu [src++],%xmm0
movntdq %xmm0,[dst++]
}
sfence
This works:
sfence
loop {
movdqu [src++],%xmm0
movdqu %xmm0,[dst++]
}
sfence

Failures happen only on sandybridge CPU:
Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz
kernel is 4.17.0-11928-g2837461dbe6f.

Problem is not reproducible instantly after reboot. Machine has to be
heavily loaded to start corrupting memory. A few hours of memtest86+
does not reveal any memory failures.

I wonder if anyone else can reproduce this failure or should I start
looking for a new CPU.

From the above it looks like as if movntdq does not play well with XMM
context save/restore and there is an 'mfence' missing somewhere in
interrupt handling.

If there is no obvious problems with glibc's memove() or my small test
what can I do to rule-out/pin-down hardware or kernel problem?

Thanks!

--

Sergei


Attachments:
(No filename) (3.48 kB)
test-memmove-xmm-unaligned.c (4.27 kB)
(No filename) (201.00 B)
Цифровая подпись OpenPGP
Download all attachments

2018-06-17 21:40:33

by Sergei Trofimovich

[permalink] [raw]
Subject: Re: x86_64: movdqu rarely stores bad data (movdqu works fine). Kernel bug, fried CPU or glibc bug?

On Sat, 16 Jun 2018 22:22:50 +0100
Sergei Trofimovich <[email protected]> wrote:

> TL;DR: on master string/test-memmove glibc test fails on my machine
> and I don't know why. Other tests work fine.
> ...
> This fails:
> loop {
> movdqu [src++],%xmm0
> movntdq %xmm0,[dst++]
> }
> sfence
> This works:
> loop {
> movdqu [src++],%xmm0
> movdqu %xmm0,[dst++]
> }
> sfence
> ...
> If there is no obvious problems with glibc's memove() or my small test
> what can I do to rule-out/pin-down hardware or kernel problem?

Found the cause: bad RAM module.

After I've tweaked test to allocate most of available physical RAM
I've got fully reproducible failure.

I unplugged RAM modules one by one and ran the test. That way I've
nailed down to one bad chip. Removing single bad chip restored
string/test-memmove test on this machine \o/

Sorry for the noise!

--

Sergei


Attachments:
(No filename) (201.00 B)
Цифровая подпись OpenPGP