LinuxLists.cc - Regression: memory corruption on Atmel SAMA5D31

2022-03-03 01:00:28

Subject: Regression: memory corruption on Atmel SAMA5D31

Hi!

I'm seeing a weird problem, and I'd like some help with further
things to try in order to track down what's going on. I have
bisected the issue to

f9aa460672c9 ("driver core: Refactor fw_devlink feature")

The symptoms are that I get (seemingly) random memory corruption
when processing large amounts of data (compared to system size).
I have two known reproducers, but I'm sure there are more if I
keep digging. One is to do this:

$ dd if=/dev/urandom of=testfile bs=1024 count=40000
40000+0 records in
40000+0 records out
40960000 bytes (41 MB, 39 MiB) copied, 19.7759 s, 2.1 MB/s
$ for i in 1 2 3 4; do cat testfile | sha256sum; done
d8c85f816e08baa5ad27050bf0413e11a09f325fb0a8843b7b2b45b9333ab542 -
f223c1cbb6dbecb02d1741e7991dc98cd8d5b40ffee05bb32dc2c15eb73d6b1f -
d6f3e7f3d325c67e83a6104934dd8a7c891ebfd9a2cf59633dbe97fb2cbb9c81 -
cf8ada47e7e2fee299314440b225ba83fca3cef1f6286adc160a5d4f207caccd -

It is harder to tickle the problem if I redirect the testfile to
sha256sum w/o involving cat or give the file as an argument to
sha256sum. I can also get things to behave better by getting rid
of a bunch of USB interrupts by doing the following:

$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer
$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB2/latency_timer
$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB3/latency_timer

With the lower interrupt pressure I get this:

$ for i in 1 2 3 4; do cat testfile | sha256sum; done
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -

Nice. However, I need the latency to be lower than the default
16ms, 3ms could perhaps work in theory, but preferably 1ms, so
the above 100ms is far off. The initial hash run was with latency
set to 1ms, which makes it easy to trigger the issue. The latency
timer setting is for this driver: drivers/usb/serial/ftdi_sio.c

And also, that does not help with the other reproducer, namely
to copy that same random testfile with scp to a working system...

$ scp testfile peda@xyzzy:testfile1
testfile 100% 39MB 2.0MB/s 00:19
$ scp testfile peda@xyzzy:testfile2
testfile 100% 39MB 2.1MB/s 00:18
$ scp testfile peda@xyzzy:testfile3
testfile 100% 39MB 2.1MB/s 00:18
$ scp testfile peda@xyzzy:testfile4
testfile 100% 39MB 2.1MB/s 00:19

...and then perform the sha256sum on that xyzzy host instead:

$ sha256sum testfile?
39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
$ sha256sum testfile?
39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4

Same output every time. Of course. xyzzy is a working system...
Converting these files to hex (hexdump -C) and diffing yields this:

$ diff -u0 testfile1.hex testfile2.hex
--- testfile1.hex 2022-03-02 23:56:38.273149516 +0100
+++ testfile2.hex 2022-03-03 00:00:57.912747033 +0100
@@ -8658,2 +8658,2 @@
-00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
-00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
+00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
+00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
@@ -20592,2 +20592,2 @@
-000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
-00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
+000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
+00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
@@ -461019,2 +461019,2 @@
-00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
-00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
+00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
+00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
@@ -902788,2 +902788,2 @@
-00dc6830 f2 41 23 1b ec 54 d5 fe f0 33 51 f7 d2 fc bf bd |.A#..T...3Q.....|
-00dc6840 e5 1f 58 df 24 2f e3 dc 65 87 b2 27 12 86 d1 9a |..X.$/..e..'....|
+00dc6830 44 82 94 b5 c9 26 08 42 bd 89 e1 96 41 66 8a b5 |D....&.B....Af..|
+00dc6840 a5 34 46 5e fd 1b c1 73 86 33 24 fd 4d e1 e1 68 |.4F^...s.3$.M..h|
@@ -931900,2 +931900,2 @@
-00e383b0 ee 64 c5 6f 38 44 5b 31 41 e1 2c 64 49 d5 f8 ad |.d.o8D[1A.,dI...|
-00e383c0 fb 85 52 4f 00 1f 80 7a f3 de ee 8e db ac d5 bb |..RO...z........|
+00e383b0 4b 4d 29 a1 0a 99 8f f7 32 71 8c de 23 ca a0 f1 |KM).....2q..#...|
+00e383c0 e2 af e3 c4 a0 95 d3 1c ed 58 c4 c5 30 da 56 b9 |.........X..0.V.|
@@ -1170109,2 +1170109,2 @@
-011dabc0 6a 7c 0c 3c 86 1a b6 48 50 d7 98 68 0c 01 e3 1c |j|.<...HP..h....|
-011dabd0 a3 a8 b0 f2 62 21 86 b9 d1 52 9d 74 9e 26 42 51 |....b!...R.t.&BQ|
+011dabc0 5b 1a 9e 23 ae 58 42 68 83 58 df d6 c1 57 6b b0 |[..#.XBh.X...Wk.|
+011dabd0 ec d5 50 8b 76 5e 96 b4 49 21 f7 e4 b7 8f a3 45 |..P.v^..I!.....E|
@@ -1880164,2 +1880164,2 @@
-01cb0630 1c 74 74 16 75 b4 de f7 ce 4b 5e 4d 97 d6 36 d4 |.tt.u....K^M..6.|
-01cb0640 44 d9 fd 69 c5 d0 f0 a6 c6 44 26 53 7f 91 f3 62 |D..i.....D&S...b|
+01cb0630 73 bc 40 ce f8 9d 99 91 1b 14 8b a8 52 2a 7b 39 |[email protected]*{9|
+01cb0640 6b ff f5 c5 02 b9 ab c2 c2 08 5e e7 3a 5e 69 c4 |k.........^.:^i.|

Grepping (some of the above) for duplicates yields this:

$ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile1.hex
00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
$ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile2.hex
00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|*
00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|*

$ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile1.hex
0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|*
00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|*
$ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile2.hex
0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|

$ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile1.hex
00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
$ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile2.hex
00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|*
00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|*

I.e. testfile1 is (probably) corrupted at 000506f0..70f while
testfile2 is (probably) corrupted at 00021d10..2f and 00708da0..bf
(correpted lines marked with hand-made asterisks above)

If I keep grepping like this, the pattern is similar both within
these files and within testfile3 and testfile4. I.e. with
corruptions in 32-byte blocks at (seemingly) random positions
in the files. The corruption is always 16-byte-aligned and the bad
data seems to be a copy from exactly one page up in the file.

As stated above, I have bisected the issue to patch

f9aa460672c9 ("driver core: Refactor fw_devlink feature")

which was added between v5.10-rc3 and v5.10-rc4. Every kernel I have
tried with that patch applied have exhibited the issue, and I have
had no trouble like this with any kernel without that patch. Apart
from a whole bunch of kernels prior to v5.10-rc3, that includes some
later kernels with the patch reverted (along with the dependent
followup 2d09e6eb4a6f). The latest I have tried is 5.11.22. Those
two patches does not revert cleanly in 5.12 (and thereafter) so I
have not tried anything beyond 5.11 with the patch reverted.

I fail to understand how that patch might cause this issue. I have
compared boot messages before and after the patch and there is no
(significant) difference. Everything seems to happen in the same
order with the same result. But that comparison is of course limited
to what is logged.

In some random attempt I tried to disable the D-Cache bit, and that
makes it all very slow but it also (seemingly) fixes the issue. But
that may of course be due to vastly different timings.

Some background:

We have a "Linea" CPU module, with a design based on the Atmel (now
Microchip) SAMA5D31 evaluation board. This CPU module is used on e.g.
our TSE-850 for which there is a device tree in
arch/arm/boot/dts/at91-tse850-3.dts
It has a nand flash for the rootfs and 64 MB RAM. The 40 MB random
testfile is thus big enough to cause page cache churn.

We have used this module in thousands of delivered units (however,
not that many TSE-850) and have never observed anything like this
before. But that has been with older kernels. 4.13.<something> and
4.15.<something> was what we were on until this recent activity.

We're now developing a new product (preliminary device tree included)
and the trusty old CPU module was used again and a fresh new kernel
was built for it. I then started to notice this issue and have tried
to include as much relevant data as possible. If you need more data
or would like me to test something, please ask.

I'm stumped.

Cheers,
Peter

Attachments:

.config (104.44 kB)
dmesg (17.48 kB)
at91-me20.dts (6.48 kB)
Download all attachments

2022-03-03 03:05:29

by Saravana Kannan

[permalink] [raw]

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>
> Hi!
>
> I'm seeing a weird problem, and I'd like some help with further
> things to try in order to track down what's going on. I have
> bisected the issue to
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")

I skimmed through your email and I'll read it more closely tomorrow,
but it wasn't clear if you see this on Linus's tip of the tree too.
Asking because of:
https://lore.kernel.org/lkml/[email protected]/

Also, a couple of other data points that _might_ help. Try kernel
command line option fw_devlink=permissive vs fw_devlink=on (I forget
if this was the default by 5.10) vs fw_devlink=off.

I'm expecting "off" to fix the issue for you. But if permissive vs on
shows a difference driver issues would start becoming a real
possibility.

-Saravana

>
> The symptoms are that I get (seemingly) random memory corruption
> when processing large amounts of data (compared to system size).
> I have two known reproducers, but I'm sure there are more if I
> keep digging. One is to do this:
>
> $ dd if=/dev/urandom of=testfile bs=1024 count=40000
> 40000+0 records in
> 40000+0 records out
> 40960000 bytes (41 MB, 39 MiB) copied, 19.7759 s, 2.1 MB/s
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> d8c85f816e08baa5ad27050bf0413e11a09f325fb0a8843b7b2b45b9333ab542 -
> f223c1cbb6dbecb02d1741e7991dc98cd8d5b40ffee05bb32dc2c15eb73d6b1f -
> d6f3e7f3d325c67e83a6104934dd8a7c891ebfd9a2cf59633dbe97fb2cbb9c81 -
> cf8ada47e7e2fee299314440b225ba83fca3cef1f6286adc160a5d4f207caccd -
>
> It is harder to tickle the problem if I redirect the testfile to
> sha256sum w/o involving cat or give the file as an argument to
> sha256sum. I can also get things to behave better by getting rid
> of a bunch of USB interrupts by doing the following:
>
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB2/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB3/latency_timer
>
> With the lower interrupt pressure I get this:
>
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>
> Nice. However, I need the latency to be lower than the default
> 16ms, 3ms could perhaps work in theory, but preferably 1ms, so
> the above 100ms is far off. The initial hash run was with latency
> set to 1ms, which makes it easy to trigger the issue. The latency
> timer setting is for this driver: drivers/usb/serial/ftdi_sio.c
>
> And also, that does not help with the other reproducer, namely
> to copy that same random testfile with scp to a working system...
>
> $ scp testfile peda@xyzzy:testfile1
> testfile 100% 39MB 2.0MB/s 00:19
> $ scp testfile peda@xyzzy:testfile2
> testfile 100% 39MB 2.1MB/s 00:18
> $ scp testfile peda@xyzzy:testfile3
> testfile 100% 39MB 2.1MB/s 00:18
> $ scp testfile peda@xyzzy:testfile4
> testfile 100% 39MB 2.1MB/s 00:19
>
> ...and then perform the sha256sum on that xyzzy host instead:
>
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
>
> Same output every time. Of course. xyzzy is a working system...
> Converting these files to hex (hexdump -C) and diffing yields this:
>
> $ diff -u0 testfile1.hex testfile2.hex
> --- testfile1.hex 2022-03-02 23:56:38.273149516 +0100
> +++ testfile2.hex 2022-03-03 00:00:57.912747033 +0100
> @@ -8658,2 +8658,2 @@
> -00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
> -00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
> +00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> +00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> @@ -20592,2 +20592,2 @@
> -000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> -00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> +000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
> +00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
> @@ -461019,2 +461019,2 @@
> -00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
> -00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
> +00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> +00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> @@ -902788,2 +902788,2 @@
> -00dc6830 f2 41 23 1b ec 54 d5 fe f0 33 51 f7 d2 fc bf bd |.A#..T...3Q.....|
> -00dc6840 e5 1f 58 df 24 2f e3 dc 65 87 b2 27 12 86 d1 9a |..X.$/..e..'....|
> +00dc6830 44 82 94 b5 c9 26 08 42 bd 89 e1 96 41 66 8a b5 |D....&.B....Af..|
> +00dc6840 a5 34 46 5e fd 1b c1 73 86 33 24 fd 4d e1 e1 68 |.4F^...s.3$.M..h|
> @@ -931900,2 +931900,2 @@
> -00e383b0 ee 64 c5 6f 38 44 5b 31 41 e1 2c 64 49 d5 f8 ad |.d.o8D[1A.,dI...|
> -00e383c0 fb 85 52 4f 00 1f 80 7a f3 de ee 8e db ac d5 bb |..RO...z........|
> +00e383b0 4b 4d 29 a1 0a 99 8f f7 32 71 8c de 23 ca a0 f1 |KM).....2q..#...|
> +00e383c0 e2 af e3 c4 a0 95 d3 1c ed 58 c4 c5 30 da 56 b9 |.........X..0.V.|
> @@ -1170109,2 +1170109,2 @@
> -011dabc0 6a 7c 0c 3c 86 1a b6 48 50 d7 98 68 0c 01 e3 1c |j|.<...HP..h....|
> -011dabd0 a3 a8 b0 f2 62 21 86 b9 d1 52 9d 74 9e 26 42 51 |....b!...R.t.&BQ|
> +011dabc0 5b 1a 9e 23 ae 58 42 68 83 58 df d6 c1 57 6b b0 |[..#.XBh.X...Wk.|
> +011dabd0 ec d5 50 8b 76 5e 96 b4 49 21 f7 e4 b7 8f a3 45 |..P.v^..I!.....E|
> @@ -1880164,2 +1880164,2 @@
> -01cb0630 1c 74 74 16 75 b4 de f7 ce 4b 5e 4d 97 d6 36 d4 |.tt.u....K^M..6.|
> -01cb0640 44 d9 fd 69 c5 d0 f0 a6 c6 44 26 53 7f 91 f3 62 |D..i.....D&S...b|
> +01cb0630 73 bc 40 ce f8 9d 99 91 1b 14 8b a8 52 2a 7b 39 |[email protected]*{9|
> +01cb0640 6b ff f5 c5 02 b9 ab c2 c2 08 5e e7 3a 5e 69 c4 |k.........^.:^i.|
>
> Grepping (some of the above) for duplicates yields this:
>
> $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile1.hex
> 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> 00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
> 00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
> $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile2.hex
> 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> 00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|*
> 00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|*
>
> $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile1.hex
> 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> 000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|*
> 00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|*
> $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile2.hex
> 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> 000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
> 00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
>
> $ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile1.hex
> 00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> 00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> 00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
> 00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
> $ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile2.hex
> 00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> 00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> 00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|*
> 00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|*
>
> I.e. testfile1 is (probably) corrupted at 000506f0..70f while
> testfile2 is (probably) corrupted at 00021d10..2f and 00708da0..bf
> (correpted lines marked with hand-made asterisks above)
>
> If I keep grepping like this, the pattern is similar both within
> these files and within testfile3 and testfile4. I.e. with
> corruptions in 32-byte blocks at (seemingly) random positions
> in the files. The corruption is always 16-byte-aligned and the bad
> data seems to be a copy from exactly one page up in the file.
>
> As stated above, I have bisected the issue to patch
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> which was added between v5.10-rc3 and v5.10-rc4. Every kernel I have
> tried with that patch applied have exhibited the issue, and I have
> had no trouble like this with any kernel without that patch. Apart
> from a whole bunch of kernels prior to v5.10-rc3, that includes some
> later kernels with the patch reverted (along with the dependent
> followup 2d09e6eb4a6f). The latest I have tried is 5.11.22. Those
> two patches does not revert cleanly in 5.12 (and thereafter) so I
> have not tried anything beyond 5.11 with the patch reverted.
>
> I fail to understand how that patch might cause this issue. I have
> compared boot messages before and after the patch and there is no
> (significant) difference. Everything seems to happen in the same
> order with the same result. But that comparison is of course limited
> to what is logged.
>
> In some random attempt I tried to disable the D-Cache bit, and that
> makes it all very slow but it also (seemingly) fixes the issue. But
> that may of course be due to vastly different timings.
>
> Some background:
>
> We have a "Linea" CPU module, with a design based on the Atmel (now
> Microchip) SAMA5D31 evaluation board. This CPU module is used on e.g.
> our TSE-850 for which there is a device tree in
> arch/arm/boot/dts/at91-tse850-3.dts
> It has a nand flash for the rootfs and 64 MB RAM. The 40 MB random
> testfile is thus big enough to cause page cache churn.
>
> We have used this module in thousands of delivered units (however,
> not that many TSE-850) and have never observed anything like this
> before. But that has been with older kernels. 4.13.<something> and
> 4.15.<something> was what we were on until this recent activity.
>
> We're now developing a new product (preliminary device tree included)
> and the trusty old CPU module was used again and a fresh new kernel
> was built for it. I then started to notice this issue and have tried
> to include as much relevant data as possible. If you need more data
> or would like me to test something, please ask.
>
> I'm stumped.
>
> Cheers,
> Peter

2022-03-03 09:40:48

Subject: Regression: memory corruption on Atmel SAMA5D31

Attachments:

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: RE: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Subject: Re: Regression: memory corruption on Atmel SAMA5D31