2022-03-03 01:00:28

by Peter Rosin

[permalink] [raw]
Subject: Regression: memory corruption on Atmel SAMA5D31

Hi!

I'm seeing a weird problem, and I'd like some help with further
things to try in order to track down what's going on. I have
bisected the issue to

f9aa460672c9 ("driver core: Refactor fw_devlink feature")

The symptoms are that I get (seemingly) random memory corruption
when processing large amounts of data (compared to system size).
I have two known reproducers, but I'm sure there are more if I
keep digging. One is to do this:

$ dd if=/dev/urandom of=testfile bs=1024 count=40000
40000+0 records in
40000+0 records out
40960000 bytes (41 MB, 39 MiB) copied, 19.7759 s, 2.1 MB/s
$ for i in 1 2 3 4; do cat testfile | sha256sum; done
d8c85f816e08baa5ad27050bf0413e11a09f325fb0a8843b7b2b45b9333ab542 -
f223c1cbb6dbecb02d1741e7991dc98cd8d5b40ffee05bb32dc2c15eb73d6b1f -
d6f3e7f3d325c67e83a6104934dd8a7c891ebfd9a2cf59633dbe97fb2cbb9c81 -
cf8ada47e7e2fee299314440b225ba83fca3cef1f6286adc160a5d4f207caccd -

It is harder to tickle the problem if I redirect the testfile to
sha256sum w/o involving cat or give the file as an argument to
sha256sum. I can also get things to behave better by getting rid
of a bunch of USB interrupts by doing the following:

$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer
$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB2/latency_timer
$ echo 100 > /sys/bus/usb-serial/devices/ttyUSB3/latency_timer

With the lower interrupt pressure I get this:

$ for i in 1 2 3 4; do cat testfile | sha256sum; done
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -

Nice. However, I need the latency to be lower than the default
16ms, 3ms could perhaps work in theory, but preferably 1ms, so
the above 100ms is far off. The initial hash run was with latency
set to 1ms, which makes it easy to trigger the issue. The latency
timer setting is for this driver: drivers/usb/serial/ftdi_sio.c

And also, that does not help with the other reproducer, namely
to copy that same random testfile with scp to a working system...

$ scp testfile peda@xyzzy:testfile1
testfile 100% 39MB 2.0MB/s 00:19
$ scp testfile peda@xyzzy:testfile2
testfile 100% 39MB 2.1MB/s 00:18
$ scp testfile peda@xyzzy:testfile3
testfile 100% 39MB 2.1MB/s 00:18
$ scp testfile peda@xyzzy:testfile4
testfile 100% 39MB 2.1MB/s 00:19

...and then perform the sha256sum on that xyzzy host instead:

$ sha256sum testfile?
39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
$ sha256sum testfile?
39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4

Same output every time. Of course. xyzzy is a working system...
Converting these files to hex (hexdump -C) and diffing yields this:

$ diff -u0 testfile1.hex testfile2.hex
--- testfile1.hex 2022-03-02 23:56:38.273149516 +0100
+++ testfile2.hex 2022-03-03 00:00:57.912747033 +0100
@@ -8658,2 +8658,2 @@
-00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
-00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
+00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
+00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
@@ -20592,2 +20592,2 @@
-000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
-00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
+000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
+00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
@@ -461019,2 +461019,2 @@
-00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
-00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
+00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
+00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
@@ -902788,2 +902788,2 @@
-00dc6830 f2 41 23 1b ec 54 d5 fe f0 33 51 f7 d2 fc bf bd |.A#..T...3Q.....|
-00dc6840 e5 1f 58 df 24 2f e3 dc 65 87 b2 27 12 86 d1 9a |..X.$/..e..'....|
+00dc6830 44 82 94 b5 c9 26 08 42 bd 89 e1 96 41 66 8a b5 |D....&.B....Af..|
+00dc6840 a5 34 46 5e fd 1b c1 73 86 33 24 fd 4d e1 e1 68 |.4F^...s.3$.M..h|
@@ -931900,2 +931900,2 @@
-00e383b0 ee 64 c5 6f 38 44 5b 31 41 e1 2c 64 49 d5 f8 ad |.d.o8D[1A.,dI...|
-00e383c0 fb 85 52 4f 00 1f 80 7a f3 de ee 8e db ac d5 bb |..RO...z........|
+00e383b0 4b 4d 29 a1 0a 99 8f f7 32 71 8c de 23 ca a0 f1 |KM).....2q..#...|
+00e383c0 e2 af e3 c4 a0 95 d3 1c ed 58 c4 c5 30 da 56 b9 |.........X..0.V.|
@@ -1170109,2 +1170109,2 @@
-011dabc0 6a 7c 0c 3c 86 1a b6 48 50 d7 98 68 0c 01 e3 1c |j|.<...HP..h....|
-011dabd0 a3 a8 b0 f2 62 21 86 b9 d1 52 9d 74 9e 26 42 51 |....b!...R.t.&BQ|
+011dabc0 5b 1a 9e 23 ae 58 42 68 83 58 df d6 c1 57 6b b0 |[..#.XBh.X...Wk.|
+011dabd0 ec d5 50 8b 76 5e 96 b4 49 21 f7 e4 b7 8f a3 45 |..P.v^..I!.....E|
@@ -1880164,2 +1880164,2 @@
-01cb0630 1c 74 74 16 75 b4 de f7 ce 4b 5e 4d 97 d6 36 d4 |.tt.u....K^M..6.|
-01cb0640 44 d9 fd 69 c5 d0 f0 a6 c6 44 26 53 7f 91 f3 62 |D..i.....D&S...b|
+01cb0630 73 bc 40 ce f8 9d 99 91 1b 14 8b a8 52 2a 7b 39 |[email protected]*{9|
+01cb0640 6b ff f5 c5 02 b9 ab c2 c2 08 5e e7 3a 5e 69 c4 |k.........^.:^i.|

Grepping (some of the above) for duplicates yields this:

$ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile1.hex
00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
$ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile2.hex
00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|*
00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|*

$ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile1.hex
0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|*
00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|*
$ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile2.hex
0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|

$ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile1.hex
00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
$ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile2.hex
00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|*
00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|*

I.e. testfile1 is (probably) corrupted at 000506f0..70f while
testfile2 is (probably) corrupted at 00021d10..2f and 00708da0..bf
(correpted lines marked with hand-made asterisks above)

If I keep grepping like this, the pattern is similar both within
these files and within testfile3 and testfile4. I.e. with
corruptions in 32-byte blocks at (seemingly) random positions
in the files. The corruption is always 16-byte-aligned and the bad
data seems to be a copy from exactly one page up in the file.

As stated above, I have bisected the issue to patch

f9aa460672c9 ("driver core: Refactor fw_devlink feature")

which was added between v5.10-rc3 and v5.10-rc4. Every kernel I have
tried with that patch applied have exhibited the issue, and I have
had no trouble like this with any kernel without that patch. Apart
from a whole bunch of kernels prior to v5.10-rc3, that includes some
later kernels with the patch reverted (along with the dependent
followup 2d09e6eb4a6f). The latest I have tried is 5.11.22. Those
two patches does not revert cleanly in 5.12 (and thereafter) so I
have not tried anything beyond 5.11 with the patch reverted.

I fail to understand how that patch might cause this issue. I have
compared boot messages before and after the patch and there is no
(significant) difference. Everything seems to happen in the same
order with the same result. But that comparison is of course limited
to what is logged.

In some random attempt I tried to disable the D-Cache bit, and that
makes it all very slow but it also (seemingly) fixes the issue. But
that may of course be due to vastly different timings.

Some background:

We have a "Linea" CPU module, with a design based on the Atmel (now
Microchip) SAMA5D31 evaluation board. This CPU module is used on e.g.
our TSE-850 for which there is a device tree in
arch/arm/boot/dts/at91-tse850-3.dts
It has a nand flash for the rootfs and 64 MB RAM. The 40 MB random
testfile is thus big enough to cause page cache churn.

We have used this module in thousands of delivered units (however,
not that many TSE-850) and have never observed anything like this
before. But that has been with older kernels. 4.13.<something> and
4.15.<something> was what we were on until this recent activity.

We're now developing a new product (preliminary device tree included)
and the trusty old CPU module was used again and a fresh new kernel
was built for it. I then started to notice this issue and have tried
to include as much relevant data as possible. If you need more data
or would like me to test something, please ask.

I'm stumped.

Cheers,
Peter


Attachments:
.config (104.44 kB)
dmesg (17.48 kB)
at91-me20.dts (6.48 kB)
Download all attachments

2022-03-03 03:05:29

by Saravana Kannan

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>
> Hi!
>
> I'm seeing a weird problem, and I'd like some help with further
> things to try in order to track down what's going on. I have
> bisected the issue to
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")

I skimmed through your email and I'll read it more closely tomorrow,
but it wasn't clear if you see this on Linus's tip of the tree too.
Asking because of:
https://lore.kernel.org/lkml/[email protected]/

Also, a couple of other data points that _might_ help. Try kernel
command line option fw_devlink=permissive vs fw_devlink=on (I forget
if this was the default by 5.10) vs fw_devlink=off.

I'm expecting "off" to fix the issue for you. But if permissive vs on
shows a difference driver issues would start becoming a real
possibility.

-Saravana

>
> The symptoms are that I get (seemingly) random memory corruption
> when processing large amounts of data (compared to system size).
> I have two known reproducers, but I'm sure there are more if I
> keep digging. One is to do this:
>
> $ dd if=/dev/urandom of=testfile bs=1024 count=40000
> 40000+0 records in
> 40000+0 records out
> 40960000 bytes (41 MB, 39 MiB) copied, 19.7759 s, 2.1 MB/s
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> d8c85f816e08baa5ad27050bf0413e11a09f325fb0a8843b7b2b45b9333ab542 -
> f223c1cbb6dbecb02d1741e7991dc98cd8d5b40ffee05bb32dc2c15eb73d6b1f -
> d6f3e7f3d325c67e83a6104934dd8a7c891ebfd9a2cf59633dbe97fb2cbb9c81 -
> cf8ada47e7e2fee299314440b225ba83fca3cef1f6286adc160a5d4f207caccd -
>
> It is harder to tickle the problem if I redirect the testfile to
> sha256sum w/o involving cat or give the file as an argument to
> sha256sum. I can also get things to behave better by getting rid
> of a bunch of USB interrupts by doing the following:
>
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB2/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB3/latency_timer
>
> With the lower interrupt pressure I get this:
>
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>
> Nice. However, I need the latency to be lower than the default
> 16ms, 3ms could perhaps work in theory, but preferably 1ms, so
> the above 100ms is far off. The initial hash run was with latency
> set to 1ms, which makes it easy to trigger the issue. The latency
> timer setting is for this driver: drivers/usb/serial/ftdi_sio.c
>
> And also, that does not help with the other reproducer, namely
> to copy that same random testfile with scp to a working system...
>
> $ scp testfile peda@xyzzy:testfile1
> testfile 100% 39MB 2.0MB/s 00:19
> $ scp testfile peda@xyzzy:testfile2
> testfile 100% 39MB 2.1MB/s 00:18
> $ scp testfile peda@xyzzy:testfile3
> testfile 100% 39MB 2.1MB/s 00:18
> $ scp testfile peda@xyzzy:testfile4
> testfile 100% 39MB 2.1MB/s 00:19
>
> ...and then perform the sha256sum on that xyzzy host instead:
>
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
>
> Same output every time. Of course. xyzzy is a working system...
> Converting these files to hex (hexdump -C) and diffing yields this:
>
> $ diff -u0 testfile1.hex testfile2.hex
> --- testfile1.hex 2022-03-02 23:56:38.273149516 +0100
> +++ testfile2.hex 2022-03-03 00:00:57.912747033 +0100
> @@ -8658,2 +8658,2 @@
> -00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
> -00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
> +00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> +00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> @@ -20592,2 +20592,2 @@
> -000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> -00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> +000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
> +00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
> @@ -461019,2 +461019,2 @@
> -00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
> -00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
> +00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> +00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> @@ -902788,2 +902788,2 @@
> -00dc6830 f2 41 23 1b ec 54 d5 fe f0 33 51 f7 d2 fc bf bd |.A#..T...3Q.....|
> -00dc6840 e5 1f 58 df 24 2f e3 dc 65 87 b2 27 12 86 d1 9a |..X.$/..e..'....|
> +00dc6830 44 82 94 b5 c9 26 08 42 bd 89 e1 96 41 66 8a b5 |D....&.B....Af..|
> +00dc6840 a5 34 46 5e fd 1b c1 73 86 33 24 fd 4d e1 e1 68 |.4F^...s.3$.M..h|
> @@ -931900,2 +931900,2 @@
> -00e383b0 ee 64 c5 6f 38 44 5b 31 41 e1 2c 64 49 d5 f8 ad |.d.o8D[1A.,dI...|
> -00e383c0 fb 85 52 4f 00 1f 80 7a f3 de ee 8e db ac d5 bb |..RO...z........|
> +00e383b0 4b 4d 29 a1 0a 99 8f f7 32 71 8c de 23 ca a0 f1 |KM).....2q..#...|
> +00e383c0 e2 af e3 c4 a0 95 d3 1c ed 58 c4 c5 30 da 56 b9 |.........X..0.V.|
> @@ -1170109,2 +1170109,2 @@
> -011dabc0 6a 7c 0c 3c 86 1a b6 48 50 d7 98 68 0c 01 e3 1c |j|.<...HP..h....|
> -011dabd0 a3 a8 b0 f2 62 21 86 b9 d1 52 9d 74 9e 26 42 51 |....b!...R.t.&BQ|
> +011dabc0 5b 1a 9e 23 ae 58 42 68 83 58 df d6 c1 57 6b b0 |[..#.XBh.X...Wk.|
> +011dabd0 ec d5 50 8b 76 5e 96 b4 49 21 f7 e4 b7 8f a3 45 |..P.v^..I!.....E|
> @@ -1880164,2 +1880164,2 @@
> -01cb0630 1c 74 74 16 75 b4 de f7 ce 4b 5e 4d 97 d6 36 d4 |.tt.u....K^M..6.|
> -01cb0640 44 d9 fd 69 c5 d0 f0 a6 c6 44 26 53 7f 91 f3 62 |D..i.....D&S...b|
> +01cb0630 73 bc 40 ce f8 9d 99 91 1b 14 8b a8 52 2a 7b 39 |[email protected]*{9|
> +01cb0640 6b ff f5 c5 02 b9 ab c2 c2 08 5e e7 3a 5e 69 c4 |k.........^.:^i.|
>
> Grepping (some of the above) for duplicates yields this:
>
> $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile1.hex
> 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> 00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
> 00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
> $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile2.hex
> 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> 00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|*
> 00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|*
>
> $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile1.hex
> 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> 000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|*
> 00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|*
> $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile2.hex
> 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> 000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
> 00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
>
> $ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile1.hex
> 00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> 00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> 00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
> 00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
> $ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile2.hex
> 00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> 00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> 00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|*
> 00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|*
>
> I.e. testfile1 is (probably) corrupted at 000506f0..70f while
> testfile2 is (probably) corrupted at 00021d10..2f and 00708da0..bf
> (correpted lines marked with hand-made asterisks above)
>
> If I keep grepping like this, the pattern is similar both within
> these files and within testfile3 and testfile4. I.e. with
> corruptions in 32-byte blocks at (seemingly) random positions
> in the files. The corruption is always 16-byte-aligned and the bad
> data seems to be a copy from exactly one page up in the file.
>
> As stated above, I have bisected the issue to patch
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> which was added between v5.10-rc3 and v5.10-rc4. Every kernel I have
> tried with that patch applied have exhibited the issue, and I have
> had no trouble like this with any kernel without that patch. Apart
> from a whole bunch of kernels prior to v5.10-rc3, that includes some
> later kernels with the patch reverted (along with the dependent
> followup 2d09e6eb4a6f). The latest I have tried is 5.11.22. Those
> two patches does not revert cleanly in 5.12 (and thereafter) so I
> have not tried anything beyond 5.11 with the patch reverted.
>
> I fail to understand how that patch might cause this issue. I have
> compared boot messages before and after the patch and there is no
> (significant) difference. Everything seems to happen in the same
> order with the same result. But that comparison is of course limited
> to what is logged.
>
> In some random attempt I tried to disable the D-Cache bit, and that
> makes it all very slow but it also (seemingly) fixes the issue. But
> that may of course be due to vastly different timings.
>
> Some background:
>
> We have a "Linea" CPU module, with a design based on the Atmel (now
> Microchip) SAMA5D31 evaluation board. This CPU module is used on e.g.
> our TSE-850 for which there is a device tree in
> arch/arm/boot/dts/at91-tse850-3.dts
> It has a nand flash for the rootfs and 64 MB RAM. The 40 MB random
> testfile is thus big enough to cause page cache churn.
>
> We have used this module in thousands of delivered units (however,
> not that many TSE-850) and have never observed anything like this
> before. But that has been with older kernels. 4.13.<something> and
> 4.15.<something> was what we were on until this recent activity.
>
> We're now developing a new product (preliminary device tree included)
> and the trusty old CPU module was used again and a fresh new kernel
> was built for it. I then started to notice this issue and have tried
> to include as much relevant data as possible. If you need more data
> or would like me to test something, please ask.
>
> I'm stumped.
>
> Cheers,
> Peter

2022-03-03 09:40:48

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 2022-03-03 04:02, Saravana Kannan wrote:
> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>>
>> Hi!
>>
>> I'm seeing a weird problem, and I'd like some help with further
>> things to try in order to track down what's going on. I have
>> bisected the issue to
>>
>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> I skimmed through your email and I'll read it more closely tomorrow,
> but it wasn't clear if you see this on Linus's tip of the tree too.
> Asking because of:
> https://lore.kernel.org/lkml/[email protected]/
>
> Also, a couple of other data points that _might_ help. Try kernel
> command line option fw_devlink=permissive vs fw_devlink=on (I forget
> if this was the default by 5.10) vs fw_devlink=off.
>
> I'm expecting "off" to fix the issue for you. But if permissive vs on
> shows a difference driver issues would start becoming a real
> possibility.
>
> -Saravana

Thanks for the quick reply! I don't think I tested the very tip of
Linus tree before, only latest rc or something like that, but now I
have. I.e.

5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")

It would have been typical if an issue that existed for a couple of
years had been fixed the last few weeks, but alas, no.

On that kernel, and with whatever the default fw_devlink value is, the
issue is there. It's a bit hard to tell if the incident probability
is the same when trying fw_devlink arguments, but roughly so, and I
do not have to wait for long to get a bad hash with the first
reproducer

while :; do cat testfile | sha256sum; done

The output is typical:
78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -

Setting fw_devlink=off makes no difference, AFAICT.

So, just to double-check I went back to 5.11.22 with the two
mentioned patches reverted [1], plus an added backport of

c73960bb0a43 ("gpiolib: allow line names from device props to override driver names")

in order to make userspace behave as similarly as possible.
I left that running for an hour or so with 350-ish hashes
calculated correctly. Which is no proof that there is no latent
issue of course, but at the very least a great deal more stable
than later kernels.

Cheers,
Peter

[1]
f9aa460672c9 ("driver core: Refactor fw_devlink feature")
2d09e6eb4a6f ("driver core: Delete pointless parameter in fwnode_operations.add_links")

2022-03-04 11:20:12

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

Thanks for the report.

To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced f9aa460672c9
#regzbot title memory corruption on Atmel SAMA5D31
#regzbot ignore-activity

Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. This allows the bot to connect
the report with any patches posted or committed to fix the issue; this
again allows the bot to show the current status of regressions and
automatically resolve the issue when the fix hits the right tree.

I'm sending this to everyone that got the initial report, to make them
aware of the tracking. I also hope that messages like this motivate
people to directly get at least the regression mailing list and ideally
even regzbot involved when dealing with regressions, as messages like
this wouldn't be needed then. And don't worry, if I need to send other
mails regarding this regression only relevant for regzbot I'll send them
to the regressions lists only (with a tag in the subject so people can
filter them away). With a bit of luck no such messages will be needed
anyway.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.



On 03.03.22 01:29, Peter Rosin wrote:
> Hi!
>
> I'm seeing a weird problem, and I'd like some help with further
> things to try in order to track down what's going on. I have
> bisected the issue to
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> The symptoms are that I get (seemingly) random memory corruption
> when processing large amounts of data (compared to system size).
> I have two known reproducers, but I'm sure there are more if I
> keep digging. One is to do this:
>
> $ dd if=/dev/urandom of=testfile bs=1024 count=40000
> 40000+0 records in
> 40000+0 records out
> 40960000 bytes (41 MB, 39 MiB) copied, 19.7759 s, 2.1 MB/s
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> d8c85f816e08baa5ad27050bf0413e11a09f325fb0a8843b7b2b45b9333ab542 -
> f223c1cbb6dbecb02d1741e7991dc98cd8d5b40ffee05bb32dc2c15eb73d6b1f -
> d6f3e7f3d325c67e83a6104934dd8a7c891ebfd9a2cf59633dbe97fb2cbb9c81 -
> cf8ada47e7e2fee299314440b225ba83fca3cef1f6286adc160a5d4f207caccd -
>
> It is harder to tickle the problem if I redirect the testfile to
> sha256sum w/o involving cat or give the file as an argument to
> sha256sum. I can also get things to behave better by getting rid
> of a bunch of USB interrupts by doing the following:
>
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB2/latency_timer
> $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB3/latency_timer
>
> With the lower interrupt pressure I get this:
>
> $ for i in 1 2 3 4; do cat testfile | sha256sum; done
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>
> Nice. However, I need the latency to be lower than the default
> 16ms, 3ms could perhaps work in theory, but preferably 1ms, so
> the above 100ms is far off. The initial hash run was with latency
> set to 1ms, which makes it easy to trigger the issue. The latency
> timer setting is for this driver: drivers/usb/serial/ftdi_sio.c
>
> And also, that does not help with the other reproducer, namely
> to copy that same random testfile with scp to a working system...
>
> $ scp testfile peda@xyzzy:testfile1
> testfile 100% 39MB 2.0MB/s 00:19
> $ scp testfile peda@xyzzy:testfile2
> testfile 100% 39MB 2.1MB/s 00:18
> $ scp testfile peda@xyzzy:testfile3
> testfile 100% 39MB 2.1MB/s 00:18
> $ scp testfile peda@xyzzy:testfile4
> testfile 100% 39MB 2.1MB/s 00:19
>
> ...and then perform the sha256sum on that xyzzy host instead:
>
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
> $ sha256sum testfile?
> 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1
> 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2
> c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3
> 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4
>
> Same output every time. Of course. xyzzy is a working system...
> Converting these files to hex (hexdump -C) and diffing yields this:
>
> $ diff -u0 testfile1.hex testfile2.hex
> --- testfile1.hex 2022-03-02 23:56:38.273149516 +0100
> +++ testfile2.hex 2022-03-03 00:00:57.912747033 +0100
> @@ -8658,2 +8658,2 @@
> -00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
> -00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
> +00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> +00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> @@ -20592,2 +20592,2 @@
> -000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> -00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> +000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
> +00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
> @@ -461019,2 +461019,2 @@
> -00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
> -00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
> +00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> +00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> @@ -902788,2 +902788,2 @@
> -00dc6830 f2 41 23 1b ec 54 d5 fe f0 33 51 f7 d2 fc bf bd |.A#..T...3Q.....|
> -00dc6840 e5 1f 58 df 24 2f e3 dc 65 87 b2 27 12 86 d1 9a |..X.$/..e..'....|
> +00dc6830 44 82 94 b5 c9 26 08 42 bd 89 e1 96 41 66 8a b5 |D....&.B....Af..|
> +00dc6840 a5 34 46 5e fd 1b c1 73 86 33 24 fd 4d e1 e1 68 |.4F^...s.3$.M..h|
> @@ -931900,2 +931900,2 @@
> -00e383b0 ee 64 c5 6f 38 44 5b 31 41 e1 2c 64 49 d5 f8 ad |.d.o8D[1A.,dI...|
> -00e383c0 fb 85 52 4f 00 1f 80 7a f3 de ee 8e db ac d5 bb |..RO...z........|
> +00e383b0 4b 4d 29 a1 0a 99 8f f7 32 71 8c de 23 ca a0 f1 |KM).....2q..#...|
> +00e383c0 e2 af e3 c4 a0 95 d3 1c ed 58 c4 c5 30 da 56 b9 |.........X..0.V.|
> @@ -1170109,2 +1170109,2 @@
> -011dabc0 6a 7c 0c 3c 86 1a b6 48 50 d7 98 68 0c 01 e3 1c |j|.<...HP..h....|
> -011dabd0 a3 a8 b0 f2 62 21 86 b9 d1 52 9d 74 9e 26 42 51 |....b!...R.t.&BQ|
> +011dabc0 5b 1a 9e 23 ae 58 42 68 83 58 df d6 c1 57 6b b0 |[..#.XBh.X...Wk.|
> +011dabd0 ec d5 50 8b 76 5e 96 b4 49 21 f7 e4 b7 8f a3 45 |..P.v^..I!.....E|
> @@ -1880164,2 +1880164,2 @@
> -01cb0630 1c 74 74 16 75 b4 de f7 ce 4b 5e 4d 97 d6 36 d4 |.tt.u....K^M..6.|
> -01cb0640 44 d9 fd 69 c5 d0 f0 a6 c6 44 26 53 7f 91 f3 62 |D..i.....D&S...b|
> +01cb0630 73 bc 40 ce f8 9d 99 91 1b 14 8b a8 52 2a 7b 39 |[email protected]*{9|
> +01cb0640 6b ff f5 c5 02 b9 ab c2 c2 08 5e e7 3a 5e 69 c4 |k.........^.:^i.|
>
> Grepping (some of the above) for duplicates yields this:
>
> $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile1.hex
> 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> 00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T|
> 00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(|
> $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile2.hex
> 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|
> 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|
> 00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|*
> 00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|*
>
> $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile1.hex
> 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> 000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|*
> 00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|*
> $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile2.hex
> 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|
> 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|
> 000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__|
> 00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\....<W......|
>
> $ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile1.hex
> 00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> 00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> 00708da0 0d 49 c3 e8 57 06 20 5a c1 27 74 29 f8 83 af 69 |.I..W. Z.'t)...i|
> 00708db0 94 4d 5b 71 9f 3e e5 d2 91 cc cb cd aa ff 44 8b |.M[q.>........D.|
> $ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile2.hex
> 00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|
> 00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|
> 00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |[email protected]......|*
> 00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|*
>
> I.e. testfile1 is (probably) corrupted at 000506f0..70f while
> testfile2 is (probably) corrupted at 00021d10..2f and 00708da0..bf
> (correpted lines marked with hand-made asterisks above)
>
> If I keep grepping like this, the pattern is similar both within
> these files and within testfile3 and testfile4. I.e. with
> corruptions in 32-byte blocks at (seemingly) random positions
> in the files. The corruption is always 16-byte-aligned and the bad
> data seems to be a copy from exactly one page up in the file.
>
> As stated above, I have bisected the issue to patch
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> which was added between v5.10-rc3 and v5.10-rc4. Every kernel I have
> tried with that patch applied have exhibited the issue, and I have
> had no trouble like this with any kernel without that patch. Apart
> from a whole bunch of kernels prior to v5.10-rc3, that includes some
> later kernels with the patch reverted (along with the dependent
> followup 2d09e6eb4a6f). The latest I have tried is 5.11.22. Those
> two patches does not revert cleanly in 5.12 (and thereafter) so I
> have not tried anything beyond 5.11 with the patch reverted.
>
> I fail to understand how that patch might cause this issue. I have
> compared boot messages before and after the patch and there is no
> (significant) difference. Everything seems to happen in the same
> order with the same result. But that comparison is of course limited
> to what is logged.
>
> In some random attempt I tried to disable the D-Cache bit, and that
> makes it all very slow but it also (seemingly) fixes the issue. But
> that may of course be due to vastly different timings.
>
> Some background:
>
> We have a "Linea" CPU module, with a design based on the Atmel (now
> Microchip) SAMA5D31 evaluation board. This CPU module is used on e.g.
> our TSE-850 for which there is a device tree in
> arch/arm/boot/dts/at91-tse850-3.dts
> It has a nand flash for the rootfs and 64 MB RAM. The 40 MB random
> testfile is thus big enough to cause page cache churn.
>
> We have used this module in thousands of delivered units (however,
> not that many TSE-850) and have never observed anything like this
> before. But that has been with older kernels. 4.13.<something> and
> 4.15.<something> was what we were on until this recent activity.
>
> We're now developing a new product (preliminary device tree included)
> and the trusty old CPU module was used again and a fresh new kernel
> was built for it. I then started to notice this issue and have tried
> to include as much relevant data as possible. If you need more data
> or would like me to test something, please ask.
>
> I'm stumped.
>
> Cheers,
> Peter

--
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.

2022-03-04 12:07:02

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 2022-03-04 07:57, Peter Rosin wrote:
> On 2022-03-04 04:55, Saravana Kannan wrote:
>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
>>>
>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>> things to try in order to track down what's going on. I have
>>>>> bisected the issue to
>>>>>
>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>
>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>> Asking because of:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>
>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>> shows a difference driver issues would start becoming a real
>>>> possibility.
>>>>
>>>> -Saravana
>>>
>>> Thanks for the quick reply! I don't think I tested the very tip of
>>> Linus tree before, only latest rc or something like that, but now I
>>> have. I.e.
>>>
>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>
>>> It would have been typical if an issue that existed for a couple of
>>> years had been fixed the last few weeks, but alas, no.
>>>
>>> On that kernel, and with whatever the default fw_devlink value is, the
>>
>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>
>>> issue is there. It's a bit hard to tell if the incident probability
>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>> do not have to wait for long to get a bad hash with the first
>>> reproducer
>>>
>>> while :; do cat testfile | sha256sum; done
>>>
>>> The output is typical:
>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>
>>> Setting fw_devlink=off makes no difference, AFAICT.
>>
>> By this, I'm assuming you set fw_devlink=off in the kernel command
>> line and you still saw the corruption.
>
> Yes. On a bad kernel it's the same with all of the following kernel
> command lines.
>
> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>
> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>
> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>
>> If that's the case, I can't see how this could possibly have anything
>> to do with:
>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>
>> If you look at fw_devlink_link_device(), you'll see that the function
>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>> there, the rest of the code in the series doesn't run because more
>> fields wouldn't get set, etc. That pretty much disables ALL the code
>> in the entire series. The only remaining diff would be header file
>> changes where I add/remove fields. But that's unlikely to cause any
>> issues here because I'm either deleting fields that aren't used or
>> adding fields that won't be used (with fw_devlink=off). I think the
>> patch was just causing enough timing changes that it's masking the
>> real issue.
>
> When I compare fw_devlink_link_device() from before and after
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> I notice that you also removed an unconditional call to
> device_link_add_missing_supplier_links() that was live before,
> regardless of any fw_devlink parameter.
>
> I don't know if that's relevant. Is it?
>
> Not knowing this code at all, and without any serious attempt
> at reading it, from here the comment of that removed function
> sure looks like it might cause a different ordering before and
> after the patch that is not restored with any fw_devlink
> argument.

It appears that the device_link_add_missing_supplier_links() difference
is not relevant after all. What actually happened in the header file in
the "bad" commit was that two fields were removed (none added). Like so:

struct dev_links_info {
struct list_head suppliers;
struct list_head consumers;
- struct list_head needs_suppliers;
struct list_head defer_sync;
- bool need_for_probe;
enum dl_dev_state status;
};

If I restore those fields on a bad kernel, the issue is no longer
visible. That is true for the first bad kernel, i.e.

f9aa460672c9 ("driver core: Refactor fw_devlink feature")

and for tip of Linus as of recently, i.e.

5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")

Which is of course insane and a whole different level of bad. WTF!?!

I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
there? I think that's next on the list...

Cheers,
Peter

2022-03-04 13:17:46

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Hi, Peter!

On 3/4/22 12:57, Peter Rosin wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 2022-03-04 07:57, Peter Rosin wrote:
>> On 2022-03-04 04:55, Saravana Kannan wrote:
>>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
>>>>
>>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>>> things to try in order to track down what's going on. I have
>>>>>> bisected the issue to
>>>>>>
>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>
>>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>>> Asking because of:
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>>
>>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>>> shows a difference driver issues would start becoming a real
>>>>> possibility.
>>>>>
>>>>> -Saravana
>>>>
>>>> Thanks for the quick reply! I don't think I tested the very tip of
>>>> Linus tree before, only latest rc or something like that, but now I
>>>> have. I.e.
>>>>
>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>
>>>> It would have been typical if an issue that existed for a couple of
>>>> years had been fixed the last few weeks, but alas, no.
>>>>
>>>> On that kernel, and with whatever the default fw_devlink value is, the
>>>
>>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>>
>>>> issue is there. It's a bit hard to tell if the incident probability
>>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>>> do not have to wait for long to get a bad hash with the first
>>>> reproducer
>>>>
>>>> while :; do cat testfile | sha256sum; done
>>>>
>>>> The output is typical:
>>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>
>>>> Setting fw_devlink=off makes no difference, AFAICT.
>>>
>>> By this, I'm assuming you set fw_devlink=off in the kernel command
>>> line and you still saw the corruption.
>>
>> Yes. On a bad kernel it's the same with all of the following kernel
>> command lines.
>>
>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>
>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>
>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>
>>> If that's the case, I can't see how this could possibly have anything
>>> to do with:
>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>
>>> If you look at fw_devlink_link_device(), you'll see that the function
>>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>>> there, the rest of the code in the series doesn't run because more
>>> fields wouldn't get set, etc. That pretty much disables ALL the code
>>> in the entire series. The only remaining diff would be header file
>>> changes where I add/remove fields. But that's unlikely to cause any
>>> issues here because I'm either deleting fields that aren't used or
>>> adding fields that won't be used (with fw_devlink=off). I think the
>>> patch was just causing enough timing changes that it's masking the
>>> real issue.
>>
>> When I compare fw_devlink_link_device() from before and after
>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>> I notice that you also removed an unconditional call to
>> device_link_add_missing_supplier_links() that was live before,
>> regardless of any fw_devlink parameter.
>>
>> I don't know if that's relevant. Is it?
>>
>> Not knowing this code at all, and without any serious attempt
>> at reading it, from here the comment of that removed function
>> sure looks like it might cause a different ordering before and
>> after the patch that is not restored with any fw_devlink
>> argument.
>
> It appears that the device_link_add_missing_supplier_links() difference
> is not relevant after all. What actually happened in the header file in
> the "bad" commit was that two fields were removed (none added). Like so:
>
> struct dev_links_info {
> struct list_head suppliers;
> struct list_head consumers;
> - struct list_head needs_suppliers;
> struct list_head defer_sync;
> - bool need_for_probe;
> enum dl_dev_state status;
> };
>
> If I restore those fields on a bad kernel, the issue is no longer
> visible. That is true for the first bad kernel, i.e.
>
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> and for tip of Linus as of recently, i.e.
>
> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>
> Which is of course insane and a whole different level of bad. WTF!?!
>
> I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
> there? I think that's next on the list...
>

I have a sama5d3_xplained that uses a SAMA5D36 and has a 256MBytes DDR2 and a
256MBytes NAND Flash. I tried a test with a 200MB file, rootfs on sdcard and
I couldn't reproduce the bug. I'm using Linus's latest kernel:
38f80f42147f (HEAD, origin/master, origin/HEAD) MAINTAINERS: Remove dead patchwork link

root@sama5d3-xplained-sd:~# dd if=/dev/urandom of=testfile bs=1024 count=200000
200000+0 records in
200000+0 records out
204800000 bytes (205 MB, 195 MiB) copied, 37.6424 s, 5.4 MB/s
root@sama5d3-xplained-sd:~# for i in 1 2 3 4 5 6 7 8; do cat testfile | sha256sum; done
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
root@sama5d3-xplained-sd:~#

I'll put the rootfs on NAND and try to retest. Maybe to do some other tests
in parallel to have more interrupts on the system. Will let you know if I can
reproduce the bug on sama5d3_xplained.

Cheers,
ta

2022-03-04 13:58:26

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 2022-03-04 04:55, Saravana Kannan wrote:
> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
>>
>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>>>>
>>>> Hi!
>>>>
>>>> I'm seeing a weird problem, and I'd like some help with further
>>>> things to try in order to track down what's going on. I have
>>>> bisected the issue to
>>>>
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>
>>> I skimmed through your email and I'll read it more closely tomorrow,
>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>> Asking because of:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Also, a couple of other data points that _might_ help. Try kernel
>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>> if this was the default by 5.10) vs fw_devlink=off.
>>>
>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>> shows a difference driver issues would start becoming a real
>>> possibility.
>>>
>>> -Saravana
>>
>> Thanks for the quick reply! I don't think I tested the very tip of
>> Linus tree before, only latest rc or something like that, but now I
>> have. I.e.
>>
>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>
>> It would have been typical if an issue that existed for a couple of
>> years had been fixed the last few weeks, but alas, no.
>>
>> On that kernel, and with whatever the default fw_devlink value is, the
>
> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>
>> issue is there. It's a bit hard to tell if the incident probability
>> is the same when trying fw_devlink arguments, but roughly so, and I
>> do not have to wait for long to get a bad hash with the first
>> reproducer
>>
>> while :; do cat testfile | sha256sum; done
>>
>> The output is typical:
>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>
>> Setting fw_devlink=off makes no difference, AFAICT.
>
> By this, I'm assuming you set fw_devlink=off in the kernel command
> line and you still saw the corruption.

Yes. On a bad kernel it's the same with all of the following kernel
command lines.

console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)

console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)

console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)

> If that's the case, I can't see how this could possibly have anything
> to do with:
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> If you look at fw_devlink_link_device(), you'll see that the function
> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
> there, the rest of the code in the series doesn't run because more
> fields wouldn't get set, etc. That pretty much disables ALL the code
> in the entire series. The only remaining diff would be header file
> changes where I add/remove fields. But that's unlikely to cause any
> issues here because I'm either deleting fields that aren't used or
> adding fields that won't be used (with fw_devlink=off). I think the
> patch was just causing enough timing changes that it's masking the
> real issue.

When I compare fw_devlink_link_device() from before and after
f9aa460672c9 ("driver core: Refactor fw_devlink feature")
I notice that you also removed an unconditional call to
device_link_add_missing_supplier_links() that was live before,
regardless of any fw_devlink parameter.

I don't know if that's relevant. Is it?

Not knowing this code at all, and without any serious attempt
at reading it, from here the comment of that removed function
sure looks like it might cause a different ordering before and
after the patch that is not restored with any fw_devlink
argument.

> IIRC (it's been more than a year), the series [1] that brings in this
> patch has a few reverts. Those reverts undo subtle device probe
> ordering changes brought in by a bunch of earlier patches. You could
> go back to before those patches were added and see if you still see
> this corruption and then start bisecting from there. Basically try
> going to a point before:
> 42926ac3cd50 ("driver core: Move code to the right part of the file")

That patch was added after 5.7-rc5, so just to make sure, I have now
also tested 5.6. As expected, it looks like a good kernel from here.
It's been running while I have written this mail and has consistently
produced good hashes.

I arrived at the bad patch by first noticing that 5.15.6 was bad and
that 4.14 was good. I then did a manual preliminary bisect-like
thing and concluded that 5.1 was good, 5.8 was good, 5.11 was bad,
and that 5.10 was good (I think that was the order anyway, not that
it matters all that much). I then did a "proper" bisect between 5.10
and 5.11.

$ git bisect log
git bisect start
# good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# bad: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect bad f40ddce88593482919761f74910f42f4b84c004b
# bad: [538fcf57aaee6ad78a05f52b69a99baa22b33418] Merge branches 'acpi-scan', 'acpi-pnp' and 'acpi-sleep'
git bisect bad 538fcf57aaee6ad78a05f52b69a99baa22b33418
# good: [15b447361794271f4d03c04d82276a841fe06328] mm/lru: revise the comments of lru_lock
git bisect good 15b447361794271f4d03c04d82276a841fe06328
# good: [d635a69dd4981cc51f90293f5f64268620ed1565] Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect good d635a69dd4981cc51f90293f5f64268620ed1565
# bad: [2911ed9f47b47cb5ab87d03314b3b9fe008e607f] Merge tag 'char-misc-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 2911ed9f47b47cb5ab87d03314b3b9fe008e607f
# good: [c367caf1a38b6f0a1aababafd88b00fefa625f9e] Merge tag 'sound-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good c367caf1a38b6f0a1aababafd88b00fefa625f9e
# good: [93f998879cd95b3e4f2836e7b17d6d5ae035cf90] Merge tag 'extcon-next-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-next
git bisect good 93f998879cd95b3e4f2836e7b17d6d5ae035cf90
# good: [b5206275b46c30a8236feb34a1dc247fa3683d83] usb: typec: tcpm: convert comma to semicolon
git bisect good b5206275b46c30a8236feb34a1dc247fa3683d83
# good: [9e1792727ead477f49958578d0dbd466a7deea48] tty: use const parameters in port-flag accessors
git bisect good 9e1792727ead477f49958578d0dbd466a7deea48
# good: [157f809894f3cf8e62b4011915a00398603215c9] Merge tag 'tty-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect good 157f809894f3cf8e62b4011915a00398603215c9
# good: [25ac86c6dbe62fba9b97e997fa648cdbe2d40173] driver core: Use device's fwnode to check if it is waiting for suppliers
git bisect good 25ac86c6dbe62fba9b97e997fa648cdbe2d40173
# bad: [9c30921fe7994907e0b3e0637b2c8c0fc4b5171f] driver core: platform: use bus_type functions
git bisect bad 9c30921fe7994907e0b3e0637b2c8c0fc4b5171f
# bad: [5b6164d3465fcc13b5679c860c452963443172a7] driver core: Reorder devices on successful probe
git bisect bad 5b6164d3465fcc13b5679c860c452963443172a7
# good: [e82a840cb1c1c83d01a9b81bb63b6cf1c09239d7] efi: Update implementation of add_links() to create fwnode links
git bisect good e82a840cb1c1c83d01a9b81bb63b6cf1c09239d7
# bad: [2d09e6eb4a6f20273959f4905ccf009da8c64c7a] driver core: Delete pointless parameter in fwnode_operations.add_links
git bisect bad 2d09e6eb4a6f20273959f4905ccf009da8c64c7a
# bad: [f9aa460672c9c56896cdc12a521159e3e67000ba] driver core: Refactor fw_devlink feature
git bisect bad f9aa460672c9c56896cdc12a521159e3e67000ba
# first bad commit: [f9aa460672c9c56896cdc12a521159e3e67000ba] driver core: Refactor fw_devlink feature

Since I need drivers that was added for 5.11, and it was easy
to revert there, I landed at 5.11.22. And while that seems
workable at the moment, it's of course not at all where I want
to be.

Since then, I have tried a fair few kernels after 5.11, and
they have all been bad. I'm sad so say that I have not kept a
log of exactly which ones though.

> TL;DR: is that since you are reproducing this with fw_devlink=off, I'm
> pretty sure the problem is not actually because of my changes or any
> changes related to fw_devlink.

I too don't get it, but it's a little bit too consistent with
everything pointing at this one patch across so many changes.
Nothing is good after this patch, and it all behaves a little
bit to similar across the bad kernels for it to be some subtle
timing issue. Methinks. But maybe I just need to stumble on
to some later good kernel. Not holding my breath though...

But it does seem related to interrupts, as I mentioned in the
original mail, I can take a bad kernel and reduce the interrupt
pressure from USB from slightly more than 1kHz down to a
trickle and things behave much better when it comes to sha256sum.
Copying with scp might cause network interrupts, so the two
reproducers I have are perhaps quite similar? If that's the
case, then trigger would be page cache churn, interrupts and a
fair bit of CPU usage (calculating hashes or encrypting).

Cheers,
Peter

2022-03-04 15:37:02

by Saravana Kannan

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
>
> On 2022-03-03 04:02, Saravana Kannan wrote:
> > On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
> >>
> >> Hi!
> >>
> >> I'm seeing a weird problem, and I'd like some help with further
> >> things to try in order to track down what's going on. I have
> >> bisected the issue to
> >>
> >> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> >
> > I skimmed through your email and I'll read it more closely tomorrow,
> > but it wasn't clear if you see this on Linus's tip of the tree too.
> > Asking because of:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > Also, a couple of other data points that _might_ help. Try kernel
> > command line option fw_devlink=permissive vs fw_devlink=on (I forget
> > if this was the default by 5.10) vs fw_devlink=off.
> >
> > I'm expecting "off" to fix the issue for you. But if permissive vs on
> > shows a difference driver issues would start becoming a real
> > possibility.
> >
> > -Saravana
>
> Thanks for the quick reply! I don't think I tested the very tip of
> Linus tree before, only latest rc or something like that, but now I
> have. I.e.
>
> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>
> It would have been typical if an issue that existed for a couple of
> years had been fixed the last few weeks, but alas, no.
>
> On that kernel, and with whatever the default fw_devlink value is, the

It's fw_devlink=on by default from at least 5.12-rc4 or so.

> issue is there. It's a bit hard to tell if the incident probability
> is the same when trying fw_devlink arguments, but roughly so, and I
> do not have to wait for long to get a bad hash with the first
> reproducer
>
> while :; do cat testfile | sha256sum; done
>
> The output is typical:
> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>
> Setting fw_devlink=off makes no difference, AFAICT.

By this, I'm assuming you set fw_devlink=off in the kernel command
line and you still saw the corruption.

If that's the case, I can't see how this could possibly have anything
to do with:
f9aa460672c9 ("driver core: Refactor fw_devlink feature")

If you look at fw_devlink_link_device(), you'll see that the function
is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
there, the rest of the code in the series doesn't run because more
fields wouldn't get set, etc. That pretty much disables ALL the code
in the entire series. The only remaining diff would be header file
changes where I add/remove fields. But that's unlikely to cause any
issues here because I'm either deleting fields that aren't used or
adding fields that won't be used (with fw_devlink=off). I think the
patch was just causing enough timing changes that it's masking the
real issue.

IIRC (it's been more than a year), the series [1] that brings in this
patch has a few reverts. Those reverts undo subtle device probe
ordering changes brought in by a bunch of earlier patches. You could
go back to before those patches were added and see if you still see
this corruption and then start bisecting from there. Basically try
going to a point before:
42926ac3cd50 ("driver core: Move code to the right part of the file")

TL;DR: is that since you are reproducing this with fw_devlink=off, I'm
pretty sure the problem is not actually because of my changes or any
changes related to fw_devlink.

-Saravana
[1] - https://lore.kernel.org/all/[email protected]/

>
> So, just to double-check I went back to 5.11.22 with the two
> mentioned patches reverted [1], plus an added backport of
>
> c73960bb0a43 ("gpiolib: allow line names from device props to override driver names")
>
> in order to make userspace behave as similarly as possible.
> I left that running for an hour or so with 350-ish hashes
> calculated correctly. Which is no proof that there is no latent
> issue of course, but at the very least a great deal more stable
> than later kernels.
>
> Cheers,
> Peter
>
> [1]
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> 2d09e6eb4a6f ("driver core: Delete pointless parameter in fwnode_operations.add_links")
>

2022-03-04 16:49:07

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Hi!

On 2022-03-04 12:12, [email protected] wrote:
> Hi, Peter!
>
> On 3/4/22 12:57, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 2022-03-04 07:57, Peter Rosin wrote:
>>> On 2022-03-04 04:55, Saravana Kannan wrote:
>>>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
>>>>>
>>>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>>>> things to try in order to track down what's going on. I have
>>>>>>> bisected the issue to
>>>>>>>
>>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>
>>>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>>>> Asking because of:
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>
>>>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>>>
>>>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>>>> shows a difference driver issues would start becoming a real
>>>>>> possibility.
>>>>>>
>>>>>> -Saravana
>>>>>
>>>>> Thanks for the quick reply! I don't think I tested the very tip of
>>>>> Linus tree before, only latest rc or something like that, but now I
>>>>> have. I.e.
>>>>>
>>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>>
>>>>> It would have been typical if an issue that existed for a couple of
>>>>> years had been fixed the last few weeks, but alas, no.
>>>>>
>>>>> On that kernel, and with whatever the default fw_devlink value is, the
>>>>
>>>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>>>
>>>>> issue is there. It's a bit hard to tell if the incident probability
>>>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>>>> do not have to wait for long to get a bad hash with the first
>>>>> reproducer
>>>>>
>>>>> while :; do cat testfile | sha256sum; done
>>>>>
>>>>> The output is typical:
>>>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>>>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>>>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>>>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>>>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>
>>>>> Setting fw_devlink=off makes no difference, AFAICT.
>>>>
>>>> By this, I'm assuming you set fw_devlink=off in the kernel command
>>>> line and you still saw the corruption.
>>>
>>> Yes. On a bad kernel it's the same with all of the following kernel
>>> command lines.
>>>
>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>
>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>
>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>
>>>> If that's the case, I can't see how this could possibly have anything
>>>> to do with:
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>
>>>> If you look at fw_devlink_link_device(), you'll see that the function
>>>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>>>> there, the rest of the code in the series doesn't run because more
>>>> fields wouldn't get set, etc. That pretty much disables ALL the code
>>>> in the entire series. The only remaining diff would be header file
>>>> changes where I add/remove fields. But that's unlikely to cause any
>>>> issues here because I'm either deleting fields that aren't used or
>>>> adding fields that won't be used (with fw_devlink=off). I think the
>>>> patch was just causing enough timing changes that it's masking the
>>>> real issue.
>>>
>>> When I compare fw_devlink_link_device() from before and after
>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>> I notice that you also removed an unconditional call to
>>> device_link_add_missing_supplier_links() that was live before,
>>> regardless of any fw_devlink parameter.
>>>
>>> I don't know if that's relevant. Is it?
>>>
>>> Not knowing this code at all, and without any serious attempt
>>> at reading it, from here the comment of that removed function
>>> sure looks like it might cause a different ordering before and
>>> after the patch that is not restored with any fw_devlink
>>> argument.
>>
>> It appears that the device_link_add_missing_supplier_links() difference
>> is not relevant after all. What actually happened in the header file in
>> the "bad" commit was that two fields were removed (none added). Like so:
>>
>> struct dev_links_info {
>> struct list_head suppliers;
>> struct list_head consumers;
>> - struct list_head needs_suppliers;
>> struct list_head defer_sync;
>> - bool need_for_probe;
>> enum dl_dev_state status;
>> };
>>
>> If I restore those fields on a bad kernel, the issue is no longer
>> visible. That is true for the first bad kernel, i.e.
>>
>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>
>> and for tip of Linus as of recently, i.e.
>>
>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>
>> Which is of course insane and a whole different level of bad. WTF!?!
>>
>> I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
>> there? I think that's next on the list...
>>
>
> I have a sama5d3_xplained that uses a SAMA5D36 and has a 256MBytes DDR2 and a
> 256MBytes NAND Flash. I tried a test with a 200MB file, rootfs on sdcard and
> I couldn't reproduce the bug. I'm using Linus's latest kernel:
> 38f80f42147f (HEAD, origin/master, origin/HEAD) MAINTAINERS: Remove dead patchwork link
>
> root@sama5d3-xplained-sd:~# dd if=/dev/urandom of=testfile bs=1024 count=200000
> 200000+0 records in
> 200000+0 records out
> 204800000 bytes (205 MB, 195 MiB) copied, 37.6424 s, 5.4 MB/s
> root@sama5d3-xplained-sd:~# for i in 1 2 3 4 5 6 7 8; do cat testfile | sha256sum; done
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> root@sama5d3-xplained-sd:~#
>
> I'll put the rootfs on NAND and try to retest. Maybe to do some other tests
> in parallel to have more interrupts on the system. Will let you know if I can
> reproduce the bug on sama5d3_xplained.

Thanks for testing!

Since you (probably) don't have the interrupt source from the USB
serial chip that I have, that is not completely unexpected.

$ lsusb
Bus 001 Device 002: ID 0403:6011 Future Technology Devices International, Ltd FT4232H Quad HS USB-UART/FIFO IC
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
$ cat /sys/bus/usb-serial/devices/ttyUSB?/latency_timer
1
1
1
1

Also, your file is perhaps too small? You leave approx 50MB for the
system, so it might be the case that the page cache can hold the whole
file?

So, can you please try that again with a slightly bigger file or if you
restrict how much RAM you allow the kernel to see?

And if you don't have the FTDI usb-serial chip, you should probably go
with the other reproducer, namely to simply copy the random file to a
different host using scp.

Thanks again!

Cheers,
Peter

2022-03-04 19:24:31

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 3/4/22 14:38, Peter Rosin wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> Hi!

Hi, Peter!

>
> On 2022-03-04 12:12, [email protected] wrote:
>> Hi, Peter!
>>
>> On 3/4/22 12:57, Peter Rosin wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> On 2022-03-04 07:57, Peter Rosin wrote:
>>>> On 2022-03-04 04:55, Saravana Kannan wrote:
>>>>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
>>>>>>
>>>>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>>>>> things to try in order to track down what's going on. I have
>>>>>>>> bisected the issue to
>>>>>>>>
>>>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>>
>>>>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>>>>> Asking because of:
>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>
>>>>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>>>>
>>>>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>>>>> shows a difference driver issues would start becoming a real
>>>>>>> possibility.
>>>>>>>
>>>>>>> -Saravana
>>>>>>
>>>>>> Thanks for the quick reply! I don't think I tested the very tip of
>>>>>> Linus tree before, only latest rc or something like that, but now I
>>>>>> have. I.e.
>>>>>>
>>>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>>>
>>>>>> It would have been typical if an issue that existed for a couple of
>>>>>> years had been fixed the last few weeks, but alas, no.
>>>>>>
>>>>>> On that kernel, and with whatever the default fw_devlink value is, the
>>>>>
>>>>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>>>>
>>>>>> issue is there. It's a bit hard to tell if the incident probability
>>>>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>>>>> do not have to wait for long to get a bad hash with the first
>>>>>> reproducer
>>>>>>
>>>>>> while :; do cat testfile | sha256sum; done
>>>>>>
>>>>>> The output is typical:
>>>>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>>>>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>>>>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>>>>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>>>>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>
>>>>>> Setting fw_devlink=off makes no difference, AFAICT.
>>>>>
>>>>> By this, I'm assuming you set fw_devlink=off in the kernel command
>>>>> line and you still saw the corruption.
>>>>
>>>> Yes. On a bad kernel it's the same with all of the following kernel
>>>> command lines.
>>>>
>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>
>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>
>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>
>>>>> If that's the case, I can't see how this could possibly have anything
>>>>> to do with:
>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>
>>>>> If you look at fw_devlink_link_device(), you'll see that the function
>>>>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>>>>> there, the rest of the code in the series doesn't run because more
>>>>> fields wouldn't get set, etc. That pretty much disables ALL the code
>>>>> in the entire series. The only remaining diff would be header file
>>>>> changes where I add/remove fields. But that's unlikely to cause any
>>>>> issues here because I'm either deleting fields that aren't used or
>>>>> adding fields that won't be used (with fw_devlink=off). I think the
>>>>> patch was just causing enough timing changes that it's masking the
>>>>> real issue.
>>>>
>>>> When I compare fw_devlink_link_device() from before and after
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>> I notice that you also removed an unconditional call to
>>>> device_link_add_missing_supplier_links() that was live before,
>>>> regardless of any fw_devlink parameter.
>>>>
>>>> I don't know if that's relevant. Is it?
>>>>
>>>> Not knowing this code at all, and without any serious attempt
>>>> at reading it, from here the comment of that removed function
>>>> sure looks like it might cause a different ordering before and
>>>> after the patch that is not restored with any fw_devlink
>>>> argument.
>>>
>>> It appears that the device_link_add_missing_supplier_links() difference
>>> is not relevant after all. What actually happened in the header file in
>>> the "bad" commit was that two fields were removed (none added). Like so:
>>>
>>> struct dev_links_info {
>>> struct list_head suppliers;
>>> struct list_head consumers;
>>> - struct list_head needs_suppliers;
>>> struct list_head defer_sync;
>>> - bool need_for_probe;
>>> enum dl_dev_state status;
>>> };
>>>
>>> If I restore those fields on a bad kernel, the issue is no longer
>>> visible. That is true for the first bad kernel, i.e.
>>>
>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>
>>> and for tip of Linus as of recently, i.e.
>>>
>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>
>>> Which is of course insane and a whole different level of bad. WTF!?!
>>>
>>> I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
>>> there? I think that's next on the list...
>>>
>>
>> I have a sama5d3_xplained that uses a SAMA5D36 and has a 256MBytes DDR2 and a
>> 256MBytes NAND Flash. I tried a test with a 200MB file, rootfs on sdcard and
>> I couldn't reproduce the bug. I'm using Linus's latest kernel:
>> 38f80f42147f (HEAD, origin/master, origin/HEAD) MAINTAINERS: Remove dead patchwork link
>>
>> root@sama5d3-xplained-sd:~# dd if=/dev/urandom of=testfile bs=1024 count=200000
>> 200000+0 records in
>> 200000+0 records out
>> 204800000 bytes (205 MB, 195 MiB) copied, 37.6424 s, 5.4 MB/s
>> root@sama5d3-xplained-sd:~# for i in 1 2 3 4 5 6 7 8; do cat testfile | sha256sum; done
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>> root@sama5d3-xplained-sd:~#
>>
>> I'll put the rootfs on NAND and try to retest. Maybe to do some other tests
>> in parallel to have more interrupts on the system. Will let you know if I can
>> reproduce the bug on sama5d3_xplained.
>
> Thanks for testing!

you're welcome, no worries.
>
> Since you (probably) don't have the interrupt source from the USB
> serial chip that I have, that is not completely unexpected.
>
> $ lsusb
> Bus 001 Device 002: ID 0403:6011 Future Technology Devices International, Ltd FT4232H Quad HS USB-UART/FIFO IC
> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
> $ cat /sys/bus/usb-serial/devices/ttyUSB?/latency_timer
> 1
> 1
> 1
> 1
>
> Also, your file is perhaps too small? You leave approx 50MB for the
> system, so it might be the case that the page cache can hold the whole
> file?
>
> So, can you please try that again with a slightly bigger file or if you
> restrict how much RAM you allow the kernel to see?
>
> And if you don't have the FTDI usb-serial chip, you should probably go
> with the other reproducer, namely to simply copy the random file to a
> different host using scp.

I kept the rootfs on sdcard but this time I generated a 300MB random file.
I ran a mtd_stresstest on the NAND flash while doing the sha256sum or scp
tests. All went fine.

Here's the mtd_stresstest being successful https://pastebin.com/eWQNHAsE
While the stresstest was running I did the following sha256 and scp tests:
https://pastebin.com/wjutw63C

On my laptop the sha256sum is matching the one on the board:
$ sha256sum /tmp/testfile?
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile1
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile2
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile3
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile4
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile5
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile6
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile7
d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile8

Here's what "top" cmd was showing when doing the scp and the mtd_stresstest:
top - 14:40:13 up 39 min, 3 users, load average: 1.95, 1.88, 1.80
Tasks: 54 total, 3 running, 51 sleeping, 0 stopped, 0 zombie
%Cpu(s): 35.1 us, 48.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 16.9 si, 0.0 st
MiB Mem : 242.3 total, 2.5 free, 15.2 used, 224.6 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 220.1 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
464 root 20 0 4296 3292 2940 R 46.6 1.3 0:17.53 ssh
401 root 20 0 1668 760 676 R 45.0 0.3 17:57.11 modprobe
463 root 20 0 3456 2232 2000 S 5.2 0.9 0:02.04 scp

Here's what "top" cmd was showing when doing the sha256sum and the mtd_stresstest:
top - 14:12:47 up 12 min, 3 users, load average: 2.14, 1.92, 1.08
Tasks: 54 total, 3 running, 51 sleeping, 0 stopped, 0 zombie
%Cpu(s): 37.4 us, 58.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 4.2 si, 0.0 st
MiB Mem : 242.3 total, 3.0 free, 14.8 used, 224.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 220.6 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
420 root 20 0 1396 784 692 R 47.2 0.3 0:06.42 sha256sum
401 root 20 0 1668 1208 1124 R 43.0 0.5 4:50.34 modprobe
419 root 20 0 1520 868 680 S 6.5 0.3 0:00.92 cat

Peter, do you think it is worth to do some other tests on sama5d3_xplained?
I'll try to find a SAMA5D31 evaluation kit meanwhile.

Cheers,
ta

2022-03-04 21:02:56

by Saravana Kannan

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On Fri, Mar 4, 2022 at 2:57 AM Peter Rosin <[email protected]> wrote:
>
> On 2022-03-04 07:57, Peter Rosin wrote:
> > On 2022-03-04 04:55, Saravana Kannan wrote:
> >> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
> >>>
> >>> On 2022-03-03 04:02, Saravana Kannan wrote:
> >>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> I'm seeing a weird problem, and I'd like some help with further
> >>>>> things to try in order to track down what's going on. I have
> >>>>> bisected the issue to
> >>>>>
> >>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> >>>>
> >>>> I skimmed through your email and I'll read it more closely tomorrow,
> >>>> but it wasn't clear if you see this on Linus's tip of the tree too.
> >>>> Asking because of:
> >>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>
> >>>> Also, a couple of other data points that _might_ help. Try kernel
> >>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
> >>>> if this was the default by 5.10) vs fw_devlink=off.
> >>>>
> >>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
> >>>> shows a difference driver issues would start becoming a real
> >>>> possibility.
> >>>>
> >>>> -Saravana
> >>>
> >>> Thanks for the quick reply! I don't think I tested the very tip of
> >>> Linus tree before, only latest rc or something like that, but now I
> >>> have. I.e.
> >>>
> >>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
> >>>
> >>> It would have been typical if an issue that existed for a couple of
> >>> years had been fixed the last few weeks, but alas, no.
> >>>
> >>> On that kernel, and with whatever the default fw_devlink value is, the
> >>
> >> It's fw_devlink=on by default from at least 5.12-rc4 or so.
> >>
> >>> issue is there. It's a bit hard to tell if the incident probability
> >>> is the same when trying fw_devlink arguments, but roughly so, and I
> >>> do not have to wait for long to get a bad hash with the first
> >>> reproducer
> >>>
> >>> while :; do cat testfile | sha256sum; done
> >>>
> >>> The output is typical:
> >>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
> >>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
> >>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
> >>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
> >>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>>
> >>> Setting fw_devlink=off makes no difference, AFAICT.
> >>
> >> By this, I'm assuming you set fw_devlink=off in the kernel command
> >> line and you still saw the corruption.
> >
> > Yes. On a bad kernel it's the same with all of the following kernel
> > command lines.
> >
> > console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
> >
> > console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
> >
> > console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
> >
> >> If that's the case, I can't see how this could possibly have anything
> >> to do with:
> >> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> >>
> >> If you look at fw_devlink_link_device(), you'll see that the function
> >> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
> >> there, the rest of the code in the series doesn't run because more
> >> fields wouldn't get set, etc. That pretty much disables ALL the code
> >> in the entire series. The only remaining diff would be header file
> >> changes where I add/remove fields. But that's unlikely to cause any
> >> issues here because I'm either deleting fields that aren't used or
> >> adding fields that won't be used (with fw_devlink=off). I think the
> >> patch was just causing enough timing changes that it's masking the
> >> real issue.
> >
> > When I compare fw_devlink_link_device() from before and after
> > f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> > I notice that you also removed an unconditional call to
> > device_link_add_missing_supplier_links() that was live before,
> > regardless of any fw_devlink parameter.
> >
> > I don't know if that's relevant. Is it?
> >
> > Not knowing this code at all, and without any serious attempt
> > at reading it, from here the comment of that removed function
> > sure looks like it might cause a different ordering before and
> > after the patch that is not restored with any fw_devlink
> > argument.
>
> It appears that the device_link_add_missing_supplier_links() difference
> is not relevant after all. What actually happened in the header file in
> the "bad" commit was that two fields were removed (none added). Like so:
>
> struct dev_links_info {
> struct list_head suppliers;
> struct list_head consumers;
> - struct list_head needs_suppliers;
> struct list_head defer_sync;
> - bool need_for_probe;
> enum dl_dev_state status;
> };
>
> If I restore those fields on a bad kernel, the issue is no longer
> visible. That is true for the first bad kernel, i.e.

Ha... I thought this might be a possibility but I wasn't sure. Which
is why I kinda left it at:
"The only remaining diff would be header file
changes where I add/remove fields. But that's unlikely to cause any
issues here because I'm either deleting fields that aren't used or
adding fields that won't be used (with fw_devlink=off)."

Ok, at this point I'm going to ignore this thread. Call me out
explicitly if you want me to pay attention :)

-Saravana

2022-03-07 10:41:27

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 3/4/22 18:48, [email protected] wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 3/4/22 14:38, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> Hi!
>
> Hi, Peter!
>
>>
>> On 2022-03-04 12:12, [email protected] wrote:
>>> Hi, Peter!
>>>
>>> On 3/4/22 12:57, Peter Rosin wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> On 2022-03-04 07:57, Peter Rosin wrote:
>>>>> On 2022-03-04 04:55, Saravana Kannan wrote:
>>>>>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <[email protected]> wrote:
>>>>>>>
>>>>>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>>>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>>>>>> things to try in order to track down what's going on. I have
>>>>>>>>> bisected the issue to
>>>>>>>>>
>>>>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>>>
>>>>>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>>>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>>>>>> Asking because of:
>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>>
>>>>>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>>>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>>>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>>>>>
>>>>>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>>>>>> shows a difference driver issues would start becoming a real
>>>>>>>> possibility.
>>>>>>>>
>>>>>>>> -Saravana
>>>>>>>
>>>>>>> Thanks for the quick reply! I don't think I tested the very tip of
>>>>>>> Linus tree before, only latest rc or something like that, but now I
>>>>>>> have. I.e.
>>>>>>>
>>>>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>>>>
>>>>>>> It would have been typical if an issue that existed for a couple of
>>>>>>> years had been fixed the last few weeks, but alas, no.
>>>>>>>
>>>>>>> On that kernel, and with whatever the default fw_devlink value is, the
>>>>>>
>>>>>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>>>>>
>>>>>>> issue is there. It's a bit hard to tell if the incident probability
>>>>>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>>>>>> do not have to wait for long to get a bad hash with the first
>>>>>>> reproducer
>>>>>>>
>>>>>>> while :; do cat testfile | sha256sum; done
>>>>>>>
>>>>>>> The output is typical:
>>>>>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>>>>>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>>>>>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>>>>>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>>>>>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>>>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>>>
>>>>>>> Setting fw_devlink=off makes no difference, AFAICT.
>>>>>>
>>>>>> By this, I'm assuming you set fw_devlink=off in the kernel command
>>>>>> line and you still saw the corruption.
>>>>>
>>>>> Yes. On a bad kernel it's the same with all of the following kernel
>>>>> command lines.
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>>>
>>>>>> If that's the case, I can't see how this could possibly have anything
>>>>>> to do with:
>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>
>>>>>> If you look at fw_devlink_link_device(), you'll see that the function
>>>>>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>>>>>> there, the rest of the code in the series doesn't run because more
>>>>>> fields wouldn't get set, etc. That pretty much disables ALL the code
>>>>>> in the entire series. The only remaining diff would be header file
>>>>>> changes where I add/remove fields. But that's unlikely to cause any
>>>>>> issues here because I'm either deleting fields that aren't used or
>>>>>> adding fields that won't be used (with fw_devlink=off). I think the
>>>>>> patch was just causing enough timing changes that it's masking the
>>>>>> real issue.
>>>>>
>>>>> When I compare fw_devlink_link_device() from before and after
>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>> I notice that you also removed an unconditional call to
>>>>> device_link_add_missing_supplier_links() that was live before,
>>>>> regardless of any fw_devlink parameter.
>>>>>
>>>>> I don't know if that's relevant. Is it?
>>>>>
>>>>> Not knowing this code at all, and without any serious attempt
>>>>> at reading it, from here the comment of that removed function
>>>>> sure looks like it might cause a different ordering before and
>>>>> after the patch that is not restored with any fw_devlink
>>>>> argument.
>>>>
>>>> It appears that the device_link_add_missing_supplier_links() difference
>>>> is not relevant after all. What actually happened in the header file in
>>>> the "bad" commit was that two fields were removed (none added). Like so:
>>>>
>>>> struct dev_links_info {
>>>> struct list_head suppliers;
>>>> struct list_head consumers;
>>>> - struct list_head needs_suppliers;
>>>> struct list_head defer_sync;
>>>> - bool need_for_probe;
>>>> enum dl_dev_state status;
>>>> };
>>>>
>>>> If I restore those fields on a bad kernel, the issue is no longer
>>>> visible. That is true for the first bad kernel, i.e.
>>>>
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>
>>>> and for tip of Linus as of recently, i.e.
>>>>
>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>
>>>> Which is of course insane and a whole different level of bad. WTF!?!
>>>>
>>>> I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
>>>> there? I think that's next on the list...
>>>>
>>>
>>> I have a sama5d3_xplained that uses a SAMA5D36 and has a 256MBytes DDR2 and a
>>> 256MBytes NAND Flash. I tried a test with a 200MB file, rootfs on sdcard and
>>> I couldn't reproduce the bug. I'm using Linus's latest kernel:
>>> 38f80f42147f (HEAD, origin/master, origin/HEAD) MAINTAINERS: Remove dead patchwork link
>>>
>>> root@sama5d3-xplained-sd:~# dd if=/dev/urandom of=testfile bs=1024 count=200000
>>> 200000+0 records in
>>> 200000+0 records out
>>> 204800000 bytes (205 MB, 195 MiB) copied, 37.6424 s, 5.4 MB/s
>>> root@sama5d3-xplained-sd:~# for i in 1 2 3 4 5 6 7 8; do cat testfile | sha256sum; done
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
>>> root@sama5d3-xplained-sd:~#
>>>
>>> I'll put the rootfs on NAND and try to retest. Maybe to do some other tests
>>> in parallel to have more interrupts on the system. Will let you know if I can
>>> reproduce the bug on sama5d3_xplained.
>>
>> Thanks for testing!
>
> you're welcome, no worries.
>>
>> Since you (probably) don't have the interrupt source from the USB
>> serial chip that I have, that is not completely unexpected.
>>
>> $ lsusb
>> Bus 001 Device 002: ID 0403:6011 Future Technology Devices International, Ltd FT4232H Quad HS USB-UART/FIFO IC
>> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
>> Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
>> $ cat /sys/bus/usb-serial/devices/ttyUSB?/latency_timer
>> 1
>> 1
>> 1
>> 1
>>
>> Also, your file is perhaps too small? You leave approx 50MB for the
>> system, so it might be the case that the page cache can hold the whole
>> file?
>>
>> So, can you please try that again with a slightly bigger file or if you
>> restrict how much RAM you allow the kernel to see?
>>
>> And if you don't have the FTDI usb-serial chip, you should probably go
>> with the other reproducer, namely to simply copy the random file to a
>> different host using scp.
>
> I kept the rootfs on sdcard but this time I generated a 300MB random file.
> I ran a mtd_stresstest on the NAND flash while doing the sha256sum or scp
> tests. All went fine.
>
> Here's the mtd_stresstest being successful https://pastebin.com/eWQNHAsE
> While the stresstest was running I did the following sha256 and scp tests:
> https://pastebin.com/wjutw63C
>
> On my laptop the sha256sum is matching the one on the board:
> $ sha256sum /tmp/testfile?
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile1
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile2
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile3
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile4
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile5
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile6
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile7
> d9232cee3ac29c3a9aaff8b23b4cb2914edd54e21550a555656988596fbd0b58 /tmp/testfile8
>
> Here's what "top" cmd was showing when doing the scp and the mtd_stresstest:
> top - 14:40:13 up 39 min, 3 users, load average: 1.95, 1.88, 1.80
> Tasks: 54 total, 3 running, 51 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 35.1 us, 48.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 16.9 si, 0.0 st
> MiB Mem : 242.3 total, 2.5 free, 15.2 used, 224.6 buff/cache
> MiB Swap: 0.0 total, 0.0 free, 0.0 used. 220.1 avail Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 464 root 20 0 4296 3292 2940 R 46.6 1.3 0:17.53 ssh
> 401 root 20 0 1668 760 676 R 45.0 0.3 17:57.11 modprobe
> 463 root 20 0 3456 2232 2000 S 5.2 0.9 0:02.04 scp
>
> Here's what "top" cmd was showing when doing the sha256sum and the mtd_stresstest:
> top - 14:12:47 up 12 min, 3 users, load average: 2.14, 1.92, 1.08
> Tasks: 54 total, 3 running, 51 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 37.4 us, 58.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 4.2 si, 0.0 st
> MiB Mem : 242.3 total, 3.0 free, 14.8 used, 224.5 buff/cache
> MiB Swap: 0.0 total, 0.0 free, 0.0 used. 220.6 avail Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 420 root 20 0 1396 784 692 R 47.2 0.3 0:06.42 sha256sum
> 401 root 20 0 1668 1208 1124 R 43.0 0.5 4:50.34 modprobe
> 419 root 20 0 1520 868 680 S 6.5 0.3 0:00.92 cat
>
> Peter, do you think it is worth to do some other tests on sama5d3_xplained?
> I'll try to find a SAMA5D31 evaluation kit meanwhile.
>

Peter, would it worth to do on your board a similar test to what I did?
I'm thinking whether the source of interrupts matters or not. So can you
disable your USB and use a mtd NAND stress test as a source of interrupts?
mtd_stresstest together with scp or hexdump.

Cheers,
ta

2022-03-07 21:45:19

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 2022-03-07 12:32, Peter Rosin wrote:
> On 2022-03-07 10:45, [email protected] wrote:
>> Peter, would it worth to do on your board a similar test to what I did?
>> I'm thinking whether the source of interrupts matters or not. So can you
>> disable your USB and use a mtd NAND stress test as a source of interrupts?
>> mtd_stresstest together with scp or hexdump.
>
> That's not a quick test for me, since I don't have modules enabled.
> I have located my SAMA5D31 evaluation kit, and I think I will try
> to get that running instead.

Hi again!

I got my SAMA5D31EK board running, using a freshly built at91bootstrap
and u-boot according to linux4sam.org and using the cross compiler I
have used from buildroot 2021.08, i.e. gcc 10.3.0, then using the
dtb for the ME20 from the original post and the same kernel and userspace
as I have used previously. Now, that dtb describes things that may not
actually be there etc etc, and I will try with a proper dtb as well
tomorrow, so this was just a quick-n-dirty test. I also added mem=64MB
to the kernel command line, to mimic our "Linea" CPU module and get a
bit quicker turnaround in the page cache.

Anyway, with that setup I can reproduce the problem on the EK board.

$ while :; do cat testfile | sha256sum; done
5a939c69dd60a1f991e43d278d2e824a0e7376600a6b20c8e8b347871c546f9b -
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
250556db0a6ac3c3e101ae2845da48ebb39a0c12d4c9b9eec5ea229c426bcce9 -
874c694ed002b04b67bf354a95ee521effd07e198f363e02cd63069a94fd4df8 -
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
c3a918a923ff2d504a45ffa51289e69fb6d8aa1140cca3fd9f30703b18d9e509 -
1577ed72d2f296f9adc50707e0e56547ecb311fa21ad875a3d55ca908c440307 -
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -


But apparently only if I have an FTDI usb-serial adapter attached
while I boot. I also start to get good hashes if I remove and
reinsert the FTDI adapter, which is interesting.

$ while :; do cat testfile | sha256sum; done
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
...
*snip many dozens of lines*
...
7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -

It's of course hard to prove the absence of trouble, but it feels
like it is working from both of those latter cases...

(for the "real" case the FTDI usb-serial adapter is soldered in,
with no easy way to make it go away, so it is not as easy to do the
same test there.)

I'll try to reduce the number of local parts of the setup further
tomorrow, such as the dtb mentioned above and the rootfs. I was
hoping for a binary download of prebuilt parts, but some links on

https://www.linux4sam.org/bin/view/Linux4SAM/Sama5d3xekMainPage

are bogus. E.g.

ftp://twiki.lnx4mchp_backend/pub/demo/linux4sam_4.7/linux4sam-poky-sama5d3xek-4.7.zip

What's up with that twiki.lnx4mchp_backend "host"?

Cheers,
Peter

2022-03-08 08:15:41

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 2022-03-07 10:45, [email protected] wrote:
> Peter, would it worth to do on your board a similar test to what I did?
> I'm thinking whether the source of interrupts matters or not. So can you
> disable your USB and use a mtd NAND stress test as a source of interrupts?
> mtd_stresstest together with scp or hexdump.

That's not a quick test for me, since I don't have modules enabled.
I have located my SAMA5D31 evaluation kit, and I think I will try
to get that running instead.


Meanwhile, during the weekend I made tests with a slightly permuted
"old style" struct dev_links_info, i.e. swapping place for the with
defer_sync and needs_suppliers list heads for this layout:

struct dev_links_info {
struct list_head suppliers;
struct list_head consumers;
struct list_head defer_sync;
struct list_head needs_suppliers;
bool need_for_probe;
enum dl_dev_state status;
};

This produces a new failure mode and hits a BUG. Maybe that's a hint
for someone? I have several more of these reports if someone is
interested, but they all look very similar to me.

$ while :; do cat testfile | sha256sum; done
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
[ 690.196564] ------------[ cut here ]------------
[ 690.201193] kernel BUG at drivers/dma/dmaengine.h:54!
[ 690.206249] Internal error: Oops - BUG: 0 [#1] ARM
[ 690.211057] CPU: 0 PID: 1753 Comm: cat Not tainted 5.17.0-rc6+ #72
[ 690.217245] Hardware name: Atmel SAMA5
[ 690.220998] PC is at atc_chain_complete+0x114/0x174
[ 690.225885] LR is at atc_advance_work+0x7c/0x190
[ 690.230510] pc : [<c03de48c>] lr : [<c03de6d4>] psr: 600f0193
[ 690.236793] sp : c0a718e8 ip : 00000000 fp : c0a71a74
[ 690.242030] r10: c0f28000 r9 : c03dd624 r8 : 00000002
[ 690.247267] r7 : 600f0113 r6 : c0d5bae8 r5 : c0d5ba78 r4 : c0d5bacc
[ 690.253811] r3 : 00000000 r2 : c0d5b800 r1 : 00000000 r0 : c0d5ba78
[ 690.260358] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
[ 690.267605] Control: 10c53c7d Table: 20b6c059 DAC: 00000051
[ 690.273361] Register r0 information: slab kmalloc-2k start c0d5b800 pointer offset 632 size 2048
[ 690.282193] Register r1 information: NULL pointer
[ 690.286906] Register r2 information: slab kmalloc-2k start c0d5b800 pointer offset 0 size 2048
[ 690.295545] Register r3 information: NULL pointer
[ 690.300258] Register r4 information: slab kmalloc-2k start c0d5b800 pointer offset 716 size 2048
[ 690.309073] Register r5 information: slab kmalloc-2k start c0d5b800 pointer offset 632 size 2048
[ 690.317887] Register r6 information: slab kmalloc-2k start c0d5b800 pointer offset 744 size 2048
[ 690.326702] Register r7 information: non-paged memory
[ 690.331764] Register r8 information: non-paged memory
[ 690.336825] Register r9 information: non-slab/vmalloc memory
[ 690.342498] Register r10 information: slab kmalloc-4k start c0f28000 pointer offset 0 size 4096
[ 690.351225] Register r11 information: non-slab/vmalloc memory
[ 690.356985] Register r12 information: NULL pointer
[ 690.361786] Process cat (pid: 1753, stack limit = 0x2b6a6c18)
[ 690.367547] Stack: (0xc0a718e8 to 0xc0a72000)
[ 690.371921] 18e0: c0f28000 c0a71a74 00000000 47a25045 c0478b9c c0d5ba78
[ 690.380128] 1900: c0d5bb18 20f28000 00000800 c03de6d4 c0f28000 c0a71a74 00000000 47a25045
[ 690.388330] 1920: c0f15a00 c0a71940 20f28000 00000800 00000002 c0478b9c 00000003 00000000
[ 690.396534] 1940: 00000001 c0a71944 c0a71944 47a25045 0002eb42 c0f1d050 c0f15a00 00000000
[ 690.404737] 1960: 00000000 c0f28000 c0f15a00 c047a23c 00000002 00000000 00000000 c0f1d050
[ 690.412942] 1980: 000005f0 00000000 00000000 00000000 c047a350 c047a36c 00000000 00000000
[ 690.421145] 19a0: 00000000 c0468570 00001030 00000000 00000004 c0f94000 00000000 00000800
[ 690.429348] 19c0: 0002dd98 00000001 c0f1d12c 00000000 c1dec000 c0f28210 00000210 00001030
[ 690.437552] 19e0: 0002dd98 00000000 00000000 c0f1d50c 00000000 00000040 00000000 c0b438e0
[ 690.445757] 1a00: 00000000 00000000 00000000 00000000 00001540 c0f1d050 c0f94000 00000000
[ 690.453959] 1a20: 00000000 00000000 c0a71a74 00000000 06ecc210 c045b078 c0a71a74 c01bc8b8
[ 690.462164] 1a40: 000000bc c0f94000 00000000 06ecc210 c0a71ad0 c1dec000 00000000 c091b928
[ 690.470367] 1a60: 06ecc210 c045b19c c0a71a74 c0484388 000000bc 00000000 00001030 00000000
[ 690.478570] 1a80: 00000000 00000000 00000000 c1dec000 00000000 47a25045 00000004 c16a6000
[ 690.486773] 1aa0: 0000c210 00000376 00001030 c04865e4 00001030 c0a71ad0 c1dec000 0000000a
[ 690.494977] 1ac0: c1dec000 c08e52c4 c091b8c0 c0b44228 00000000 47a25045 c16a6000 00000000
[ 690.503182] 1ae0: c16a6000 c16b9000 00000018 c1dec000 c16a6000 00000000 00000376 c0484250
[ 690.511385] 1b00: 00001030 47a25045 00000540 c1d33480 a0000013 c02f1604 c0c00100 00000000
[ 690.519589] 1b20: 00000018 0000b210 c16b9000 c1dec000 c16a6000 00000000 00001030 c0483144
[ 690.527792] 1b40: 0000b210 00001030 00000000 00000540 c0b62734 c16b7000 c16b7000 0000b210
[ 690.535995] 1b60: 00000018 00001030 c0a71c98 00000018 0000b210 c02cfd38 00001030 00000000
[ 690.544199] 1b80: c16b7000 c0a71c28 c1dec000 c0b40028 c16b7000 c02d2e6c 00001030 00000001
[ 690.552403] 1ba0: 00000000 c02d53a4 60000013 00001030 00000001 c0a71c24 c0a17d00 00000018
[ 690.560607] 1bc0: c16b7000 c0100b14 c16b70e4 c0a71c98 c1dec000 c0a70000 c16b7000 c0a71c98
[ 690.568810] 1be0: 00000000 00000000 c1dec000 47a25045 00000018 c16b7000 c0a71c98 00000000
[ 690.577014] 1c00: 00000000 c1dec000 c16b70e4 00000018 00000000 c02d57ec c0b52718 00000000
[ 690.585218] 1c20: 00000000 c0a17d00 0000007d 20001082 00000000 00000018 0000b210 00001030
[ 690.593423] 1c40: 01140cca 47a25045 70586723 c3fd4ee0 c2cf7000 c12b4b30 c1dec000 0000007d
[ 690.601625] 1c60: 00001082 c0b3f800 c16b7000 c02c5870 00000000 c0182df8 c12b4c20 c0803744
[ 690.609830] 1c80: c0a71cd4 c12b4c18 02710000 00002710 c16c6180 00000000 0000007d 20001082
[ 690.618033] 1ca0: c0a71c9c 47a25045 c3fd4ee0 c3fd4ee0 c16b7000 00001082 00001082 c12b4b30
[ 690.626236] 1cc0: 00001081 00000000 c12b4c24 c02c5f80 00000000 c0b3dd0c 60000013 c0b5b900
[ 690.634440] 1ce0: c0b5b8d8 c0b1d6f4 c01995ac c3fd4ee0 00000000 00000cc0 00001082 47a25045
[ 690.642644] 1d00: c3fd4ee0 c3fd4ee0 c16c6180 c0a71dc4 00001082 c12b4c18 c3fd4ee0 00000000
[ 690.650847] 1d20: c12b4c24 c0176090 000010a0 c0a71e30 c0a71dc4 c0177d40 00000002 c0a70000
[ 690.659050] 1d40: c16c6180 c16c61d8 c0a71f18 61c88647 c0a71d84 c16c6180 c12b4c18 c16c61d8
[ 690.667254] 1d60: 00001082 00000000 00000000 47a25045 00001000 c12b4b30 c0a71dc8 00000000
[ 690.675457] 1d80: 00001000 c0a71f18 c0a71e30 00002000 c16c6180 c017a064 c0b52718 c0a71dc8
[ 690.683660] 1da0: 00000000 c0d8b268 c0b0eb40 00000000 02710000 00000000 c12b4b30 c12b4c18
[ 690.691864] 1dc0: 200f0193 00000000 c3fd4ec0 00000006 c0a71de4 c01381cc 00000001 c0c24040
[ 690.700069] 1de0: c0a71e04 c01382ac 00000040 c0b52730 40000006 c0a70000 00000100 c0b52718
[ 690.708272] 1e00: c0b52734 47a25045 00000000 c16c6180 00000000 c0a71f38 00000000 c0a71f18
[ 690.716475] 1e20: 00000000 00000000 00004004 c01c1f3c c16c6180 00000000 01082000 00000000
[ 690.724680] 1e40: 00000000 00000000 00000000 40040000 00000000 00000000 c12b4b30 47a25045
[ 690.732883] 1e60: 00008000 c0a71f18 c0a71f18 c1dd8600 c16c6180 c0a71f38 00000000 00000001
[ 690.741089] 1e80: c0a71f30 c01c205c 00000000 c0a71ea8 c02fee04 c02fbe10 c0a71f18 c0a71f80
[ 690.749290] 1ea0: c1dd8600 c1c54e80 00000000 00000000 c0a71ecc c02ff91c c07aa990 c0136e98
[ 690.757494] 1ec0: 60000013 ffffffff 00000051 c16c6180 00000000 47a25045 00000002 c1dd8600
[ 690.765697] 1ee0: c0a71f80 00000001 00020000 00000000 00000000 00004004 00020000 c01c37f0
[ 690.773903] 1f00: 00020000 c0d8b240 c0a70000 c0100264 b6c7c000 00020000 00000000 00002000
[ 690.782105] 1f20: 0001e000 c0a71f10 00000001 00000000 c1dd8600 00000000 01080000 00000000
[ 690.790308] 1f40: 00000000 00000000 00000000 40040000 00000000 00000000 b6c7c000 47a25045
[ 690.798512] 1f60: c1dd8600 c1dd8600 01080000 00000000 c0100264 c0a70000 00000003 c01c419c
[ 690.806716] 1f80: 01080000 00000000 00000000 47a25045 00020000 b6c7c000 00020000 b6fdc550
[ 690.814920] 1fa0: 00000003 c0100060 b6c7c000 00020000 00000003 b6c7c000 00020000 00000000
[ 690.823122] 1fc0: b6c7c000 00020000 b6fdc550 00000003 00000003 00000000 0000005e 00020000
[ 690.831327] 1fe0: 00000003 bed35b58 b6dce1db b6dcffc6 600f0030 00000003 00000000 00000000
[ 690.839537] atc_chain_complete from atc_advance_work+0x7c/0x190
[ 690.845562] atc_advance_work from atmel_nand_dma_transfer+0x118/0x234
[ 690.852109] atmel_nand_dma_transfer from atmel_hsmc_nand_pmecc_read_pg+0xd8/0x1c8
[ 690.859698] atmel_hsmc_nand_pmecc_read_pg from atmel_hsmc_nand_pmecc_read_page+0x1c/0x24
[ 690.867901] atmel_hsmc_nand_pmecc_read_page from nand_read_oob+0x268/0x7f8
[ 690.874883] nand_read_oob from mtd_read_oob+0x84/0x148
[ 690.880121] mtd_read_oob from mtd_read+0x60/0x90
[ 690.884832] mtd_read from ubi_io_read+0xf0/0x3fc
[ 690.889545] ubi_io_read from ubi_eba_read_leb+0xb0/0x468
[ 690.894956] ubi_eba_read_leb from ubi_leb_read+0x90/0x104
[ 690.900454] ubi_leb_read from ubifs_leb_read+0x2c/0x78
[ 690.905693] ubifs_leb_read from fallible_read_node+0x84/0x2b0
[ 690.911537] fallible_read_node from ubifs_tnc_locate+0x140/0x1dc
[ 690.917647] ubifs_tnc_locate from do_readpage+0x10c/0x4c4
[ 690.923146] do_readpage from ubifs_readpage+0x4c/0x4e0
[ 690.928381] ubifs_readpage from filemap_read_folio+0x34/0xac
[ 690.934144] filemap_read_folio from filemap_get_pages+0x4c0/0x670
[ 690.940337] filemap_get_pages from filemap_read+0xc4/0x390
[ 690.945922] filemap_read from do_iter_readv_writev+0x128/0x1c0
[ 690.951859] do_iter_readv_writev from do_iter_read+0x88/0x1f0
[ 690.957704] do_iter_read from ovl_read_iter+0x1f4/0x248
[ 690.963030] ovl_read_iter from vfs_read+0x204/0x314
[ 690.968003] vfs_read from ksys_read+0x60/0xe4
[ 690.972454] ksys_read from ret_fast_syscall+0x0/0x58
[ 690.977513] Exception stack(0xc0a71fa8 to 0xc0a71ff0)
[ 690.982586] 1fa0: b6c7c000 00020000 00000003 b6c7c000 00020000 00000000
[ 690.990791] 1fc0: b6c7c000 00020000 b6fdc550 00000003 00000003 00000000 0000005e 00020000
[ 690.998989] 1fe0: 00000003 bed35b58 b6dce1db b6dcffc6
[ 691.004061] Code: c5940028 c580100c c584301c caffffca (e7f001f2)
[ 691.010166] ---[ end trace 0000000000000000 ]---






$ while :; do cat testfile | sha256sum; done
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
[ 1928.214666] ------------[ cut here ]------------
[ 1928.219293] kernel BUG at drivers/dma/dmaengine.h:54!
[ 1928.224350] Internal error: Oops - BUG: 0 [#1] ARM
[ 1928.229157] CPU: 0 PID: 4427 Comm: cat Not tainted 5.17.0-rc6+ #72
[ 1928.235346] Hardware name: Atmel SAMA5
[ 1928.239100] PC is at atc_chain_complete+0x114/0x174
[ 1928.243988] LR is at atc_advance_work+0x7c/0x190
[ 1928.248612] pc : [<c03de48c>] lr : [<c03de6d4>] psr: 600f0193
[ 1928.254895] sp : c17358e8 ip : 00000000 fp : c1735a74
[ 1928.260131] r10: c0f28000 r9 : c03dd624 r8 : 00000002
[ 1928.265367] r7 : 600f0113 r6 : c0d5bae8 r5 : c0d5ba78 r4 : c0d5bacc
[ 1928.271913] r3 : 00000000 r2 : c0d5b800 r1 : 00000000 r0 : c0d5ba78
[ 1928.278460] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
[ 1928.285707] Control: 10c53c7d Table: 20044059 DAC: 00000051
[ 1928.291463] Register r0 information: slab kmalloc-2k start c0d5b800 pointer offset 632 size 2048
[ 1928.300295] Register r1 information: NULL pointer
[ 1928.305007] Register r2 information: slab kmalloc-2k start c0d5b800 pointer offset 0 size 2048
[ 1928.313647] Register r3 information: NULL pointer
[ 1928.318360] Register r4 information: slab kmalloc-2k start c0d5b800 pointer offset 716 size 2048
[ 1928.327174] Register r5 information: slab kmalloc-2k start c0d5b800 pointer offset 632 size 2048
[ 1928.335989] Register r6 information: slab kmalloc-2k start c0d5b800 pointer offset 744 size 2048
[ 1928.344803] Register r7 information: non-paged memory
[ 1928.349865] Register r8 information: non-paged memory
[ 1928.354927] Register r9 information: non-slab/vmalloc memory
[ 1928.360600] Register r10 information: slab kmalloc-4k start c0f28000 pointer offset 0 size 4096
[ 1928.369327] Register r11 information: non-slab/vmalloc memory
[ 1928.375087] Register r12 information: NULL pointer
[ 1928.379887] Process cat (pid: 4427, stack limit = 0x41e59390)
[ 1928.385648] Stack: (0xc17358e8 to 0xc1736000)
[ 1928.390027] 58e0: c0f28000 c1735a74 00000000 f09b5186 c0478b9c c0d5ba78
[ 1928.398228] 5900: c0d5bb18 20f28000 00000800 c03de6d4 00000051 c03dcdc8 c0f15a00 f09b5186
[ 1928.406433] 5920: c0f15a00 c1735940 20f28000 00000800 00000002 c0478b9c 00000003 00000000
[ 1928.414636] 5940: 00000001 c1735944 c1735944 f09b5186 00028218 c0f1d050 c0f15a00 00000000
[ 1928.422839] 5960: 00000000 c0f28000 c0f15a00 c047a23c 00000002 00000000 00000000 c0f1d050
[ 1928.431042] 5980: 000007d0 00000000 00000000 00000000 c047a350 c047a36c 00000000 00000000
[ 1928.439247] 59a0: 00000000 c0468570 00001030 00000000 c159d300 c0f94000 00000000 00000800
[ 1928.447449] 59c0: 0002c404 00000001 c0f1d12c 00000000 c1c78000 c0f28030 00000030 00001030
[ 1928.455654] 59e0: 0002c404 00000000 00000000 c0f1d50c 00000000 00000040 00000000 c0b438e0
[ 1928.463858] 5a00: 00000000 00000000 00000000 00000000 0000c1c0 c0f1d050 c0f94000 00000000
[ 1928.472061] 5a20: 00000000 00000000 c1735a74 00000000 06202030 c045b078 c1735a74 c01bc8b8
[ 1928.480264] 5a40: 000000bc c0f94000 00000000 06202030 c1735ad0 c1c78000 00000000 c091b928
[ 1928.488468] 5a60: 06202030 c045b19c c1735a74 c0484388 000000bc 00000000 00001030 00000000
[ 1928.496671] 5a80: 00000000 00000000 00000000 c1c78000 00000000 f09b5186 00000004 c16b7000
[ 1928.504876] 5aa0: 00002030 00000310 00001030 c04865e4 00001030 c1735ad0 c1c78000 a0000113
[ 1928.513080] 5ac0: c1c78000 c08e52c4 c091b8c0 c0b44228 00000000 f09b5186 c16b7000 00000000
[ 1928.521283] 5ae0: c16b7000 c16be000 0000005a c1c78000 c16b7000 00000000 00000310 c0484250
[ 1928.529487] 5b00: 00001030 f09b5186 0000b1c0 c1d37300 a0000013 c02f1604 c0c00100 00000000
[ 1928.537690] 5b20: 0000005a 00001030 c16be000 c1c78000 c16b7000 00000000 00001030 c0483144
[ 1928.545894] 5b40: 00001030 00001030 00000000 0000b1c0 c0c06018 c16b8000 c16b8000 00001030
[ 1928.554098] 5b60: 0000005a 00001030 c1735c98 0000005a 00001030 c02cfd38 00001030 00000000
[ 1928.562302] 5b80: c16b8000 c1735c28 c1c78000 c0b40028 c16b8000 c02d2e6c 00001030 00000001
[ 1928.570505] 5ba0: 00000000 c02d53a4 00000041 00001030 00000001 c1735c24 c17b5400 f09b5186
[ 1928.578709] 5bc0: c0b52718 00000a20 00000000 c159f400 c159f4f8 00000040 00000000 00000006
[ 1928.586912] 5be0: c0b52718 c04eb804 c159d300 f09b5186 00000000 c16b8000 c1735c98 00000000
[ 1928.595116] 5c00: 00000000 c1c78000 c16b80e4 00000018 00000000 c02d57ec 20000193 00000000
[ 1928.603319] 5c20: 00000002 c17b5400 0000007d 200004fc 00000000 0000005a 00001030 00001030
[ 1928.611523] 5c40: 01140cca f09b5186 70586723 c3fee7c0 c39be000 c12b4b30 c1c78000 0000007d
[ 1928.619727] 5c60: 000004fc c0b3f800 c16b8000 c02c5870 00000000 c0b5b94c c12b4c20 c0b3dd0c
[ 1928.627931] 5c80: c1735cd4 c12b4c18 02710000 00002710 c16e3000 00000000 0000007d 200004fc
[ 1928.636134] 5ca0: c1734000 f09b5186 c3fee7c0 c3fee7c0 c16b8000 000004fc 000004fc c12b4b30
[ 1928.644338] 5cc0: 000004f1 00000000 c12b4c24 c02c5f80 c0b3e164 c12b4c1c 000004fc 003c0000
[ 1928.652542] 5ce0: c12b6e40 00000000 c01995ac f09b5186 00000013 c3fee7c0 c1735e30 f09b5186
[ 1928.660746] 5d00: c3fee7c0 c3fee7c0 c16e3000 c1735dc4 000004fc c12b4c18 c3fee7c0 00000000
[ 1928.668948] 5d20: c12b4c24 c0176090 00000500 c1735e30 c1735dc4 c0177d40 00000002 c1734000
[ 1928.677153] 5d40: c16e3000 c16e3058 c1735f18 61c88647 c1735d84 c16e3000 c12b4c18 c16e3058
[ 1928.685356] 5d60: 000004fc 00000000 00000000 f09b5186 00001000 c12b4b30 c1735dec 00000000
[ 1928.693560] 5d80: 00001000 c1735f18 c1735e30 0001c000 c16e3000 c017a064 c014965c c1735dec
[ 1928.701762] 5da0: 00000000 c3fb0120 00000000 00000000 02710000 00000000 c12b4b30 c12b4c18
[ 1928.709968] 5dc0: 70586723 c1730000 c3fc38e0 c3fae3c0 c3fae140 c3fae2a0 c3fae300 c3fae360
[ 1928.718170] 5de0: c3fadfa0 c3faee60 c3faf480 c3fb0120 00000010 00000000 00000000 c0b032f4
[ 1928.726375] 5e00: c1c00000 f09b5186 c1735e20 c16e3000 00000000 c1735f38 00000000 c1735f18
[ 1928.734578] 5e20: 00000000 00000000 00004004 c01c1f3c c16e3000 00000000 004fc000 00000000
[ 1928.742779] 5e40: 00000000 00000000 00000000 40040000 00000000 00000000 00000006 f09b5186
[ 1928.750985] 5e60: c1735f18 c1735f18 c1735f18 c16e3840 c16e3000 c1735f38 00000000 00000001
[ 1928.759191] 5e80: c1735f30 c01c205c 00000000 70729076 c159d000 f09b5186 c1735f18 c1735f80
[ 1928.767391] 5ea0: c16e3840 c0c22f80 00000000 00000000 c1735ecc c02ff91c c0803c00 00500cc2
[ 1928.775595] 5ec0: 00000001 c0c25240 c013f468 c16e3000 00000000 f09b5186 00000002 c16e3840
[ 1928.783799] 5ee0: c1735f80 00000001 00020000 00000000 00000000 00004004 00020000 c01c37f0
[ 1928.792003] 5f00: 00020000 c0c25240 c1734000 c0100264 b6c2a000 00020000 00000000 0001c000
[ 1928.800206] 5f20: 00004000 c1735f10 00000001 00000000 c16e3840 00000000 004e0000 00000000
[ 1928.808409] 5f40: 00000000 00000000 00000000 40040000 00000000 00000000 b6c2a000 f09b5186
[ 1928.816614] 5f60: c16e3840 c16e3840 004e0000 00000000 c0100264 c1734000 00000003 c01c419c
[ 1928.824817] 5f80: 004e0000 00000000 10c53c7d f09b5186 00020000 b6c2a000 00020000 b6f8a550
[ 1928.833022] 5fa0: 00000003 c0100060 b6c2a000 00020000 00000003 b6c2a000 00020000 00000000
[ 1928.841224] 5fc0: b6c2a000 00020000 b6f8a550 00000003 00000003 00000000 0000005e 00020000
[ 1928.849428] 5fe0: 00000003 bebf4b58 b6d7c1db b6d7dfc6 600f0030 00000003 00000000 00000000
[ 1928.857638] atc_chain_complete from atc_advance_work+0x7c/0x190
[ 1928.863664] atc_advance_work from atmel_nand_dma_transfer+0x118/0x234
[ 1928.870209] atmel_nand_dma_transfer from atmel_hsmc_nand_pmecc_read_pg+0xd8/0x1c8
[ 1928.877799] atmel_hsmc_nand_pmecc_read_pg from atmel_hsmc_nand_pmecc_read_page+0x1c/0x24
[ 1928.886003] atmel_hsmc_nand_pmecc_read_page from nand_read_oob+0x268/0x7f8
[ 1928.892985] nand_read_oob from mtd_read_oob+0x84/0x148
[ 1928.898222] mtd_read_oob from mtd_read+0x60/0x90
[ 1928.902933] mtd_read from ubi_io_read+0xf0/0x3fc
[ 1928.907647] ubi_io_read from ubi_eba_read_leb+0xb0/0x468
[ 1928.913057] ubi_eba_read_leb from ubi_leb_read+0x90/0x104
[ 1928.918555] ubi_leb_read from ubifs_leb_read+0x2c/0x78
[ 1928.923792] ubifs_leb_read from fallible_read_node+0x84/0x2b0
[ 1928.929639] fallible_read_node from ubifs_tnc_locate+0x140/0x1dc
[ 1928.935748] ubifs_tnc_locate from do_readpage+0x10c/0x4c4
[ 1928.941246] do_readpage from ubifs_readpage+0x4c/0x4e0
[ 1928.946482] ubifs_readpage from filemap_read_folio+0x34/0xac
[ 1928.952244] filemap_read_folio from filemap_get_pages+0x4c0/0x670
[ 1928.958439] filemap_get_pages from filemap_read+0xc4/0x390
[ 1928.964023] filemap_read from do_iter_readv_writev+0x128/0x1c0
[ 1928.969961] do_iter_readv_writev from do_iter_read+0x88/0x1f0
[ 1928.975805] do_iter_read from ovl_read_iter+0x1f4/0x248
[ 1928.981131] ovl_read_iter from vfs_read+0x204/0x314
[ 1928.986104] vfs_read from ksys_read+0x60/0xe4
[ 1928.990555] ksys_read from ret_fast_syscall+0x0/0x58
[ 1928.995614] Exception stack(0xc1735fa8 to 0xc1735ff0)
[ 1929.000687] 5fa0: b6c2a000 00020000 00000003 b6c2a000 00020000 00000000
[ 1929.008892] 5fc0: b6c2a000 00020000 b6f8a550 00000003 00000003 00000000 0000005e 00020000
[ 1929.017091] 5fe0: 00000003 bebf4b58 b6d7c1db b6d7dfc6
[ 1929.022161] Code: c5940028 c580100c c584301c caffffca (e7f001f2)
[ 1929.028267] ---[ end trace 0000000000000000 ]---

2022-03-08 10:29:29

by Nicolas Ferre

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 07/03/2022 at 21:32, Peter Rosin wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 2022-03-07 12:32, Peter Rosin wrote:
>> On 2022-03-07 10:45, [email protected] wrote:
>>> Peter, would it worth to do on your board a similar test to what I did?
>>> I'm thinking whether the source of interrupts matters or not. So can you
>>> disable your USB and use a mtd NAND stress test as a source of interrupts?
>>> mtd_stresstest together with scp or hexdump.
>>
>> That's not a quick test for me, since I don't have modules enabled.
>> I have located my SAMA5D31 evaluation kit, and I think I will try
>> to get that running instead.
>
> Hi again!
>
> I got my SAMA5D31EK board running, using a freshly built at91bootstrap
> and u-boot according to linux4sam.org and using the cross compiler I
> have used from buildroot 2021.08, i.e. gcc 10.3.0, then using the
> dtb for the ME20 from the original post and the same kernel and userspace
> as I have used previously. Now, that dtb describes things that may not
> actually be there etc etc, and I will try with a proper dtb as well
> tomorrow, so this was just a quick-n-dirty test. I also added mem=64MB
> to the kernel command line, to mimic our "Linea" CPU module and get a
> bit quicker turnaround in the page cache.
>
> Anyway, with that setup I can reproduce the problem on the EK board.
>
> $ while :; do cat testfile | sha256sum; done
> 5a939c69dd60a1f991e43d278d2e824a0e7376600a6b20c8e8b347871c546f9b -
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
> 250556db0a6ac3c3e101ae2845da48ebb39a0c12d4c9b9eec5ea229c426bcce9 -
> 874c694ed002b04b67bf354a95ee521effd07e198f363e02cd63069a94fd4df8 -
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
> c3a918a923ff2d504a45ffa51289e69fb6d8aa1140cca3fd9f30703b18d9e509 -
> 1577ed72d2f296f9adc50707e0e56547ecb311fa21ad875a3d55ca908c440307 -
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>
>
> But apparently only if I have an FTDI usb-serial adapter attached
> while I boot. I also start to get good hashes if I remove and
> reinsert the FTDI adapter, which is interesting.
>
> $ while :; do cat testfile | sha256sum; done
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
> ...
> *snip many dozens of lines*
> ...
> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>
> It's of course hard to prove the absence of trouble, but it feels
> like it is working from both of those latter cases...
>
> (for the "real" case the FTDI usb-serial adapter is soldered in,
> with no easy way to make it go away, so it is not as easy to do the
> same test there.)
>
> I'll try to reduce the number of local parts of the setup further
> tomorrow, such as the dtb mentioned above and the rootfs. I was
> hoping for a binary download of prebuilt parts, but some links on
>
> https://www.linux4sam.org/bin/view/Linux4SAM/Sama5d3xekMainPage
>
> are bogus. E.g.
>
> ftp://twiki.lnx4mchp_backend/pub/demo/linux4sam_4.7/linux4sam-poky-sama5d3xek-4.7.zip

Okay, that's a bug in the TWiki page.
> What's up with that twiki.lnx4mchp_backend "host"?

URL must be:
https://files.linux4sam.org/pub/demo/linux4sam_4.7/linux4sam-poky-sama5d3xek-4.7.zip

Regards,
Nicolas

--
Nicolas Ferre

2022-03-09 09:29:45

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 2022-03-08 08:55, Nicolas Ferre wrote:
> On 07/03/2022 at 21:32, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 2022-03-07 12:32, Peter Rosin wrote:
>>> On 2022-03-07 10:45, [email protected] wrote:
>>>> Peter, would it worth to do on your board a similar test to what I did?
>>>> I'm thinking whether the source of interrupts matters or not. So can you
>>>> disable your USB and use a mtd NAND stress test as a source of interrupts?
>>>> mtd_stresstest together with scp or hexdump.
>>>
>>> That's not a quick test for me, since I don't have modules enabled.
>>> I have located my SAMA5D31 evaluation kit, and I think I will try
>>> to get that running instead.
>>
>> Hi again!
>>
>> I got my SAMA5D31EK board running, using a freshly built at91bootstrap
>> and u-boot according to linux4sam.org and using the cross compiler I
>> have used from buildroot 2021.08, i.e. gcc 10.3.0, then using the
>> dtb for the ME20 from the original post and the same kernel and userspace
>> as I have used previously. Now, that dtb describes things that may not
>> actually be there etc etc, and I will try with a proper dtb as well
>> tomorrow, so this was just a quick-n-dirty test. I also added mem=64MB
>> to the kernel command line, to mimic our "Linea" CPU module and get a
>> bit quicker turnaround in the page cache.
>>
>> Anyway, with that setup I can reproduce the problem on the EK board.
>>
>> $ while :; do cat testfile | sha256sum; done
>> 5a939c69dd60a1f991e43d278d2e824a0e7376600a6b20c8e8b347871c546f9b -
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>> 250556db0a6ac3c3e101ae2845da48ebb39a0c12d4c9b9eec5ea229c426bcce9 -
>> 874c694ed002b04b67bf354a95ee521effd07e198f363e02cd63069a94fd4df8 -
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>> c3a918a923ff2d504a45ffa51289e69fb6d8aa1140cca3fd9f30703b18d9e509 -
>> 1577ed72d2f296f9adc50707e0e56547ecb311fa21ad875a3d55ca908c440307 -
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>>
>>
>> But apparently only if I have an FTDI usb-serial adapter attached
>> while I boot. I also start to get good hashes if I remove and
>> reinsert the FTDI adapter, which is interesting.
>>
>> $ while :; do cat testfile | sha256sum; done
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>> ...
>> *snip many dozens of lines*
>> ...
>> 7bf74cf37c8bf81ad4f8e86da8eb129a8ae0ee0f5a22ac584ad39233b97acb4d -
>>
>> It's of course hard to prove the absence of trouble, but it feels
>> like it is working from both of those latter cases...
>>
>> (for the "real" case the FTDI usb-serial adapter is soldered in,
>> with no easy way to make it go away, so it is not as easy to do the
>> same test there.)
>>
>> I'll try to reduce the number of local parts of the setup further
>> tomorrow, such as the dtb mentioned above and the rootfs. I was
>> hoping for a binary download of prebuilt parts, but some links on
>>
>> https://www.linux4sam.org/bin/view/Linux4SAM/Sama5d3xekMainPage
>>
>> are bogus. E.g.
>>
>> ftp://twiki.lnx4mchp_backend/pub/demo/linux4sam_4.7/linux4sam-poky-sama5d3xek-4.7.zip
>
> Okay, that's a bug in the TWiki page.
>> What's up with that twiki.lnx4mchp_backend "host"?
>
> URL must be:
> https://files.linux4sam.org/pub/demo/linux4sam_4.7/linux4sam-poky-sama5d3xek-4.7.zip

Thanks,

I ended up not using that anyway since it didn't reproduce right
away. So, I went back to something I knew was workable and built
a smaller reproducer that isn't depending on any of our code. I
uploaded it to github.

https://github.com/peda-r/sama5d31

I make that, then flash it from the output sam-ba dir with sam-ba 3.2.

$ make
... *snip* *snip* *snip* *snip* ...
$ cd sam-ba
$ .../sam-ba_3.2.1/sam-ba -x prog-sama5d31ek.qml ttyACM0
... *snip* ...

Then on first boot, I append mem=64MB to the kernel command line.
Also, since I no longer have anything else that accesses the serial
ports I need something to make them fire USB interrupts, hence the
"cat /dev/ttyUSB0 &" etc commands in the below transcript. I have
also bumped the testfile to 50MB since there are fewer things going
on, and thus more memory available for the page cache.

I have the ETDI serial adapter in the top USB slot since the udev
rule that sets the latency_timer to 1ms is written as it is; it
is based on what we use for the soldered in case on the "real"
hardware. It shouldn't really matter, I can connect the FTDI serial
adapter to the other USB port and set the latency_timers to 1ms
manually and still reproduce.

I have some trouble getting the network going on the EK board,
and I plan to dig into that next and check if I can also reproduce
with the scp load. I'm not too hopeful thoough, since I fail to
reproduce even with the "cat testfile | sha256sum" load when FTDI
serial adapter has not been connected all the time since boot. That
makes me think that the issue is there for the scp load only because
the ETDI serial adapter is always there on the "real" board, and
that it will be hard to reproduce without that adapter in place.

Cheers,
Peter

-------------- transcript --------------

RomBOOT


AT91Bootstrap 3.10.3 (2022-03-08 17:40:20)

1-Wire: Loading 1-Wire information ...
1-Wire: ROM Searching ... Done, 2 1-Wire chips found

1-Wire: BoardName | [Revid] | VendorName
#0 SAMA5D31-CM [DD4] EMBEST
#1 SAMA5D3x-MB [CC3] FLEX

1-Wire: Board sn: 0x480002a revision: 0x620803

NAND: ONFI flash detected
NAND: Manufacturer ID: 0x2c Chip ID: 0xda
NAND: Page Bytes: 2048, Spare Bytes: 64
NAND: ECC Correctability Bits: 4, ECC Sector Bytes: 512
NAND: Disable On-Die ECC
NAND: Initialize PMECC params, cap: 4, sector: 512
NAND: Image: Copy 0xa0000 bytes from 0x40000 to 0x26f00000
NAND: Done to load image
<debug_uart>

U-Boot 2017.03-linux4sam_5.8 (Mar 08 2022 - 17:40:32 +0100)

CPU: SAMA5D31
Crystal frequency: 12 MHz
CPU clock : 528 MHz
Master clock : 132 MHz
DRAM: 512 MiB
Flash: 16 MiB
NAND: 256 MiB
MMC: Atmel mci: 0, Atmel mci: 1
*** Warning - bad CRC, using default environment

In: serial
Out: serial
Err: serial
Net:
Error: ethernet@f0028000 address not set.
No ethernet found.
Hit any key to stop autoboot: 0
=> printenv bootargs
bootargs=console=ttyS0,115200 earlyprintk mtdparts=atmel_nand:256k(bootstrap)ro,768k(uboot)ro,256K(env_redundant),256k(env),512k(dtb),6M(kernel)ro,-(rootfs) rootfstype=ubifs ubi.mtd=6 root=ubi0:rootfs
=> setenv bootargs console=ttyS0,115200 earlyprintk mtdparts=atmel_nand:256k(bootstrap)ro,768k(uboot)ro,256K(env_redundant),256k(env),512k(dtb),6M(kernel)ro,-(rootfs) rootfstype=ubifs ubi.mtd=6 root=ubi0:rootfs mem=64MB
=> saveenv
Saving Environment to NAND...
Erasing redundant NAND...
Erasing at 0x100000 -- 100% complete.
Writing to redundant NAND... OK
=> boot

NAND read: device 0 offset 0x180000, size 0x80000
524288 bytes read: OK

NAND read: device 0 offset 0x200000, size 0x600000
6291456 bytes read: OK
## Flattened Device Tree blob at 21000000
Booting using the fdt blob at 0x21000000
Loading Device Tree to 3fb42000, end 3fb4b8cf ... OK

Starting kernel ...

[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 5.17.0-rc7 (peda@orc) (arm-buildroot-linux-gnueabihf-gcc.br_real (Buildroot 2021.08.3) 10.3.0, GNU ld (GNU Binutils) 2.36.1) #1 Tue Mar 8 17:48:36 CET 2022
[ 0.000000] CPU: ARMv7 Processor [410fc051] revision 1 (ARMv7), cr=10c53c7d
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[ 0.000000] OF: fdt: Machine model: Atmel SAMA5D31-EK
[ 0.000000] Memory policy: Data cache writeback
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000020000000-0x0000000023ffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000020000000-0x0000000023ffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000020000000-0x0000000023ffffff]
[ 0.000000] CPU: All CPU(s) started in SVC mode.
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 16256
[ 0.000000] Kernel command line: console=ttyS0,115200 earlyprintk mtdparts=atmel_nand:256k(bootstrap)ro,768k(uboot)ro,256K(env_redundant),256k(env),512k(dtb),6M(kernel)ro,-(rootfs) rootfstype=ubifs ubi.mtd=6 root=ubi0:rootfs mem=64MB
[ 0.000000] Unknown kernel command line parameters "earlyprintk", will be passed to user space.
[ 0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes, linear)
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 54160K/65536K available (7168K kernel code, 325K rwdata, 1344K rodata, 1024K init, 104K bss, 11376K reserved, 0K cma-reserved)
[ 0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[ 0.000000] random: get_random_bytes called from start_kernel+0x3ec/0x524 with crng_init=0
[ 0.000000] clocksource: timer@f0010000: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 115833966437 ns
[ 0.000004] sched_clock: 32 bits at 16MHz, resolution 60ns, wraps every 130150523873ns
[ 0.000056] Switching to timer-based delay loop, resolution 60ns
[ 0.000477] clocksource: pit: mask: 0xfffffff max_cycles: 0xfffffff, max_idle_ns: 14479245754 ns
[ 0.001100] Console: colour dummy device 80x30
[ 0.001189] Calibrating delay loop (skipped), value calculated using timer frequency.. 33.00 BogoMIPS (lpj=165000)
[ 0.001241] pid_max: default: 32768 minimum: 301
[ 0.001504] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.001565] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.002635] CPU: Testing write buffer coherency: ok
[ 0.003882] Setting up static identity map for 0x20100000 - 0x20100060
[ 0.005538] devtmpfs: initialized
[ 0.016983] VFP support v0.3: implementor 41 architecture 2 part 30 variant 5 rev 1
[ 0.017461] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 0.017533] futex hash table entries: 256 (order: -1, 3072 bytes, linear)
[ 0.017699] pinctrl core: initialized pinctrl subsystem
[ 0.019515] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[ 0.020668] DMA: preallocated 256 KiB pool for atomic coherent allocations
[ 0.057473] AT91: PM: standby: standby, suspend: ulp0
[ 0.057529] No ATAGs?
[ 0.058784] gpio-at91 fffff200.gpio: at address (ptrval)
[ 0.060184] gpio-at91 fffff400.gpio: at address (ptrval)
[ 0.061631] gpio-at91 fffff600.gpio: at address (ptrval)
[ 0.063128] gpio-at91 fffff800.gpio: at address (ptrval)
[ 0.064739] gpio-at91 fffffa00.gpio: at address (ptrval)
[ 0.066585] pinctrl-at91 ahb:apb:pinctrl@fffff200: initialized AT91 pinctrl driver
[ 0.080562] at_hdmac ffffe600.dma-controller: Atmel AHB DMA Controller ( cpy set slave ), 8 channels
[ 0.082434] at_hdmac ffffe800.dma-controller: Atmel AHB DMA Controller ( cpy set slave ), 8 channels
[ 0.084762] AT91: Detected SoC family: sama5d3
[ 0.084805] AT91: Detected SoC: sama5d31, revision 2
[ 0.085672] SCSI subsystem initialized
[ 0.086186] usbcore: registered new interface driver usbfs
[ 0.086329] usbcore: registered new interface driver hub
[ 0.086466] usbcore: registered new device driver usb
[ 0.087663] at91_i2c f0014000.i2c: using dma0chan0 (tx) and dma0chan1 (rx) for DMA transfers
[ 0.088083] i2c i2c-0: using pinctrl states for GPIO recovery
[ 0.088224] i2c i2c-0: using generic GPIOs for recovery
[ 0.088698] at91_i2c f0014000.i2c: AT91 i2c bus driver (hw version: 0x402).
[ 0.089833] at91_i2c f0018000.i2c: using dma0chan2 (tx) and dma0chan3 (rx) for DMA transfers
[ 0.090295] i2c i2c-1: using pinctrl states for GPIO recovery
[ 0.090433] i2c i2c-1: using generic GPIOs for recovery
[ 0.092266] at91_i2c f0018000.i2c: AT91 i2c bus driver (hw version: 0x402).
[ 0.093647] Advanced Linux Sound Architecture Driver Initialized.
[ 0.095756] clocksource: Switched to clocksource timer@f0010000
[ 0.118209] NET: Registered PF_INET protocol family
[ 0.118510] IP idents hash table entries: 2048 (order: 2, 16384 bytes, linear)
[ 0.119613] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[ 0.119704] TCP established hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.119761] TCP bind hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.119809] TCP: Hash tables configured (established 1024 bind 1024)
[ 0.120102] UDP hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 0.120186] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 0.120498] NET: Registered PF_UNIX/PF_LOCAL protocol family
[ 0.122255] workingset: timestamp_bits=30 max_order=14 bucket_order=0
[ 0.123405] io scheduler mq-deadline registered
[ 0.123457] io scheduler kyber registered
[ 0.136138] brd: module loaded
[ 0.149150] loop: module loaded
[ 0.149846] ssc f0008000.ssc: Atmel SSC device at 0x(ptrval) (irq 21)
[ 0.151507] atmel_usart_serial.0.auto: ttyS2 at MMIO 0xf0020000 (irq = 24, base_baud = 4125000) is a ATMEL_SERIAL
[ 0.153485] atmel_usart_serial.1.auto: ttyS0 at MMIO 0xffffee00 (irq = 34, base_baud = 8250000) is a ATMEL_SERIAL
[ 0.705038] printk: console [ttyS0] enabled
[ 0.716182] macb f802c000.ethernet: invalid hw address, using random
[ 0.751175] macb f802c000.ethernet eth0: Cadence MACB rev 0x0001010c at 0xf802c000 irq 42 (d2:e4:fe:11:9c:b2)
[ 0.761741] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[ 0.768332] ehci-atmel: EHCI Atmel driver
[ 0.776581] atmel-ehci 700000.ehci: EHCI Host Controller
[ 0.781999] atmel-ehci 700000.ehci: new USB bus registered, assigned bus number 1
[ 0.789730] atmel-ehci 700000.ehci: irq 44, io mem 0x00700000
[ 0.820024] atmel-ehci 700000.ehci: USB 2.0 started, EHCI 1.00
[ 0.826275] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.17
[ 0.834597] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 0.841838] usb usb1: Product: EHCI Host Controller
[ 0.846697] usb usb1: Manufacturer: Linux 5.17.0-rc7 ehci_hcd
[ 0.852481] usb usb1: SerialNumber: 700000.ehci
[ 0.858200] hub 1-0:1.0: USB hub found
[ 0.862177] hub 1-0:1.0: 3 ports detected
[ 0.867560] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[ 0.873824] ohci-atmel: OHCI Atmel driver
[ 0.879375] at91_ohci 600000.ohci: USB Host Controller
[ 0.884635] at91_ohci 600000.ohci: new USB bus registered, assigned bus number 2
[ 0.892287] at91_ohci 600000.ohci: irq 44, io mem 0x00600000
[ 0.964328] usb usb2: New USB device found, idVendor=1d6b, idProduct=0001, bcdDevice= 5.17
[ 0.972650] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 0.979859] usb usb2: Product: USB Host Controller
[ 0.984686] usb usb2: Manufacturer: Linux 5.17.0-rc7 ohci_hcd
[ 0.990440] usb usb2: SerialNumber: at91
[ 0.995541] hub 2-0:1.0: USB hub found
[ 0.999369] hub 2-0:1.0: 3 ports detected
[ 1.005896] usbcore: registered new interface driver uas
[ 1.011436] usbcore: registered new interface driver usb-storage
[ 1.017508] usbcore: registered new interface driver ums-alauda
[ 1.023564] usbcore: registered new interface driver ums-cypress
[ 1.029629] usbcore: registered new interface driver ums-datafab
[ 1.035731] usbcore: registered new interface driver ums_eneub6250
[ 1.042038] usbcore: registered new interface driver ums-freecom
[ 1.048095] usbcore: registered new interface driver ums-isd200
[ 1.054109] usbcore: registered new interface driver ums-jumpshot
[ 1.060291] usbcore: registered new interface driver ums-karma
[ 1.066174] usbcore: registered new interface driver ums-onetouch
[ 1.072367] usbcore: registered new interface driver ums-realtek
[ 1.078434] usbcore: registered new interface driver ums-sddr09
[ 1.084463] usbcore: registered new interface driver ums-sddr55
[ 1.090482] usbcore: registered new interface driver ums-usbat
[ 1.096513] usbcore: registered new interface driver ftdi_sio
[ 1.102376] usbserial: USB Serial support registered for FTDI USB Serial Device
[ 1.110721] atmel_usba_udc 500000.gadget: MMIO registers at [mem 0xf8030000-0xf8033fff] mapped at (ptrval)
[ 1.120587] atmel_usba_udc 500000.gadget: FIFO at [mem 0x00500000-0x005fffff] mapped at (ptrval)
[ 1.132265] g_serial gadget: Gadget Serial v2.4
[ 1.136785] g_serial gadget: g_serial ready
[ 1.143169] at91_rtc fffffeb0.rtc: registered as rtc0
[ 1.148247] at91_rtc fffffeb0.rtc: setting system clock to 2015-05-16T14:19:33 UTC (1431785973)
[ 1.157038] at91_rtc fffffeb0.rtc: AT91 Real Time Clock driver.
[ 1.163257] i2c_dev: i2c /dev entries driver
[ 1.169663] at91-reset fffffe00.rstc: Starting after user reset
[ 1.176794] at91_wdt fffffe40.watchdog: watchdog is disabled
[ 1.182495] at91_wdt: probe of fffffe40.watchdog failed with error -22
[ 1.190832] atmel_aes f8038000.aes: version: 0x135
[ 1.196145] atmel_aes f8038000.aes: Atmel AES - Using dma1chan0, dma1chan1 for DMA transfers
[ 1.205373] atmel_sha f8034000.sha: version: 0x410
[ 1.210437] atmel_sha f8034000.sha: using dma1chan2 for DMA transfers
[ 1.216976] atmel_sha f8034000.sha: Atmel SHA1/SHA256/SHA224/SHA384/SHA512
[ 1.224567] atmel_tdes f803c000.tdes: version: 0x701
[ 1.229943] atmel_tdes f803c000.tdes: using dma1chan3, dma1chan4 for DMA transfers
[ 1.237747] atmel_tdes f803c000.tdes: Atmel DES/TDES
[ 1.243284] usbcore: registered new interface driver usbhid
[ 1.248839] usbhid: USB HID core driver
[ 1.258067] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xda
[ 1.264472] nand: Micron MT29F2G08ABAEAWP
[ 1.268447] nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[ 1.276874] usb 1-2: new high-speed USB device number 2 using atmel-ehci
[ 1.286322] Bad block table not found for chip 0
[ 1.293124] Bad block table not found for chip 0
[ 1.297723] Scanning device for bad blocks
[ 1.500885] Bad block table written to 0x00000ffe0000, version 0x01
[ 1.508187] Bad block table written to 0x00000ffc0000, version 0x01
[ 1.514569] 7 cmdlinepart partitions found on MTD device atmel_nand
[ 1.520869] Creating 7 MTD partitions on "atmel_nand":
[ 1.525983] 0x000000000000-0x000000040000 : "bootstrap"
[ 1.532203] mtdblock: MTD device 'bootstrap' is NAND, please consider using UBI block devices instead.
[ 1.543914] 0x000000040000-0x000000100000 : "uboot"
[ 1.549845] mtdblock: MTD device 'uboot' is NAND, please consider using UBI block devices instead.
[ 1.560705] 0x000000100000-0x000000140000 : "env_redundant"
[ 1.567241] mtdblock: MTD device 'env_redundant' is NAND, please consider using UBI block devices instead.
[ 1.579007] 0x000000140000-0x000000180000 : "env"
[ 1.584771] mtdblock: MTD device 'env' is NAND, please consider using UBI block devices instead.
[ 1.595462] 0x000000180000-0x000000200000 : "dtb"
[ 1.601194] mtdblock: MTD device 'dtb' is NAND, please consider using UBI block devices instead.
[ 1.611800] 0x000000200000-0x000000800000 : "kernel"
[ 1.617732] mtdblock: MTD device 'kernel' is NAND, please consider using UBI block devices instead.
[ 1.629150] 0x000000800000-0x000010000000 : "rootfs"
[ 1.637058] usb 1-2: New USB device found, idVendor=0403, idProduct=6011, bcdDevice= 8.00
[ 1.645340] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 1.652514] usb 1-2: Product: Quad RS232-HS
[ 1.656687] usb 1-2: Manufacturer: FTDI
[ 1.661074] mtdblock: MTD device 'rootfs' is NAND, please consider using UBI block devices instead.
[ 1.673740] iio iio:device0: Resolution used: 12 bits
[ 1.679427] input: at91_adc as /devices/platform/ahb/ahb:apb/f8018000.adc/input/input0
[ 1.687403] random: fast init done
[ 1.694695] ftdi_sio 1-2:1.0: FTDI USB Serial Device converter detected
[ 1.701699] usb 1-2: Detected FT4232H
[ 1.707813] xt_time: kernel timezone is -0000
[ 1.712672] gre: GRE over IPv4 demultiplexor driver
[ 1.717681] Initializing XFRM netlink socket
[ 1.722278] NET: Registered PF_INET6 protocol family
[ 1.729303] usb 1-2: FTDI USB Serial Device converter now attached to ttyUSB0
[ 1.738202] Segment Routing with IPv6
[ 1.741969] In-situ OAM (IOAM) with IPv6
[ 1.746187] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[ 1.753545] NET: Registered PF_PACKET protocol family
[ 1.759611] ftdi_sio 1-2:1.1: FTDI USB Serial Device converter detected
[ 1.766611] usb 1-2: Detected FT4232H
[ 1.771399] usb 1-2: FTDI USB Serial Device converter now attached to ttyUSB1
[ 1.783894] ftdi_sio 1-2:1.2: FTDI USB Serial Device converter detected
[ 1.790948] usb 1-2: Detected FT4232H
[ 1.798470] usb 1-2: FTDI USB Serial Device converter now attached to ttyUSB2
[ 1.807567] ftdi_sio 1-2:1.3: FTDI USB Serial Device converter detected
[ 1.814557] usb 1-2: Detected FT4232H
[ 1.820308] usb 1-2: FTDI USB Serial Device converter now attached to ttyUSB3
[ 1.843413] ubi0: attaching mtd6
[ 2.648198] ubi0: scanning is finished
[ 2.674329] gluebi (pid 1): gluebi_resized: got update notification for unknown UBI device 0 volume 0
[ 2.683623] ubi0: volume 0 ("rootfs") re-sized from 90 to 1940 LEBs
[ 2.691016] ubi0: attached mtd6 (name "rootfs", size 248 MiB)
[ 2.696764] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[ 2.703703] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[ 2.710494] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
[ 2.717454] ubi0: good PEBs: 1980, bad PEBs: 4, corrupted PEBs: 0
[ 2.723582] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128
[ 2.730822] ubi0: max/mean erase counter: 1/0, WL threshold: 4096, image sequence number: 1391204677
[ 2.739970] ubi0: available PEBs: 0, total reserved PEBs: 1980, PEBs reserved for bad PEB handling: 36
[ 2.749603] ubi0: background thread "ubi_bgt0d" started, PID 67
[ 2.758960] ALSA device list:
[ 2.761952] No soundcards found.
[ 2.769813] UBIFS (ubi0:0): Mounting in unauthenticated mode
[ 2.882936] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "rootfs", R/O mode
[ 2.891290] UBIFS (ubi0:0): LEB size: 126976 bytes (124 KiB), min./max. I/O unit sizes: 2048 bytes/2048 bytes
[ 2.901241] UBIFS (ubi0:0): FS size: 244936704 bytes (233 MiB, 1929 LEBs), max 2048 LEBs, journal size 9023488 bytes (8 MiB, 72 LEBs)
[ 2.913292] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB)
[ 2.919115] UBIFS (ubi0:0): media format: w4/r0 (latest is w5/r0), UUID 6AAC8EC5-1B1E-4E71-9F6F-EEB719E02AFC, small LPT model
[ 2.935358] VFS: Mounted root (ubifs filesystem) readonly on device 0:13.
[ 2.945679] devtmpfs: mounted
[ 2.951449] Freeing unused kernel image (initmem) memory: 1024K
[ 2.958144] Run /sbin/init as init process
[ 3.303513] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 70
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Populating /dev using udev: [ 4.159275] udevd[97]: starting version 3.2.10
[ 4.193870] random: udevd: uninitialized urandom read (16 bytes read)
[ 4.226640] random: udevd: uninitialized urandom read (16 bytes read)
[ 4.236763] random: udevd: uninitialized urandom read (16 bytes read)
[ 4.325369] random: crng init done
[ 4.353551] udevd[98]: starting eudev-3.2.10
[ 6.162090] ubi0 error: ubi_open_volume: cannot open device 0, volume 0, error -16
[ 6.214815] ubi0 error: ubi_open_volume: cannot open device 0, volume 0, error -16
done
Saving random seed: OK
Starting network: [ 7.090546] macb f802c000.ethernet eth0: PHY [f802c000.ethernet-ffffffff:01] driver [Micrel KSZ8031] (irq=45)
[ 7.150491] macb f802c000.ethernet eth0: configuring for phy/rmii link mode
udhcpc: started, v1.33.1
udhcpc: sending discover
udhcpc: sending discover
udhcpc: sending discover
udhcpc: no lease, failing
FAIL
ssh-keygen: generating new host keys: RSA DSA ECDSA ED25519
Starting sshd: OK

Welcome to Buildroot
buildroot login: root
root@buildroot:~# cat /sys/bus/usb-serial/devices/ttyUSB?/latency_timer
1
1
1
1
root@buildroot:~# cat inittest.sh
#! /bin/sh

echo "generating random file"
dd if=/dev/urandom of=testfile bs=1024 count=50000
root@buildroot:~# ./inittest.sh
generating random file
50000+0 records in
50000+0 records out
root@buildroot:~# cat /dev/ttyUSB0 &
root@buildroot:~# cat /dev/ttyUSB1 &
root@buildroot:~# cat /dev/ttyUSB2 &
root@buildroot:~# cat /dev/ttyUSB3 &
root@buildroot:~# cat runtest.sh
#! /bin/sh

while :; do cat testfile | sha256sum; done
root@buildroot:~# ./runtest.sh
abd6ded5a6eb1467e4b48909bfae35cea2191d417c3f27022954cee103c334ca -
98d03c79185168cbff6dc8db32e931061aa9e7c35025b7507f89faa208e12b6f -
1464940fc3cc527f89f153ec79ae7c8c892948ae013e6f54fba64664930e9ec4 -
98d03c79185168cbff6dc8db32e931061aa9e7c35025b7507f89faa208e12b6f -
326320e5a50777f8db772b6d06ac1beab246c32c66c75cefc0ace12f73394d68 -
d79664b5e2d461ce6617be24c1fbeab551b8fed0501e596ba09f1977b0fd70ee -
c362e254b14024fc46c4f18d7d10dc9424688c4d842ba6672361da12420a58fa -
be35c862a57e8a751af8517f3dc6f257ba1f18157b643ca3e8919f827e37e241 -
98d03c79185168cbff6dc8db32e931061aa9e7c35025b7507f89faa208e12b6f -
087eba1b603365320c9379391521791c5cd2ddce9a77e230ccb5bd67b2e856d0 -
22b6b0eb1d9428360fcd930c47bc41e566337a824bd66c5a468bfdf8adf89b36 -
^C
root@buildroot:~#

2022-03-10 10:48:36

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

[bringing this threadlet back to the lists, hope that's ok]

On 2022-03-10 09:27, Nicolas Ferre wrote:
> On 09/03/2022 at 12:42, Peter Rosin wrote:
>> On 2022-03-09 11:38, Nicolas Ferre wrote:
>>> Hi Peter,
>>>

*snip*

>>> One of my colleagues had an idea about this issue and in particular with
>>> the fact that removing some of the entries in the structure triggered
>>> the problem: "isn't it some kind of misalignment between structures that
>>> are supposed to be treated in 64 bits machines and our 32 bits core that
>>> we use?"
>>> This misalignment or "wrong assumption" of using 64 bits machine might
>>> be present in the USB stack as it seems to be related to this sub-system
>>> somehow.
>>
>> Yes, something like that has been creeping around in the back of my
>> head too. And it could be something much later in struct device that
>> is no longer sufficiently aligned when struct dev_links_info changes.
>> But what?

I verified the alignment of various things. With the old working
struct dev_links_info, i.e.

struct dev_links_info {
struct list_head suppliers;
struct list_head consumers;
struct list_head needs_suppliers;
struct list_head defer_sync;
bool need_for_probe;
enum dl_dev_state status;
};

I get

sizeof(struct device) 440
sizeof(struct dev_links_info) 40
offsetof(struct device, links) 80
offsetof(struct device, power) 120

"power" is the next member after "struct dev_links_info links" in
struct device, and I find no other uses of struct dev_links_info.
With the new problematic layout, i.e.

struct dev_links_info {
struct list_head suppliers;
struct list_head consumers;
struct list_head defer_sync;
enum dl_dev_state status;
};

I get:

sizeof(struct device) 432
sizeof(struct dev_links_info) 28
offsetof(struct device, links) 80
offsetof(struct device, power) 112

Which means that everything around and within dev_links_info is 8-byte
aligned in the same way in either case. The exception being that
"status" no longer shares 8-byte space with "need_for_probe" (which is
gone). But that should only make things better, no?

That combined with the test with this permuted version (swapped two
list_heads in the middle):

struct dev_links_info {
struct list_head suppliers;
struct list_head consumers;
struct list_head defer_sync;
struct list_head needs_suppliers;
bool need_for_probe;
enum dl_dev_state status;
};

which displayed a new failure mode (BUG instead of corruption, see
upthread) indicates that this is not an alignment issue. Famous last
words...

> From that article:
> https://lwn.net/Articles/885941/
>
> I read:

> "Koschel included a patch fixing a bug in the USB subsystem where the
> iterator passed to this macro was used after the exit from the macro,
> which is a dangerous thing to do. Depending on what happens within the
> list, the contents of that iterator could be something surprising, even
> in the absence of speculative execution. Koschel fixed the problem by
> reworking the code in question to stop using the iterator after the loop. "
>
> USB subsystem, "struct list_head *next, *prev;"... Some keywords present
> there... worth a try?
>
> Regards,
> Nicolas

gr_udc.c is not built with the config that is in use, which is sad because
it looked like a good candidate.

Cheers,
Peter

2022-03-10 11:19:16

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 2022-03-10 10:58, Peter Rosin wrote:
> [bringing this threadlet back to the lists, hope that's ok]
>
> On 2022-03-10 09:27, Nicolas Ferre wrote:
>> From that article:
>> https://lwn.net/Articles/885941/
>>
>> I read:
>>
>> "Koschel included a patch fixing a bug in the USB subsystem where the
>> iterator passed to this macro was used after the exit from the macro,
>> which is a dangerous thing to do. Depending on what happens within the
>> list, the contents of that iterator could be something surprising, even
>> in the absence of speculative execution. Koschel fixed the problem by
>> reworking the code in question to stop using the iterator after the loop. "
>>
>> USB subsystem, "struct list_head *next, *prev;"... Some keywords present
>> there... worth a try?
>>
>> Regards,
>> Nicolas
>
> gr_udc.c is not built with the config that is in use, which is sad because
> it looked like a good candidate.

at91_usba_udc.c, which is included, has the same pattern. But alas, doing
the equivalent patch there does not fix things either. I.e. (whitespace
damaged)

--- a/drivers/usb/gadget/udc/atmel_usba_udc.c
+++ b/drivers/usb/gadget/udc/atmel_usba_udc.c
@@ -863,6 +863,7 @@ static int usba_ep_dequeue(struct usb_ep *_ep, struct usb_request *_req)
struct usba_request *req;
unsigned long flags;
u32 status;
+ bool found = false;

DBG(DBG_GADGET | DBG_QUEUE, "ep_dequeue: %s, req %p\n",
ep->ep.name, _req);
@@ -870,11 +871,13 @@ static int usba_ep_dequeue(struct usb_ep *_ep, struct usb_request *_req)
spin_lock_irqsave(&udc->lock, flags);

list_for_each_entry(req, &ep->queue, queue) {
- if (&req->req == _req)
+ if (&req->req == _req) {
+ found = true;
break;
+ }
}

- if (&req->req != _req) {
+ if (!found) {
spin_unlock_irqrestore(&udc->lock, flags);
return -EINVAL;
}

The test started out with 3 good hashes though, so I got my hopes up. But
no, it's about the same failure rate as usual. I have the feeling that I
will never again trust a single sha256sum...

Cheers,
Peter

2022-04-10 16:56:45

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

Can somebody please provide a status update what the outcome of this
thread? It started as a regression report, that's why I'm tracking it --
but seems nothing happened for a while. Was it fixed? Did it fall
through the cracks? Or did it turn out that this is not a regression? If
the latter: please feel free to include a paragraph like "#regzbot
invalid: a few words why this is invalid in the lengths of a mail subject"

Ciao, Thorsten

#regzbot poke

On 10.03.22 11:40, Peter Rosin wrote:
> On 2022-03-10 10:58, Peter Rosin wrote:
>> [bringing this threadlet back to the lists, hope that's ok]
>>
>> On 2022-03-10 09:27, Nicolas Ferre wrote:
>>> From that article:
>>> https://lwn.net/Articles/885941/
>>>
>>> I read:
>>>
>>> "Koschel included a patch fixing a bug in the USB subsystem where the
>>> iterator passed to this macro was used after the exit from the macro,
>>> which is a dangerous thing to do. Depending on what happens within the
>>> list, the contents of that iterator could be something surprising, even
>>> in the absence of speculative execution. Koschel fixed the problem by
>>> reworking the code in question to stop using the iterator after the loop. "
>>>
>>> USB subsystem, "struct list_head *next, *prev;"... Some keywords present
>>> there... worth a try?
>>>
>>> Regards,
>>> Nicolas
>>
>> gr_udc.c is not built with the config that is in use, which is sad because
>> it looked like a good candidate.
>
> at91_usba_udc.c, which is included, has the same pattern. But alas, doing
> the equivalent patch there does not fix things either. I.e. (whitespace
> damaged)
>
> --- a/drivers/usb/gadget/udc/atmel_usba_udc.c
> +++ b/drivers/usb/gadget/udc/atmel_usba_udc.c
> @@ -863,6 +863,7 @@ static int usba_ep_dequeue(struct usb_ep *_ep, struct usb_request *_req)
> struct usba_request *req;
> unsigned long flags;
> u32 status;
> + bool found = false;
>
> DBG(DBG_GADGET | DBG_QUEUE, "ep_dequeue: %s, req %p\n",
> ep->ep.name, _req);
> @@ -870,11 +871,13 @@ static int usba_ep_dequeue(struct usb_ep *_ep, struct usb_request *_req)
> spin_lock_irqsave(&udc->lock, flags);
>
> list_for_each_entry(req, &ep->queue, queue) {
> - if (&req->req == _req)
> + if (&req->req == _req) {
> + found = true;
> break;
> + }
> }
>
> - if (&req->req != _req) {
> + if (!found) {
> spin_unlock_irqrestore(&udc->lock, flags);
> return -EINVAL;
> }
>
> The test started out with 3 good hashes though, so I got my hopes up. But
> no, it's about the same failure rate as usual. I have the feeling that I
> will never again trust a single sha256sum...
>
> Cheers,
> Peter

2022-04-12 06:03:32

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31


On 4/9/22 16:02, Thorsten Leemhuis wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
>
> Can somebody please provide a status update what the outcome of this
> thread? It started as a regression report, that's why I'm tracking it --

Hi, Thorsten,

There are some concurrency bugs in the at-hdmac (DMA) driver, I'm handling them
and will come with a resolution. Disabling the DMA showed the bug is no more
reproducible.

> but seems nothing happened for a while. Was it fixed? Did it fall
> through the cracks? Or did it turn out that this is not a regression? If

Not yet sure if it's a regression or not, as the bugs are there since the
beginning. Maybe they are just harder to reproduce.

> the latter: please feel free to include a paragraph like "#regzbot
> invalid: a few words why this is invalid in the lengths of a mail subject"
>

Will come with a follow up after I fix the DMA bugs.

Cheers,
ta
> Ciao, Thorsten
>
> #regzbot poke
>
> On 10.03.22 11:40, Peter Rosin wrote:
>> On 2022-03-10 10:58, Peter Rosin wrote:
>>> [bringing this threadlet back to the lists, hope that's ok]
>>>
>>> On 2022-03-10 09:27, Nicolas Ferre wrote:
>>>> From that article:
>>>> https://lwn.net/Articles/885941/
>>>>
>>>> I read:
>>>>
>>>> "Koschel included a patch fixing a bug in the USB subsystem where the
>>>> iterator passed to this macro was used after the exit from the macro,
>>>> which is a dangerous thing to do. Depending on what happens within the
>>>> list, the contents of that iterator could be something surprising, even
>>>> in the absence of speculative execution. Koschel fixed the problem by
>>>> reworking the code in question to stop using the iterator after the loop. "
>>>>
>>>> USB subsystem, "struct list_head *next, *prev;"... Some keywords present
>>>> there... worth a try?
>>>>
>>>> Regards,
>>>> Nicolas
>>>
>>> gr_udc.c is not built with the config that is in use, which is sad because
>>> it looked like a good candidate.
>>
>> at91_usba_udc.c, which is included, has the same pattern. But alas, doing
>> the equivalent patch there does not fix things either. I.e. (whitespace
>> damaged)
>>
>> --- a/drivers/usb/gadget/udc/atmel_usba_udc.c
>> +++ b/drivers/usb/gadget/udc/atmel_usba_udc.c
>> @@ -863,6 +863,7 @@ static int usba_ep_dequeue(struct usb_ep *_ep, struct usb_request *_req)
>> struct usba_request *req;
>> unsigned long flags;
>> u32 status;
>> + bool found = false;
>>
>> DBG(DBG_GADGET | DBG_QUEUE, "ep_dequeue: %s, req %p\n",
>> ep->ep.name, _req);
>> @@ -870,11 +871,13 @@ static int usba_ep_dequeue(struct usb_ep *_ep, struct usb_request *_req)
>> spin_lock_irqsave(&udc->lock, flags);
>>
>> list_for_each_entry(req, &ep->queue, queue) {
>> - if (&req->req == _req)
>> + if (&req->req == _req) {
>> + found = true;
>> break;
>> + }
>> }
>>
>> - if (&req->req != _req) {
>> + if (!found) {
>> spin_unlock_irqrestore(&udc->lock, flags);
>> return -EINVAL;
>> }
>>
>> The test started out with 3 good hashes though, so I got my hopes up. But
>> no, it's about the same failure rate as usual. I have the feeling that I
>> will never again trust a single sha256sum...
>>
>> Cheers,
>> Peter
>

2022-05-18 04:55:14

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

2022-04-11 at 08:21, [email protected] wrote:
> There are some concurrency bugs in the at-hdmac (DMA) driver, I'm handling them
> and will come with a resolution. Disabling the DMA showed the bug is no more
> reproducible.

Any news?

Cheers,
Peter

2022-05-18 06:25:48

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 5/17/22 17:50, Peter Rosin wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> 2022-04-11 at 08:21, [email protected] wrote:
>> There are some concurrency bugs in the at-hdmac (DMA) driver, I'm handling them
>> and will come with a resolution. Disabling the DMA showed the bug is no more
>> reproducible.
>
> Any news?
>

I'm now allocated on this, so I started looking around what has to be done.
I'm thinking of using virt-dma to manage the channels and the request queues.
Will get back to you after I'll have something working.

Cheers,
ta

2022-05-18 07:52:13

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

2022-05-18 at 08:21, [email protected] wrote:
> On 5/17/22 17:50, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> 2022-04-11 at 08:21, [email protected] wrote:
>>> There are some concurrency bugs in the at-hdmac (DMA) driver, I'm handling them
>>> and will come with a resolution. Disabling the DMA showed the bug is no more
>>> reproducible.
>>
>> Any news?
>>
>
> I'm now allocated on this, so I started looking around what has to be done.
> I'm thinking of using virt-dma to manage the channels and the request queues.
> Will get back to you after I'll have something working.

Sounds good, thanks!

Cheers,
Peter

2022-06-20 07:14:22

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 18.05.22 09:51, Peter Rosin wrote:
> 2022-05-18 at 08:21, [email protected] wrote:
>> On 5/17/22 17:50, Peter Rosin wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> 2022-04-11 at 08:21, [email protected] wrote:
>>>> There are some concurrency bugs in the at-hdmac (DMA) driver, I'm handling them
>>>> and will come with a resolution. Disabling the DMA showed the bug is no more
>>>> reproducible.
>>>
>>> Any news?
>>
>> I'm now allocated on this, so I started looking around what has to be done.
>> I'm thinking of using virt-dma to manage the channels and the request queues.
>> Will get back to you after I'll have something working.
>
> Sounds good, thanks!

That was about a month ago. Has any progress been made to get this
regression fixed?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

#regzbot poke


2022-06-20 09:21:12

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31


On 6/20/22 10:04, Thorsten Leemhuis wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 18.05.22 09:51, Peter Rosin wrote:
>> 2022-05-18 at 08:21, [email protected] wrote:
>>> On 5/17/22 17:50, Peter Rosin wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> 2022-04-11 at 08:21, [email protected] wrote:
>>>>> There are some concurrency bugs in the at-hdmac (DMA) driver, I'm handling them
>>>>> and will come with a resolution. Disabling the DMA showed the bug is no more
>>>>> reproducible.
>>>>
>>>> Any news?
>>>
>>> I'm now allocated on this, so I started looking around what has to be done.
>>> I'm thinking of using virt-dma to manage the channels and the request queues.
>>> Will get back to you after I'll have something working.
>>
>> Sounds good, thanks!
>
> That was about a month ago. Has any progress been made to get this
> regression fixed?

Hi, Thorsten, Peter,

I was mostly out of office last month, and I'll still be offline this week.
I made some progress, tried to address the bugs in an incremental way. I now
encounter the memory corruption less often, but I still hit it. I made some
drafts at [1] if someone is curios. Anyway, I'm modifying the driver to use
virt-dma and also I'm trying to move the election of a new transfer in the
irq handler instead of in tasklet. I couldn't find a quick non-invasive fix,
so still in progress.

Cheers,
ta

[email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt

>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.
>
> #regzbot poke
>
>

2022-06-20 15:45:31

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31


>
> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>

Hi, Peter,

I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
and tell me if you can still reproduce the bug?

Thanks,
ta

2022-06-21 07:13:13

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

2022-06-20 at 16:22, [email protected] wrote:
>
>>
>> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>>
>
> Hi, Peter,
>
> I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
> no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
> and tell me if you can still reproduce the bug?
>
> Thanks,
> ta

Hi!

Great news! I will test today.

Cheers,
Peter

2022-06-21 10:51:09

by Peter Rosin

[permalink] [raw]
Subject: RE: Regression: memory corruption on Atmel SAMA5D31

2022-06-20 at 16:22, [email protected] wrote:
>
>>
>> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>>
>
> Hi, Peter,
>
> I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
> no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
> and tell me if you can still reproduce the bug?

Hi!

I rebased your patches onto my current branch which is v5.18.2 plus a few unrelated
changes (at least they are unrelated after removing the previous workaround to disable
nand-dma entirely).

The unrelated patches are two backports so that drivers recognize new compatibles [1][2],
which should be completely harmless, plus a couple of proposed fixes that happens to fix
eeprom issues with the at91 I2C driver from Codrin Ciubotariu [3].

On that kernel, I can still reproduce. It seems a bit harder to reproduce the problem now
though. If the system is otherwise idle, the sha256sum test did not reproduce in a run of
150+ attempts, but if I let the "real" application run while I do the test, I get a failure rate
of about 10%, see below. The real application burns some CPU (but not all of it) and
communicates with HW using I2C, native UARTs and two of the four USB-serial ports
(FTDI, with the latency set to 1ms as mentioned earlier), so I guess there is more DMA
pressure or something? There is a 100mbps network connection, but it was left "idle"
during this test.

Cheers,
Peter

$ dd if=/dev/urandom of=testfile bs=1024 count=40000
40000+0 records in
40000+0 records out
40960000 bytes (41 MB, 39 MiB) copied, 80.0485 s, 512 kB/s
$ while :; do cat testfile | sha256sum; done
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
a4850c1bb0226f14659035cdf1461c7df03d50bff8af560e3bd204942556b73f -
43c1941e15bd7e048e9d5f1d41ce67517cb6e59dae1d3af256d1507168100fcb -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
4a35af384455853a24b943ef94353663e8c22a9aa29d2e275194fd544d0b194a -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
f6710849b36e6954c26ff62cd974ecb082b93fa6e53ecf0aea7e0c93acc0a445 -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
2c2f4ac91f435439d2d640c34ee89b4d1ebf3adb8438efbf064a4139247241c5 -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
b93e0c56cfba75cb9e2b35ff6769abdb6c1a5d17cadc28cec1979188e044cf3d -
^C

[1] https://lore.kernel.org/linux-kernel/[email protected]/
[2] https://lore.kernel.org/linux-kernel/[email protected]/
[3] https://lore.kernel.org/linux-kernel/[email protected]/

2022-06-27 12:56:48

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 6/21/22 13:46, Peter Rosin wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> 2022-06-20 at 16:22, [email protected] wrote:
>>
>>>
>>> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>>>
>>
>> Hi, Peter,
>>
>> I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
>> no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
>> and tell me if you can still reproduce the bug?
>
> Hi!
>
> I rebased your patches onto my current branch which is v5.18.2 plus a few unrelated
> changes (at least they are unrelated after removing the previous workaround to disable
> nand-dma entirely).
>
> The unrelated patches are two backports so that drivers recognize new compatibles [1][2],
> which should be completely harmless, plus a couple of proposed fixes that happens to fix
> eeprom issues with the at91 I2C driver from Codrin Ciubotariu [3].
>
> On that kernel, I can still reproduce. It seems a bit harder to reproduce the problem now
> though. If the system is otherwise idle, the sha256sum test did not reproduce in a run of
> 150+ attempts, but if I let the "real" application run while I do the test, I get a failure rate
> of about 10%, see below. The real application burns some CPU (but not all of it) and
> communicates with HW using I2C, native UARTs and two of the four USB-serial ports
> (FTDI, with the latency set to 1ms as mentioned earlier), so I guess there is more DMA
> pressure or something? There is a 100mbps network connection, but it was left "idle"
> during this test.
>

Thanks, Peter.
I got back to the office, I'm rechecking what could go wrong.

ta

2022-06-27 17:10:27

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 6/27/22 15:26, [email protected] wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 6/21/22 13:46, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> 2022-06-20 at 16:22, [email protected] wrote:
>>>
>>>>
>>>> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>>>>
>>>
>>> Hi, Peter,
>>>
>>> I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
>>> no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
>>> and tell me if you can still reproduce the bug?
>>
>> Hi!
>>
>> I rebased your patches onto my current branch which is v5.18.2 plus a few unrelated
>> changes (at least they are unrelated after removing the previous workaround to disable
>> nand-dma entirely).
>>
>> The unrelated patches are two backports so that drivers recognize new compatibles [1][2],
>> which should be completely harmless, plus a couple of proposed fixes that happens to fix
>> eeprom issues with the at91 I2C driver from Codrin Ciubotariu [3].
>>
>> On that kernel, I can still reproduce. It seems a bit harder to reproduce the problem now
>> though. If the system is otherwise idle, the sha256sum test did not reproduce in a run of
>> 150+ attempts, but if I let the "real" application run while I do the test, I get a failure rate
>> of about 10%, see below. The real application burns some CPU (but not all of it) and
>> communicates with HW using I2C, native UARTs and two of the four USB-serial ports
>> (FTDI, with the latency set to 1ms as mentioned earlier), so I guess there is more DMA
>> pressure or something? There is a 100mbps network connection, but it was left "idle"
>> during this test.
>>
>
> Thanks, Peter.
> I got back to the office, I'm rechecking what could go wrong.
>

Hi, Peter,

Would you please help me with another round of testing? I have difficulties
in reproducing the bug and maybe you can speed up the process while I copy
your testing setup. I made two more patches on top of the same branch [1].
My assumption is that the last problem that you saw is that a transfer
could be started multiple times. I think these are the last less invasive
changes that I try, I'll have to rewrite the logic anyway.

Thanks!

[1] To github.com:ambarus/linux-0day.git
cbb2ddca4618..79c7784dbcf2 dma-regression-hdmac-v5.18-rc7-4th-attempt -> dma-regression-hdmac-v5.18-rc7-4th-attempt

2022-06-30 05:44:40

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Hi!

2022-06-27 at 18:53, [email protected] wrote:
> On 6/27/22 15:26, [email protected] wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 6/21/22 13:46, Peter Rosin wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> 2022-06-20 at 16:22, [email protected] wrote:
>>>>
>>>>>
>>>>> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>>>>>
>>>>
>>>> Hi, Peter,
>>>>
>>>> I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
>>>> no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
>>>> and tell me if you can still reproduce the bug?
>>>
>>> Hi!
>>>
>>> I rebased your patches onto my current branch which is v5.18.2 plus a few unrelated
>>> changes (at least they are unrelated after removing the previous workaround to disable
>>> nand-dma entirely).
>>>
>>> The unrelated patches are two backports so that drivers recognize new compatibles [1][2],
>>> which should be completely harmless, plus a couple of proposed fixes that happens to fix
>>> eeprom issues with the at91 I2C driver from Codrin Ciubotariu [3].
>>>
>>> On that kernel, I can still reproduce. It seems a bit harder to reproduce the problem now
>>> though. If the system is otherwise idle, the sha256sum test did not reproduce in a run of
>>> 150+ attempts, but if I let the "real" application run while I do the test, I get a failure rate
>>> of about 10%, see below. The real application burns some CPU (but not all of it) and
>>> communicates with HW using I2C, native UARTs and two of the four USB-serial ports
>>> (FTDI, with the latency set to 1ms as mentioned earlier), so I guess there is more DMA
>>> pressure or something? There is a 100mbps network connection, but it was left "idle"
>>> during this test.
>>>
>>
>> Thanks, Peter.
>> I got back to the office, I'm rechecking what could go wrong.
>>
>
> Hi, Peter,
>
> Would you please help me with another round of testing? I have difficulties
> in reproducing the bug and maybe you can speed up the process while I copy
> your testing setup. I made two more patches on top of the same branch [1].
> My assumption is that the last problem that you saw is that a transfer
> could be started multiple times. I think these are the last less invasive
> changes that I try, I'll have to rewrite the logic anyway.
>
> Thanks!
>
> [1] To github.com:ambarus/linux-0day.git
> cbb2ddca4618..79c7784dbcf2 dma-regression-hdmac-v5.18-rc7-4th-attempt -> dma-regression-hdmac-v5.18-rc7-4th-attempt

I was out of office, but I managed to get a test running over night and can
report that It still fails. This is a longer run of about 500 with a failure
rate of 5% compared to the last time when the failure rate was 10%. I tend
to think that the observed difference in failure rate may well be statistical
noise, but who knows? Would it be useful with a longer run without the last
two patches to see if they make a difference?

Cheers,
Peter

2022-06-30 09:57:11

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 6/30/22 08:20, Peter Rosin wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> Hi!

Hi, Peter!
>
> 2022-06-27 at 18:53, [email protected] wrote:
>> On 6/27/22 15:26, [email protected] wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> On 6/21/22 13:46, Peter Rosin wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> 2022-06-20 at 16:22, [email protected] wrote:
>>>>>
>>>>>>
>>>>>> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>>>>>>
>>>>>
>>>>> Hi, Peter,
>>>>>
>>>>> I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
>>>>> no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
>>>>> and tell me if you can still reproduce the bug?
>>>>
>>>> Hi!
>>>>
>>>> I rebased your patches onto my current branch which is v5.18.2 plus a few unrelated
>>>> changes (at least they are unrelated after removing the previous workaround to disable
>>>> nand-dma entirely).
>>>>
>>>> The unrelated patches are two backports so that drivers recognize new compatibles [1][2],
>>>> which should be completely harmless, plus a couple of proposed fixes that happens to fix
>>>> eeprom issues with the at91 I2C driver from Codrin Ciubotariu [3].
>>>>
>>>> On that kernel, I can still reproduce. It seems a bit harder to reproduce the problem now
>>>> though. If the system is otherwise idle, the sha256sum test did not reproduce in a run of
>>>> 150+ attempts, but if I let the "real" application run while I do the test, I get a failure rate
>>>> of about 10%, see below. The real application burns some CPU (but not all of it) and
>>>> communicates with HW using I2C, native UARTs and two of the four USB-serial ports
>>>> (FTDI, with the latency set to 1ms as mentioned earlier), so I guess there is more DMA
>>>> pressure or something? There is a 100mbps network connection, but it was left "idle"
>>>> during this test.
>>>>
>>>
>>> Thanks, Peter.
>>> I got back to the office, I'm rechecking what could go wrong.
>>>
>>
>> Hi, Peter,
>>
>> Would you please help me with another round of testing? I have difficulties
>> in reproducing the bug and maybe you can speed up the process while I copy
>> your testing setup. I made two more patches on top of the same branch [1].
>> My assumption is that the last problem that you saw is that a transfer
>> could be started multiple times. I think these are the last less invasive
>> changes that I try, I'll have to rewrite the logic anyway.
>>
>> Thanks!
>>
>> [1] To github.com:ambarus/linux-0day.git
>> cbb2ddca4618..79c7784dbcf2 dma-regression-hdmac-v5.18-rc7-4th-attempt -> dma-regression-hdmac-v5.18-rc7-4th-attempt
>
> I was out of office, but I managed to get a test running over night and can
> report that It still fails. This is a longer run of about 500 with a failure
> rate of 5% compared to the last time when the failure rate was 10%. I tend

Thanks!

> to think that the observed difference in failure rate may well be statistical
> noise, but who knows? Would it be useful with a longer run without the last
> two patches to see if they make a difference?

I pushed another patch were I added a write mem barrier to make sure everything
is in place before starting the transfer. Could you also take the last patch
and re-test if it's not too complicated? I still can't reproduce it on my side,
I'm checking what else I can add to stress test the DMA.

Thanks!
ta

2022-06-30 10:24:56

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 6/30/22 12:23, [email protected] wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 6/30/22 08:20, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> Hi!
>
> Hi, Peter!
>>
>> 2022-06-27 at 18:53, [email protected] wrote:
>>> On 6/27/22 15:26, [email protected] wrote:
>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>
>>>> On 6/21/22 13:46, Peter Rosin wrote:
>>>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>>>
>>>>> 2022-06-20 at 16:22, [email protected] wrote:
>>>>>>
>>>>>>>
>>>>>>> [email protected]:ambarus/linux-0day.git, branch dma-regression-hdmac-v5.18-rc7-4th-attempt
>>>>>>>
>>>>>>
>>>>>> Hi, Peter,
>>>>>>
>>>>>> I've just forced pushed on this branch, I had a typo somewhere and with that fixed I could
>>>>>> no longer reproduce the bug. Tested for ~20 minutes. Would you please test last 3 patches
>>>>>> and tell me if you can still reproduce the bug?
>>>>>
>>>>> Hi!
>>>>>
>>>>> I rebased your patches onto my current branch which is v5.18.2 plus a few unrelated
>>>>> changes (at least they are unrelated after removing the previous workaround to disable
>>>>> nand-dma entirely).
>>>>>
>>>>> The unrelated patches are two backports so that drivers recognize new compatibles [1][2],
>>>>> which should be completely harmless, plus a couple of proposed fixes that happens to fix
>>>>> eeprom issues with the at91 I2C driver from Codrin Ciubotariu [3].
>>>>>
>>>>> On that kernel, I can still reproduce. It seems a bit harder to reproduce the problem now
>>>>> though. If the system is otherwise idle, the sha256sum test did not reproduce in a run of
>>>>> 150+ attempts, but if I let the "real" application run while I do the test, I get a failure rate
>>>>> of about 10%, see below. The real application burns some CPU (but not all of it) and
>>>>> communicates with HW using I2C, native UARTs and two of the four USB-serial ports
>>>>> (FTDI, with the latency set to 1ms as mentioned earlier), so I guess there is more DMA
>>>>> pressure or something? There is a 100mbps network connection, but it was left "idle"
>>>>> during this test.
>>>>>
>>>>
>>>> Thanks, Peter.
>>>> I got back to the office, I'm rechecking what could go wrong.
>>>>
>>>
>>> Hi, Peter,
>>>
>>> Would you please help me with another round of testing? I have difficulties
>>> in reproducing the bug and maybe you can speed up the process while I copy
>>> your testing setup. I made two more patches on top of the same branch [1].
>>> My assumption is that the last problem that you saw is that a transfer
>>> could be started multiple times. I think these are the last less invasive
>>> changes that I try, I'll have to rewrite the logic anyway.
>>>
>>> Thanks!
>>>
>>> [1] To github.com:ambarus/linux-0day.git
>>> cbb2ddca4618..79c7784dbcf2 dma-regression-hdmac-v5.18-rc7-4th-attempt -> dma-regression-hdmac-v5.18-rc7-4th-attempt
>>
>> I was out of office, but I managed to get a test running over night and can
>> report that It still fails. This is a longer run of about 500 with a failure
>> rate of 5% compared to the last time when the failure rate was 10%. I tend
>
> Thanks!
>
>> to think that the observed difference in failure rate may well be statistical
>> noise, but who knows? Would it be useful with a longer run without the last
>> two patches to see if they make a difference?

I forgot to answer, sorry. No, not needed as it still fails.
>
> I pushed another patch were I added a write mem barrier to make sure everything
> is in place before starting the transfer. Could you also take the last patch
> and re-test if it's not too complicated? I still can't reproduce it on my side,
> I'm checking what else I can add to stress test the DMA.

I could reproduce the bug even with the wmb(). I'm rechecking what I missed.

Cheers,
ta

2022-07-13 17:11:54

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

Hi, Peter,

Thanks for the patience. I was still out of office last week,
but now I have some news.

On 6/27/22 19:53, [email protected] wrote:
> I think these are the last less invasive
> changes that I try, I'll have to rewrite the logic anyway.

I've chopped the driver to use virt-dma (check [1]). It's not clean, but
it works and one can see how the logic is changed. Unfortunately the mem
corruption is still present on high loads. Maybe it's a coherency problem.
I need more time on it. Will get back to you.

Cheers,
ta

[1] To github.com:ambarus/linux-0day.git
a7351e6f4c12..1557e0df0fd0 at-hdmac-virt-dma -> at-hdmac-virt-dma

2022-07-28 08:02:26

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 7/13/22 19:01, [email protected] wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> Hi, Peter,
>
> Thanks for the patience. I was still out of office last week,
> but now I have some news.
>
> On 6/27/22 19:53, [email protected] wrote:
>> I think these are the last less invasive
>> changes that I try, I'll have to rewrite the logic anyway.
>
> I've chopped the driver to use virt-dma (check [1]). It's not clean, but
> it works and one can see how the logic is changed. Unfortunately the mem
> corruption is still present on high loads. Maybe it's a coherency problem.
> I need more time on it. Will get back to you.
>
> Cheers,
> ta
>
> [1] To github.com:ambarus/linux-0day.git
> a7351e6f4c12..1557e0df0fd0 at-hdmac-virt-dma -> at-hdmac-virt-dma

Hi, Peter,

Does this [1] one line patch solve the mem corruption on your side?
Even if yes, there are still bugs in at-hdmac that can be squashed by
using virt-dma. I'd like to follow up with patches that integrate
virt-dma logic in at-hdmac.

Cheers,
ta

[1] https://lore.kernel.org/linux-mtd/[email protected]/T/#u

2022-07-28 08:53:45

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 7/28/22 10:45, [email protected] wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 7/13/22 19:01, [email protected] wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> Hi, Peter,
>>
>> Thanks for the patience. I was still out of office last week,
>> but now I have some news.
>>
>> On 6/27/22 19:53, [email protected] wrote:
>>> I think these are the last less invasive
>>> changes that I try, I'll have to rewrite the logic anyway.
>>
>> I've chopped the driver to use virt-dma (check [1]). It's not clean, but
>> it works and one can see how the logic is changed. Unfortunately the mem
>> corruption is still present on high loads. Maybe it's a coherency problem.
>> I need more time on it. Will get back to you.
>>
>> Cheers,
>> ta
>>
>> [1] To github.com:ambarus/linux-0day.git
>> a7351e6f4c12..1557e0df0fd0 at-hdmac-virt-dma -> at-hdmac-virt-dma
>
> Hi, Peter,
>
> Does this [1] one line patch solve the mem corruption on your side?
> Even if yes, there are still bugs in at-hdmac that can be squashed by
> using virt-dma. I'd like to follow up with patches that integrate
> virt-dma logic in at-hdmac.
>
> Cheers,
> ta
>
> [1] https://lore.kernel.org/linux-mtd/[email protected]/T/#u

Hi, Peter,

Looks like I've already caught an oops in at-hdmac driver when not using virt-dma,
see below. Would you please test with all the patches from [2] instead of just
using the patch from [1]? I've run stress tests over night by using [2] and
everything went fine on my side.

Cheers,
ta

[2] To github.com:ambarus/linux-0day.git
* [new branch] at-hdmac-virt-dma-2nd-iteration -> at-hdmac-virt-dma-2nd-iteration

root@sama5d3-xplained:~# while :; do cat testfile | sha256sum; done
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
dad5c65bf4c2a009ad6bad0d279096841de91631636057dd5dc47d0a45f9ec84 -
[ 4115.100000] ------------[ cut here ]------------
[ 4115.100000] kernel BUG at drivers/dma/dmaengine.h:54!
[ 4115.100000] Internal error: Oops - BUG: 0 [#1] ARM
[ 4115.100000] CPU: 0 PID: 480 Comm: cat Not tainted 5.18.0-rc7+ #40
[ 4115.100000] Hardware name: Atmel SAMA5
[ 4115.100000] PC is at atc_chain_complete+0x150/0x168
[ 4115.100000] LR is at atc_advance_work+0x78/0x184
[ 4115.100000] pc : [<c03d1fe4>] lr : [<c03d21d4>] psr: 60030093
[ 4115.100000] sp : c4e597e8 ip : c0d72800 fp : 00000001
[ 4115.100000] r10: c4e59848 r9 : c17c4300 r8 : 60030013
[ 4115.100000] r7 : c0d72c58 r6 : c17c4300 r5 : c0d72c88 r4 : c0d72be8
[ 4115.100000] r3 : 00000000 r2 : 00000000 r1 : c0d72c3c r0 : c0d72be8
[ 4115.100000] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
[ 4115.100000] Control: 10c53c7d Table: 20ae0059 DAC: 00000051
[ 4115.100000] Register r0 information: slab kmalloc-2k start c0d72800 pointer offset 1000 size 2048
[ 4115.100000] Register r1 information: slab kmalloc-2k start c0d72800 pointer offset 1084 size 2048
[ 4115.100000] Register r2 information: NULL pointer
[ 4115.100000] Register r3 information: NULL pointer
[ 4115.100000] Register r4 information: slab kmalloc-2k start c0d72800 pointer offset 1000 size 2048
[ 4115.100000] Register r5 information: slab kmalloc-2k start c0d72800 pointer offset 1160 size 2048
[ 4115.100000] Register r6 information: slab task_struct start c17c4300 pointer offset 0
[ 4115.100000] Register r7 information: slab kmalloc-2k start c0d72800 pointer offset 1112 size 2048
[ 4115.100000] Register r8 information: non-paged memory
[ 4115.100000] Register r9 information: slab task_struct start c17c4300 pointer offset 0
[ 4115.100000] Register r10 information: 2-page vmalloc region starting at 0xc4e58000 allocated at kernel_clone+0xb4/0x358
[ 4115.100000] Register r11 information: non-paged memory
[ 4115.100000] Register r12 information: slab kmalloc-2k start c0d72800 pointer offset 0 size 2048
[ 4115.100000] Process cat (pid: 480, stack limit = 0x511e7a27)
[ 4115.100000] Stack: (0xc4e597e8 to 0xc4e5a000)
[ 4115.100000] 97e0: c4e59848 00000001 00000010 4f21ffb9 c0d72be8 c0d72c88
[ 4115.100000] 9800: c17c4300 00000002 00000800 c03d21d4 00000800 c17c4300 c4e59848 4f21ffb9
[ 4115.100000] 9820: 00000010 c0e7d100 00000000 20f28000 00000002 00000800 c17c4300 c0467490
[ 4115.100000] 9840: 00000003 00000000 00000001 c4e5984c c4e5984c 4f21ffb9 c0e6e040 c0f28000
[ 4115.100000] 9860: 00000800 c0e7d100 c090e8bc c090e99c c0b41d18 c0467688 00000002 00000000
[ 4115.100000] 9880: c4e59888 00000000 00000000 c4e59990 c08f4bd4 00000014 00000001 c04677fc
[ 4115.100000] 98a0: c4e59990 c045a918 00000000 00000000 00000000 00000004 c0833744 00000001
[ 4115.100000] 98c0: 00000000 00000000 00000001 c4e59990 c4e59990 c4e59990 00000001 c0832df4
[ 4115.100000] 98e0: c0b41d18 00000000 c4e59984 c0e6e050 c17c4300 c090e904 c4e59990 00000001
[ 4115.100000] 9900: 00000000 c4e59990 00000001 00000000 00000000 c4e59990 00000001 00000000
[ 4115.100000] 9920: c4e59990 00000001 00000000 00000000 c4e59990 00000001 00000000 c4e59990
[ 4115.100000] 9940: 00000000 00000000 00000000 4f21ffb9 00000200 00000000 c0e6e050 c17c4300
[ 4115.100000] 9960: c0f28000 c0464d54 00000800 c0831be0 c4e59ac4 c04595b4 c17c4300 00000000
[ 4115.100000] 9980: c4e59ac4 00000000 c4e59990 00000001 00000002 00000800 c0f28000 4f21ff00
[ 4115.100000] 99a0: 00000000 4f21ffb9 00000000 c0e6e050 00000000 c0f28000 00000000 c0467b38
[ 4115.100000] 99c0: 00000000 c0466dc4 00000000 ffffffff c0e6e050 c0e6e050 00000480 00000000
[ 4115.100000] 99e0: 00000000 c0467b54 00000000 00000000 c4e59ac4 c04569b4 00001030 00000000
[ 4115.100000] 9a00: 00000000 00000000 c4e59a4c c0f28000 00000800 00000001 0001e872 c0e6e12c
[ 4115.100000] 9a20: 00000000 c1704bb0 c0f28000 00000480 00000000 0001e872 00000000 00000000
[ 4115.100000] 9a40: 00000000 00000040 00000000 c0b41d18 00000134 00000000 00000002 00000000
[ 4115.100000] 9a60: 00000000 c0e6e050 c0f67000 0ec38450 00000000 00000134 00000000 00000000
[ 4115.100000] 9a80: c4e59ac4 c0448d8c c4e59ac4 c171f0c0 c0913590 c17c4300 c4e59ac4 c4e59b38
[ 4115.100000] 9aa0: c1704000 0ec38450 00000000 00001030 c0f67000 c0448eac c4e59ac4 c01b6d98
[ 4115.100000] 9ac0: 000000bc 00000000 00001030 00000000 00000000 00000000 00000000 c1704000
[ 4115.100000] 9ae0: 00000000 4f21ffb9 c1410800 c08dcbf8 c0fee000 00001030 00000004 00000761
[ 4115.100000] 9b00: 00018450 c1704000 c0913590 c0473c24 00001030 c4e59b38 c1704000 c0470f9c
[ 4115.100000] 9b20: 0ec38450 00000000 00000000 c17c4300 c0913528 c0b42660 c17c4300 4f21ffb9
[ 4115.100000] 9b40: 4f21ffb9 00000000 c0fee000 c1410800 000006c5 c17c4300 00000000 c1704000
[ 4115.100000] 9b60: 00000761 c047177c 00001030 c02c363c c1630000 00001ec0 00000000 4f21ffb9
[ 4115.100000] 9b80: 00000761 4f21ffb9 c171f0c0 c1410800 00001030 00000000 00017450 000006c5
[ 4115.100000] 9ba0: c0fee000 c1704000 00000000 c047050c 00017450 00001030 00000000 4f21ffb9
[ 4115.100000] 9bc0: 4f21ffb9 c1630000 c1630000 00001030 000006c5 00017450 c17c4300 c4e59d10
[ 4115.100000] 9be0: 000006c5 c02c3050 00001030 00000000 c1630000 c0b3e460 c4e59c98 c1704000
[ 4115.100000] 9c00: c1630000 c02c6ddc 00001030 00000001 00001013 c02c84b0 c1630000 00017450
[ 4115.100000] 9c20: 00001030 00000001 c0be9e00 c0b0ec00 c17c4300 c0100bb8 00000001 4f21ffb9
[ 4115.100000] 9c40: c4e59ca8 c0100bb8 c27c8c20 c1704c50 00000360 29307c32 92b74a1c 4f21ffb9
[ 4115.100000] 9c60: 24da6e32 c1630000 00000000 c4e59d10 c16300e4 00000000 c1704000 00000000
[ 4115.100000] 9c80: c17c4300 c02c8874 20030013 00000801 00000003 c0be9e00 000042eb 2000460c
[ 4115.100000] 9ca0: 00000000 000006c5 00017450 00001030 c17c4300 4f21ffb9 c1704000 c3fca920
[ 4115.100000] 9cc0: 00004f00 c11f7410 c0b3dc20 0000460c c17c4300 000042eb c1704000 c02b88c4
[ 4115.100000] 9ce0: 00000000 c0b57ef4 00000000 c0b3c32c 8af8af8b 00140cca 00352ac1 04f00000
[ 4115.100000] 9d00: c27c9000 c1630000 c17c4300 c3fca920 000042eb 2000460c c0b57ea4 4f21ffb9
[ 4115.100000] 9d20: c4e59d70 c3fca920 c17c4300 c1630000 c19ff600 0000460c 0000460b c11f7410
[ 4115.100000] 9d40: c11f7504 c02b9c90 c0b57ea4 c11f74f8 c3fca920 00000001 00000cc0 c01715d4
[ 4115.100000] 9d60: c4e59dc0 00000001 c17c4300 c3fca93c c11f74fc 4f21ffb9 000c0000 c3fca920
[ 4115.100000] 9d80: c17c4300 c4e59f28 c19ff600 c3fca920 00000000 00004620 c11f7504 c0170364
[ 4115.100000] 9da0: 00000000 00000cc0 c3fca920 0000460c c3fca920 c0171780 c4e59dc0 4f21ffb9
[ 4115.100000] 9dc0: 0034d869 c11f74f8 c4e59e8c c4e59f28 0000460c c3fca920 00000000 c017242c
[ 4115.100000] 9de0: c198aabc c19ff600 c19ff650 c17c4300 00100008 c4e59f10 c0b02800 61c88647
[ 4115.100000] 9e00: c4e59f08 0000b000 c19ff600 c11f74f8 c19ff650 0000460c 00000000 00000000
[ 4115.100000] 9e20: c0b0b9cc 4f21ffb9 00007000 c11f7410 c4e59e90 c4e59f28 0000c000 c4e59f10
[ 4115.100000] 9e40: 00000000 00001000 00000000 c0175ed8 00001000 00000000 04620000 00000000
[ 4115.100000] 9e60: 10b072e6 c4e59f10 c19ff600 00000000 04f00000 00000000 00000000 c11f7410
[ 4115.100000] 9e80: c11f74f8 c17c4300 00000000 00020000 c3fca900 c19ff540 c4e59f08 c01c637c
[ 4115.100000] 9ea0: 0000005e c0100bb8 c4e59f2c 00000000 ffffffe4 00000000 00020000 c19ff600
[ 4115.100000] 9ec0: 00000000 00004004 c4e59f78 4f21ffb9 00020000 00020000 c19ff600 04600000
[ 4115.100000] 9ee0: 00000000 c4e59f78 c17c4300 00020000 0000005e c01bdd74 00020000 4f21ffb9
[ 4115.100000] 9f00: b6ebe000 00020000 b6ebe000 00020000 00000000 0000c000 00014000 c4e59f08
[ 4115.100000] 9f20: 00000001 00000000 c19ff600 00000000 0460c000 00000000 00000000 00000000
[ 4115.100000] 9f40: 00000000 00004004 00000000 00000000 004660b4 4f21ffb9 b6ff5940 c19ff600
[ 4115.100000] 9f60: c19ff600 04600000 00000000 c17c4300 b6ebe000 c01be668 04600000 00000000
[ 4115.100000] 9f80: 00000003 4f21ffb9 004660b4 00020000 b6ff5940 00000003 c01002c4 c17c4300
[ 4115.100000] 9fa0: 00000003 c0100060 004660b4 00020000 00000003 b6ebe000 00020000 00000000
[ 4115.100000] 9fc0: 004660b4 00020000 b6ff5940 00000003 00000003 00020000 00020000 0000005e
[ 4115.100000] 9fe0: 00000003 be907b58 b6f6d187 b6ef9c66 60070030 00000003 00000000 00000000
[ 4115.100000] atc_chain_complete from atc_advance_work+0x78/0x184
[ 4115.100000] atc_advance_work from atmel_nand_dma_transfer+0x114/0x23c
[ 4115.100000] atmel_nand_dma_transfer from atmel_nand_data_in+0xd0/0x108
[ 4115.100000] atmel_nand_data_in from atmel_hsmc_exec_rw+0x34/0x3c
[ 4115.100000] atmel_hsmc_exec_rw from nand_op_parser_exec_op+0x3b0/0x5c4
[ 4115.100000] nand_op_parser_exec_op from nand_read_data_op+0x1a4/0x27c
[ 4115.100000] nand_read_data_op from atmel_nand_pmecc_read_pg.constprop.8+0x68/0xc4
[ 4115.100000] atmel_nand_pmecc_read_pg.constprop.8 from atmel_hsmc_nand_pmecc_read_page+0x1c/0x24
[ 4115.100000] atmel_hsmc_nand_pmecc_read_page from nand_read_oob+0x1bc/0x7c0
[ 4115.100000] nand_read_oob from mtd_read_oob+0x88/0x14c
[ 4115.100000] mtd_read_oob from mtd_read+0x5c/0x80
[ 4115.100000] mtd_read from ubi_io_read+0xe4/0x36c
[ 4115.100000] ubi_io_read from ubi_eba_read_leb+0xd0/0x4e8
[ 4115.100000] ubi_eba_read_leb from ubi_leb_read+0x94/0x100
[ 4115.100000] ubi_leb_read from ubifs_leb_read+0x2c/0x78
[ 4115.100000] ubifs_leb_read from fallible_read_node+0x84/0x27c
[ 4115.100000] fallible_read_node from ubifs_tnc_locate+0xfc/0x1cc
[ 4115.100000] ubifs_tnc_locate from do_readpage+0x19c/0x494
[ 4115.100000] do_readpage from ubifs_readpage+0x48/0x4a4
[ 4115.100000] ubifs_readpage from filemap_read_folio+0x44/0x1fc
[ 4115.100000] filemap_read_folio from filemap_get_pages+0x4cc/0x790
[ 4115.100000] filemap_get_pages from filemap_read+0xcc/0x3bc
[ 4115.100000] filemap_read from vfs_read+0x25c/0x2e4
[ 4115.100000] vfs_read from ksys_read+0xa0/0xd0
[ 4115.100000] ksys_read from ret_fast_syscall+0x0/0x54
[ 4115.100000] Exception stack(0xc4e59fa8 to 0xc4e59ff0)
[ 4115.100000] 9fa0: 004660b4 00020000 00000003 b6ebe000 00020000 00000000
[ 4115.100000] 9fc0: 004660b4 00020000 b6ff5940 00000003 00000003 00020000 00020000 0000005e
[ 4115.100000] 9fe0: 00000003 be907b58 b6f6d187 b6ef9c66
[ 4115.100000] Code: ebf78ba1 e3a03000 e5c43068 eaffffc3 (e7f001f2)
[ 4115.100000] ---[ end trace 0000000000000000 ]---

2022-07-29 20:11:47

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

2022-07-28 at 10:39, [email protected] wrote:
> On 7/28/22 10:45, [email protected] wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 7/13/22 19:01, [email protected] wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>>
>>> Hi, Peter,
>>>
>>> Thanks for the patience. I was still out of office last week,
>>> but now I have some news.
>>>
>>> On 6/27/22 19:53, [email protected] wrote:
>>>> I think these are the last less invasive
>>>> changes that I try, I'll have to rewrite the logic anyway.
>>>
>>> I've chopped the driver to use virt-dma (check [1]). It's not clean, but
>>> it works and one can see how the logic is changed. Unfortunately the mem
>>> corruption is still present on high loads. Maybe it's a coherency problem.
>>> I need more time on it. Will get back to you.
>>>
>>> Cheers,
>>> ta
>>>
>>> [1] To github.com:ambarus/linux-0day.git
>>> a7351e6f4c12..1557e0df0fd0 at-hdmac-virt-dma -> at-hdmac-virt-dma
>>
>> Hi, Peter,
>>
>> Does this [1] one line patch solve the mem corruption on your side?
>> Even if yes, there are still bugs in at-hdmac that can be squashed by
>> using virt-dma. I'd like to follow up with patches that integrate
>> virt-dma logic in at-hdmac.
>>
>> Cheers,
>> ta
>>
>> [1] https://lore.kernel.org/linux-mtd/[email protected]/T/#u
>
> Hi, Peter,
>
> Looks like I've already caught an oops in at-hdmac driver when not using virt-dma,
> see below. Would you please test with all the patches from [2] instead of just
> using the patch from [1]? I've run stress tests over night by using [2] and
> everything went fine on my side.
>
> Cheers,
> ta
>
> [2] To github.com:ambarus/linux-0day.git
> * [new branch] at-hdmac-virt-dma-2nd-iteration -> at-hdmac-virt-dma-2nd-iteration

Hi Tudor,

This last one feels very promising! It's been running for a few hours without
incidents, so even if it isn't fixed it's several magnitudes better.

I'll leave it running for the night. Fingers crossed...

Cheers,
Peter

2022-07-30 11:59:56

by Peter Rosin

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

2022-07-29 at 22:09, Peter Rosin wrote:
> 2022-07-28 at 10:39, [email protected] wrote:
>> Looks like I've already caught an oops in at-hdmac driver when not using virt-dma,
>> see below. Would you please test with all the patches from [2] instead of just
>> using the patch from [1]? I've run stress tests over night by using [2] and
>> everything went fine on my side.
>>
>> Cheers,
>> ta
>>
>> [2] To github.com:ambarus/linux-0day.git
>> * [new branch] at-hdmac-virt-dma-2nd-iteration -> at-hdmac-virt-dma-2nd-iteration
>
> Hi Tudor,
>
> This last one feels very promising! It's been running for a few hours without
> incidents, so even if it isn't fixed it's several magnitudes better.
>
> I'll leave it running for the night. Fingers crossed...

Reporting that it's still all good and that I think it's time to declare
victory.

Thanks a bunch for you effort!

Looking through the patches on that branch, I suspect not all of it will
be submitted upstream in that exact form. Please let me know when you have
a cleaned up series so that I can retest and add some tested-by tags to.

Cheers and thanks again,
Peter

2022-07-31 03:50:41

by Tudor Ambarus

[permalink] [raw]
Subject: Re: Regression: memory corruption on Atmel SAMA5D31

On 7/30/22 14:37, Peter Rosin wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> 2022-07-29 at 22:09, Peter Rosin wrote:
>> 2022-07-28 at 10:39, [email protected] wrote:
>>> Looks like I've already caught an oops in at-hdmac driver when not using virt-dma,
>>> see below. Would you please test with all the patches from [2] instead of just
>>> using the patch from [1]? I've run stress tests over night by using [2] and
>>> everything went fine on my side.
>>>
>>> Cheers,
>>> ta
>>>
>>> [2] To github.com:ambarus/linux-0day.git
>>> * [new branch] at-hdmac-virt-dma-2nd-iteration -> at-hdmac-virt-dma-2nd-iteration
>>
>> Hi Tudor,
>>
>> This last one feels very promising! It's been running for a few hours without
>> incidents, so even if it isn't fixed it's several magnitudes better.
>>
>> I'll leave it running for the night. Fingers crossed...
>
> Reporting that it's still all good and that I think it's time to declare
> victory.
>
> Thanks a bunch for you effort!
>
> Looking through the patches on that branch, I suspect not all of it will
> be submitted upstream in that exact form. Please let me know when you have

Right, they're just some quick drafts which demonstrate where the problem
resides.

> a cleaned up series so that I can retest and add some tested-by tags to.

Sure. Will add your Reported-by tag when submitting. Thanks for the detailed
bug report and for the help since then!

>
> Cheers and thanks again,
> Peter


--
Cheers,
ta