Hello,
I've encountered an interesting case of data corruption while accessing
IPU (Image Processing Unit) on i.MX6 (rev1.2, Cortex A9). What I'm doing
here is basically:
openat(AT_FDCWD, "/dev/mem", O_RDWR|O_SYNC) = 3
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x2630000) = ptr
write(1, ptr, 32) = 32
Normally, the write() should end up with:
04008A00 02FF03FF 02FF03FF 00000000 00000000 00000000 00000000 00000000
However, with current kernels, the first 32 bits (the first IPU
register) are dropped:
02FF03FF 02FF03FF 00000000 00000000 00000000 00000000 00000000 00000000
0x2630000 is IPU1 CSI0 address (i.e., a register block). The same
happens with other IPU regions. Writes shorter than 8 * 32 bits are not
affected.
write() uses arm_copy_from_user() and since commit f441882a5229:
ARM: 8812/1: Optimise copy_{from/to}_user for !CPU_USE_DOMAINS
ARMv6+ processors do not use CONFIG_CPU_USE_DOMAINS and use privileged
ldr/str instructions in copy_{from/to}_user. They are currently
unnecessarily using single ldr/str instructions and can use ldm/stm
instructions instead like memcpy does (but with appropriate fixup
tables).
apparently uses 8 * 32-bit ldmia instruction to copy data:
.macro ldr8w ptr reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 abort
USERL(\abort, ldmia \ptr!, {\reg1, \reg2, \reg3, \reg4, \reg5, \reg6, \reg7, \reg8})
.endm
Before this commit it used ldr instruction (single 32-bit value) and the
problem didn't show up (reverting f441882a5229 on v5.11 fixes it as
well). The i.MX6 errata doesn't seem to list this problem.
I wonder what the theory says about this case. Is it at all valid to
read 8 IPU registers at a time using LDM instruction? If so, should
something be done with this problem, or should it be left as is?
--
Krzysztof Hałasa
Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa
On Wed, May 26, 2021 at 10:26:50AM +0200, Krzysztof Hałasa wrote:
> Hello,
>
> I've encountered an interesting case of data corruption while accessing
> IPU (Image Processing Unit) on i.MX6 (rev1.2, Cortex A9). What I'm doing
> here is basically:
>
> openat(AT_FDCWD, "/dev/mem", O_RDWR|O_SYNC) = 3
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x2630000) = ptr
> write(1, ptr, 32) = 32
>
> Normally, the write() should end up with:
> 04008A00 02FF03FF 02FF03FF 00000000 00000000 00000000 00000000 00000000
>
> However, with current kernels, the first 32 bits (the first IPU
> register) are dropped:
> 02FF03FF 02FF03FF 00000000 00000000 00000000 00000000 00000000 00000000
>
> 0x2630000 is IPU1 CSI0 address (i.e., a register block). The same
> happens with other IPU regions. Writes shorter than 8 * 32 bits are not
> affected.
>
> write() uses arm_copy_from_user() and since commit f441882a5229:
> ARM: 8812/1: Optimise copy_{from/to}_user for !CPU_USE_DOMAINS
>
> ARMv6+ processors do not use CONFIG_CPU_USE_DOMAINS and use privileged
> ldr/str instructions in copy_{from/to}_user. They are currently
> unnecessarily using single ldr/str instructions and can use ldm/stm
> instructions instead like memcpy does (but with appropriate fixup
> tables).
>
> apparently uses 8 * 32-bit ldmia instruction to copy data:
> .macro ldr8w ptr reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 abort
> USERL(\abort, ldmia \ptr!, {\reg1, \reg2, \reg3, \reg4, \reg5, \reg6, \reg7, \reg8})
> .endm
>
> Before this commit it used ldr instruction (single 32-bit value) and the
> problem didn't show up (reverting f441882a5229 on v5.11 fixes it as
> well). The i.MX6 errata doesn't seem to list this problem.
>
> I wonder what the theory says about this case. Is it at all valid to
> read 8 IPU registers at a time using LDM instruction? If so, should
> something be done with this problem, or should it be left as is?
Surely someone is not using copy_*_user() to copy data from userspace
direct to MMIO space... that would be crazy.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
"Russell King (Oracle)" <[email protected]> writes:
> Surely someone is not using copy_*_user() to copy data from userspace
> direct to MMIO space... that would be crazy.
No, it's the other way around: reading MMIO mapped to userspace (mmap
on /dev/mem) and copying it to simple kernel buffer (e.g. pipe buffer).
I.e., the MMIO is the userspace here (thus copy_from_user()).
--
Krzysztof Hałasa
Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa
On Wed, May 26, 2021 at 02:29:07PM +0200, Krzysztof Hałasa wrote:
> "Russell King (Oracle)" <[email protected]> writes:
>
> > Surely someone is not using copy_*_user() to copy data from userspace
> > direct to MMIO space... that would be crazy.
>
> No, it's the other way around: reading MMIO mapped to userspace (mmap
> on /dev/mem) and copying it to simple kernel buffer (e.g. pipe buffer).
> I.e., the MMIO is the userspace here (thus copy_from_user()).
Ah. I think we assume copy_from_user() will be used on memory only and
not device mappings.
In any case, looking at the architecture reference manual, LDM is
permitted on device and strongly ordered mappings, and the memory
subsystem is required to decompose it into a series of 32-bit accesses.
So, it sounds to me like there could be a hardware bug in the buses/IPU
causing this.
Can you try using LDM directly inside the kernel and seeing what effect
it has when reading the IPU? A simple test module should be sufficient.
I suspect it'll show the same thing - basically, that using LDM to the
IPU is broken.
Thanks.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
From: Russell King <[email protected]>
> Sent: 26 May 2021 14:19
>
> On Wed, May 26, 2021 at 02:29:07PM +0200, Krzysztof Hałasa wrote:
> > "Russell King (Oracle)" <[email protected]> writes:
> >
> > > Surely someone is not using copy_*_user() to copy data from userspace
> > > direct to MMIO space... that would be crazy.
> >
> > No, it's the other way around: reading MMIO mapped to userspace (mmap
> > on /dev/mem) and copying it to simple kernel buffer (e.g. pipe buffer).
> > I.e., the MMIO is the userspace here (thus copy_from_user()).
>
> Ah. I think we assume copy_from_user() will be used on memory only and
> not device mappings.
>
> In any case, looking at the architecture reference manual, LDM is
> permitted on device and strongly ordered mappings, and the memory
> subsystem is required to decompose it into a series of 32-bit accesses.
> So, it sounds to me like there could be a hardware bug in the buses/IPU
> causing this.
>
> Can you try using LDM directly inside the kernel and seeing what effect
> it has when reading the IPU? A simple test module should be sufficient.
> I suspect it'll show the same thing - basically, that using LDM to the
> IPU is broken.
I was wondering if there is some kind of page fault on the first access?
What happens if you repeat the write() ?
FWIW you don't want to try this on x86.
The MMIO addresses are likely to be uncached but the copy
function is likely to decide to use the ERMS 'rep movsb' so
suddenly you get single byte PCIe reads!
What you really want is the largest AVX register available.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
"Russell King (Oracle)" <[email protected]> writes:
> In any case, looking at the architecture reference manual, LDM is
> permitted on device and strongly ordered mappings, and the memory
> subsystem is required to decompose it into a series of 32-bit accesses.
> So, it sounds to me like there could be a hardware bug in the buses/IPU
> causing this.
It seems so.
I modified the kernel IPU module a bit, initialized a bunch of IPU
registers to known values (1..0xD). Results (from 1 to 13 IPU
registers) obtained with different instructions:
readl(13 consecutive registers): CSI = 1 2 3 4 5 6 7 8 9 A B C D
1 = register #0 and so on - readl() results are obviously correct.
LDM1: 1 (not corrupted)
LDM2: 1 3
LDM3: 1 3 4
LDM4: 2 3 4 4
LDM5: 1 3 4 5 6
LDM6: 1 3 4 5 6 7
LDM7: 1 3 4 5 6 7 8
LDM8: 2 3 4 5 6 7 8 8
LDM9: 1 3 4 5 6 7 8 9 A
LDM10: 1 3 4 5 6 7 8 9 A B
LDM11: 1 3 4 5 6 7 8 9 A B C
LDM12: 1 3 4 5 6 7 8 9 A B C D
The last one uses:
ldm r4, {r1, r2, r3, r4, r5, r6, r7, r8, r9, sl, fp, ip}.
I haven't tested more than 12 registers in one kernel LDMIA instruction.
The results don't depend on the address offset (adding 4, 8 or 12 to the
address doesn't change anything).
The arm_copy_from_user() is a specific case of the same corruption. It
uses a number of PLDs and 8-register LDMIAs (and then possibly LDRs
which don't fail). Each LDMIA ("LDM8") returns again:
LMD8: 2 3 4 5 6 7 8 8
(the same with subsequent LDMIAs: 10 11 12 13 14 15 16 16 and so on).
Summary: it appears all 64-bit and longer LDMIA instructions fail. The
first or the second 32-bit access is skipped (possibly somewhere between
AXI and IPU). In case of 4- and 8-register LDMs, the first (#0) value is
skipped, otherwise, it's the second (#1) value.
Now the PLDs ring a bell:
"ERR003730 ARM: 743623—Bad interaction between a minimum of seven PLDs
and one Non-Cacheable LDM can lead to a deadlock". Looking at the
disassembly I can count 6 PLDs (the first two seem to be the same,
though I don't claim I understand this (source) .s code). Also this
problem happens with IPU and not other devices, so I think it's not
related to this erratum after all.
size_t arm_copy_from_user(void *to, const void *from, size_t n)
... for n = 32 = 8 * 4 bytes:
2c: subs r2, r2, #4 ; = 28
30: blt e4 ; NOP
34: ands ip, r0, #3 ; r0 = destination
38: pld [r1]
3c: bne 108 ; NOP
40: ands ip, r1, #3 ; r1 = address in IPU
44: bne 138 ; NOP
48: subs r2, r2, #28
4c: push {r5, r6, r7, r8}
50: blt 88 ; NOP
54: pld [r1] ; duplicate PLD?
58: subs r2, r2, #0x60
5c: pld [r1, #28]
60: blt 70
64: pld [r1, #0x3c]
68: pld [r1, #0x5c]
6c: pld [r1, #0x7c]
70: ldm r1!, {r3, r4, r5, r6, r7, r8, ip, lr} ; <<<<< fails
I also wonder if STMs may have similar problems - will check.
--
Krzysztof Hałasa
Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa
On Fri, May 28, 2021 at 12:02:52PM +0200, Krzysztof Hałasa wrote:
> "Russell King (Oracle)" <[email protected]> writes:
>
> > In any case, looking at the architecture reference manual, LDM is
> > permitted on device and strongly ordered mappings, and the memory
> > subsystem is required to decompose it into a series of 32-bit accesses.
> > So, it sounds to me like there could be a hardware bug in the buses/IPU
> > causing this.
>
> It seems so.
>
> I modified the kernel IPU module a bit, initialized a bunch of IPU
> registers to known values (1..0xD). Results (from 1 to 13 IPU
> registers) obtained with different instructions:
>
> readl(13 consecutive registers): CSI = 1 2 3 4 5 6 7 8 9 A B C D
> 1 = register #0 and so on - readl() results are obviously correct.
>
> LDM1: 1 (not corrupted)
> LDM2: 1 3
> LDM3: 1 3 4
> LDM4: 2 3 4 4
> LDM5: 1 3 4 5 6
> LDM6: 1 3 4 5 6 7
> LDM7: 1 3 4 5 6 7 8
> LDM8: 2 3 4 5 6 7 8 8
> LDM9: 1 3 4 5 6 7 8 9 A
> LDM10: 1 3 4 5 6 7 8 9 A B
> LDM11: 1 3 4 5 6 7 8 9 A B C
> LDM12: 1 3 4 5 6 7 8 9 A B C D
That's rather sad, and does look very much like a hardware bug.
The question is what to do about it... there's Linus' "do not break
userspace" edict and that's exactly what this change has done. So I
suppose we're going to have to revert the change and put up with
everything being slightly slower on arm32 than it otherwise would
have been. That probably means we'll end up with almost every kernel
tree out there carrying a revert of the revert to work around the
fact that seemingly NXP broke their hardware - which itself is not
a good idea. I guess we're just going to have to put up with that.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
"Russell King (Oracle)" <[email protected]> writes:
>> LDM12: 1 3 4 5 6 7 8 9 A B C D
>
> That's rather sad, and does look very much like a hardware bug.
>
> The question is what to do about it... there's Linus' "do not break
> userspace" edict and that's exactly what this change has done. So I
> suppose we're going to have to revert the change and put up with
> everything being slightly slower on arm32 than it otherwise would
> have been. That probably means we'll end up with almost every kernel
> tree out there carrying a revert of the revert to work around the
> fact that seemingly NXP broke their hardware - which itself is not
> a good idea. I guess we're just going to have to put up with that.
For userspace, it's quite a corner case, basically development-only -
and I guess there are very few people who will do things like this.
The same problem can manifest itself without any kernel involvement -
it's enough to mmap /dev/mem and use LDM on in completely in userspace.
This can't be fixed - unless we disallow IPU mmap.
Perhaps making sure the bug is clearly documented is better than doing
a partial fix. Ideally NXP should document it in their papers, and we
should add notes to IPU driver code.
The last one - I guess I can do.
--
Krzysztof Hałasa
Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa
BTW writes (STMIA) to IPU registers are apparently not corrupted, only
LDMIAs are affected - in pure userspace as well.
--
Krzysztof Hałasa
Sieć Badawcza Łukasiewicz
Przemysłowy Instytut Automatyki i Pomiarów PIAP
Al. Jerozolimskie 202, 02-486 Warszawa