2017-07-24 13:07 GMT+02:00 Ard Biesheuvel <[email protected]>:
> On 24 July 2017 at 11:57, Romain Izard <[email protected]> wrote:
>>
>> While upgrading the kernel from 4.9 to 4.12 for a custom board with a
>> Cortex-A5 based CPU, I have encountered a compilation issue that leads to
>> a data abort during the execution of the LZ4 decompression code in
>> zImage.
>>
>> [...]
>>
>> The compilation options are a little different between both cases:
>> The library is built with -O3, whereas the zImage decompressor is built
>> with -O2, -DDISABLE_BRANCH_PROFILING, -fpic, -mno-single-pic-base,
>> -fno-builtin. All other compilation options are shared in both cases.
>>
This is a red herring: the critical option here is '-fno-builtin'. If it is
not set, the bug disappears. It also disappears if we replace it with
'-fno-builtin-putc'. But it only changes the optimizations applied by
the compiler itself, and cannot explain the issue.
Before updating the LZ4 decompressor, the LZ4 header contained specific
code for handling alignment issues, which has been changed.
>> For Linux 4.9, the LZ4 decompressor code is completely different, which
>> explains why the issue appeared when changing kernel versions.
>>
>
> I see some void* to u32* casts in the new code, which makes me think
> that it is perhaps not valid C, and has maybe not been tested on an
> architecture that has stricter alignment requirements than x86?
>
I can reproduce it easily on v4.13 with GCC6.3:
- Configure with allnoconfig
- Enable CONFIG_MMU, CONFIG_KERNEL_LZ4
- Check the generated assembly for arch/arm/boot/compressed/decompress.o:
In the LZ4_decompress_fast function, the memory access after the third
branch uses ldm and stm. This is invalid, as the addresses can be unaligned.
With this configuration, HAVE_EFFICIENT_UNALIGNED_ACCESS is set, but this is
wrong. On 32-bit ARM, the compiler is free to generate LDM or LDRD access
that will always fail on unaligned addresses. In this case, we have two
LDR/STR access to adjascent addresses that appear in inline code. The
get_unaligned functions in "include/linux/unaligned/access_ok.h" cast the
pointers directly as regular 32-bit access, and as those are by default
aligned, the compiler will optimise and combine the access.
If we use the functions from "include/linux/unaligned/le_struct.h", the
get_unaligned() function correctly tells the compiler that the access is
special, and that it should not merge memory access. But we do not fall back
to byte-by-byte access, as the compiler itself knows how to use 32-bit
access when -funaligned-access is set (by default for ARMv7).
The issue is probably hidden by the kernel fault handler in normal kernel
code, but for this case it does nothing as we are working in the boot
decompressor, that cannot use the fault handler. But it should have a
performance inpact.
As a result, this means that HAVE_EFFICIENT_UNALIGNED_ACCESS should not
be set at least in the context of "include/asm-generic/unaligned.h". But
as this option is also used in other places, where it is not related to
the get_unaligned functions, it is not possible to remove it on ARM 32-bit
without further study.
--
Romain Izard
On Mon, Sep 4, 2017 at 6:19 PM, Romain Izard <[email protected]> wrote:
> 2017-07-24 13:07 GMT+02:00 Ard Biesheuvel <[email protected]>:
>> On 24 July 2017 at 11:57, Romain Izard <[email protected]> wrote:
>>>
>>> While upgrading the kernel from 4.9 to 4.12 for a custom board with a
>>> Cortex-A5 based CPU, I have encountered a compilation issue that leads to
>>> a data abort during the execution of the LZ4 decompression code in
>>> zImage.
>>>
>>> [...]
>>>
>>> The compilation options are a little different between both cases:
>>> The library is built with -O3, whereas the zImage decompressor is built
>>> with -O2, -DDISABLE_BRANCH_PROFILING, -fpic, -mno-single-pic-base,
>>> -fno-builtin. All other compilation options are shared in both cases.
>>>
>
> This is a red herring: the critical option here is '-fno-builtin'. If it is
> not set, the bug disappears. It also disappears if we replace it with
> '-fno-builtin-putc'. But it only changes the optimizations applied by
> the compiler itself, and cannot explain the issue.
>
> Before updating the LZ4 decompressor, the LZ4 header contained specific
> code for handling alignment issues, which has been changed.
>
>>> For Linux 4.9, the LZ4 decompressor code is completely different, which
>>> explains why the issue appeared when changing kernel versions.
>>>
>>
>> I see some void* to u32* casts in the new code, which makes me think
>> that it is perhaps not valid C, and has maybe not been tested on an
>> architecture that has stricter alignment requirements than x86?
>>
>
> I can reproduce it easily on v4.13 with GCC6.3:
> - Configure with allnoconfig
> - Enable CONFIG_MMU, CONFIG_KERNEL_LZ4
> - Check the generated assembly for arch/arm/boot/compressed/decompress.o:
> In the LZ4_decompress_fast function, the memory access after the third
> branch uses ldm and stm. This is invalid, as the addresses can be unaligned.
>
> With this configuration, HAVE_EFFICIENT_UNALIGNED_ACCESS is set, but this is
> wrong. On 32-bit ARM, the compiler is free to generate LDM or LDRD access
> that will always fail on unaligned addresses. In this case, we have two
> LDR/STR access to adjascent addresses that appear in inline code. The
> get_unaligned functions in "include/linux/unaligned/access_ok.h" cast the
> pointers directly as regular 32-bit access, and as those are by default
> aligned, the compiler will optimise and combine the access.
>
> If we use the functions from "include/linux/unaligned/le_struct.h", the
> get_unaligned() function correctly tells the compiler that the access is
> special, and that it should not merge memory access. But we do not fall back
> to byte-by-byte access, as the compiler itself knows how to use 32-bit
> access when -funaligned-access is set (by default for ARMv7).
Right, I've come across this in the past as well.
> The issue is probably hidden by the kernel fault handler in normal kernel
> code, but for this case it does nothing as we are working in the boot
> decompressor, that cannot use the fault handler. But it should have a
> performance inpact.
>
> As a result, this means that HAVE_EFFICIENT_UNALIGNED_ACCESS should not
> be set at least in the context of "include/asm-generic/unaligned.h". But
> as this option is also used in other places, where it is not related to
> the get_unaligned functions, it is not possible to remove it on ARM 32-bit
> without further study.
This is a patch I prototyped in the past https://pastebin.com/apPTPXys
I'm not entirely sure if this produces good object code with all compilers
on all architectures, but it should solve the problem you observed and
more.
Arnd
On 6 September 2017 at 21:57, Arnd Bergmann <[email protected]> wrote:
> On Mon, Sep 4, 2017 at 6:19 PM, Romain Izard <[email protected]> wrote:
>> 2017-07-24 13:07 GMT+02:00 Ard Biesheuvel <[email protected]>:
>>> On 24 July 2017 at 11:57, Romain Izard <[email protected]> wrote:
>>>>
>>>> While upgrading the kernel from 4.9 to 4.12 for a custom board with a
>>>> Cortex-A5 based CPU, I have encountered a compilation issue that leads to
>>>> a data abort during the execution of the LZ4 decompression code in
>>>> zImage.
>>>>
>>>> [...]
>>>>
>>>> The compilation options are a little different between both cases:
>>>> The library is built with -O3, whereas the zImage decompressor is built
>>>> with -O2, -DDISABLE_BRANCH_PROFILING, -fpic, -mno-single-pic-base,
>>>> -fno-builtin. All other compilation options are shared in both cases.
>>>>
>>
>> This is a red herring: the critical option here is '-fno-builtin'. If it is
>> not set, the bug disappears. It also disappears if we replace it with
>> '-fno-builtin-putc'. But it only changes the optimizations applied by
>> the compiler itself, and cannot explain the issue.
>>
>> Before updating the LZ4 decompressor, the LZ4 header contained specific
>> code for handling alignment issues, which has been changed.
>>
>>>> For Linux 4.9, the LZ4 decompressor code is completely different, which
>>>> explains why the issue appeared when changing kernel versions.
>>>>
>>>
>>> I see some void* to u32* casts in the new code, which makes me think
>>> that it is perhaps not valid C, and has maybe not been tested on an
>>> architecture that has stricter alignment requirements than x86?
>>>
>>
>> I can reproduce it easily on v4.13 with GCC6.3:
>> - Configure with allnoconfig
>> - Enable CONFIG_MMU, CONFIG_KERNEL_LZ4
>> - Check the generated assembly for arch/arm/boot/compressed/decompress.o:
>> In the LZ4_decompress_fast function, the memory access after the third
>> branch uses ldm and stm. This is invalid, as the addresses can be unaligned.
>>
>> With this configuration, HAVE_EFFICIENT_UNALIGNED_ACCESS is set, but this is
>> wrong. On 32-bit ARM, the compiler is free to generate LDM or LDRD access
>> that will always fail on unaligned addresses. In this case, we have two
>> LDR/STR access to adjascent addresses that appear in inline code. The
>> get_unaligned functions in "include/linux/unaligned/access_ok.h" cast the
>> pointers directly as regular 32-bit access, and as those are by default
>> aligned, the compiler will optimise and combine the access.
>>
>> If we use the functions from "include/linux/unaligned/le_struct.h", the
>> get_unaligned() function correctly tells the compiler that the access is
>> special, and that it should not merge memory access. But we do not fall back
>> to byte-by-byte access, as the compiler itself knows how to use 32-bit
>> access when -funaligned-access is set (by default for ARMv7).
>
> Right, I've come across this in the past as well.
>
>> The issue is probably hidden by the kernel fault handler in normal kernel
>> code, but for this case it does nothing as we are working in the boot
>> decompressor, that cannot use the fault handler. But it should have a
>> performance inpact.
>>
>> As a result, this means that HAVE_EFFICIENT_UNALIGNED_ACCESS should not
>> be set at least in the context of "include/asm-generic/unaligned.h". But
>> as this option is also used in other places, where it is not related to
>> the get_unaligned functions, it is not possible to remove it on ARM 32-bit
>> without further study.
>
> This is a patch I prototyped in the past https://pastebin.com/apPTPXys
>
> I'm not entirely sure if this produces good object code with all compilers
> on all architectures, but it should solve the problem you observed and
> more.
>
HAVE_EFFICIENT_UNALIGNED_ACCESS only affects explicit unaligned
accesses, and selects between fixups in hardware or in software.
AFAICT the issue here is implicit unaligned accesses, where char
pointers are passed as u32 * arguments.
It seems to me that the updated decompression code simply violates the
C spec in ways that you get away with on x86. On ARM, we fix u
unaligned accesses in software (on *all* arch revisions, given that we
are dealing with ldm/stm instructions here), and in the decompressor,
you either get a crash or incorrect data, depending on the version of
the CPU and its current setting for handling of unaligned accesses.
I guess this means we should revert the LZ4 update, or blacklist it
for ARM. Fiddling with HAVE_EFFICIENT_UNALIGNED_ACCESS is not going to
help afaict.
On Thu, Sep 7, 2017 at 12:23 AM, Ard Biesheuvel
<[email protected]> wrote:
> On 6 September 2017 at 21:57, Arnd Bergmann <[email protected]> wrote:
>> On Mon, Sep 4, 2017 at 6:19 PM, Romain Izard <[email protected]> wrote:
>
> HAVE_EFFICIENT_UNALIGNED_ACCESS only affects explicit unaligned
> accesses, and selects between fixups in hardware or in software.
> AFAICT the issue here is implicit unaligned accesses, where char
> pointers are passed as u32 * arguments.
The problem with include/linux/unaligned/access_ok.h is that it
converts pointers
that are known by the caller to be potentially unaligned and accesses them as if
they were aligned. This means we require a software fixup through the
trap handler
on ARM in cases that the compiler already knows how to handle correctly when
using linux/unaligned/le_struct.h. On ARMv7 this means it ends up using normal
load/store instructures but not the ldm/stm or ldrd/stdr instructions
that are not
allowed on unaligned pointers.
Doing that solves the problem that Romain ran into and also makes other
code much more efficient on ARMv7.
Arnd
On 6 September 2017 at 23:38, Arnd Bergmann <[email protected]> wrote:
> On Thu, Sep 7, 2017 at 12:23 AM, Ard Biesheuvel
> <[email protected]> wrote:
>> On 6 September 2017 at 21:57, Arnd Bergmann <[email protected]> wrote:
>>> On Mon, Sep 4, 2017 at 6:19 PM, Romain Izard <[email protected]> wrote:
>>
>> HAVE_EFFICIENT_UNALIGNED_ACCESS only affects explicit unaligned
>> accesses, and selects between fixups in hardware or in software.
>> AFAICT the issue here is implicit unaligned accesses, where char
>> pointers are passed as u32 * arguments.
>
> The problem with include/linux/unaligned/access_ok.h is that it
> converts pointers
> that are known by the caller to be potentially unaligned and accesses them as if
> they were aligned. This means we require a software fixup through the
> trap handler
> on ARM in cases that the compiler already knows how to handle correctly when
> using linux/unaligned/le_struct.h. On ARMv7 this means it ends up using normal
> load/store instructures but not the ldm/stm or ldrd/stdr instructions
> that are not
> allowed on unaligned pointers.
>
Ah ok, I missed that part. The distinction between ldr/str and
ldm/stm/ldrd is a bit fiddly, but if we can solve this using C code, I
am all for it.
> Doing that solves the problem that Romain ran into and also makes other
> code much more efficient on ARMv7.
>
It is not entirely clear to me why casting to a pointer-to-struct type
makes any difference here. Is it simply because of the __packed
attribute?
Anyway, the issue I spotted in the LZ4 code did not use unaligned
accessors at all, so we must be talking about different things here.
But perhaps the solution there is to simply update that code to use
these accessors in places where such casts are being done. If we then
compile the decompressor with -mno-unaligned-access (which we should
be doing already in any case), these issues should be eliminated.
On Thu, Sep 7, 2017 at 12:48 AM, Ard Biesheuvel
<[email protected]> wrote:
> On 6 September 2017 at 23:38, Arnd Bergmann <[email protected]> wrote:
>> On Thu, Sep 7, 2017 at 12:23 AM, Ard Biesheuvel
>> <[email protected]> wrote:
>>> On 6 September 2017 at 21:57, Arnd Bergmann <[email protected]> wrote:
>>>> On Mon, Sep 4, 2017 at 6:19 PM, Romain Izard <[email protected]> wrote:
>>>
>>> HAVE_EFFICIENT_UNALIGNED_ACCESS only affects explicit unaligned
>>> accesses, and selects between fixups in hardware or in software.
>>> AFAICT the issue here is implicit unaligned accesses, where char
>>> pointers are passed as u32 * arguments.
>>
>> The problem with include/linux/unaligned/access_ok.h is that it
>> converts pointers
>> that are known by the caller to be potentially unaligned and accesses them as if
>> they were aligned. This means we require a software fixup through the
>> trap handler
>> on ARM in cases that the compiler already knows how to handle correctly when
>> using linux/unaligned/le_struct.h. On ARMv7 this means it ends up using normal
>> load/store instructures but not the ldm/stm or ldrd/stdr instructions
>> that are not
>> allowed on unaligned pointers.
>>
>
> Ah ok, I missed that part. The distinction between ldr/str and
> ldm/stm/ldrd is a bit fiddly, but if we can solve this using C code, I
> am all for it.
>
>> Doing that solves the problem that Romain ran into and also makes other
>> code much more efficient on ARMv7.
>>
>
> It is not entirely clear to me why casting to a pointer-to-struct type
> makes any difference here. Is it simply because of the __packed
> attribute?
The problem is code like
struct twoint {
int a; int b;
};
void __noinline access_unaligned_8bytes(struct twoint *s, int a, int b)
{
put_unaligned(a, &s->a);
put_unaligned(b, &s->b);
}
int caller(char *c, int offset, int a, int b)
{
access_unaligned_8bytes((void *)c + offset, a, b);
}
With include/linux/unaligned/access_ok.h, this turns into two stores
that gcc can combine into a single 'strd' or 'stm'. With the
linux/unaligned/le_struct.h version, gcc knows that the pointer
may be unaligned, so it will use instructions that it knows are
safe, either byte accesses (on armv5 and earlier) or normal
str (on armv6+).
> Anyway, the issue I spotted in the LZ4 code did not use unaligned
> accessors at all, so we must be talking about different things here.
I see lots of unaligned helpers in the lz4 code, is this not what
we hit?
$ git grep unaligned lib/
lib/lz4/lz4_compress.c:#include <asm/unaligned.h>
lib/lz4/lz4_decompress.c:#include <asm/unaligned.h>
lib/lz4/lz4defs.h:#include <asm/unaligned.h>
lib/lz4/lz4defs.h: return get_unaligned((const U16 *)ptr);
lib/lz4/lz4defs.h: return get_unaligned((const U32 *)ptr);
lib/lz4/lz4defs.h: return get_unaligned((const size_t *)ptr);
lib/lz4/lz4defs.h: put_unaligned(value, (U16 *)memPtr);
Arnd
On 7 September 2017 at 00:18, Arnd Bergmann <[email protected]> wrote:
> On Thu, Sep 7, 2017 at 12:48 AM, Ard Biesheuvel
> <[email protected]> wrote:
>> On 6 September 2017 at 23:38, Arnd Bergmann <[email protected]> wrote:
>>> On Thu, Sep 7, 2017 at 12:23 AM, Ard Biesheuvel
>>> <[email protected]> wrote:
>>>> On 6 September 2017 at 21:57, Arnd Bergmann <[email protected]> wrote:
>>>>> On Mon, Sep 4, 2017 at 6:19 PM, Romain Izard <[email protected]> wrote:
>>>>
>>>> HAVE_EFFICIENT_UNALIGNED_ACCESS only affects explicit unaligned
>>>> accesses, and selects between fixups in hardware or in software.
>>>> AFAICT the issue here is implicit unaligned accesses, where char
>>>> pointers are passed as u32 * arguments.
>>>
>>> The problem with include/linux/unaligned/access_ok.h is that it
>>> converts pointers
>>> that are known by the caller to be potentially unaligned and accesses them as if
>>> they were aligned. This means we require a software fixup through the
>>> trap handler
>>> on ARM in cases that the compiler already knows how to handle correctly when
>>> using linux/unaligned/le_struct.h. On ARMv7 this means it ends up using normal
>>> load/store instructures but not the ldm/stm or ldrd/stdr instructions
>>> that are not
>>> allowed on unaligned pointers.
>>>
>>
>> Ah ok, I missed that part. The distinction between ldr/str and
>> ldm/stm/ldrd is a bit fiddly, but if we can solve this using C code, I
>> am all for it.
>>
>>> Doing that solves the problem that Romain ran into and also makes other
>>> code much more efficient on ARMv7.
>>>
>>
>> It is not entirely clear to me why casting to a pointer-to-struct type
>> makes any difference here. Is it simply because of the __packed
>> attribute?
>
> The problem is code like
>
> struct twoint {
> int a; int b;
> };
> void __noinline access_unaligned_8bytes(struct twoint *s, int a, int b)
> {
> put_unaligned(a, &s->a);
> put_unaligned(b, &s->b);
> }
> int caller(char *c, int offset, int a, int b)
> {
> access_unaligned_8bytes((void *)c + offset, a, b);
> }
>
> With include/linux/unaligned/access_ok.h, this turns into two stores
> that gcc can combine into a single 'strd' or 'stm'. With the
> linux/unaligned/le_struct.h version, gcc knows that the pointer
> may be unaligned, so it will use instructions that it knows are
> safe, either byte accesses (on armv5 and earlier) or normal
> str (on armv6+).
>
>> Anyway, the issue I spotted in the LZ4 code did not use unaligned
>> accessors at all, so we must be talking about different things here.
>
> I see lots of unaligned helpers in the lz4 code, is this not what
> we hit?
>
> $ git grep unaligned lib/
> lib/lz4/lz4_compress.c:#include <asm/unaligned.h>
> lib/lz4/lz4_decompress.c:#include <asm/unaligned.h>
> lib/lz4/lz4defs.h:#include <asm/unaligned.h>
> lib/lz4/lz4defs.h: return get_unaligned((const U16 *)ptr);
> lib/lz4/lz4defs.h: return get_unaligned((const U32 *)ptr);
> lib/lz4/lz4defs.h: return get_unaligned((const size_t *)ptr);
> lib/lz4/lz4defs.h: put_unaligned(value, (U16 *)memPtr);
>
Yes, you are right. The code I looked at before does cast a char* to a
U32*, but it is in the compression path, so it has nothing to do with
this issue.
So I agree that access_ok.h is unsuitable for any 32-bit ARM core, and
we should be using the struct version instead. My only remaining
question is why we need access_ok.h in the first place: it is worth a
try to check whether both produce the same code on AArch64.
On Thu, Sep 7, 2017 at 1:31 AM, Ard Biesheuvel
<[email protected]> wrote:
> On 7 September 2017 at 00:18, Arnd Bergmann <[email protected]> wrote:
>> On Thu, Sep 7, 2017 at 12:48 AM, Ard Biesheuvel
>> I see lots of unaligned helpers in the lz4 code, is this not what
>> we hit?
>>
>> $ git grep unaligned lib/
>> lib/lz4/lz4_compress.c:#include <asm/unaligned.h>
>> lib/lz4/lz4_decompress.c:#include <asm/unaligned.h>
>> lib/lz4/lz4defs.h:#include <asm/unaligned.h>
>> lib/lz4/lz4defs.h: return get_unaligned((const U16 *)ptr);
>> lib/lz4/lz4defs.h: return get_unaligned((const U32 *)ptr);
>> lib/lz4/lz4defs.h: return get_unaligned((const size_t *)ptr);
>> lib/lz4/lz4defs.h: put_unaligned(value, (U16 *)memPtr);
>>
>
> Yes, you are right. The code I looked at before does cast a char* to a
> U32*, but it is in the compression path, so it has nothing to do with
> this issue.
>
> So I agree that access_ok.h is unsuitable for any 32-bit ARM core, and
> we should be using the struct version instead. My only remaining
> question is why we need access_ok.h in the first place: it is worth a
> try to check whether both produce the same code on AArch64.
It's been a while since I looked into this problem, but from my memory,
it turned out rather hard to analyze single files after the change, as
gcc inlining decisions and register allocation tend to be non-deterministic.
However, my conclusion then was that those changes are rather random,
usually no effect, sometimes better and sometimes worse by chance,
with the only real differences being the few cases we avoid the ldrd/ldm/...
instructions.
I have no idea which compiler version I tried back then, so it's very
possible that some older compilers actually do produce slightly worse
code with the struct version, the question is what the oldest compiler
is that we care about enough to investigate. Maybe gcc-4.8?
Arnd