2020-08-19 00:11:07

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH] lib/string.c: Disable tree-loop-distribute-patterns

gcc can transform the loop in a naive implementation of memset/memcpy
etc into a call to the function itself. This optimization is enabled by
-ftree-loop-distribute-patterns.

This has been the case for a while (see eg [0]), but gcc-10.x enables
this option at -O2 rather than -O3 as in previous versions.

Add -ffreestanding, which implicitly disables this optimization with
gcc. It is unclear whether clang performs such optimizations, but
hopefully it will also not do so in a freestanding environment.

This by itself is insufficient for gcc if the optimization was
explicitly enabled by CFLAGS, so also add a flag to explicitly disable
it.

[0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888

Signed-off-by: Arvind Sankar <[email protected]>
---
lib/Makefile | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/Makefile b/lib/Makefile
index e290fc5707ea..80edea49613f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -15,11 +15,18 @@ KCOV_INSTRUMENT_debugobjects.o := n
KCOV_INSTRUMENT_dynamic_debug.o := n
KCOV_INSTRUMENT_fault-inject.o := n

+# string.o implements standard library functions like memset/memcpy etc.
+# Use -ffreestanding to ensure that the compiler does not try to "optimize"
+# them into calls to themselves.
+# The optimization pass that does such transformations in gcc is
+# tree-loop-distribute-patterns. Explicitly disable it just in case.
+CFLAGS_string.o := -ffreestanding $(call cc-option,-fno-tree-loop-distribute-patterns)
+
# Early boot use of cmdline, don't instrument it
ifdef CONFIG_AMD_MEM_ENCRYPT
KASAN_SANITIZE_string.o := n

-CFLAGS_string.o := -fno-stack-protector
+CFLAGS_string.o += -fno-stack-protector
endif

# Used by KCSAN while enabled, avoid recursion.
--
2.26.2


2020-08-19 00:47:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Disable tree-loop-distribute-patterns

On Tue, Aug 18, 2020 at 4:43 PM Arvind Sankar <[email protected]> wrote:
>
> This by itself is insufficient for gcc if the optimization was
> explicitly enabled by CFLAGS, so also add a flag to explicitly disable
> it.

Using -fno-tree-loop-distribute-patterns seems to really be a bit too
incestuous with internal compiler knowledge.

That generic memcpy implementation is horrible anyway. It should never be used.

So I'd rather see this either removed entirely, ot possibly rewritten
to be a somewhat proper memcpy implementation, and in the process made
to not be recognizable by the compiler (possibly by adding a dummy
barrier() or something like that).

Looking at the implementation of "strscpy()" in the same file, and
then comparing that to the ludicrously simplisting "memcpy()", I
really get the feeling that that memcpy() is not worth having.

Linus

2020-08-19 03:06:15

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Disable tree-loop-distribute-patterns

On Tue, Aug 18, 2020 at 05:44:03PM -0700, Linus Torvalds wrote:
> On Tue, Aug 18, 2020 at 4:43 PM Arvind Sankar <[email protected]> wrote:
> >
> > This by itself is insufficient for gcc if the optimization was
> > explicitly enabled by CFLAGS, so also add a flag to explicitly disable
> > it.
>
> Using -fno-tree-loop-distribute-patterns seems to really be a bit too
> incestuous with internal compiler knowledge.

Fair enough -- you ok with just the -ffreestanding? That's what protects
the memset in arch/x86/boot/compressed/string.c.

I think this is worthwhile to be safe.

>
> That generic memcpy implementation is horrible anyway. It should never be used.
>
> So I'd rather see this either removed entirely, ot possibly rewritten
> to be a somewhat proper memcpy implementation, and in the process made
> to not be recognizable by the compiler (possibly by adding a dummy
> barrier() or something like that).
>
> Looking at the implementation of "strscpy()" in the same file, and
> then comparing that to the ludicrously simplisting "memcpy()", I
> really get the feeling that that memcpy() is not worth having.
>
> Linus

I don't think anything actually uses the generic memcpy, and I think
only c6x uses the generic memset. Might be worth optimizing strnlen etc
with the word-at-a-time thing though.

2020-08-19 03:44:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Disable tree-loop-distribute-patterns

On Tue, Aug 18, 2020 at 8:04 PM Arvind Sankar <[email protected]> wrote:
>
> On Tue, Aug 18, 2020 at 05:44:03PM -0700, Linus Torvalds wrote:
> > Using -fno-tree-loop-distribute-patterns seems to really be a bit too
> > incestuous with internal compiler knowledge.
>
> Fair enough -- you ok with just the -ffreestanding? That's what protects
> the memset in arch/x86/boot/compressed/string.c.

Yeah, I think -ffreestanding makes sense. It may not be optimal, but
it doesn't smell wrong to me.

> > Looking at the implementation of "strscpy()" in the same file, and
> > then comparing that to the ludicrously simplisting "memcpy()", I
> > really get the feeling that that memcpy() is not worth having.
>
> I don't think anything actually uses the generic memcpy, and I think
> only c6x uses the generic memset.

I do think maybe we should just remove the generic memcpy and memset
and say "hey people, you really do need to implement your own".

Even if you don't have this "recognize and recurse" issue, you end up
having other issues like just tracing etc. Yeah, we've hopefully
turned everything like that off, but considering that no major
architecture uses this, I wonder how many small details we've missed
with ftrace recursion etc?

> Might be worth optimizing strnlen etc with the word-at-a-time thing though.

Yeah, possibly. Except the kernel almost never uses strnlen for
anything bigger. At least I haven't seen it very much in the profiles.

The "strncpy_from_user()" stuff shows up like a sore thumb on some
loads (lots and lots of strings from user space for pathnames and
execve), but the kernel itself tends to seldom deal a lot with any
longer strings. Stuff like device names etc, I guess, but any time I
see string handling in profiles, it tends to be in user space (GNU
make spends all of its time in string handling, it sometimes seems).

Of course, that may be just me looking at very particular profiles, so
maybe I've just not seen the loads where the kernel strnlen matters.

memcpy and memset? Those matter. A lot.

Linus

2020-08-19 13:25:47

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH] lib/string.c: Disable tree-loop-distribute-patterns

On Tue, Aug 18, 2020 at 08:32:58PM -0700, Linus Torvalds wrote:
> On Tue, Aug 18, 2020 at 8:04 PM Arvind Sankar <[email protected]> wrote:
>
> > Might be worth optimizing strnlen etc with the word-at-a-time thing though.
>
> Yeah, possibly. Except the kernel almost never uses strnlen for
> anything bigger. At least I haven't seen it very much in the profiles.

strscpy could be implemented as strnlen+memcpy. I'd think that wouldn't
be much slower, especially if strnlen is optimized and the arch has a
good implementation of memcpy?

2020-08-19 14:11:46

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH v2] lib/string.c: Use freestanding environment

gcc can transform the loop in a naive implementation of memset/memcpy
etc into a call to the function itself. This optimization is enabled by
-ftree-loop-distribute-patterns.

This has been the case for a while (see eg [0]), but gcc-10.x enables
this option at -O2 rather than -O3 as in previous versions.

Add -ffreestanding, which implicitly disables this optimization with
gcc. It is unclear whether clang performs such optimizations, but
hopefully it will also not do so in a freestanding environment.

[0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888

Signed-off-by: Arvind Sankar <[email protected]>
---
lib/Makefile | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/lib/Makefile b/lib/Makefile
index e290fc5707ea..a4a4c6864f51 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -15,11 +15,16 @@ KCOV_INSTRUMENT_debugobjects.o := n
KCOV_INSTRUMENT_dynamic_debug.o := n
KCOV_INSTRUMENT_fault-inject.o := n

+# string.o implements standard library functions like memset/memcpy etc.
+# Use -ffreestanding to ensure that the compiler does not try to "optimize"
+# them into calls to themselves.
+CFLAGS_string.o := -ffreestanding
+
# Early boot use of cmdline, don't instrument it
ifdef CONFIG_AMD_MEM_ENCRYPT
KASAN_SANITIZE_string.o := n

-CFLAGS_string.o := -fno-stack-protector
+CFLAGS_string.o += -fno-stack-protector
endif

# Used by KCSAN while enabled, avoid recursion.
--
2.26.2

2020-08-19 18:38:30

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v2] lib/string.c: Use freestanding environment

On Wed, Aug 19, 2020 at 7:08 AM Arvind Sankar <[email protected]> wrote:
>
> gcc can transform the loop in a naive implementation of memset/memcpy
> etc into a call to the function itself. This optimization is enabled by
> -ftree-loop-distribute-patterns.
>
> This has been the case for a while (see eg [0]), but gcc-10.x enables
> this option at -O2 rather than -O3 as in previous versions.
>
> Add -ffreestanding, which implicitly disables this optimization with
> gcc. It is unclear whether clang performs such optimizations, but
> hopefully it will also not do so in a freestanding environment.
>
> [0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888
>
> Signed-off-by: Arvind Sankar <[email protected]>

For Clang:
For x86_64 defconfig:
This results in no change for the code generated.

For aarch64 defconfig:
This results in calls to bcmp being replaced with calls to memcmp in
strstr and strnstr. I plan on adding -fno-built-bcmp then removing
bcmp anyways. Not a bug either way, just noting the difference is
disassembly.

For arm defconfig:
This results in no change for the code generated.

I should check the other architectures we support, but my local build
doesn't have all backends enabled currently; we'll catch it once it's
being testing in -next if it's an issue, but I don't foresee it
(knocks on wood, famous last words, ...)

If it helps GCC not optimize these core functions into infinite
recursion, I'm for that, especially since I'd bet these get called
frequently and early on in boot, which is my least favorite time to
debug.

Reviewed-by: Nick Desaulniers <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>

> ---
> lib/Makefile | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/lib/Makefile b/lib/Makefile
> index e290fc5707ea..a4a4c6864f51 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -15,11 +15,16 @@ KCOV_INSTRUMENT_debugobjects.o := n
> KCOV_INSTRUMENT_dynamic_debug.o := n
> KCOV_INSTRUMENT_fault-inject.o := n
>
> +# string.o implements standard library functions like memset/memcpy etc.
> +# Use -ffreestanding to ensure that the compiler does not try to "optimize"
> +# them into calls to themselves.
> +CFLAGS_string.o := -ffreestanding
> +
> # Early boot use of cmdline, don't instrument it
> ifdef CONFIG_AMD_MEM_ENCRYPT
> KASAN_SANITIZE_string.o := n
>
> -CFLAGS_string.o := -fno-stack-protector
> +CFLAGS_string.o += -fno-stack-protector
> endif
>
> # Used by KCSAN while enabled, avoid recursion.
> --
> 2.26.2
>


--
Thanks,
~Nick Desaulniers

2020-08-19 19:09:49

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH v2] lib/string.c: Use freestanding environment

On Wed, Aug 19, 2020 at 11:35:11AM -0700, Nick Desaulniers wrote:
> On Wed, Aug 19, 2020 at 7:08 AM Arvind Sankar <[email protected]> wrote:
> >
> > gcc can transform the loop in a naive implementation of memset/memcpy
> > etc into a call to the function itself. This optimization is enabled by
> > -ftree-loop-distribute-patterns.
> >
> > This has been the case for a while (see eg [0]), but gcc-10.x enables
> > this option at -O2 rather than -O3 as in previous versions.
> >
> > Add -ffreestanding, which implicitly disables this optimization with
> > gcc. It is unclear whether clang performs such optimizations, but
> > hopefully it will also not do so in a freestanding environment.
> >
> > [0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888
> >
> > Signed-off-by: Arvind Sankar <[email protected]>
>
> For Clang:
> For x86_64 defconfig:
> This results in no change for the code generated.
>
> For aarch64 defconfig:
> This results in calls to bcmp being replaced with calls to memcmp in
> strstr and strnstr. I plan on adding -fno-built-bcmp then removing
> bcmp anyways. Not a bug either way, just noting the difference is
> disassembly.
>
> For arm defconfig:
> This results in no change for the code generated.
>
> I should check the other architectures we support, but my local build
> doesn't have all backends enabled currently; we'll catch it once it's
> being testing in -next if it's an issue, but I don't foresee it
> (knocks on wood, famous last words, ...)
>
> If it helps GCC not optimize these core functions into infinite
> recursion, I'm for that, especially since I'd bet these get called
> frequently and early on in boot, which is my least favorite time to
> debug.
>
> Reviewed-by: Nick Desaulniers <[email protected]>
> Tested-by: Nick Desaulniers <[email protected]>
>

I verified that arch/c6x with gcc-10 downloaded from kernel.org has the
broken memset with CC_OPTIMIZE_FOR_PERFORMANCE and gets fixed with this
patch. The default is optimize for size though, which doesn't seem to be
busted.

Thanks.