Recent compilers know the meaning of builtins (`memset`,
`memcpy`, ...) and can replace calls by inline code when
deemed better. For example, `memset(p, 0, 4)` will be lowered
to a four-byte zero store.
When using -ffreestanding (this is the case e.g. building on
clang), these optimizations are disabled. This means that **all**
memsets, including those with small, constant sizes, will result
in an actual call to memset.
We have identified several spots where we have high CPU usage
because of this. For example, a single one of these memsets is
responsible for about 0.3% of our total CPU usage in the kernel.
Aliasing `memset` to `__builtin_memset` allows the compiler to
perform this optimization even when -ffreestanding is used.
There is no change when -ffreestanding is not used.
Below is a diff (clang) for `update_sg_lb_stats()`, which
includes the aforementionned hot memset:
memset(sgs, 0, sizeof(*sgs));
Diff:
movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
callq <memset> ~~~> movq $0x0, 0x18(%r8)
~~~> movq $0x0, 0x10(%r8)
~~~> movq $0x0, 0x8(%r8)
~~~> movq $0x0, (%r8)
In terms of code size, this grows the clang-built kernel a
bit (+0.022%):
440285512 vmlinux.clang.after
440383608 vmlinux.clang.before
Signed-off-by: Clement Courbet <[email protected]>
---
arch/x86/include/asm/string_64.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..7073c25aa4a3 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
+/* Recent compilers can generate much better code for known size and/or
+ * fill values, and will fallback on `memset` if they fail.
+ * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
+ * perform this optimization even when -ffreestanding is used.
+ */
+#if (__GNUC__ >= 4)
+#define memset(s, c, count) __builtin_memset(s, c, count)
+#endif
+
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
{
@@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
+#undef memset
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY
--
2.25.1.696.g5e7596f4ac-goog
On Mon, Mar 23, 2020 at 12:42:06PM +0100, 'Clement Courbet' via Clang Built Linux wrote:
> Recent compilers know the meaning of builtins (`memset`,
> `memcpy`, ...) and can replace calls by inline code when
> deemed better. For example, `memset(p, 0, 4)` will be lowered
> to a four-byte zero store.
>
> When using -ffreestanding (this is the case e.g. building on
> clang), these optimizations are disabled. This means that **all**
> memsets, including those with small, constant sizes, will result
> in an actual call to memset.
>
> We have identified several spots where we have high CPU usage
> because of this. For example, a single one of these memsets is
> responsible for about 0.3% of our total CPU usage in the kernel.
>
> Aliasing `memset` to `__builtin_memset` allows the compiler to
> perform this optimization even when -ffreestanding is used.
> There is no change when -ffreestanding is not used.
>
> Below is a diff (clang) for `update_sg_lb_stats()`, which
> includes the aforementionned hot memset:
> memset(sgs, 0, sizeof(*sgs));
>
> Diff:
> movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
> movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
> movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
> movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
> xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
> callq <memset> ~~~> movq $0x0, 0x18(%r8)
> ~~~> movq $0x0, 0x10(%r8)
> ~~~> movq $0x0, 0x8(%r8)
> ~~~> movq $0x0, (%r8)
>
> In terms of code size, this grows the clang-built kernel a
> bit (+0.022%):
> 440285512 vmlinux.clang.after
> 440383608 vmlinux.clang.before
>
> Signed-off-by: Clement Courbet <[email protected]>
> ---
> arch/x86/include/asm/string_64.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 75314c3dbe47..7073c25aa4a3 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
> void *memset(void *s, int c, size_t n);
> void *__memset(void *s, int c, size_t n);
>
> +/* Recent compilers can generate much better code for known size and/or
> + * fill values, and will fallback on `memset` if they fail.
> + * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
> + * perform this optimization even when -ffreestanding is used.
> + */
> +#if (__GNUC__ >= 4)
This ifdef is unnecessary, this will always be true because the minimum
supported GCC version is 4.6 and clang pretends it is 4.2.1. It appears
the Intel compiler will pretend to be whatever whatever GCC version the
host is using (no idea if it is still used by anyone or truly supported
but still worth mentioning) so it would still always be true because of
the GCC 4.6 requirement.
I cannot comment on the validity of the patch even though the reasoning
seems sound to me.
Cheers,
Nathan
> +#define memset(s, c, count) __builtin_memset(s, c, count)
> +#endif
> +
> #define __HAVE_ARCH_MEMSET16
> static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
> {
> @@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
> #undef memcpy
> #define memcpy(dst, src, len) __memcpy(dst, src, len)
> #define memmove(dst, src, len) __memmove(dst, src, len)
> +#undef memset
> #define memset(s, c, n) __memset(s, c, n)
>
> #ifndef __NO_FORTIFY
> --
> 2.25.1.696.g5e7596f4ac-goog
>
On Mon, Mar 23, 2020 at 4:43 AM 'Clement Courbet' via Clang Built
Linux <[email protected]> wrote:
>
> Recent compilers know the meaning of builtins (`memset`,
> `memcpy`, ...) and can replace calls by inline code when
> deemed better. For example, `memset(p, 0, 4)` will be lowered
> to a four-byte zero store.
>
> When using -ffreestanding (this is the case e.g. building on
> clang), these optimizations are disabled. This means that **all**
> memsets, including those with small, constant sizes, will result
> in an actual call to memset.
Isn't this only added for 32b x86 (if I'm reading arch/x86/Makefile
right)? Who's adding it for 64b?
arch/x86/Makefile has a comment:
88 # temporary until string.h is fixed
89 KBUILD_CFLAGS += -ffreestanding
Did you look into fixing that?
>
> We have identified several spots where we have high CPU usage
> because of this. For example, a single one of these memsets is
> responsible for about 0.3% of our total CPU usage in the kernel.
>
> Aliasing `memset` to `__builtin_memset` allows the compiler to
> perform this optimization even when -ffreestanding is used.
> There is no change when -ffreestanding is not used.
>
> Below is a diff (clang) for `update_sg_lb_stats()`, which
> includes the aforementionned hot memset:
> memset(sgs, 0, sizeof(*sgs));
>
> Diff:
> movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
> movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
> movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
> movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
> xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
> callq <memset> ~~~> movq $0x0, 0x18(%r8)
> ~~~> movq $0x0, 0x10(%r8)
> ~~~> movq $0x0, 0x8(%r8)
> ~~~> movq $0x0, (%r8)
>
> In terms of code size, this grows the clang-built kernel a
> bit (+0.022%):
> 440285512 vmlinux.clang.after
> 440383608 vmlinux.clang.before
The before number looks bigger? Did it shrink in size, or was the
above mislabeled?
>
> Signed-off-by: Clement Courbet <[email protected]>
> ---
> arch/x86/include/asm/string_64.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 75314c3dbe47..7073c25aa4a3 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
> void *memset(void *s, int c, size_t n);
> void *__memset(void *s, int c, size_t n);
>
> +/* Recent compilers can generate much better code for known size and/or
> + * fill values, and will fallback on `memset` if they fail.
> + * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
> + * perform this optimization even when -ffreestanding is used.
> + */
> +#if (__GNUC__ >= 4)
> +#define memset(s, c, count) __builtin_memset(s, c, count)
> +#endif
> +
> #define __HAVE_ARCH_MEMSET16
> static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
> {
> @@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
> #undef memcpy
> #define memcpy(dst, src, len) __memcpy(dst, src, len)
> #define memmove(dst, src, len) __memmove(dst, src, len)
> +#undef memset
> #define memset(s, c, n) __memset(s, c, n)
>
> #ifndef __NO_FORTIFY
> --
--
Thanks,
~Nick Desaulniers
On Mon, Mar 23, 2020 at 6:22 PM Nick Desaulniers
<[email protected]> wrote:
>
> On Mon, Mar 23, 2020 at 4:43 AM 'Clement Courbet' via Clang Built
> Linux <[email protected]> wrote:
> >
> > Recent compilers know the meaning of builtins (`memset`,
> > `memcpy`, ...) and can replace calls by inline code when
> > deemed better. For example, `memset(p, 0, 4)` will be lowered
> > to a four-byte zero store.
> >
> > When using -ffreestanding (this is the case e.g. building on
> > clang), these optimizations are disabled. This means that **all**
> > memsets, including those with small, constant sizes, will result
> > in an actual call to memset.
>
> Isn't this only added for 32b x86 (if I'm reading arch/x86/Makefile
> right)? Who's adding it for 64b?
>
> arch/x86/Makefile has a comment:
> 88 # temporary until string.h is fixed
> 89 KBUILD_CFLAGS += -ffreestanding
> Did you look into fixing that?
>
> >
> > We have identified several spots where we have high CPU usage
> > because of this. For example, a single one of these memsets is
> > responsible for about 0.3% of our total CPU usage in the kernel.
> >
> > Aliasing `memset` to `__builtin_memset` allows the compiler to
> > perform this optimization even when -ffreestanding is used.
> > There is no change when -ffreestanding is not used.
> >
> > Below is a diff (clang) for `update_sg_lb_stats()`, which
> > includes the aforementionned hot memset:
> > memset(sgs, 0, sizeof(*sgs));
Further, `make CC=clang -j71 kernel/sched/fair.o V=1` doesn't show
`-ffreestanding` being used. Any idea where it's coming from in your
build? Maybe a local modification to be removed?
> >
> > Diff:
> > movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
> > movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
> > movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
> > movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
> > xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
> > callq <memset> ~~~> movq $0x0, 0x18(%r8)
> > ~~~> movq $0x0, 0x10(%r8)
> > ~~~> movq $0x0, 0x8(%r8)
> > ~~~> movq $0x0, (%r8)
> >
> > In terms of code size, this grows the clang-built kernel a
> > bit (+0.022%):
> > 440285512 vmlinux.clang.after
> > 440383608 vmlinux.clang.before
>
> The before number looks bigger? Did it shrink in size, or was the
> above mislabeled?
>
> >
> > Signed-off-by: Clement Courbet <[email protected]>
> > ---
> > arch/x86/include/asm/string_64.h | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> > index 75314c3dbe47..7073c25aa4a3 100644
> > --- a/arch/x86/include/asm/string_64.h
> > +++ b/arch/x86/include/asm/string_64.h
> > @@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
> > void *memset(void *s, int c, size_t n);
> > void *__memset(void *s, int c, size_t n);
> >
> > +/* Recent compilers can generate much better code for known size and/or
> > + * fill values, and will fallback on `memset` if they fail.
> > + * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
> > + * perform this optimization even when -ffreestanding is used.
> > + */
> > +#if (__GNUC__ >= 4)
> > +#define memset(s, c, count) __builtin_memset(s, c, count)
> > +#endif
> > +
> > #define __HAVE_ARCH_MEMSET16
> > static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
> > {
> > @@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
> > #undef memcpy
> > #define memcpy(dst, src, len) __memcpy(dst, src, len)
> > #define memmove(dst, src, len) __memmove(dst, src, len)
> > +#undef memset
> > #define memset(s, c, n) __memset(s, c, n)
> >
> > #ifndef __NO_FORTIFY
> > --
>
> --
> Thanks,
> ~Nick Desaulniers
--
Thanks,
~Nick Desaulniers
Thanks for the comments. Answers below.
> This ifdef is unnecessary
> This needs to be within #ifndef CONFIG_FORTIFY_SOURCE
Thanks, fixed.
> shouldn't this just be done universally ?
It looks like every architecture does its own magic with memory
functions. I'm not very familiar with how the linux kernel is
organized, do you have a pointer for where this would go if enabled
globally ?
> Who's adding it for 64b?
> Any idea where it's coming from in your
> build? Maybe a local modification to be removed?
Actually this is from our local build configuration. Apparently this
is needed to build on some architectures that redefine common
symbols, but the authors seemed to feel strongly that this should be
on for all architectures. I've reached out to the authors for an
extended rationale.
On the other hand I think that there is no reason to prevent people
from building with freestanding if we can easily allow them to.
Recent compilers know the meaning of builtins (`memset`,
`memcpy`, ...) and can replace calls by inline code when
deemed better. For example, `memset(p, 0, 4)` will be lowered
to a four-byte zero store.
When using -ffreestanding (this is the case e.g. building on
clang), these optimizations are disabled. This means that **all**
memsets, including those with small, constant sizes, will result
in an actual call to memset.
We have identified several spots where we have high CPU usage
because of this. For example, a single one of these memsets is
responsible for about 0.3% of our total CPU usage in the kernel.
Aliasing `memset` to `__builtin_memset` allows the compiler to
perform this optimization even when -ffreestanding is used.
There is no change when -ffreestanding is not used.
Below is a diff (clang) for `update_sg_lb_stats()`, which
includes the aforementionned hot memset:
memset(sgs, 0, sizeof(*sgs));
Diff:
movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
callq <memset> ~~~> movq $0x0, 0x18(%r8)
~~~> movq $0x0, 0x10(%r8)
~~~> movq $0x0, 0x8(%r8)
~~~> movq $0x0, (%r8)
In terms of code size, this grows the clang-built kernel a
bit (+0.022%):
440285512 vmlinux.clang.after
440383608 vmlinux.clang.before
Signed-off-by: Clement Courbet <[email protected]>
---
changes in v2:
- Removed ifdef(GNUC >= 4).
- Disabled change when CONFIG_FORTIFY_SOURCE is set.
---
arch/x86/include/asm/string_64.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..9cfce0a840a4 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
+/* Recent compilers can generate much better code for known size and/or
+ * fill values, and will fallback on `memset` if they fail.
+ * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
+ * perform this optimization even when -ffreestanding is used.
+ */
+#if !defined(CONFIG_FORTIFY_SOURCE)
+#define memset(s, c, count) __builtin_memset(s, c, count)
+#endif
+
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
{
@@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
+#undef memset
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY
--
2.25.1.696.g5e7596f4ac-goog
On Tue, 2020-03-24 at 15:07 +0100, Clement Courbet wrote:
> Recent compilers know the meaning of builtins (`memset`,
> `memcpy`, ...) and can replace calls by inline code when
> deemed better. For example, `memset(p, 0, 4)` will be lowered
> to a four-byte zero store.
>
> When using -ffreestanding (this is the case e.g. building on
> clang), these optimizations are disabled. This means that **all**
> memsets, including those with small, constant sizes, will result
> in an actual call to memset.
[]
> In terms of code size, this grows the clang-built kernel a
> bit (+0.022%):
> 440285512 vmlinux.clang.after
> 440383608 vmlinux.clang.before
This shows the kernel getting smaller not larger.
Is this still reversed or is this correct?
Recent compilers know the meaning of builtins (`memset`,
`memcpy`, ...) and can replace calls by inline code when
deemed better. For example, `memset(p, 0, 4)` will be lowered
to a four-byte zero store.
When using -ffreestanding (this is the case e.g. building on
clang), these optimizations are disabled. This means that **all**
memsets, including those with small, constant sizes, will result
in an actual call to memset.
We have identified several spots where we have high CPU usage
because of this. For example, a single one of these memsets is
responsible for about 0.3% of our total CPU usage in the kernel.
Aliasing `memset` to `__builtin_memset` allows the compiler to
perform this optimization even when -ffreestanding is used.
There is no change when -ffreestanding is not used.
Below is a diff (clang) for `update_sg_lb_stats()`, which
includes the aforementionned hot memset:
memset(sgs, 0, sizeof(*sgs));
Diff:
movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
callq <memset> ~~~> movq $0x0, 0x18(%r8)
~~~> movq $0x0, 0x10(%r8)
~~~> movq $0x0, 0x8(%r8)
~~~> movq $0x0, (%r8)
In terms of code size, this shrins the clang-built kernel a
bit (-0.022%):
440383608 vmlinux.clang.before
440285512 vmlinux.clang.after
Signed-off-by: Clement Courbet <[email protected]>
---
changes in v2:
- Removed ifdef(GNUC >= 4).
- Disabled change when CONFIG_FORTIFY_SOURCE is set.
changes in v3:
- Fixed commit message: the kernel shrinks in size.
---
arch/x86/include/asm/string_64.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..9cfce0a840a4 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
+/* Recent compilers can generate much better code for known size and/or
+ * fill values, and will fallback on `memset` if they fail.
+ * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
+ * perform this optimization even when -ffreestanding is used.
+ */
+#if !defined(CONFIG_FORTIFY_SOURCE)
+#define memset(s, c, count) __builtin_memset(s, c, count)
+#endif
+
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
{
@@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
+#undef memset
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY
--
2.25.1.696.g5e7596f4ac-goog
> This shows the kernel getting smaller not larger.
> Is this still reversed or is this correct?
Yes, sorry. This was wrong. The kernel actully shrinks. Fixed.
On Tue, Mar 24, 2020 at 7:06 AM Clement Courbet <[email protected]> wrote:
>
> Thanks for the comments. Answers below.
>
> > This ifdef is unnecessary
> > This needs to be within #ifndef CONFIG_FORTIFY_SOURCE
>
> Thanks, fixed.
>
> > shouldn't this just be done universally ?
>
> It looks like every architecture does its own magic with memory
> functions. I'm not very familiar with how the linux kernel is
> organized, do you have a pointer for where this would go if enabled
> globally ?
>
> > Who's adding it for 64b?
> > Any idea where it's coming from in your
> > build? Maybe a local modification to be removed?
>
> Actually this is from our local build configuration. Apparently this
Not sure we should modify this for someone's downstream tree to work
around an issue they introduced. If you want to file a bug
internally, I'd be happy to comment on it.
Does __builtin_memset detect support for `rep stosb`, then patch the
kernel to always use it or not? The kernel is limited in that we use
-mno-sse and friends to avoid saving/restoring vector registers on
context switch unless kernel_fpu_{begin|end}() is called, which the
compiler doesn't insert for memcpy's.
Did you verify what this change does for other compilers?
> is needed to build on some architectures that redefine common
> symbols, but the authors seemed to feel strongly that this should be
Sounds like a linkage error or issue with symbol visibility; I don't
see how -ffreestanding should have any bearing on that.
> on for all architectures. I've reached out to the authors for an
> extended rationale.
> On the other hand I think that there is no reason to prevent people
> from building with freestanding if we can easily allow them to.
I disagree. The codegen in the kernel is very tricky to get right;
it's somewhat an embedded system, yet provides many symbols that would
have been provided by libc, yet it doesn't use vector operations for
the general case. Adding -ffreestanding to optimize one hot memset in
one function is using a really big hammer to solve the wrong problem.
--
Thanks,
~Nick Desaulniers
> Not sure we should modify this for someone's downstream tree to work
> around an issue they introduced. If you want to file a bug
> internally, I'd be happy to comment on it.
I don't have a strong opinion on that. I don't know the policy of the
linux kernel w.r.t. this. There is an internal bug for this, where
kernel maintainers suggested I upstream this patch.
> Does __builtin_memset detect support for `rep stosb`, then patch the
> kernel to always use it or not?
__builtin_memset just allows the compiler to recognize that this has the
semantics of a memset, exactly like it happens when -freestanding is not
specified.
In terms of what compilers do when expanding the memset, it depends on
the size.
gcc or clang obviously do not generate vector code when -no-sse is
specified, as this would break promises.
clang inlines stores for small sizes and switches to memset as the size
increases: https://godbolt.org/z/_X16xt
gcc inlines stores for tiny sizes, then switches to repstos for larger
sizes: https://godbolt.org/z/m-G233 This behaviour is additionally
configurable through command line flags: https://godbolt.org/z/wsoJpQ
> Did you verify what this change does for other compilers?
Are there other compilers are used to build the kernel on x86 ?
icc does the same as gcc and clang: https://godbolt.org/z/yCju_D
> yet it doesn't use vector operations for the general case
I'm not sure how vector operations relate to freestanding, or this change.
> Adding -ffreestanding to optimize one hot memset in
> one function is using a really big hammer to solve the wrong
> problem.
-ffreestanding was not added to optimize anything. It was already there.
If anything -ffreestanding actually pessimizes a lot of the code
generation, as it prevents the compiler from recognizing the semantics
of common primitives. This is what this change is trying to fix.
Removing -ffreestanding from the options is another (easier?) way to fix
the problem. This change is a stab at accomodating both those who want
freestanding and those who want performance.
The hot memset I mentionned was just the hottest one. But as you can imagine
there are many constant-size memsets spread across many functions, some of
which are also very hot, the others constituting a long tail which is not
negligible.
Hi all!
Sry for being late at the party:
On 24/03/2020 16:59, Clement Courbet wrote:
[...]
> ---
> arch/x86/include/asm/string_64.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 75314c3dbe47..9cfce0a840a4 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
> void *memset(void *s, int c, size_t n);
> void *__memset(void *s, int c, size_t n);
>
> +/* Recent compilers can generate much better code for known size and/or
> + * fill values, and will fallback on `memset` if they fail.
> + * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
> + * perform this optimization even when -ffreestanding is used.
> + */
> +#if !defined(CONFIG_FORTIFY_SOURCE)
> +#define memset(s, c, count) __builtin_memset(s, c, count)
To be on the safe side, the usual way to write macros is like
#define memset(s, c, count) __builtin_memset((s), (c), (count))
as no one know what is passed as parameter to memset() and the extra
pair of parentheses don't hurt.
And similar below (and I fear there are more places).
Or did I miss something in the Kernel?
> +#endif
> +
> #define __HAVE_ARCH_MEMSET16
> static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
> {
> @@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
> #undef memcpy
> #define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memcpy(dst, src, len) __memcpy((dst), (src), (len))
> #define memmove(dst, src, len) __memmove(dst, src, len)
#define memmove(dst, src, len) __memmove((dst), (src), (len))
> +#undef memset
> #define memset(s, c, n) __memset(s, c, n)
#define memset(s, c, n) __memset((s), (c), (n))
>
> #ifndef __NO_FORTIFY
>
MfG,
Bernd
--
Bernd Petrovitsch Email : [email protected]
LUGA : http://www.luga.at
Recent compilers know the meaning of builtins (`memset`,
`memcpy`, ...) and can replace calls by inline code when
deemed better. For example, `memset(p, 0, 4)` will be lowered
to a four-byte zero store.
When using -ffreestanding (this is the case e.g. building on
clang), these optimizations are disabled. This means that **all**
memsets, including those with small, constant sizes, will result
in an actual call to memset.
We have identified several spots where we have high CPU usage
because of this. For example, a single one of these memsets is
responsible for about 0.3% of our total CPU usage in the kernel.
Aliasing `memset` to `__builtin_memset` allows the compiler to
perform this optimization even when -ffreestanding is used.
There is no change when -ffreestanding is not used.
Below is a diff (clang) for `update_sg_lb_stats()`, which
includes the aforementionned hot memset:
memset(sgs, 0, sizeof(*sgs));
Diff:
movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
callq <memset> ~~~> movq $0x0, 0x18(%r8)
~~~> movq $0x0, 0x10(%r8)
~~~> movq $0x0, 0x8(%r8)
~~~> movq $0x0, (%r8)
In terms of code size, this shrins the clang-built kernel a
bit (-0.022%):
440383608 vmlinux.clang.before
440285512 vmlinux.clang.after
Signed-off-by: Clement Courbet <[email protected]>
---
changes in v2:
- Removed ifdef(GNUC >= 4).
- Disabled change when CONFIG_FORTIFY_SOURCE is set.
changes in v3:
- Fixed commit message: the kernel shrinks in size.
changes in v4:
- Properly parenthesize the macro.
---
arch/x86/include/asm/string_64.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..1bfa825e9ad3 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
+/* Recent compilers can generate much better code for known size and/or
+ * fill values, and will fallback on `memset` if they fail.
+ * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
+ * perform this optimization even when -ffreestanding is used.
+ */
+#if !defined(CONFIG_FORTIFY_SOURCE)
+#define memset(s, c, count) __builtin_memset((s), (c), (count))
+#endif
+
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
{
@@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
+#undef memset
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY
--
2.25.1.696.g5e7596f4ac-goog
I discussed with the original authors who added freestanding to our
build. It turns out that it was added globally but this was just to
to workaround powerpc not compiling under clang, but they felt the
fix was appropriate globally.
Now Nick has dug up https://lkml.org/lkml/2019/8/29/1300, which
advises against freestanding. Also, I've did some research and
discovered that the original reason for using freestanding for
powerpc has been fixed here:
https://lore.kernel.org/linuxppc-dev/[email protected]/
I'm going to remove -ffreestanding from downstream, so we don't really need
this anymore, sorry for waisting people's time.
I wonder if the freestanding fix from the aforementioned patch is really needed
though. I think that clang is actually right to point out the issue.
I don't see any reason why setjmp()/longjmp() are declared as taking longs
rather than ints. The implementation looks like it only ever propagates the
value (in longjmp) or sets it to 1 (in setjmp), and we only ever call longjmp
with integer parameters. But I'm not a PowerPC expert, so I might
be misreading the code.
So it seems that we could just remove freestanding altogether and rewrite the
code to:
diff --git a/arch/powerpc/include/asm/setjmp.h b/arch/powerpc/include/asm/setjmp.h
index 279d03a1eec6..7941ae68fe21 100644
--- a/arch/powerpc/include/asm/setjmp.h
+++ b/arch/powerpc/include/asm/setjmp.h
@@ -12,7 +12,9 @@
#define JMP_BUF_LEN 23
-extern long setjmp(long *);
-extern void longjmp(long *, long);
+typedef long * jmp_buf;
+
+extern int setjmp(jmp_buf);
+extern void longjmp(jmp_buf, int);
I'm happy to send a patch for this, and get rid of more -ffreestanding.
Opinions ?
On Thu, Mar 26, 2020 at 5:38 AM Clement Courbet <[email protected]> wrote:
>
> I discussed with the original authors who added freestanding to our
> build. It turns out that it was added globally but this was just to
> to workaround powerpc not compiling under clang, but they felt the
> fix was appropriate globally.
>
> Now Nick has dug up https://lkml.org/lkml/2019/8/29/1300, which
> advises against freestanding. Also, I've did some research and
> discovered that the original reason for using freestanding for
> powerpc has been fixed here:
> https://lore.kernel.org/linuxppc-dev/[email protected]/
>
> I'm going to remove -ffreestanding from downstream, so we don't really need
> this anymore, sorry for waisting people's time.
>
> I wonder if the freestanding fix from the aforementioned patch is really needed
> though. I think that clang is actually right to point out the issue.
> I don't see any reason why setjmp()/longjmp() are declared as taking longs
> rather than ints. The implementation looks like it only ever propagates the
> value (in longjmp) or sets it to 1 (in setjmp), and we only ever call longjmp
> with integer parameters. But I'm not a PowerPC expert, so I might
> be misreading the code.
>
>
> So it seems that we could just remove freestanding altogether and rewrite the
> code to:
>
> diff --git a/arch/powerpc/include/asm/setjmp.h b/arch/powerpc/include/asm/setjmp.h
> index 279d03a1eec6..7941ae68fe21 100644
> --- a/arch/powerpc/include/asm/setjmp.h
> +++ b/arch/powerpc/include/asm/setjmp.h
> @@ -12,7 +12,9 @@
>
> #define JMP_BUF_LEN 23
> -extern long setjmp(long *);
> -extern void longjmp(long *, long);
> +typedef long * jmp_buf;
> +
> +extern int setjmp(jmp_buf);
> +extern void longjmp(jmp_buf, int);
>
> I'm happy to send a patch for this, and get rid of more -ffreestanding.
> Opinions ?
Yes, I think the above diff and additionally cleaning up
-ffreestanding from arch/powerpc/kernel/Makefile and
arch/powerpc/xmon/Makefile would be a much better approach. If that's
good enough to fix the warning, I kind of can't believe we didn't spot
that in the original code review!
Actually, the god awful warning was:
./arch/powerpc/include/asm/setjmp.h:10:13: error: declaration of
built-in function 'setjmp' requires the declaration of the 'jmp_buf'
type, commonly provided in the header <setjmp.h>.
[-Werror,-Wincomplete-setjmp-declaration]
extern long setjmp(long *) __attribute__((returns_twice));
^
So jmp_buf was missing, wasn't used, but also the long vs int confusion.
I tested the above diff, all calls to setjmp under arch/powerpc/ just
compare the return value against 0. Callers of longjmp just pass 1
for the final parameter. So the above changes should be no functional
change.
--
Thanks,
~Nick Desaulniers
Clement Courbet <[email protected]> writes:
> I discussed with the original authors who added freestanding to our
> build. It turns out that it was added globally but this was just to
> to workaround powerpc not compiling under clang, but they felt the
> fix was appropriate globally.
>
> Now Nick has dug up https://lkml.org/lkml/2019/8/29/1300, which
> advises against freestanding. Also, I've did some research and
> discovered that the original reason for using freestanding for
> powerpc has been fixed here:
> https://lore.kernel.org/linuxppc-dev/[email protected]/
>
> I'm going to remove -ffreestanding from downstream, so we don't really need
> this anymore, sorry for waisting people's time.
>
> I wonder if the freestanding fix from the aforementioned patch is really needed
> though. I think that clang is actually right to point out the issue.
> I don't see any reason why setjmp()/longjmp() are declared as taking longs
> rather than ints. The implementation looks like it only ever propagates the
> value (in longjmp) or sets it to 1 (in setjmp), and we only ever call longjmp
> with integer parameters. But I'm not a PowerPC expert, so I might
> be misreading the code.
>
>
> So it seems that we could just remove freestanding altogether and rewrite the
> code to:
>
> diff --git a/arch/powerpc/include/asm/setjmp.h b/arch/powerpc/include/asm/setjmp.h
> index 279d03a1eec6..7941ae68fe21 100644
> --- a/arch/powerpc/include/asm/setjmp.h
> +++ b/arch/powerpc/include/asm/setjmp.h
> @@ -12,7 +12,9 @@
>
> #define JMP_BUF_LEN 23
> -extern long setjmp(long *);
> -extern void longjmp(long *, long);
> +typedef long * jmp_buf;
> +
> +extern int setjmp(jmp_buf);
> +extern void longjmp(jmp_buf, int);
>
> I'm happy to send a patch for this, and get rid of more -ffreestanding.
> Opinions ?
If it works then it looks like a much better fix than using -ffreestanding.
Please submit a patch with a change log etc. and I'd be happy to merge
it.
cheers
Hi!
On Thu, Mar 26, 2020 at 01:38:39PM +0100, Clement Courbet wrote:
> --- a/arch/powerpc/include/asm/setjmp.h
> +++ b/arch/powerpc/include/asm/setjmp.h
> @@ -12,7 +12,9 @@
>
> #define JMP_BUF_LEN 23
> -extern long setjmp(long *);
> -extern void longjmp(long *, long);
> +typedef long * jmp_buf;
> +
> +extern int setjmp(jmp_buf);
> +extern void longjmp(jmp_buf, int);
>
> I'm happy to send a patch for this, and get rid of more -ffreestanding.
> Opinions ?
Pedantically, jmp_buf should be an array type. But, this will probably
work fine, and it certainly is better than what we had before.
You could do
typedef long jmp_buf[JMP_BUF_LEN];
perhaps?
Thanks,
Segher