2009-09-28 09:34:12

by Arjan van de Ven

[permalink] [raw]
Subject: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy


>From ebb81aab0c3df19771ebc0eec1261ae314ddc0af Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <[email protected]>
Date: Mon, 28 Sep 2009 11:21:32 +0200
Subject: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

GCC provides reasonable memset/memcpy functions itself, with __builtin_memset
and __builtin_memcpy. For the "unknown" cases, it'll fall back to our
current existing functions, but for fixed size versions it'll inline
something smart. Quite often that will be the same as we have now,
but sometimes it can do something smarter (for example, if the code
then sets the first member of a struct, it can do a shorter memset).

In addition, and this is more important, gcc knows which registers and
such are not clobbered (while for our asm version it pretty much
acts like a compiler barrier), so for various cases it can avoid reloading
values.

The effect on codesize is shown below on my typical laptop .config:

text data bss dec hex filename
5605675 2041100 6525148 14171923 d83f13 vmlinux.before
5595849 2041668 6525148 14162665 d81ae9 vmlinux.after

Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86/include/asm/string_32.h | 11 ++---------
1 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index ae907e6..23a0bc2 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -177,10 +177,7 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
*/

#ifndef CONFIG_KMEMCHECK
-#define memcpy(t, f, n) \
- (__builtin_constant_p((n)) \
- ? __constant_memcpy((t), (f), (n)) \
- : __memcpy((t), (f), (n)))
+#define memcpy(t, f, n) __builtin_memcpy(t, f, n)
#else
/*
* kmemcheck becomes very happy if we use the REP instructions unconditionally,
@@ -316,11 +313,7 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern,
: __memset_generic((s), (c), (count)))

#define __HAVE_ARCH_MEMSET
-#define memset(s, c, count) \
- (__builtin_constant_p(c) \
- ? __constant_c_x_memset((s), (0x01010101UL * (unsigned char)(c)), \
- (count)) \
- : __memset((s), (c), (count)))
+#define memset(s, c, count) __builtin_memset(s, c, count)

/*
* find the first occurrence of byte 'c', or 1 past the area if none
--
1.6.2.5


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org


2009-09-28 12:21:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

On Mon, 28 Sep 2009 11:34:33 +0200

turns out doing this unconditional is not a good idea due to gcc 3.x;
the updated patch below only does the change for gcc 4.x

>From e86cf2618a2e489cafcafb4361569c4d0853156e Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <[email protected]>
Date: Mon, 28 Sep 2009 11:21:32 +0200
Subject: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

GCC provides reasonable memset/memcpy functions itself, with __builtin_memset
and __builtin_memcpy. For the "unknown" cases, it'll fall back to our
current existing functions, but for fixed size versions it'll inline
something smart. Quite often that will be the same as we have now,
but sometimes it can do something smarter (for example, if the code
then sets the first member of a struct, it can do a shorter memset).

In addition, and this is more important, gcc knows which registers and
such are not clobbered (while for our asm version it pretty much
acts like a compiler barrier), so for various cases it can avoid reloading
values.

The effect on codesize is shown below on my typical laptop .config:

text data bss dec hex filename
5605675 2041100 6525148 14171923 d83f13 vmlinux.before
5595849 2041668 6525148 14162665 d81ae9 vmlinux.after

Due to some not-so-good behavior in the gcc 3.x series, this change
is only done for GCC 4.x and above.

Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/x86/include/asm/string_32.h | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index ae907e6..3d3e835 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -177,10 +177,15 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
*/

#ifndef CONFIG_KMEMCHECK
+
+#if (__GNUC__ >= 4)
+#define memcpy(t, f, n) __builtin_memcpy(t, f, n)
+#else
#define memcpy(t, f, n) \
(__builtin_constant_p((n)) \
? __constant_memcpy((t), (f), (n)) \
: __memcpy((t), (f), (n)))
+#endif
#else
/*
* kmemcheck becomes very happy if we use the REP instructions unconditionally,
@@ -316,11 +321,15 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern,
: __memset_generic((s), (c), (count)))

#define __HAVE_ARCH_MEMSET
+#if (__GNUC__ >= 4)
+#define memset(s, c, count) __builtin_memset(s, c, count)
+#else
#define memset(s, c, count) \
(__builtin_constant_p(c) \
? __constant_c_x_memset((s), (0x01010101UL * (unsigned char)(c)), \
(count)) \
: __memset((s), (c), (count)))
+#endif

/*
* find the first occurrence of byte 'c', or 1 past the area if none
--
1.6.2.5



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-28 23:48:03

by Arjan van de Ven

[permalink] [raw]
Subject: [tip:x86/asm] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

Commit-ID: ff60fab71bb3b4fdbf8caf57ff3739ffd0887396
Gitweb: http://git.kernel.org/tip/ff60fab71bb3b4fdbf8caf57ff3739ffd0887396
Author: Arjan van de Ven <[email protected]>
AuthorDate: Mon, 28 Sep 2009 14:21:22 +0200
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 28 Sep 2009 16:43:15 -0700

x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

GCC provides reasonable memset/memcpy functions itself, with __builtin_memset
and __builtin_memcpy. For the "unknown" cases, it'll fall back to our
current existing functions, but for fixed size versions it'll inline
something smart. Quite often that will be the same as we have now,
but sometimes it can do something smarter (for example, if the code
then sets the first member of a struct, it can do a shorter memset).

In addition, and this is more important, gcc knows which registers and
such are not clobbered (while for our asm version it pretty much
acts like a compiler barrier), so for various cases it can avoid reloading
values.

The effect on codesize is shown below on my typical laptop .config:

text data bss dec hex filename
5605675 2041100 6525148 14171923 d83f13 vmlinux.before
5595849 2041668 6525148 14162665 d81ae9 vmlinux.after

Due to some not-so-good behavior in the gcc 3.x series, this change
is only done for GCC 4.x and above.

Signed-off-by: Arjan van de Ven <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>


---
arch/x86/include/asm/string_32.h | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index ae907e6..3d3e835 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -177,10 +177,15 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
*/

#ifndef CONFIG_KMEMCHECK
+
+#if (__GNUC__ >= 4)
+#define memcpy(t, f, n) __builtin_memcpy(t, f, n)
+#else
#define memcpy(t, f, n) \
(__builtin_constant_p((n)) \
? __constant_memcpy((t), (f), (n)) \
: __memcpy((t), (f), (n)))
+#endif
#else
/*
* kmemcheck becomes very happy if we use the REP instructions unconditionally,
@@ -316,11 +321,15 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern,
: __memset_generic((s), (c), (count)))

#define __HAVE_ARCH_MEMSET
+#if (__GNUC__ >= 4)
+#define memset(s, c, count) __builtin_memset(s, c, count)
+#else
#define memset(s, c, count) \
(__builtin_constant_p(c) \
? __constant_c_x_memset((s), (0x01010101UL * (unsigned char)(c)), \
(count)) \
: __memset((s), (c), (count)))
+#endif

/*
* find the first occurrence of byte 'c', or 1 past the area if none

2009-09-29 12:46:03

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

On Monday 28 September 2009, Arjan van de Ven wrote:
>
> GCC provides reasonable memset/memcpy functions itself, with __builtin_memset
> and __builtin_memcpy. For the "unknown" cases, it'll fall back to our
> current existing functions, but for fixed size versions it'll inline
> something smart. Quite often that will be the same as we have now,
> but sometimes it can do something smarter (for example, if the code
> then sets the first member of a struct, it can do a shorter memset).
>
> In addition, and this is more important, gcc knows which registers and
> such are not clobbered (while for our asm version it pretty much
> acts like a compiler barrier), so for various cases it can avoid reloading
> values.
>
> The effect on codesize is shown below on my typical laptop .config:
>
> text data bss dec hex filename
> 5605675 2041100 6525148 14171923 d83f13 vmlinux.before
> 5595849 2041668 6525148 14162665 d81ae9 vmlinux.after
>

The patch looks good, but is there a reason to keep it architecture
specific? I would guess that the same logic applies to all architectures
with gcc-4.x and could be put into include/linux/compiler-gcc4.h.

Arnd <><

2009-09-29 12:52:36

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

On Tue, 29 Sep 2009 14:44:06 +0200
Arnd Bergmann <[email protected]> wrote:

> >
>
> The patch looks good, but is there a reason to keep it architecture
> specific? I would guess that the same logic applies to all
> architectures with gcc-4.x and could be put into
> include/linux/compiler-gcc4.h.

there are some requirements on the architecture for this to work....
and as always with compiler things, the tradeoffs and how well it works
will vary per architecture.

If the architectures in large majority make this switch we could move it
generic... but it's a bit early for that.



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-29 15:36:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

On 09/29/2009 05:44 AM, Arnd Bergmann wrote:
>
> The patch looks good, but is there a reason to keep it architecture
> specific? I would guess that the same logic applies to all architectures
> with gcc-4.x and could be put into include/linux/compiler-gcc4.h.
>

Each architecture has its own implementation, and in some cases gcc will
even generate code which is illegal in the kernel. It really needs to
be enabled by each architecture.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-02 19:19:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

Arjan van de Ven <[email protected]> writes:

> From ebb81aab0c3df19771ebc0eec1261ae314ddc0af Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <[email protected]>
> Date: Mon, 28 Sep 2009 11:21:32 +0200
> Subject: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy
>
> GCC provides reasonable memset/memcpy functions itself, with __builtin_memset
> and __builtin_memcpy. For the "unknown" cases, it'll fall back to our
> current existing functions, but for fixed size versions it'll inline
> something smart. Quite often that will be the same as we have now,
> but sometimes it can do something smarter (for example, if the code
> then sets the first member of a struct, it can do a shorter memset).
>
> In addition, and this is more important, gcc knows which registers and
> such are not clobbered (while for our asm version it pretty much
> acts like a compiler barrier), so for various cases it can avoid reloading
> values.
>
> The effect on codesize is shown below on my typical laptop .config:
>
> text data bss dec hex filename
> 5605675 2041100 6525148 14171923 d83f13 vmlinux.before
> 5595849 2041668 6525148 14162665 d81ae9 vmlinux.after

I tried this some time ago, but it it generates bad code on some
gcc 3 versions.

You really need to test such kind of changes on a wide variety
of compilers, not assuming everyone uses the same version as you.

-Andi

--
[email protected] -- Speaking for myself only.

2009-10-02 20:08:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

On 10/02/2009 12:19 PM, Andi Kleen wrote:
>
> I tried this some time ago, but it it generates bad code on some
> gcc 3 versions.
>
> You really need to test such kind of changes on a wide variety
> of compilers, not assuming everyone uses the same version as you.
>

The version that went in is for gcc 4 only.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-02 20:12:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy

On Fri, Oct 02, 2009 at 01:04:25PM -0700, H. Peter Anvin wrote:
> On 10/02/2009 12:19 PM, Andi Kleen wrote:
> >
> > I tried this some time ago, but it it generates bad code on some
> > gcc 3 versions.
> >
> > You really need to test such kind of changes on a wide variety
> > of compilers, not assuming everyone uses the same version as you.
> >
>
> The version that went in is for gcc 4 only.

Thanks.

My memory might be faulty, but iirc 4.0/4.1 was also slightly
problematic. So better take a look at code for that too
if you haven't yet (just code size numbers are not enough)

-Andi

--
[email protected] -- Speaking for myself only.