00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV18GL [Quadro4 380 XGL] (rev a2)
05:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5782 Gigabit Ethernet (rev 02)
05:09.0 Mass storage controller: Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01)
* Jeff Garzik <[email protected]> wrote:
> On x86-32, this warning now appears for me in 2.6.30-rc3, and did
> not appear in 2.6.29.
>
> drivers/acpi/acpica/tbfadt.c: In function 'acpi_tb_create_local_fadt':
> /spare/repo/linux-2.6/arch/x86/include/asm/string_32.h:75: warning:
> array subscript is above array bounds
Last i checked it was a GCC bounds check bogosity. All attempts to
work it around or annotate it sanely (without changing the assembly
code) failed. (new ideas welcome)
The closest i came was the hacklet below to the assembly code. [with
an intentionally corrupted patch header to make it harder to apply
accidentally.]
The hacklet writes the fifth byte by writing two bytes from byte
position 3.
> lspci, dmesg and .config attached.
>
> Jeff
>
> P.S. It is unclear in MAINTAINERS whether [email protected] should
> be CC'd in addition to the other addresses listed, or as a
> replacement for individual emails.
at your option. Cc:-ing maintainers directly can get faster
treatment occasionally. Using the alias is shorter.
Ingo
NOT-Signed-off-by-me:
+++ linux/arch/x86/include/asm/string_32.h
@@ -72,7 +72,7 @@ static __always_inline void *__constant_
return to;
case 5:
*(int *)to = *(int *)from;
- *((char *)to + 4) = *((char *)from + 4);
+ *((short *)(char *)(to + 3)) = *((short *)(char *)(from + 3));
return to;
case 6:
*(int *)to = *(int *)from;
Ingo Molnar <[email protected]> writes:
> * Jeff Garzik <[email protected]> wrote:
>
>> On x86-32, this warning now appears for me in 2.6.30-rc3, and did
>> not appear in 2.6.29.
>>
>> drivers/acpi/acpica/tbfadt.c: In function 'acpi_tb_create_local_fadt':
>> /spare/repo/linux-2.6/arch/x86/include/asm/string_32.h:75: warning:
>> array subscript is above array bounds
>
> Last i checked it was a GCC bounds check bogosity. All attempts to
> work it around or annotate it sanely (without changing the assembly
> code) failed. (new ideas welcome)
>
> The closest i came was the hacklet below to the assembly code. [with
> an intentionally corrupted patch header to make it harder to apply
> accidentally.]
Modern gcc (and that is all that is supported now) should be able to
generate this code on its own already. So if you call __builtin_* it
will just work (that is what 64bit does) without that explicit code.
Here's a patch that does that with some numbers. It gives about 3k
smaller kernels on my configuration.
IMHO it's long overdue to do this for 32bit too.
It's a very attractive patch because it removes a lot of code:
arch/x86/include/asm/string_32.h | 127 ++-------------------------------------
arch/x86/lib/memcpy_32.c | 16 ++++
-Andi
---
X86-32: Let gcc decide whether to inline memcpy
Modern gccs have own heuristics to decide whether string functions should be inlined
or not. This used to be not the case with old gccs, but Linux doesn't support them
anymore. The 64bit kernel always did it this way. Just define memcpy to
__builtin_memcpy and gcc should do the right thing. Also supply
a out of line memcpy that gcc can fall back to when it decides
not to inline.
First this fixes the
arch/x86/include/asm/string_32.h:75: warning: array subscript is above array bounds
warnings which have been creeping up recently by just
removing that code.
Then trusting gcc actually makes the kernel smaller by nearly 3K:
5503146 529444 1495040 7527630 72dcce vmlinux
5500373 529444 1495040 7524857 72d1f9 vmlinux-string
Also it removes some quite ugly code and will likely speed up
compilation by a tiny bit by having less inline code to process
for every file.
It did some quick boot tests and everything worked as expected.
I left the 3d now case alone for now.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/string_32.h | 127 ++-------------------------------------
arch/x86/lib/memcpy_32.c | 16 ++++
2 files changed, 23 insertions(+), 120 deletions(-)
Index: linux-2.6.30-rc2-ak/arch/x86/include/asm/string_32.h
===================================================================
--- linux-2.6.30-rc2-ak.orig/arch/x86/include/asm/string_32.h 2009-04-22 10:22:33.000000000 +0200
+++ linux-2.6.30-rc2-ak/arch/x86/include/asm/string_32.h 2009-04-22 10:23:12.000000000 +0200
@@ -29,121 +29,10 @@
#define __HAVE_ARCH_STRLEN
extern size_t strlen(const char *s);
-static __always_inline void *__memcpy(void *to, const void *from, size_t n)
-{
- int d0, d1, d2;
- asm volatile("rep ; movsl\n\t"
- "movl %4,%%ecx\n\t"
- "andl $3,%%ecx\n\t"
- "jz 1f\n\t"
- "rep ; movsb\n\t"
- "1:"
- : "=&c" (d0), "=&D" (d1), "=&S" (d2)
- : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
- : "memory");
- return to;
-}
-
-/*
- * This looks ugly, but the compiler can optimize it totally,
- * as the count is constant.
- */
-static __always_inline void *__constant_memcpy(void *to, const void *from,
- size_t n)
-{
- long esi, edi;
- if (!n)
- return to;
-
- switch (n) {
- case 1:
- *(char *)to = *(char *)from;
- return to;
- case 2:
- *(short *)to = *(short *)from;
- return to;
- case 4:
- *(int *)to = *(int *)from;
- return to;
-
- case 3:
- *(short *)to = *(short *)from;
- *((char *)to + 2) = *((char *)from + 2);
- return to;
- case 5:
- *(int *)to = *(int *)from;
- *((char *)to + 4) = *((char *)from + 4);
- return to;
- case 6:
- *(int *)to = *(int *)from;
- *((short *)to + 2) = *((short *)from + 2);
- return to;
- case 8:
- *(int *)to = *(int *)from;
- *((int *)to + 1) = *((int *)from + 1);
- return to;
- }
-
- esi = (long)from;
- edi = (long)to;
- if (n >= 5 * 4) {
- /* large block: use rep prefix */
- int ecx;
- asm volatile("rep ; movsl"
- : "=&c" (ecx), "=&D" (edi), "=&S" (esi)
- : "0" (n / 4), "1" (edi), "2" (esi)
- : "memory"
- );
- } else {
- /* small block: don't clobber ecx + smaller code */
- if (n >= 4 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- if (n >= 3 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- if (n >= 2 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- if (n >= 1 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- }
- switch (n % 4) {
- /* tail */
- case 0:
- return to;
- case 1:
- asm volatile("movsb"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- return to;
- case 2:
- asm volatile("movsw"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- return to;
- default:
- asm volatile("movsw\n\tmovsb"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- return to;
- }
-}
-
#define __HAVE_ARCH_MEMCPY
+extern void *__memcpy(void *to, const void *from, size_t n);
+
#ifdef CONFIG_X86_USE_3DNOW
#include <asm/mmx.h>
@@ -155,7 +44,7 @@
static inline void *__constant_memcpy3d(void *to, const void *from, size_t len)
{
if (len < 512)
- return __constant_memcpy(to, from, len);
+ return __memcpy(to, from, len);
return _mmx_memcpy(to, from, len);
}
@@ -168,20 +57,18 @@
#define memcpy(t, f, n) \
(__builtin_constant_p((n)) \
- ? __constant_memcpy3d((t), (f), (n)) \
+ ? __builtin_memcpy((t), (f), (n)) \
: __memcpy3d((t), (f), (n)))
#else
/*
* No 3D Now!
+ *
+ * Let gcc figure it out.
*/
-#define memcpy(t, f, n) \
- (__builtin_constant_p((n)) \
- ? __constant_memcpy((t), (f), (n)) \
- : __memcpy((t), (f), (n)))
-
+#define memcpy(t, f, n) __builtin_memcpy(t,f,n)
#endif
#define __HAVE_ARCH_MEMMOVE
Index: linux-2.6.30-rc2-ak/arch/x86/lib/memcpy_32.c
===================================================================
--- linux-2.6.30-rc2-ak.orig/arch/x86/lib/memcpy_32.c 2009-04-22 10:22:33.000000000 +0200
+++ linux-2.6.30-rc2-ak/arch/x86/lib/memcpy_32.c 2009-04-22 10:23:56.000000000 +0200
@@ -4,6 +4,22 @@
#undef memcpy
#undef memset
+void *__memcpy(void *to, const void *from, size_t n)
+{
+ int d0, d1, d2;
+ asm volatile("rep ; movsl\n\t"
+ "movl %4,%%ecx\n\t"
+ "andl $3,%%ecx\n\t"
+ "jz 1f\n\t"
+ "rep ; movsb\n\t"
+ "1:"
+ : "=&c" (d0), "=&D" (d1), "=&S" (d2)
+ : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
+ : "memory");
+ return to;
+}
+EXPORT_SYMBOL(__memcpy);
+
void *memcpy(void *to, const void *from, size_t n)
{
#ifdef CONFIG_X86_USE_3DNOW
--
[email protected] -- Speaking for myself only.
Commit-ID: 1405ae250ec86802b32ca9f7aea977a5ab551b22
Gitweb: http://git.kernel.org/tip/1405ae250ec86802b32ca9f7aea977a5ab551b22
Author: Andi Kleen <[email protected]>
AuthorDate: Wed, 22 Apr 2009 10:45:15 +0200
Committer: H. Peter Anvin <[email protected]>
CommitDate: Wed, 22 Apr 2009 10:55:20 -0700
x86: use __builtin_memcpy() on 32 bits
Modern gccs have own heuristics to decide whether string functions
should be inlined or not. This used to be not the case with old gccs,
but Linux doesn't support them anymore. The 64bit kernel always did it
this way. Just define memcpy to __builtin_memcpy and gcc should do the
right thing. Also supply a out of line memcpy that gcc can fall back
to when it decides not to inline.
First this fixes the
arch/x86/include/asm/string_32.h:75: warning: array subscript is above array bounds
warnings which have been creeping up recently by just
removing that code.
Then trusting gcc actually makes the kernel smaller by nearly 3K:
5503146 529444 1495040 7527630 72dcce vmlinux
5500373 529444 1495040 7524857 72d1f9 vmlinux-string
Also it removes some quite ugly code and will likely speed up
compilation by a tiny bit by having less inline code to process
for every file.
It did some quick boot tests and everything worked as expected.
I left the 3dnow case alone for now.
[ Impact: fixes warning, reduces code size ]
Signed-off-by: Andi Kleen <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/string_32.h | 127 ++-----------------------------------
arch/x86/lib/memcpy_32.c | 16 +++++
2 files changed, 23 insertions(+), 120 deletions(-)
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 0e0e3ba..29fff54 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -29,121 +29,10 @@ extern char *strchr(const char *s, int c);
#define __HAVE_ARCH_STRLEN
extern size_t strlen(const char *s);
-static __always_inline void *__memcpy(void *to, const void *from, size_t n)
-{
- int d0, d1, d2;
- asm volatile("rep ; movsl\n\t"
- "movl %4,%%ecx\n\t"
- "andl $3,%%ecx\n\t"
- "jz 1f\n\t"
- "rep ; movsb\n\t"
- "1:"
- : "=&c" (d0), "=&D" (d1), "=&S" (d2)
- : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
- : "memory");
- return to;
-}
-
-/*
- * This looks ugly, but the compiler can optimize it totally,
- * as the count is constant.
- */
-static __always_inline void *__constant_memcpy(void *to, const void *from,
- size_t n)
-{
- long esi, edi;
- if (!n)
- return to;
-
- switch (n) {
- case 1:
- *(char *)to = *(char *)from;
- return to;
- case 2:
- *(short *)to = *(short *)from;
- return to;
- case 4:
- *(int *)to = *(int *)from;
- return to;
-
- case 3:
- *(short *)to = *(short *)from;
- *((char *)to + 2) = *((char *)from + 2);
- return to;
- case 5:
- *(int *)to = *(int *)from;
- *((char *)to + 4) = *((char *)from + 4);
- return to;
- case 6:
- *(int *)to = *(int *)from;
- *((short *)to + 2) = *((short *)from + 2);
- return to;
- case 8:
- *(int *)to = *(int *)from;
- *((int *)to + 1) = *((int *)from + 1);
- return to;
- }
-
- esi = (long)from;
- edi = (long)to;
- if (n >= 5 * 4) {
- /* large block: use rep prefix */
- int ecx;
- asm volatile("rep ; movsl"
- : "=&c" (ecx), "=&D" (edi), "=&S" (esi)
- : "0" (n / 4), "1" (edi), "2" (esi)
- : "memory"
- );
- } else {
- /* small block: don't clobber ecx + smaller code */
- if (n >= 4 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- if (n >= 3 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- if (n >= 2 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- if (n >= 1 * 4)
- asm volatile("movsl"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- }
- switch (n % 4) {
- /* tail */
- case 0:
- return to;
- case 1:
- asm volatile("movsb"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- return to;
- case 2:
- asm volatile("movsw"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- return to;
- default:
- asm volatile("movsw\n\tmovsb"
- : "=&D"(edi), "=&S"(esi)
- : "0"(edi), "1"(esi)
- : "memory");
- return to;
- }
-}
-
#define __HAVE_ARCH_MEMCPY
+extern void *__memcpy(void *to, const void *from, size_t n);
+
#ifdef CONFIG_X86_USE_3DNOW
#include <asm/mmx.h>
@@ -155,7 +44,7 @@ static __always_inline void *__constant_memcpy(void *to, const void *from,
static inline void *__constant_memcpy3d(void *to, const void *from, size_t len)
{
if (len < 512)
- return __constant_memcpy(to, from, len);
+ return __memcpy(to, from, len);
return _mmx_memcpy(to, from, len);
}
@@ -168,20 +57,18 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
#define memcpy(t, f, n) \
(__builtin_constant_p((n)) \
- ? __constant_memcpy3d((t), (f), (n)) \
+ ? __builtin_memcpy((t), (f), (n)) \
: __memcpy3d((t), (f), (n)))
#else
/*
* No 3D Now!
+ *
+ * Let gcc figure it out.
*/
-#define memcpy(t, f, n) \
- (__builtin_constant_p((n)) \
- ? __constant_memcpy((t), (f), (n)) \
- : __memcpy((t), (f), (n)))
-
+#define memcpy(t, f, n) __builtin_memcpy(t,f,n)
#endif
#define __HAVE_ARCH_MEMMOVE
diff --git a/arch/x86/lib/memcpy_32.c b/arch/x86/lib/memcpy_32.c
index 5415a9d..16dc123 100644
--- a/arch/x86/lib/memcpy_32.c
+++ b/arch/x86/lib/memcpy_32.c
@@ -4,6 +4,22 @@
#undef memcpy
#undef memset
+void *__memcpy(void *to, const void *from, size_t n)
+{
+ int d0, d1, d2;
+ asm volatile("rep ; movsl\n\t"
+ "movl %4,%%ecx\n\t"
+ "andl $3,%%ecx\n\t"
+ "jz 1f\n\t"
+ "rep ; movsb\n\t"
+ "1:"
+ : "=&c" (d0), "=&D" (d1), "=&S" (d2)
+ : "0" (n / 4), "g" (n), "1" ((long)to), "2" ((long)from)
+ : "memory");
+ return to;
+}
+EXPORT_SYMBOL(__memcpy);
+
void *memcpy(void *to, const void *from, size_t n)
{
#ifdef CONFIG_X86_USE_3DNOW
On Wed, 22 Apr 2009, Andi Kleen wrote:
>
> Modern gcc (and that is all that is supported now) should be able to
> generate this code on its own already. So if you call __builtin_* it
> will just work (that is what 64bit does) without that explicit code.
Last time we tried that, it wasn't true. Gcc wouldn't inline even trivial
cases of constant sizes.
I do not recall what gcc version that was all about, and maybe the
versions we now require are all modern enough that it isn't an issue, but
I'd want somebody to actually _verify_ that the oldest version we support
does an ok job, and doesn't do stupid things for constant-sized memcpy's.
> IMHO it's long overdue to do this for 32bit too.
With actual testing, I'll happily merge this.
Linus
On Wed, Apr 22, 2009 at 01:56:54PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 22 Apr 2009, Andi Kleen wrote:
> >
> > Modern gcc (and that is all that is supported now) should be able to
> > generate this code on its own already. So if you call __builtin_* it
> > will just work (that is what 64bit does) without that explicit code.
>
> Last time we tried that, it wasn't true. Gcc wouldn't inline even trivial
> cases of constant sizes.
AFAIK it's all true on 3.2+ when it can figure out the alignment
(but some gcc versions had problems passing the alignment around e.g.
through inlining), under the assumption that out of line can do
a better job with unaligned data. That's not true with my patch,
but could be true in theory.
Quick test here:
char a[10];
char b[2];
char c[4];
char d[8];
short x;
long y;
char xyz[100];
f()
{
#define C(x) memcpy(&x, xyz, sizeof(x));
C(x)
C(y)
C(a)
C(b)
C(c)
C(d)
}
and everything gets inlined with gcc 3.2 which is the oldest
we still care about:
gcc version 3.2.3
movzwl xyz+8(%rip), %eax
movzwl xyz(%rip), %ecx
movq xyz(%rip), %rdx
movw %ax, a+8(%rip)
movw %cx, x(%rip)
movw %cx, b(%rip)
movl xyz(%rip), %eax
movq %rdx, y(%rip)
movq %rdx, a(%rip)
movq %rdx, d(%rip)
movl %eax, c(%rip)
ret
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, 22 Apr 2009, Andi Kleen wrote:
>
> AFAIK it's all true on 3.2+ when it can figure out the alignment
> (but some gcc versions had problems passing the alignment around e.g.
> through inlining), under the assumption that out of line can do
> a better job with unaligned data. That's not true with my patch,
> but could be true in theory.
Maybe it was the unaligned case that I remember.
Because it's definitely not true that out-of-line code can do any better
with unaligned data, at least not for small constants. And the case I
remember was for some silly 8-byte case or similar.
> Quick test here:
How about you just compile the kernel with gcc-3.2 and compare the number
of calls to memcpy before-and-after instead? That's the real test.
Linus
> > Quick test here:
>
> How about you just compile the kernel with gcc-3.2 and compare the number
> of calls to memcpy before-and-after instead? That's the real test.
I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
patience. If someone has a fast disassembler we can try it. I'll leave
them running over night, maybe there are exact numbers tomorrow.
But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
very little files which call it with the patch, so there's some
evidence that there isn't a dramatic increase.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 22, 2009 at 1:45 AM, Andi Kleen <[email protected]> wrote:
> Ingo Molnar <[email protected]> writes:
>
>> * Jeff Garzik <[email protected]> wrote:
>>
>>> On x86-32, this warning now appears for me in 2.6.30-rc3, and did
>>> not appear in 2.6.29.
>>>
>>> drivers/acpi/acpica/tbfadt.c: In function 'acpi_tb_create_local_fadt':
>>> /spare/repo/linux-2.6/arch/x86/include/asm/string_32.h:75: warning:
>>> array subscript is above array bounds
>>
>> Last i checked it was a GCC bounds check bogosity. All attempts to
>> work it around or annotate it sanely (without changing the assembly
>> code) failed. (new ideas welcome)
>>
>> The closest i came was the hacklet below to the assembly code. [with
>> an intentionally corrupted patch header to make it harder to apply
>> accidentally.]
>
> Modern gcc (and that is all that is supported now) should be able to
> generate this code on its own already. ?So if you call __builtin_* it
> will ?just work (that is what 64bit does) without that explicit code.
>
> Here's a patch that does that with some numbers. It gives about 3k
> smaller kernels on my configuration.
>
> IMHO it's long overdue to do this for 32bit too.
>
> It's a very attractive patch because it removes a lot of code:
I think this patch is great. Perhaps this would be a good time to also
clean out memset for x86_32? (If needed, I can start a new email
thread with this patch) I've built and booted this on my x86_32
hardware.
[danger: a newb's patch below]
Joe
--
Define memset to __builtin_memset and gcc should do the right thing.
Keep around the generic memset routine that gcc can use when it does
not inline. Removes a bunch of ugly code, too.
Signed-off-by: Joe Damato <[email protected]>
---
arch/x86/include/asm/string_32.h | 117 +-------------------------------------
arch/x86/lib/memcpy_32.c | 12 ++++
2 files changed, 15 insertions(+), 114 deletions(-)
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 29fff54..aedee80 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -30,7 +30,6 @@ extern char *strchr(const char *s, int c);
extern size_t strlen(const char *s);
#define __HAVE_ARCH_MEMCPY
-
extern void *__memcpy(void *to, const void *from, size_t n);
#ifdef CONFIG_X86_USE_3DNOW
@@ -79,42 +78,8 @@ void *memmove(void *dest, const void *src, size_t n);
#define __HAVE_ARCH_MEMCHR
extern void *memchr(const void *cs, int c, size_t count);
-static inline void *__memset_generic(void *s, char c, size_t count)
-{
- int d0, d1;
- asm volatile("rep\n\t"
- "stosb"
- : "=&c" (d0), "=&D" (d1)
- : "a" (c), "1" (s), "0" (count)
- : "memory");
- return s;
-}
-
-/* we might want to write optimized versions of these later */
-#define __constant_count_memset(s, c, count) __memset_generic((s),
(c), (count))
-
-/*
- * memset(x, 0, y) is a reasonably common thing to do, so we want to fill
- * things 32 bits at a time even when we don't know the size of the
- * area at compile-time..
- */
-static __always_inline
-void *__constant_c_memset(void *s, unsigned long c, size_t count)
-{
- int d0, d1;
- asm volatile("rep ; stosl\n\t"
- "testb $2,%b3\n\t"
- "je 1f\n\t"
- "stosw\n"
- "1:\ttestb $1,%b3\n\t"
- "je 2f\n\t"
- "stosb\n"
- "2:"
- : "=&c" (d0), "=&D" (d1)
- : "a" (c), "q" (count), "0" (count/4), "1" ((long)s)
- : "memory");
- return s;
-}
+#define __HAVE_ARCH_MEMSET
+extern void *__memset(void *s, char c, size_t count);
/* Added by Gertjan van Wingerde to make minix and sysv module work */
#define __HAVE_ARCH_STRNLEN
@@ -124,83 +89,7 @@ extern size_t strnlen(const char *s, size_t count);
#define __HAVE_ARCH_STRSTR
extern char *strstr(const char *cs, const char *ct);
-/*
- * This looks horribly ugly, but the compiler can optimize it totally,
- * as we by now know that both pattern and count is constant..
- */
-static __always_inline
-void *__constant_c_and_count_memset(void *s, unsigned long pattern,
- size_t count)
-{
- switch (count) {
- case 0:
- return s;
- case 1:
- *(unsigned char *)s = pattern & 0xff;
- return s;
- case 2:
- *(unsigned short *)s = pattern & 0xffff;
- return s;
- case 3:
- *(unsigned short *)s = pattern & 0xffff;
- *((unsigned char *)s + 2) = pattern & 0xff;
- return s;
- case 4:
- *(unsigned long *)s = pattern;
- return s;
- }
-
-#define COMMON(x) \
- asm volatile("rep ; stosl" \
- x \
- : "=&c" (d0), "=&D" (d1) \
- : "a" (eax), "0" (count/4), "1" ((long)s) \
- : "memory")
-
- {
- int d0, d1;
-#if __GNUC__ == 4 && __GNUC_MINOR__ == 0
- /* Workaround for broken gcc 4.0 */
- register unsigned long eax asm("%eax") = pattern;
-#else
- unsigned long eax = pattern;
-#endif
-
- switch (count % 4) {
- case 0:
- COMMON("");
- return s;
- case 1:
- COMMON("\n\tstosb");
- return s;
- case 2:
- COMMON("\n\tstosw");
- return s;
- default:
- COMMON("\n\tstosw\n\tstosb");
- return s;
- }
- }
-
-#undef COMMON
-}
-
-#define __constant_c_x_memset(s, c, count) \
- (__builtin_constant_p(count) \
- ? __constant_c_and_count_memset((s), (c), (count)) \
- : __constant_c_memset((s), (c), (count)))
-
-#define __memset(s, c, count) \
- (__builtin_constant_p(count) \
- ? __constant_count_memset((s), (c), (count)) \
- : __memset_generic((s), (c), (count)))
-
-#define __HAVE_ARCH_MEMSET
-#define memset(s, c, count) \
- (__builtin_constant_p(c) \
- ? __constant_c_x_memset((s), (0x01010101UL * (unsigned char)(c)), \
- (count)) \
- : __memset((s), (c), (count)))
+#define memset(s, c, count) __builtin_memset(s, c, count)
/*
* find the first occurrence of byte 'c', or 1 past the area if none
diff --git a/arch/x86/lib/memcpy_32.c b/arch/x86/lib/memcpy_32.c
index 16dc123..16b10d9 100644
--- a/arch/x86/lib/memcpy_32.c
+++ b/arch/x86/lib/memcpy_32.c
@@ -30,6 +30,18 @@ void *memcpy(void *to, const void *from, size_t n)
}
EXPORT_SYMBOL(memcpy);
+void *__memset(void *s, char c, size_t count)
+{
+ int d0, d1;
+ asm volatile("rep\n\t"
+ "stosb"
+ : "=&c" (d0), "=&D" (d1)
+ : "a" (c), "1" (s), "0" (count)
+ : "memory");
+ return s;
+}
+EXPORT_SYMBOL(__memset);
+
void *memset(void *s, int c, size_t count)
{
return __memset(s, c, count);
--
1.6.2
Joe Damato wrote:
>
> I think this patch is great. Perhaps this would be a good time to also
> clean out memset for x86_32? (If needed, I can start a new email
> thread with this patch) I've built and booted this on my x86_32
> hardware.
>
Under the same conditions, i.e. making sure that with the oldest gcc we
still care about (3.2) we don't end up uninlining any new cases we
shouldn't.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
> > It's a very attractive patch because it removes a lot of code:
>
> I think this patch is great. Perhaps this would be a good time to also
> clean out memset for x86_32? (If needed, I can start a new email
Yes looks reasonable. You can add
Reviewed-by: Andi Kleen <[email protected]>
if you fix the comment below.
In fact it should be synced to do the same as on 64bit which already
does all that and was originally written for gcc 3.1/3.2 ...
-Andi
> +void *__memset(void *s, char c, size_t count)
> +{
> + int d0, d1;
> + asm volatile("rep\n\t"
> + "stosb"
> + : "=&c" (d0), "=&D" (d1)
> + : "a" (c), "1" (s), "0" (count)
> + : "memory");
> + return s;
> +}
> +EXPORT_SYMBOL(__memset);
I suspect _memset is not needed anymore, just memset() alone
should be enough. So remove the wrapper below.
> +
> void *memset(void *s, int c, size_t count)
> {
> return __memset(s, c, count);
> --
> 1.6.2
>
--
[email protected] -- Speaking for myself only.
Andi Kleen <[email protected]> writes:
>> > Quick test here:
>>
>> How about you just compile the kernel with gcc-3.2 and compare the number
>> of calls to memcpy before-and-after instead? That's the real test.
>
> I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
> patience. If someone has a fast disassembler we can try it. I'll leave
> them running over night, maybe there are exact numbers tomorrow.
>
> But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
> very little files which call it with the patch, so there's some
> evidence that there isn't a dramatic increase.
I let the objdumps finish over night. On my setup (defconfig + some
additions) there are actually less calls to out of line memcpy/__memcpy
with the patch. I see only one for my defconfig, while there are
~10 without the patch. So it makes very little difference.
The code size savings must come from more efficient code generation
for the inline case. I haven't investigated that in detail though.
So the patch seems like a overall win.
-Andi
--
[email protected] -- Speaking for myself only.
* Andi Kleen <[email protected]> wrote:
> > > Quick test here:
> >
> > How about you just compile the kernel with gcc-3.2 and compare
> > the number of calls to memcpy before-and-after instead? That's
> > the real test.
>
> I waited over 10 minutes for the full vmlinux objdumps to finish.
> sorry lost patience. If someone has a fast disassembler we can try
> it. I'll leave them running over night, maybe there are exact
> numbers tomorrow.
Uhm, the test Linus requested is very simple, it doesnt need 'full'
objdumps, just a plain defconfig [*] - an objdump takes less than 10
seconds here even on an old box i tried it on.
I just did this - it all took less than 5 minutes to do the whole
test with gcc34:
vmlinux.gcc34.vanilla: 679 calls to memcpy
vmlinux.gcc34.gcc-memcpy: 1393 calls to memcpy
So your patch more than doubles the number of calls to out-of-line
memcpy on older GCC. That's not really acceptable so i'm NAK-ing
this patch.
Next time you send such patches please test with older GCCs straight
away - it's a basic act of testing when doing a patch that 'lets GCC
decide' anything. GCC has a very bad track record in that area.
Ingo
* Andi Kleen <[email protected]> wrote:
> Andi Kleen <[email protected]> writes:
>
> >> > Quick test here:
> >>
> >> How about you just compile the kernel with gcc-3.2 and compare the number
> >> of calls to memcpy before-and-after instead? That's the real test.
> >
> > I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
> > patience. If someone has a fast disassembler we can try it. I'll leave
> > them running over night, maybe there are exact numbers tomorrow.
> >
> > But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
> > very little files which call it with the patch, so there's some
> > evidence that there isn't a dramatic increase.
>
> I let the objdumps finish over night. [...]
objdump -d never took me more than a minute - let alone a full
night. You must be doing something really wrong there. Looking at
objdump -d is an essential, unavoidable component of my workflow
with x86 architecture patches, you need to find a way to do it
efficiently if you want to send patches for this area of the kernel.
> [...] On my setup (defconfig + some additions) there are actually
> less calls to out of line memcpy/__memcpy with the patch. I see
> only one for my defconfig, while there are ~10 without the patch.
> So it makes very little difference. The code size savings must
> come from more efficient code generation for the inline case. I
> haven't investigated that in detail though.
>
> So the patch seems like a overall win.
It's a clear loss here with GCC 3.4, and it took me less than 5
minutes to figure that out.
With what precise compiler version did you test (please paste the
gcc -v output), and could you send me the precise .config you used,
and describe the method you used to determine the number of
out-of-line memcpy calls? I'd like to double-check your numbers.
Ingo
On Thu, Apr 23, 2009 at 08:36:25AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > Andi Kleen <[email protected]> writes:
> >
> > >> > Quick test here:
> > >>
> > >> How about you just compile the kernel with gcc-3.2 and compare the number
> > >> of calls to memcpy before-and-after instead? That's the real test.
> > >
> > > I waited over 10 minutes for the full vmlinux objdumps to finish. sorry lost
> > > patience. If someone has a fast disassembler we can try it. I'll leave
> > > them running over night, maybe there are exact numbers tomorrow.
> > >
> > > But from a quick check (find -name '*.o' | xargs nm | grep memcpy) there are
> > > very little files which call it with the patch, so there's some
> > > evidence that there isn't a dramatic increase.
> >
> > I let the objdumps finish over night. [...]
>
> objdump -d never took me more than a minute - let alone a full
I use objdump -S. Maybe that's slower than -d.
Hmm quick test, yes -S seems to be much slower than -d. Thanks for
the hint. I guess I should switch to -d for these cases, unfortunately
-S seems to be hardcoded in my fingers and of course it gives much
nicer output if you have debug info.
> night. You must be doing something really wrong there. Looking at
> objdump -d is an essential, unavoidable component of my workflow
> with x86 architecture patches, you need to find a way to do it
I do it all the time too, but only for specific functions, not
for full kernels. I have a objdump-symbol script for that that
looks up a symbol in the symbol table and only disassembles
the function I'm interested in
(ftp://firstfloor.org/pub/ak/perl/objdump-symbol)
I normally don't look at full listings of the complete kernel.
> > [...] On my setup (defconfig + some additions) there are actually
> > less calls to out of line memcpy/__memcpy with the patch. I see
> > only one for my defconfig, while there are ~10 without the patch.
> > So it makes very little difference. The code size savings must
> > come from more efficient code generation for the inline case. I
> > haven't investigated that in detail though.
> >
> > So the patch seems like a overall win.
>
> It's a clear loss here with GCC 3.4, and it took me less than 5
> minutes to figure that out.
Loss in what way?
>
> With what precise compiler version did you test (please paste the
> gcc -v output), and could you send me the precise .config you used,
See the 2nd previous mail: 3.2.3
I didn't do tests with later versions, assuming there are no
regressions.
> and describe the method you used to determine the number of
> out-of-line memcpy calls? I'd like to double-check your numbers.
objdump -S ... | grep call.*memcpy (gives some false positives,
you have to weed them out)
In addition I did a quick find -name '*.o' | xargs nm | grep 'U.*memcpy$'
to (under) estimate the calls
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Apr 23, 2009 at 08:30:53AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > > Quick test here:
> > >
> > > How about you just compile the kernel with gcc-3.2 and compare
> > > the number of calls to memcpy before-and-after instead? That's
> > > the real test.
> >
> > I waited over 10 minutes for the full vmlinux objdumps to finish.
> > sorry lost patience. If someone has a fast disassembler we can try
> > it. I'll leave them running over night, maybe there are exact
> > numbers tomorrow.
>
> Uhm, the test Linus requested is very simple, it doesnt need 'full'
> objdumps, just a plain defconfig [*] - an objdump takes less than 10
> seconds here even on an old box i tried it on.
>
> I just did this - it all took less than 5 minutes to do the whole
> test with gcc34:
>
> vmlinux.gcc34.vanilla: 679 calls to memcpy
> vmlinux.gcc34.gcc-memcpy: 1393 calls to memcpy
>
> So your patch more than doubles the number of calls to out-of-line
> memcpy on older GCC. That's not really acceptable
How do you determine it's not acceptable?
It seems not nice to me, but not a fatal problem. I think
Linus was more interested in dramatic growth, but factor 2 over
a very large code base doesn't seem to be dramatic to me, especially
since it's very likely most of the are slow path code.
> Next time you send such patches please test with older GCCs straight
> away - it's a basic act of testing when doing a patch that 'lets GCC
I tested with 3.2.3, but not with 3.4.
3.2.3 doesn't do that. AFAIK 3.2 was a pretty common compiler; at least
several SUSE releases shipped with it.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 22, 2009 at 6:48 PM, H. Peter Anvin <[email protected]> wrote:
> Joe Damato wrote:
>>
>> I think this patch is great. Perhaps this would be a good time to also
>> clean out memset for x86_32? (If needed, I can start a new email
>> thread with this patch) I've built and booted this on my x86_32
>> hardware.
>>
>
> Under the same conditions, i.e. making sure that with the oldest gcc we
> still care about (3.2) we don't end up uninlining any new cases we
> shouldn't.
Looks like this thread is dead/dying, but figured I should reply with
my test findings. The number of out-of-line calls (as determined by:
make mrproper && make defconfig && make && objdump -d vmlinux | grep
"call.*\<memset" | wc -l))
gcc 4.2.4 - withOUT memset patch: 20
gcc 4.2.4 - with memset patch: 365
gcc 3.4 - withOUT memset patch: 17
gcc 3.4 - with memset patch: 349
I'm guessing this is probably not acceptable, so I won't bother
installing/trying gcc-3.2 unless anyone thinks that a 300+ increase in
out-of-line calls is OK.
Joe
Joe Damato wrote:
>
> Looks like this thread is dead/dying, but figured I should reply with
> my test findings. The number of out-of-line calls (as determined by:
> make mrproper && make defconfig && make && objdump -d vmlinux | grep
> "call.*\<memset" | wc -l))
>
> gcc 4.2.4 - withOUT memset patch: 20
> gcc 4.2.4 - with memset patch: 365
>
> gcc 3.4 - withOUT memset patch: 17
> gcc 3.4 - with memset patch: 349
>
> I'm guessing this is probably not acceptable, so I won't bother
> installing/trying gcc-3.2 unless anyone thinks that a 300+ increase in
> out-of-line calls is OK.
>
Not unless it can be proven those calls are in non-performance-critical
contexts. That's a lot of work to go through, though.
-hpa
> gcc 4.2.4 - withOUT memset patch: 20
> gcc 4.2.4 - with memset patch: 365
>
> gcc 3.4 - withOUT memset patch: 17
> gcc 3.4 - with memset patch: 349
Yes it sounds like 3.4 is worse on that than 3.2. Too bad.
> I'm guessing this is probably not acceptable, so I won't bother
It depends if the calls are in critical code. Or how big they
are (for a 1K memset it's totally fine to have it out of line).
For example for any memsets in __init functions we wouldn't
care. You could filter those out. And perhaps eyeball the code.
-Andi
--
[email protected] -- Speaking for myself only.