2005-03-18 09:22:33

by Denis Vlasenko

[permalink] [raw]
Subject: [PATCH] reduce inlined x86 memcpy by 2 bytes

This memcpy() is 2 bytes shorter than one currently in mainline
and it have one branch less. It is also 3-4% faster in microbenchmarks
on small blocks if block size is multiple of 4. Mainline is slower
because it has to branch twice per memcpy, both mispredicted
(but branch prediction hides that in microbenchmark).

Last remaining branch can be dropped too, but then we execute second
'rep movsb' always, even if blocksize%4==0. This is slower than mainline
because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness'
wins over this in real world use (not in bench).

I think blocksize%4==0 happens more than 25% of the time.

This is how many 'allyesconfig' vmlinux gains on branchless memcpy():

# size vmlinux.org vmlinux.memcpy
text data bss dec hex filename
18178950 6293427 1808916 26281293 191054d vmlinux.org
18165160 6293427 1808916 26267503 190cf6f vmlinux.memcpy

# echo $(( (18178950-18165160) ))
13790 <============= bytes saved on allyesconfig

# echo $(( (18178950-18165160)/4 ))
3447 <============= memcpy() callsites optimized

Attached patch (with one branch) would save 6.5k instead of 13k.

Patch is run tested.
--
vda


Attachments:
(No filename) (1.17 kB)
string.memcpy.diff (709.00 B)
Download all attachments

2005-03-18 10:08:07

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] reduce inlined x86 memcpy by 2 bytes

On Friday 18 March 2005 11:21, Denis Vlasenko wrote:
> This memcpy() is 2 bytes shorter than one currently in mainline
> and it have one branch less. It is also 3-4% faster in microbenchmarks
> on small blocks if block size is multiple of 4. Mainline is slower
> because it has to branch twice per memcpy, both mispredicted
> (but branch prediction hides that in microbenchmark).
>
> Last remaining branch can be dropped too, but then we execute second
> 'rep movsb' always, even if blocksize%4==0. This is slower than mainline
> because 'rep movsb' is microcoded. I wonder, tho, whether 'branchlessness'
> wins over this in real world use (not in bench).
>
> I think blocksize%4==0 happens more than 25% of the time.

s/%4/&3 of course.
--
vda

2005-03-20 13:17:42

by Adrian Bunk

[permalink] [raw]
Subject: Re: [PATCH] reduce inlined x86 memcpy by 2 bytes

Hi Denis,

what do your benchmarks say about replacing the whole assembler code
with a

#define __memcpy __builtin_memcpy

?

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2005-03-22 06:45:28

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] reduce inlined x86 memcpy by 2 bytes

On Sunday 20 March 2005 15:17, Adrian Bunk wrote:
> Hi Denis,
>
> what do your benchmarks say about replacing the whole assembler code
> with a
>
> #define __memcpy __builtin_memcpy

It generates call to out-of-line memcpy()
if count is non-constant.

# cat t.c
extern char *a, *b;
extern int n;

void f() {
__builtin_memcpy(a,b,n);
}

void g() {
__builtin_memcpy(a,b,24);
}
# gcc -S -O2 --omit-frame-pointer t.c
# cat t.s
.file "t.c"
.text
.p2align 2,,3
.globl f
.type f, @function
f:
subl $16, %esp
pushl n
pushl b
pushl a
call memcpy
addl $28, %esp
ret
.size f, .-f
.p2align 2,,3
.globl g
.type g, @function
g:
pushl %edi
pushl %esi
movl a, %edi
movl b, %esi
cld
movl $6, %ecx
rep
movsl
popl %esi
popl %edi
ret
.size g, .-g
.section .note.GNU-stack,"",@progbits
.ident "GCC: (GNU) 3.4.1"

Proving that it is slower than inline is left
as an excercise to the reader :)

Kernel one will be inlined always.
void h) { __memcpy(a,b,n);} is
movl n, %eax
pushl %edi
movl %eax, %ecx
pushl %esi
movl a, %edi
movl b, %esi
shrl $2, %ecx
#APP
rep ; movsl
movl %eax,%ecx
andl $3,%ecx
jz 1f
rep ; movsb
1:
#NO_APP
popl %esi
popl %edi
ret
--
vda