2009-11-06 09:36:51

by tip-bot for Ma Ling

[permalink] [raw]
Subject: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

From: Ma Ling <[email protected]>

Hi All

Intel Nehalem improves the performance of REP strings significantly
over previous microarchitectures in several ways:

1. Startup overhead have been reduced in most cases.
2. Data transfer throughput are improved.
3. REP string can operate in "fast string" even if address is not
aligned to 16bytes.

According to the experiment when copy size is big enough
movsq almost can get 16bytes throughput per cycle, which
approximate SSE instruction set. The patch intends to utilize
the optimization when copy size is over 1024.

Experiment data speedup under Nehalem platform:
Len alignment Speedup
1024, 0/ 0: 1.04x
2048, 0/ 0: 1.36x
3072, 0/ 0: 1.51x
4096, 0/ 0: 1.60x
5120, 0/ 0: 1.70x
6144, 0/ 0: 1.74x
7168, 0/ 0: 1.77x
8192, 0/ 0: 1.80x
9216, 0/ 0: 1.82x
10240, 0/ 0: 1.83x
11264, 0/ 0: 1.85x
12288, 0/ 0: 1.86x
13312, 0/ 0: 1.92x
14336, 0/ 0: 1.84x
15360, 0/ 0: 1.74x

'perf stat --repeat 10 ./static_orig' command get data before patch:

Performance counter stats for './static_orig' (10 runs):

2835.650105 task-clock-msecs # 0.999 CPUs ( +- 0.051% )
3 context-switches # 0.000 M/sec ( +- 6.503% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4429 page-faults # 0.002 M/sec ( +- 0.003% )
7941098692 cycles # 2800.451 M/sec ( +- 0.051% )
10848100323 instructions # 1.366 IPC ( +- 0.000% )
322808 cache-references # 0.114 M/sec ( +- 1.467% )
280716 cache-misses # 0.099 M/sec ( +- 0.618% )

2.838006377 seconds time elapsed ( +- 0.051% )

'perf stat --repeat 10 ./static_new' command get data after patch:

Performance counter stats for './static_new' (10 runs):

7401.423466 task-clock-msecs # 0.999 CPUs ( +- 0.108% )
10 context-switches # 0.000 M/sec ( +- 2.797% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.001 M/sec ( +- 0.003% )
20727280183 cycles # 2800.445 M/sec ( +- 0.107% )
1472673654 instructions # 0.071 IPC ( +- 0.013% )
1092221 cache-references # 0.148 M/sec ( +- 12.414% )
290550 cache-misses # 0.039 M/sec ( +- 1.577% )

7.407006046 seconds time elapsed ( +- 0.108% )

Appreciate your comments.

Thanks
Ma Ling

---
arch/x86/lib/memcpy_64.S | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index ad5441e..2ea3561 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -50,6 +50,12 @@ ENTRY(memcpy)
movl %edx, %ecx
shrl $6, %ecx
jz .Lhandle_tail
+ /*
+ * If length is more than 1024 we chose optimized MOVSQ,
+ * which has more throughput.
+ */
+ cmpl $0x400, %edx
+ jae .Lmore_0x400

.p2align 4
.Lloop_64:
@@ -119,6 +125,17 @@ ENTRY(memcpy)

.Lend:
ret
+
+ .p2align 4
+.Lmore_0x400:
+ movq %rdi, %rax
+ movl %edx, %ecx
+ shrl $3, %ecx
+ andl $7, %edx
+ rep movsq
+ movl %edx, %ecx
+ rep movsb
+ ret
CFI_ENDPROC
ENDPROC(memcpy)
ENDPROC(__memcpy)
--
1.6.2.5


2009-11-06 16:51:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

[email protected] writes:

> Intel Nehalem improves the performance of REP strings significantly
> over previous microarchitectures in several ways:

The problem is that it's not necessarily a win on older CPUs to
do it this way.

-Andi

--
[email protected] -- Speaking for myself only.

2009-11-06 17:08:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/06/2009 01:41 AM, [email protected] wrote:
>
> Performance counter stats for './static_orig' (10 runs):
>
> 2835.650105 task-clock-msecs # 0.999 CPUs ( +- 0.051% )
> 3 context-switches # 0.000 M/sec ( +- 6.503% )
> 0 CPU-migrations # 0.000 M/sec ( +- nan% )
> 4429 page-faults # 0.002 M/sec ( +- 0.003% )
> 7941098692 cycles # 2800.451 M/sec ( +- 0.051% )
> 10848100323 instructions # 1.366 IPC ( +- 0.000% )
> 322808 cache-references # 0.114 M/sec ( +- 1.467% )
> 280716 cache-misses # 0.099 M/sec ( +- 0.618% )
>
> 2.838006377 seconds time elapsed ( +- 0.051% )
>
> 'perf stat --repeat 10 ./static_new' command get data after patch:
>
> Performance counter stats for './static_new' (10 runs):
>
> 7401.423466 task-clock-msecs # 0.999 CPUs ( +- 0.108% )
> 10 context-switches # 0.000 M/sec ( +- 2.797% )
> 0 CPU-migrations # 0.000 M/sec ( +- nan% )
> 4428 page-faults # 0.001 M/sec ( +- 0.003% )
> 20727280183 cycles # 2800.445 M/sec ( +- 0.107% )
> 1472673654 instructions # 0.071 IPC ( +- 0.013% )
> 1092221 cache-references # 0.148 M/sec ( +- 12.414% )
> 290550 cache-misses # 0.039 M/sec ( +- 1.577% )
>
> 7.407006046 seconds time elapsed ( +- 0.108% )
>

I assume these are backwards? If so, it's a dramatic performance
improvement.

Where did the 1024 byte threshold come from? It seems a bit high to me,
and is at the very best a CPU-specific tuning factor.

Andi is of course correct that older CPUs might suffer (sadly enough),
which is why we'd at the very least need some idea of what the
performance impact on those older CPUs would look like -- at that point
we can make a decision to just unconditionally do the rep movs or
consider some system where we point at different implementations for
different processors -- memcpy is probably one of the very few
operations for which something like that would make sense.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-11-06 19:26:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/06/2009 09:07 AM, H. Peter Anvin wrote:
>
> Where did the 1024 byte threshold come from? It seems a bit high to me,
> and is at the very best a CPU-specific tuning factor.
>
> Andi is of course correct that older CPUs might suffer (sadly enough),
> which is why we'd at the very least need some idea of what the
> performance impact on those older CPUs would look like -- at that point
> we can make a decision to just unconditionally do the rep movs or
> consider some system where we point at different implementations for
> different processors -- memcpy is probably one of the very few
> operations for which something like that would make sense.
>

To be expicit: Ling, would you be willing to run some benchmarks across
processors to see how this performs on non-Nehalem CPUs?

-hpa

2009-11-08 10:19:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.


* Andi Kleen <[email protected]> wrote:

> [email protected] writes:
>
> > Intel Nehalem improves the performance of REP strings significantly
> > over previous microarchitectures in several ways:
>
> The problem is that it's not necessarily a win on older CPUs to do it
> this way.

I'm wondering, why are you writing such obtruse comments to Intel
submitted patches? I know it and you know it too which older CPUs have a
slow string implementation, and you know the rough order of magnitude
and significance as well and you have ideas how to solve it all.

Instead you injected just the minimal amount of information into this
thread to derail this patch you can see a problem with, but you didnt at
all explain your full opinion openly and honestly and you certainly
didnt give enough information to allow Ling Ma to act upon your opinion
with maximum efficiency.

I.e. you are not being helpful at all here and you are obstructing Intel
folks actively, making their workflow and progress as inefficient as you
possibly can. Why are you doing that?

Ingo

2009-11-09 07:25:17

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

Hi All

Today we run our benchmark on Core2 and Sandy Bridge:

1. Retrieve result on Core2
Speedup on Core2
Len Alignement Speedup
1024, 0/ 0: 0.95x
2048, 0/ 0: 1.03x
3072, 0/ 0: 1.02x
4096, 0/ 0: 1.09x
5120, 0/ 0: 1.13x
6144, 0/ 0: 1.13x
7168, 0/ 0: 1.14x
8192, 0/ 0: 1.13x
9216, 0/ 0: 1.14x
10240, 0/ 0: 0.99x
11264, 0/ 0: 1.14x
12288, 0/ 0: 1.14x
13312, 0/ 0: 1.10x
14336, 0/ 0: 1.10x
15360, 0/ 0: 1.13x
Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

3323.041832 task-clock-msecs # 0.998 CPUs ( +- 0.016% )
22 context-switches # 0.000 M/sec ( +- 31.913% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.001 M/sec ( +- 0.003% )
9921549804 cycles # 2985.683 M/sec ( +- 0.016% )
10863809359 instructions # 1.095 IPC ( +- 0.000% )
972283451 cache-references # 292.588 M/sec ( +- 0.018% )
17703 cache-misses # 0.005 M/sec ( +- 4.304% )

3.330714469 seconds time elapsed ( +- 0.021% )
After the patch:
Performance counter stats for './static_new' (10 runs):
3392.902871 task-clock-msecs # 0.998 CPUs ( +- 0.226% )
21 context-switches # 0.000 M/sec ( +- 30.982% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.001 M/sec ( +- 0.003% )
10130188030 cycles # 2985.699 M/sec ( +- 0.227% )
391981414 instructions # 0.039 IPC ( +- 0.013% )
874161826 cache-references # 257.644 M/sec ( +- 3.034% )
17628 cache-misses # 0.005 M/sec ( +- 4.577% )

3.400681174 seconds time elapsed ( +- 0.219% )

2. Retrieve result on Sandy Bridge
Speedup on Sandy Bridge
Len Alignement Speedup
1024, 0/ 0: 1.08x
2048, 0/ 0: 1.42x
3072, 0/ 0: 1.51x
4096, 0/ 0: 1.63x
5120, 0/ 0: 1.67x
6144, 0/ 0: 1.72x
7168, 0/ 0: 1.75x
8192, 0/ 0: 1.77x
9216, 0/ 0: 1.80x
10240, 0/ 0: 1.80x
11264, 0/ 0: 1.82x
12288, 0/ 0: 1.85x
13312, 0/ 0: 1.85x
14336, 0/ 0: 1.88x
15360, 0/ 0: 1.88x

Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

3787.441240 task-clock-msecs # 0.995 CPUs ( +- 0.140% )
8 context-switches # 0.000 M/sec ( +- 22.602% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.001 M/sec ( +- 0.003% )
6053487926 cycles # 1598.305 M/sec ( +- 0.140% )
10861025194 instructions # 1.794 IPC ( +- 0.001% )
2823963 cache-references # 0.746 M/sec ( +- 69.345% )
266000 cache-misses # 0.070 M/sec ( +- 0.980% )

3.805400837 seconds time elapsed ( +- 0.139% )
After the patch:
Performance counter stats for './static_new' (10 runs):

2879.424879 task-clock-msecs # 0.995 CPUs ( +- 0.076% )
10 context-switches # 0.000 M/sec ( +- 24.761% )
0 CPU-migrations # 0.000 M/sec ( +- nan% )
4428 page-faults # 0.002 M/sec ( +- 0.003% )
4602155158 cycles # 1598.290 M/sec ( +- 0.076% )
386146993 instructions # 0.084 IPC ( +- 0.005% )
520008 cache-references # 0.181 M/sec ( +- 8.077% )
267345 cache-misses # 0.093 M/sec ( +- 0.792% )

2.893813235 seconds time elapsed ( +- 0.085% )

Thanks
Ling

>-----Original Message-----
>From: H. Peter Anvin [mailto:[email protected]]
>Sent: 2009??11??7?? 3:26
>To: Ma, Ling
>Cc: [email protected]; [email protected]; [email protected]
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/06/2009 09:07 AM, H. Peter Anvin wrote:
>>
>> Where did the 1024 byte threshold come from? It seems a bit high to me,
>> and is at the very best a CPU-specific tuning factor.
>>
>> Andi is of course correct that older CPUs might suffer (sadly enough),
>> which is why we'd at the very least need some idea of what the
>> performance impact on those older CPUs would look like -- at that point
>> we can make a decision to just unconditionally do the rep movs or
>> consider some system where we point at different implementations for
>> different processors -- memcpy is probably one of the very few
>> operations for which something like that would make sense.
>>
>
>To be expicit: Ling, would you be willing to run some benchmarks across
>processors to see how this performs on non-Nehalem CPUs?
>
> -hpa
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2009-11-09 07:36:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/08/2009 11:24 PM, Ma, Ling wrote:
> Hi All
>
> Today we run our benchmark on Core2 and Sandy Bridge:
>

Hi Ling,

Thanks for doing that. Do you also have access to any older CPUs? I
suspect that the CPUs that Andi are worried about are older CPUs like
P4, K8 or Pentium M/Core 1. (Andi: please do clarify if you have
additional information.)

My personal opinion is that if we can show no significant slowdown on
P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this code
unconditionally. If one of them is radically worse than baseline, then
we have to do something conditional, which is a lot more complicated.

[Ingo, Thomas: do you agree?]

Thanks,

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-11-09 08:08:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.


* H. Peter Anvin <[email protected]> wrote:

> On 11/08/2009 11:24 PM, Ma, Ling wrote:
> > Hi All
> >
> > Today we run our benchmark on Core2 and Sandy Bridge:
> >
>
> Hi Ling,
>
> Thanks for doing that. Do you also have access to any older CPUs? I
> suspect that the CPUs that Andi are worried about are older CPUs like
> P4, K8 or Pentium M/Core 1. (Andi: please do clarify if you have
> additional information.)
>
> My personal opinion is that if we can show no significant slowdown on
> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this
> code unconditionally. If one of them is radically worse than
> baseline, then we have to do something conditional, which is a lot
> more complicated.
>
> [Ingo, Thomas: do you agree?]

Yeah. IIRC the worst-case were the old P2's which had a really slow,
microcode based string ops. (Some of them even had erratums in early
prototypes although we can certainly ignore those as string ops get
relied on quite frequently.)

IIRC the original PPro core came up with some nifty, hardwired string
ops, but those had to be dumbed down and emulated in microcode due to
SMP bugs - making it an inferior choice in the end.

But that should be ancient history and i'd suggest we ignore the P4
dead-end too, unless it's some really big slowdown (which i doubt). If
anyone cares then some optional assembly implementations could be added
back.

Ling, if you are interested, could you send a user-space test-app to
this thread that everyone could just compile and run on various older
boxes, to gather a performance profile of hand-coded versus string ops
performance?

( And i think we can make a judgement based on cache-hot performance
alone - if then the strings ops will perform comparatively better in
cache-cold scenarios, so the cache-hot numbers would be a conservative
estimate. )

Ingo

2009-11-09 09:26:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

"H. Peter Anvin" <[email protected]> writes:
>
> My personal opinion is that if we can show no significant slowdown on
> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this code

The issue is Core 2.

P4 uses a different path, and Core 1 doesn't use the 64bit code.

> unconditionally. If one of them is radically worse than baseline, then
> we have to do something conditional, which is a lot more complicated.

I have an older patchkit which did this, and some more optimizations
to this code.

There was still one open issue, that is why I didn't post it. If there's
interest I can post it.

-Andi
--
[email protected] -- Speaking for myself only.

2009-11-09 16:41:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/09/2009 01:26 AM, Andi Kleen wrote:
> "H. Peter Anvin" <[email protected]> writes:
>>
>> My personal opinion is that if we can show no significant slowdown on
>> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this code
>
> The issue is Core 2.
>
> P4 uses a different path, and Core 1 doesn't use the 64bit code.
>

Ling's numbers didn't seem to show a significant slowdown on Core 2 (it
was something like 0.95x baseline in the worst case, and most of the
cases were positive) so Core 2 doesn't seem to have a problem.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-11-09 18:54:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

> Ling's numbers didn't seem to show a significant slowdown on Core 2 (it
> was something like 0.95x baseline in the worst case, and most of the
> cases were positive) so Core 2 doesn't seem to have a problem.

I ran quite a lot of micro benchmarks with various alignments and sizes
the 'q' variant was not always a win. I haven't checked that particular
version though.

There's also K8 of course.

-Andi
--
[email protected] -- Speaking for myself only.

2009-11-09 22:37:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/09/2009 10:54 AM, Andi Kleen wrote:
>> Ling's numbers didn't seem to show a significant slowdown on Core 2 (it
>> was something like 0.95x baseline in the worst case, and most of the
>> cases were positive) so Core 2 doesn't seem to have a problem.
>
> I ran quite a lot of micro benchmarks with various alignments and sizes
> the 'q' variant was not always a win. I haven't checked that particular
> version though.

Well, if you have concrete information about what the problem cases are,
then please provide it. If you don't, but have a hunch where these
potential problems may lie, then please indicate what they might be.
Otherwise, there isn't any actionable information here.

-hpa

2009-11-11 07:06:28

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

Hi All
Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
if you have interest. In this program we did simple modification
on memcpy_new function.

Thanks
Ling


>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: 2009??11??9?? 16:09
>To: H. Peter Anvin
>Cc: Ma, Ling; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>
>* H. Peter Anvin <[email protected]> wrote:
>
>> On 11/08/2009 11:24 PM, Ma, Ling wrote:
>> > Hi All
>> >
>> > Today we run our benchmark on Core2 and Sandy Bridge:
>> >
>>
>> Hi Ling,
>>
>> Thanks for doing that. Do you also have access to any older CPUs? I
>> suspect that the CPUs that Andi are worried about are older CPUs like
>> P4, K8 or Pentium M/Core 1. (Andi: please do clarify if you have
>> additional information.)
>>
>> My personal opinion is that if we can show no significant slowdown on
>> P4, K8, P-M/Core 1, Core 2, and Nehalem then we can simply use this
>> code unconditionally. If one of them is radically worse than
>> baseline, then we have to do something conditional, which is a lot
>> more complicated.
>>
>> [Ingo, Thomas: do you agree?]
>
>Yeah. IIRC the worst-case were the old P2's which had a really slow,
>microcode based string ops. (Some of them even had erratums in early
>prototypes although we can certainly ignore those as string ops get
>relied on quite frequently.)
>
>IIRC the original PPro core came up with some nifty, hardwired string
>ops, but those had to be dumbed down and emulated in microcode due to
>SMP bugs - making it an inferior choice in the end.
>
>But that should be ancient history and i'd suggest we ignore the P4
>dead-end too, unless it's some really big slowdown (which i doubt). If
>anyone cares then some optional assembly implementations could be added
>back.
>
>Ling, if you are interested, could you send a user-space test-app to
>this thread that everyone could just compile and run on various older
>boxes, to gather a performance profile of hand-coded versus string ops
>performance?
>
>( And i think we can make a judgement based on cache-hot performance
> alone - if then the strings ops will perform comparatively better in
> cache-cold scenarios, so the cache-hot numbers would be a conservative
> estimate. )
>
> Ingo


Attachments:
memcpy.c (5.55 kB)
memcpy.c

2009-11-11 07:18:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.


* Ma, Ling <[email protected]> wrote:

> Hi All
> Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
> if you have interest. In this program we did simple modification
> on memcpy_new function.

FYI:

earth4:~/s> cc -o memcpy memcpy.c -O2
memcpy.c: In function 'do_one_throughput':
memcpy.c:45: error: impossible register constraint in 'asm'
memcpy.c:53: error: impossible register constraint in 'asm'
memcpy.c:47: error: impossible register constraint in 'asm'
memcpy.c:53: error: impossible register constraint in 'asm'

Ingo

2009-11-11 07:59:01

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

Hi Ingo

This program is for 64bit version, so please use 'cc -o memcpy memcpy.c -O2 -m64'

Thanks
Ling

>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: 2009??11??11?? 15:19
>To: Ma, Ling
>Cc: H. Peter Anvin; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>
>* Ma, Ling <[email protected]> wrote:
>
>> Hi All
>> Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
>> if you have interest. In this program we did simple modification
>> on memcpy_new function.
>
>FYI:
>
>earth4:~/s> cc -o memcpy memcpy.c -O2
>memcpy.c: In function 'do_one_throughput':
>memcpy.c:45: error: impossible register constraint in 'asm'
>memcpy.c:53: error: impossible register constraint in 'asm'
>memcpy.c:47: error: impossible register constraint in 'asm'
>memcpy.c:53: error: impossible register constraint in 'asm'
>
> Ingo
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2009-11-11 20:39:53

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On Wed, Nov 11, 2009 at 03:05:34PM +0800, Ma, Ling wrote:
> Hi All
> Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
> if you have interest. In this program we did simple modification
> on memcpy_new function.
>
> Thanks
> Ling

Just my 0.2$ :)

-- Cyrill
---
memcpy_orig memcpy_new
TPT: Len 1024, alignment 8/ 0: 490 570
TPT: Len 2048, alignment 8/ 0: 826 329
TPT: Len 3072, alignment 8/ 0: 441 464
TPT: Len 4096, alignment 8/ 0: 579 596
TPT: Len 5120, alignment 8/ 0: 723 729
TPT: Len 6144, alignment 8/ 0: 859 861
TPT: Len 7168, alignment 8/ 0: 996 994
TPT: Len 8192, alignment 8/ 0: 1165 1127
TPT: Len 9216, alignment 8/ 0: 1273 1260
TPT: Len 10240, alignment 8/ 0: 1402 1395
TPT: Len 11264, alignment 8/ 0: 1543 1525
TPT: Len 12288, alignment 8/ 0: 1682 1659
TPT: Len 13312, alignment 8/ 0: 1869 1815
TPT: Len 14336, alignment 8/ 0: 1982 1951
TPT: Len 15360, alignment 8/ 0: 2185 2110
---

I've run this test a few times and results almost the same,
with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.

---
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz
stepping : 6
cpu MHz : 800.000
cache size : 3072 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 4189.60
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz
stepping : 6
cpu MHz : 800.000
cache size : 3072 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 4189.46
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

2009-11-11 22:42:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/11/2009 12:34 PM, Cyrill Gorcunov wrote:
> memcpy_orig memcpy_new
> TPT: Len 1024, alignment 8/ 0: 490 570
> TPT: Len 2048, alignment 8/ 0: 826 329
> TPT: Len 3072, alignment 8/ 0: 441 464
> TPT: Len 4096, alignment 8/ 0: 579 596
> TPT: Len 5120, alignment 8/ 0: 723 729
> TPT: Len 6144, alignment 8/ 0: 859 861
> TPT: Len 7168, alignment 8/ 0: 996 994
> TPT: Len 8192, alignment 8/ 0: 1165 1127
> TPT: Len 9216, alignment 8/ 0: 1273 1260
> TPT: Len 10240, alignment 8/ 0: 1402 1395
> TPT: Len 11264, alignment 8/ 0: 1543 1525
> TPT: Len 12288, alignment 8/ 0: 1682 1659
> TPT: Len 13312, alignment 8/ 0: 1869 1815
> TPT: Len 14336, alignment 8/ 0: 1982 1951
> TPT: Len 15360, alignment 8/ 0: 2185 2110
>
> I've run this test a few times and results almost the same,
> with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.
>

Was the result for 2048 consistent (it seems odd in the extreme)... the
discrepancy between this result and Ling's results bothers me; perhaps
the right answer is to leave the current code for Core2 and use new code
(with a lower than 1024 threshold?) for NHM and K8?

-hpa

2009-11-11 23:21:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/10/2009 11:57 PM, Ma, Ling wrote:
> Hi Ingo
>
> This program is for 64bit version, so please use 'cc -o memcpy memcpy.c -O2 -m64'
>

I did some measurements with this program; I added power-of-two
measurements from 1-512 bytes, plus some different alignments, and found
some very interesting results:

Nehalem:
memcpy_new is a win for 1024+ bytes, but *also* a win for 2-32
bytes, where the old code apparently performs appallingly bad.

memcpy_new loses in the 64-512 byte range, so the 1024
threshold is probably justified.

Core2:
memcpy_new is a win for <= 512 bytes, but a lose for larger
copies (possibly a win again for 16K+ copies, but those are
very rare in the Linux kernel.) Surprise...

However, the difference is very small.

However, I had overlooked something much more fundamental about your
patch. On Nehalem, at least *it will never get executed* (except during
very early startup), because we replace the memcpy code with a jmp to
memcpy_c on any CPU which has X86_FEATURE_REP_GOOD, which includes Nehalem.

So the patch is a no-op on Nehalem, and any other modern CPU.

Am I guessing that the perf numbers you posted originally were all from
your user space test program?

-hpa

2009-11-12 02:13:21

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

>-----Original Message-----
>From: H. Peter Anvin [mailto:[email protected]]
>Sent: 2009??11??12?? 7:21
>To: Ma, Ling
>Cc: Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/10/2009 11:57 PM, Ma, Ling wrote:
>> Hi Ingo
>>
>> This program is for 64bit version, so please use 'cc -o memcpy memcpy.c -O2
>-m64'
>>
>
>I did some measurements with this program; I added power-of-two
>measurements from 1-512 bytes, plus some different alignments, and found
>some very interesting results:
>
>Nehalem:
> memcpy_new is a win for 1024+ bytes, but *also* a win for 2-32
> bytes, where the old code apparently performs appallingly bad.
>
> memcpy_new loses in the 64-512 byte range, so the 1024
> threshold is probably justified.
>
>Core2:
> memcpy_new is a win for <= 512 bytes, but a lose for larger
> copies (possibly a win again for 16K+ copies, but those are
> very rare in the Linux kernel.) Surprise...
>
> However, the difference is very small.
>
>However, I had overlooked something much more fundamental about your
>patch. On Nehalem, at least *it will never get executed* (except during
>very early startup), because we replace the memcpy code with a jmp to
>memcpy_c on any CPU which has X86_FEATURE_REP_GOOD, which includes Nehalem.
>
>So the patch is a no-op on Nehalem, and any other modern CPU.

[Ma Ling]
It is good for modern CPU, our original intention is also to introduce movsq for Nehalem, above method is more smart.

>Am I guessing that the perf numbers you posted originally were all from
>your user space test program?

[Ma Ling]
Yes, they are all from this program, and I'm confused about measurement values will be different for only one case after multiple tests.
(3 times at least on my core2 platform).

Thanks
Ling
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2009-11-12 04:28:00

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On Thu, Nov 12, 2009 at 1:39 AM, H. Peter Anvin <[email protected]> wrote:
> On 11/11/2009 12:34 PM, Cyrill Gorcunov wrote:
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? memcpy_orig ? ? memcpy_new
>> TPT: Len 1024, alignment ?8/ 0: ? ? ? ? ? ? ? 490 ? ? ? ? ? ? 570
>> TPT: Len 2048, alignment ?8/ 0: ? ? ? ? ? ? ? 826 ? ? ? ? ? ? 329
>> TPT: Len 3072, alignment ?8/ 0: ? ? ? ? ? ? ? 441 ? ? ? ? ? ? 464
>> TPT: Len 4096, alignment ?8/ 0: ? ? ? ? ? ? ? 579 ? ? ? ? ? ? 596
>> TPT: Len 5120, alignment ?8/ 0: ? ? ? ? ? ? ? 723 ? ? ? ? ? ? 729
>> TPT: Len 6144, alignment ?8/ 0: ? ? ? ? ? ? ? 859 ? ? ? ? ? ? 861
>> TPT: Len 7168, alignment ?8/ 0: ? ? ? ? ? ? ? 996 ? ? ? ? ? ? 994
>> TPT: Len 8192, alignment ?8/ 0: ? ? ? ? ? ? ? 1165 ? ? ? ? ? ?1127
>> TPT: Len 9216, alignment ?8/ 0: ? ? ? ? ? ? ? 1273 ? ? ? ? ? ?1260
>> TPT: Len 10240, alignment ?8/ 0: ? ? ?1402 ? ? ? ? ? ?1395
>> TPT: Len 11264, alignment ?8/ 0: ? ? ?1543 ? ? ? ? ? ?1525
>> TPT: Len 12288, alignment ?8/ 0: ? ? ?1682 ? ? ? ? ? ?1659
>> TPT: Len 13312, alignment ?8/ 0: ? ? ?1869 ? ? ? ? ? ?1815
>> TPT: Len 14336, alignment ?8/ 0: ? ? ?1982 ? ? ? ? ? ?1951
>> TPT: Len 15360, alignment ?8/ 0: ? ? ?2185 ? ? ? ? ? ?2110
>>
>> I've run this test a few times and results almost the same,
>> with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.
>>
>
> Was the result for 2048 consistent (it seems odd in the extreme)... the
> discrepancy between this result and Ling's results bothers me; perhaps
> the right answer is to leave the current code for Core2 and use new code
> (with a lower than 1024 threshold?) for NHM and K8?
>
> ? ? ? ?-hpa
>

Hi Peter,

no, results for 2048 is not repeatable (that is why I didn't mention this number
in a former report).

Test1:
TPT: Len 2048, alignment 8/ 0: 826 329
Test2:
TPT: Len 2048, alignment 8/ 0: 359 329
Test3:
TPT: Len 2048, alignment 8/ 0: 306 331
Test4:
TPT: Len 2048, alignment 8/ 0: 415 329

I guess this was due to cpu frequency change from 800 to 2.1Ghz since
I did tests manually
not using any kind of bash cycle to run the test program.

2009-11-12 04:49:57

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

Hi All
The attachment is latest memcpy.c, please update by
"cc -o memcpy memcpy.c -O2 -m64".

Thanks
Ling


>-----Original Message-----
>From: Cyrill Gorcunov [mailto:[email protected]]
>Sent: 2009年11月12日 12:28
>To: H. Peter Anvin
>Cc: Ma, Ling; Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On Thu, Nov 12, 2009 at 1:39 AM, H. Peter Anvin <[email protected]> wrote:
>> On 11/11/2009 12:34 PM, Cyrill Gorcunov wrote:
>>>                                               memcpy_orig     memc
>py_new
>>> TPT: Len 1024, alignment  8/ 0:               490             570
>>> TPT: Len 2048, alignment  8/ 0:               826             329
>>> TPT: Len 3072, alignment  8/ 0:               441             464
>>> TPT: Len 4096, alignment  8/ 0:               579             596
>>> TPT: Len 5120, alignment  8/ 0:               723             729
>>> TPT: Len 6144, alignment  8/ 0:               859             861
>>> TPT: Len 7168, alignment  8/ 0:               996             994
>>> TPT: Len 8192, alignment  8/ 0:               1165            1127
>>> TPT: Len 9216, alignment  8/ 0:               1273            1260
>>> TPT: Len 10240, alignment  8/ 0:      1402            1395
>>> TPT: Len 11264, alignment  8/ 0:      1543            1525
>>> TPT: Len 12288, alignment  8/ 0:      1682            1659
>>> TPT: Len 13312, alignment  8/ 0:      1869            1815
>>> TPT: Len 14336, alignment  8/ 0:      1982            1951
>>> TPT: Len 15360, alignment  8/ 0:      2185            2110
>>>
>>> I've run this test a few times and results almost the same,
>>> with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.
>>>
>>
>> Was the result for 2048 consistent (it seems odd in the extreme)... the
>> discrepancy between this result and Ling's results bothers me; perhaps
>> the right answer is to leave the current code for Core2 and use new code
>> (with a lower than 1024 threshold?) for NHM and K8?
>>
>>        -hpa
>>
>
>Hi Peter,
>
>no, results for 2048 is not repeatable (that is why I didn't mention this number
>in a former report).
>
>Test1:
>TPT: Len 2048, alignment 8/ 0: 826 329
>Test2:
>TPT: Len 2048, alignment 8/ 0: 359 329
>Test3:
>TPT: Len 2048, alignment 8/ 0: 306 331
>Test4:
>TPT: Len 2048, alignment 8/ 0: 415 329
>
>I guess this was due to cpu frequency change from 800 to 2.1Ghz since
>I did tests manually
>not using any kind of bash cycle to run the test program.


Attachments:
memcpy.c (5.37 kB)
memcpy.c

2009-11-12 05:26:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/11/2009 08:49 PM, Ma, Ling wrote:
> Hi All
> The attachment is latest memcpy.c, please update by
> "cc -o memcpy memcpy.c -O2 -m64".

OK... given that there seems to be no point since the actual code we're
talking about modifying doesn't ever actually get executed on the real
kernel, we can just drop this, right?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-11-12 07:43:15

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

Hi H. Peter Anvin

After running the test program in my attachment-memcpy.c on Nehalem platform,
when copy size is less 1024 memcopy_c function has very big regression compared
with original memcopy function. I think we have to combine original memcopy and
memcpy_c for Nehalem and other modern CPUS, so memcpy_new is on the right track.

Thanks
Ling
(

>-----Original Message-----
>From: H. Peter Anvin [mailto:[email protected]]
>Sent: 2009??11??12?? 13:27
>To: Ma, Ling
>Cc: Cyrill Gorcunov; Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/11/2009 08:49 PM, Ma, Ling wrote:
>> Hi All
>> The attachment is latest memcpy.c, please update by
>> "cc -o memcpy memcpy.c -O2 -m64".
>
>OK... given that there seems to be no point since the actual code we're
>talking about modifying doesn't ever actually get executed on the real
>kernel, we can just drop this, right?
>
> -hpa
>
>--
>H. Peter Anvin, Intel Open Source Technology Center
>I work for Intel. I don't speak on their behalf.


Attachments:
memcpy.c (5.99 kB)
memcpy.c

2009-11-12 09:54:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On Thu, Nov 12, 2009 at 7:49 AM, Ma, Ling <[email protected]> wrote:
> Hi All
> The attachment is latest memcpy.c, please update by
> "cc -o memcpy memcpy.c -O2 -m64".
>
> Thanks
> Ling
>
>

Here is goes

memcpy_orig memcpy_new memcpy_c
TPT: Len 0, alignment 0/ 0: 34482 31920 123564
TPT: Len 1, alignment 0/ 0: 31815 31710 123564
TPT: Len 2, alignment 0/ 0: 39606 31773 123522
TPT: Len 3, alignment 0/ 0: 175329 37212 123522
TPT: Len 4, alignment 0/ 0: 55440 42357 297129
TPT: Len 5, alignment 0/ 0: 63294 47607 296898
TPT: Len 6, alignment 0/ 0: 71148 52794 296856
TPT: Len 7, alignment 0/ 0: 79023 58044 296877
TPT: Len 8, alignment 0/ 0: 32403 32424 123564
TPT: Len 9, alignment 0/ 0: 31752 31815 123522
TPT: Len 10, alignment 0/ 0: 34482 34545 123522
TPT: Len 11, alignment 0/ 0: 42294 39732 123522
TPT: Len 12, alignment 0/ 0: 50211 42378 296856
TPT: Len 13, alignment 0/ 0: 58107 48279 329007
TPT: Len 14, alignment 0/ 0: 65898 53781 296877
TPT: Len 15, alignment 0/ 0: 73773 58065 296877
TPT: Len 16, alignment 0/ 0: 34482 37107 123522
TPT: Len 17, alignment 0/ 0: 31836 31815 123543
TPT: Len 18, alignment 0/ 0: 39627 37044 123522
TPT: Len 19, alignment 0/ 0: 47565 42294 123522
TPT: Len 20, alignment 0/ 0: 55566 47754 296898
TPT: Len 21, alignment 0/ 0: 63273 52773 296877
TPT: Len 22, alignment 0/ 0: 71148 58149 296856
TPT: Len 23, alignment 0/ 0: 79086 63273 296856
TPT: Len 24, alignment 0/ 0: 39816 45024 123522
TPT: Len 25, alignment 0/ 0: 37086 39753 123522
TPT: Len 26, alignment 0/ 0: 44877 44919 123522
TPT: Len 27, alignment 0/ 0: 52773 50253 123522
TPT: Len 28, alignment 0/ 0: 60690 55545 296898
TPT: Len 29, alignment 0/ 0: 68544 60690 296877
TPT: Len 30, alignment 0/ 0: 76398 65961 296877
TPT: Len 31, alignment 0/ 0: 84273 71211 296856
TPT: Len 32, alignment 0/ 0: 45045 52899 123522
TPT: Len 33, alignment 0/ 0: 42315 47628 123522
TPT: Len 34, alignment 0/ 0: 50127 52773 123522
TPT: Len 35, alignment 0/ 0: 58044 58107 123522
TPT: Len 36, alignment 0/ 0: 129612 63462 297129
TPT: Len 37, alignment 0/ 0: 257607 68733 902034
TPT: Len 38, alignment 0/ 0: 81879 73857 296919
TPT: Len 39, alignment 0/ 0: 89460 79023 296856
TPT: Len 40, alignment 0/ 0: 50253 60753 123543
TPT: Len 41, alignment 0/ 0: 47607 55545 123564
TPT: Len 42, alignment 0/ 0: 55356 60627 123522
TPT: Len 43, alignment 0/ 0: 63357 822843 123585
TPT: Len 44, alignment 0/ 0: 71337 71169 297087
TPT: Len 45, alignment 0/ 0: 79023 353388 297129
TPT: Len 46, alignment 0/ 0: 87024 81690 296856
TPT: Len 47, alignment 0/ 0: 94689 86940 296877
TPT: Len 48, alignment 0/ 0: 55482 68523 123522
TPT: Len 49, alignment 0/ 0: 52857 63336 123564
TPT: Len 50, alignment 0/ 0: 60690 68607 123522
TPT: Len 51, alignment 0/ 0: 68502 73731 123522
TPT: Len 52, alignment 0/ 0: 76419 79086 296856
TPT: Len 53, alignment 0/ 0: 84336 126147 296877
TPT: Len 54, alignment 0/ 0: 92190 89607 296877
TPT: Len 55, alignment 0/ 0: 100023 94668 296856
TPT: Len 56, alignment 0/ 0: 60690 76440 123522
TPT: Len 57, alignment 0/ 0: 58065 71211 123522
TPT: Len 58, alignment 0/ 0: 65877 76356 123522
TPT: Len 59, alignment 0/ 0: 73773 81606 196224
TPT: Len 60, alignment 0/ 0: 81732 86961 297129
TPT: Len 61, alignment 0/ 0: 89523 136689 296877
TPT: Len 62, alignment 0/ 0: 97377 97440 296877
TPT: Len 63, alignment 0/ 0: 105210 102564 296877
TPT: Len 1023, alignment 0/ 0: 457569 457107 719502
TPT: Len 1024, alignment 0/ 0: 422856 542535 575526
TPT: Len 2048, alignment 0/ 0: 819651 8217489 982779

2009-11-12 23:03:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On Mon 2009-11-09 15:24:03, Ma, Ling wrote:
> Hi All
>
> Today we run our benchmark on Core2 and Sandy Bridge:
>
> 1. Retrieve result on Core2
> Speedup on Core2
> Len Alignement Speedup
> 1024, 0/ 0: 0.95x
> 2048, 0/ 0: 1.03x

Well, so you are running cache hot and it is only a win on huge
copies... how common are those?

> Application run through perf
> For (i= 1024; i < 1024 * 16; i = i + 64)
> do_memcpy(0, 0, i);

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-11-12 23:04:13

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.


> Ling, if you are interested, could you send a user-space test-app to
> this thread that everyone could just compile and run on various older
> boxes, to gather a performance profile of hand-coded versus string ops
> performance?
>
> ( And i think we can make a judgement based on cache-hot performance
> alone - if then the strings ops will perform comparatively better in
> cache-cold scenarios, so the cache-hot numbers would be a conservative
> estimate. )

Ugh, really? I'd expect cache-cold performance to be not helped at all
(memory bandwidth limit) and you'll get slow down from additional
i-cache misses...
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-11-13 05:34:18

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

>Well, so you are running cache hot and it is only a win on huge
>copies... how common are those?
>
Hi Pavel Machek
Yes, we intend to introduce movsq for huge hot size(over 1024bytes)
and avoid regression for less 1024bytes. I guess you suggest using
prefetch instruction for cold data (if I was wrong please correct me).
memcpy don't know whether data has been in cache or not,
so only when copy size is over (first level 1 cache)/2 and lower
(last level cache)/2 , prefetch will get benefit. Currently first
level cache size of most cpus is around 32KB, so it is useful for prefetch
when copy size is over 16KB, but as H. Peter Anvin mentioned in last email,
over 16KB copy in kernel is rare.

Thanks
Ling

2009-11-13 06:04:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/12/2009 09:33 PM, Ma, Ling wrote:
>> Well, so you are running cache hot and it is only a win on huge
>> copies... how common are those?
>>
> Hi Pavel Machek
> Yes, we intend to introduce movsq for huge hot size(over 1024bytes)
> and avoid regression for less 1024bytes. I guess you suggest using
> prefetch instruction for cold data (if I was wrong please correct me).
> memcpy don't know whether data has been in cache or not,
> so only when copy size is over (first level 1 cache)/2 and lower
> (last level cache)/2 , prefetch will get benefit. Currently first
> level cache size of most cpus is around 32KB, so it is useful for prefetch
> when copy size is over 16KB, but as H. Peter Anvin mentioned in last email,
> over 16KB copy in kernel is rare.
>

What it sounds to me is that for Nehalem, we want to use memcpy_c for >=
1024 bytes and the old code for < 1024 bytes; for Core2 it might be the
exact opposite.

Either way, whatever we do should use the appropriate static replacement
mechanism.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-11-13 07:23:39

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

Hi H. Peter Anvin
>What it sounds to me is that for Nehalem, we want to use memcpy_c for >=
>1024 bytes and the old code for < 1024 bytes;

Yes, so we modify memcpy_c as memcpy_new for Nehalem, and keep old
code for Core2 is acceptable?

Thanks
Ling

2009-11-13 07:30:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/12/2009 11:23 PM, Ma, Ling wrote:
> Hi H. Peter Anvin
>> What it sounds to me is that for Nehalem, we want to use memcpy_c for >=
>> 1024 bytes and the old code for < 1024 bytes;
>
> Yes, so we modify memcpy_c as memcpy_new for Nehalem, and keep old
> code for Core2 is acceptable?

No, what I think we should do is to rename the old memcpy to something
like memcpy_o, and then have the actual memcpy routine look like:

cmpq $1024, %rcx
ja memcpy_c
jmp memcpy_o

... where the constant as well as the ja opcode can be patched by the
alternatives mechanism (to a jb if needed).

memcpy is *definitely* frequent enough that static patching is justified.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-11-13 07:34:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.


* Pavel Machek <[email protected]> wrote:

> > Ling, if you are interested, could you send a user-space test-app to
> > this thread that everyone could just compile and run on various older
> > boxes, to gather a performance profile of hand-coded versus string ops
> > performance?
> >
> > ( And i think we can make a judgement based on cache-hot performance
> > alone - if then the strings ops will perform comparatively better in
> > cache-cold scenarios, so the cache-hot numbers would be a conservative
> > estimate. )
>
> Ugh, really? I'd expect cache-cold performance to be not helped at all
> (memory bandwidth limit) and you'll get slow down from additional
> i-cache misses...

That's my point - the new code is shorter, which will run comparatively
faster in a cache-cold environment.

Ingo

2009-11-13 08:05:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.

On 11/12/2009 11:33 PM, Ingo Molnar wrote:
>
> * Pavel Machek <[email protected]> wrote:
>
>>> Ling, if you are interested, could you send a user-space test-app to
>>> this thread that everyone could just compile and run on various older
>>> boxes, to gather a performance profile of hand-coded versus string ops
>>> performance?
>>>
>>> ( And i think we can make a judgement based on cache-hot performance
>>> alone - if then the strings ops will perform comparatively better in
>>> cache-cold scenarios, so the cache-hot numbers would be a conservative
>>> estimate. )
>>
>> Ugh, really? I'd expect cache-cold performance to be not helped at all
>> (memory bandwidth limit) and you'll get slow down from additional
>> i-cache misses...
>
> That's my point - the new code is shorter, which will run comparatively
> faster in a cache-cold environment.
>

memcpy_c by itself is by far the shortest variant, of course.

The question is if it makes sense to use the long variants for short (<
1024 bytes) copies.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-11-13 08:10:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast string.


* H. Peter Anvin <[email protected]> wrote:

> On 11/12/2009 11:33 PM, Ingo Molnar wrote:
> >
> > * Pavel Machek <[email protected]> wrote:
> >
> >>> Ling, if you are interested, could you send a user-space test-app to
> >>> this thread that everyone could just compile and run on various older
> >>> boxes, to gather a performance profile of hand-coded versus string ops
> >>> performance?
> >>>
> >>> ( And i think we can make a judgement based on cache-hot performance
> >>> alone - if then the strings ops will perform comparatively better in
> >>> cache-cold scenarios, so the cache-hot numbers would be a conservative
> >>> estimate. )
> >>
> >> Ugh, really? I'd expect cache-cold performance to be not helped at all
> >> (memory bandwidth limit) and you'll get slow down from additional
> >> i-cache misses...
> >
> > That's my point - the new code is shorter, which will run comparatively
> > faster in a cache-cold environment.
> >
>
> memcpy_c by itself is by far the shortest variant, of course.

yep. The argument i made was when a long function was compared to a
short one. As you noted we dont actually enable the long function all
that often - which inverts the same argument.

> The question is if it makes sense to use the long variants for short
> (< 1024 bytes) copies.

I'd say not - the kernel executes in a icache-cold environment most of
the time (as user-space is far more cache intense in the majority of
workloads and kernel processing starts with a cold icache), so
optimizing the kernel for code size is very important. (but numbers done
on real workloads can convince me of the opposite.)

Ingo