2011-06-17 15:37:20

by tip-bot for Ma Ling

[permalink] [raw]
Subject: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

From: Ma Ling <[email protected]>

Program's temporal & spatial locality introduce cache unit to overcome
the processor-memory performance gap, hardware prefetch is very important
to improve performance by reducing cache miss. Modern CPU micro-architecture
mainly support two kinds of prefetch mechanism in L1 data cache:

a. Data cache unit (DCU) prefetcher. Data spatial locality ask us to provide
adjacent data while handling current data. larger cache line size
is one choice, but it would cause more cached data to be evicted and latency
to load, so we simply prefetch next line when accessing current data.
This mode only prefetch data of ascending address.

b. Instruction pointer (IP)- based strided prefetcher. Based on Load/write
instruction address the mechanism predicate to prefetch data with adaptive stride,
including ascending and descending address

DCU mode is good when time program data operation spend is longer than that of
prefetch next line, however copy-page function breaks the assumption,
DCU mode is hardly helpful, specially we append software prefetch and data is
in cache, so bus traffic is more busy that impact perforamnce seriously.

In this patch we introduce backward copy to successfully avoid HW prfetch
impact(DCU prefetcher), and simplify original code.
The performance is improved about 15% on core2, 36% on snb respectively.
(We use our micro-benchmark, and will do further test according to your requirment)

Thanks
Ling

---
arch/x86/lib/copy_page_64.S | 124 +++++++++++++++++++-----------------------
1 files changed, 56 insertions(+), 68 deletions(-)

diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 6fec2d1..3d17280 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -1,4 +1,5 @@
/* Written 2003 by Andi Kleen, based on a kernel by Evandro Menezes */
+/* Updated 2011 by Ma Ling to introduce backward copy */

#include <linux/linkage.h>
#include <asm/dwarf2.h>
@@ -17,83 +18,70 @@ ENDPROC(copy_page_c)

/* Could vary the prefetch distance based on SMP/UP */

+/*
+ * By backward copy we manage to reduce impact from HW prefetch
+ * when data is in L1 cache, and get benefit when data is not in L1 cache.
+ */
ENTRY(copy_page)
CFI_STARTPROC
- subq $3*8,%rsp
- CFI_ADJUST_CFA_OFFSET 3*8
- movq %rbx,(%rsp)
- CFI_REL_OFFSET rbx, 0
- movq %r12,1*8(%rsp)
- CFI_REL_OFFSET r12, 1*8
- movq %r13,2*8(%rsp)
- CFI_REL_OFFSET r13, 2*8
-
- movl $(4096/64)-5,%ecx
- .p2align 4
+ lea 0x4096(%rsi), %rsi
+ lea 0x4096(%rdi), %rdi
+ mov $(4096/64)-5, %cl
+ mov $5, %dl
+ /*
+ * Nop force following instruction to be 16 bytes aligned.
+ */
+ nop
.Loop64:
- dec %rcx
-
- movq (%rsi), %rax
- movq 8 (%rsi), %rbx
- movq 16 (%rsi), %rdx
- movq 24 (%rsi), %r8
- movq 32 (%rsi), %r9
- movq 40 (%rsi), %r10
- movq 48 (%rsi), %r11
- movq 56 (%rsi), %r12
-
- prefetcht0 5*64(%rsi)
-
- movq %rax, (%rdi)
- movq %rbx, 8 (%rdi)
- movq %rdx, 16 (%rdi)
- movq %r8, 24 (%rdi)
- movq %r9, 32 (%rdi)
- movq %r10, 40 (%rdi)
- movq %r11, 48 (%rdi)
- movq %r12, 56 (%rdi)
-
- leaq 64 (%rsi), %rsi
- leaq 64 (%rdi), %rdi
+ prefetchnta -5*64(%rsi)
+ dec %cl
+
+ movq -0x8*1(%rsi), %rax
+ movq -0x8*2(%rsi), %r8
+ movq -0x8*3(%rsi), %r9
+ movq -0x8*4(%rsi), %r10
+ movq %rax, -0x8*1(%rdi)
+ movq %r8, -0x8*2(%rdi)
+ movq %r9, -0x8*3(%rdi)
+ movq %r10, -0x8*4(%rdi)
+
+ movq -0x8*5(%rsi), %rax
+ movq -0x8*6(%rsi), %r8
+ movq -0x8*7(%rsi), %r9
+ movq -0x8*8(%rsi), %r10
+ leaq -64(%rsi), %rsi
+ movq %rax, -0x8*5(%rdi)
+ movq %r8, -0x8*6(%rdi)
+ movq %r9, -0x8*7(%rdi)
+ movq %r10, -0x8*8(%rdi)
+ leaq -64(%rdi), %rdi

jnz .Loop64

- movl $5,%ecx
- .p2align 4
.Loop2:
- decl %ecx
-
- movq (%rsi), %rax
- movq 8 (%rsi), %rbx
- movq 16 (%rsi), %rdx
- movq 24 (%rsi), %r8
- movq 32 (%rsi), %r9
- movq 40 (%rsi), %r10
- movq 48 (%rsi), %r11
- movq 56 (%rsi), %r12
-
- movq %rax, (%rdi)
- movq %rbx, 8 (%rdi)
- movq %rdx, 16 (%rdi)
- movq %r8, 24 (%rdi)
- movq %r9, 32 (%rdi)
- movq %r10, 40 (%rdi)
- movq %r11, 48 (%rdi)
- movq %r12, 56 (%rdi)
-
- leaq 64(%rdi),%rdi
- leaq 64(%rsi),%rsi
-
+ dec %dl
+
+ movq -0x8*1(%rsi), %rax
+ movq -0x8*2(%rsi), %r8
+ movq -0x8*3(%rsi), %r9
+ movq -0x8*4(%rsi), %r10
+ movq %rax, -0x8*1(%rdi)
+ movq %r8, -0x8*2(%rdi)
+ movq %r9, -0x8*3(%rdi)
+ movq %r10, -0x8*4(%rdi)
+
+ movq -0x8*5(%rsi), %rax
+ movq -0x8*6(%rsi), %r8
+ movq -0x8*7(%rsi), %r9
+ movq -0x8*8(%rsi), %r10
+ leaq -64(%rsi), %rsi
+ movq %rax, -0x8*5(%rdi)
+ movq %r8, -0x8*6(%rdi)
+ movq %r9, -0x8*7(%rdi)
+ movq %r10, -0x8*8(%rdi)
+ leaq -64(%rdi), %rdi
jnz .Loop2

- movq (%rsp),%rbx
- CFI_RESTORE rbx
- movq 1*8(%rsp),%r12
- CFI_RESTORE r12
- movq 2*8(%rsp),%r13
- CFI_RESTORE r13
- addq $3*8,%rsp
- CFI_ADJUST_CFA_OFFSET -3*8
ret
.Lcopy_page_end:
CFI_ENDPROC
--
1.6.5.2


2011-06-22 20:07:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

[email protected] writes:
> impact(DCU prefetcher), and simplify original code.
> The performance is improved about 15% on core2, 36% on snb respectively.
> (We use our micro-benchmark, and will do further test according to your requirment)

This doesn't make a lot of sense because neither Core-2 nor SNB use the
code path you patched. They all use the rep ; movs path

-Andi

--
[email protected] -- Speaking for myself only

2011-06-23 01:02:05

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

Yes, I also have tested 64bit atom, it got 11.6% improvement.
Because older CPU almost all use prefetch-next-line mechanism, the patch should be useful to them.

Thanks
Ling
> -----Original Message-----
> From: Ma, Ling
> Sent: Monday, June 20, 2011 11:43 AM
> To: Ma, Ling; [email protected]
> Cc: [email protected]; [email protected]; [email protected]
> Subject: RE: [PATCH RFC V2] [x86] Optimize copy-page by reducing impact
> from HW prefetch
>
> New experiment shows, for 4096 bytes no improvement on snb,
> 10~15% improvement on Core2, 11.6% improvement on 64bit atom.
>
> Thanks
> Ling

> -----Original Message-----
> From: Andi Kleen [mailto:[email protected]]
> Sent: Thursday, June 23, 2011 4:06 AM
> To: Ma, Ling
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> from HW prefetch
>
> [email protected] writes:
> > impact(DCU prefetcher), and simplify original code.
> > The performance is improved about 15% on core2, 36% on snb
> respectively.
> > (We use our micro-benchmark, and will do further test according to
> your requirment)
>
> This doesn't make a lot of sense because neither Core-2 nor SNB use the
> code path you patched. They all use the rep ; movs path
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only

2011-06-23 02:29:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

On Thu, Jun 23, 2011 at 09:01:19AM +0800, Ma, Ling wrote:
> Yes, I also have tested 64bit atom, it got 11.6% improvement.

That's a nice improvement, however I should add that in my experience
copy_page micro benchmark improvements do not necessarily translate to
real world improvements. Most simple micro benchmark do not simulate
the typical page fault access pattern very well.

> Because older CPU almost all use prefetch-next-line mechanism, the patch should be useful to them.

Old in this case is P4 and ancient early stepping K8 only.

-Andi

2011-06-23 07:04:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch


* Andi Kleen <[email protected]> wrote:

> [email protected] writes:
>
> > impact(DCU prefetcher), and simplify original code. The
> > performance is improved about 15% on core2, 36% on snb
> > respectively. (We use our micro-benchmark, and will do further
> > test according to your requirment)
>
> This doesn't make a lot of sense because neither Core-2 nor SNB use
> the code path you patched. They all use the rep ; movs path

Ling, mind double checking which one is the faster/better one on SNB,
in cold-cache and hot-cache situations, copy_page or copy_page_c?

Also, while looking at this file please fix the countless pieces of
style excrements it has before modifying it:

- non-Linux comment style (and needless two comments - it can
be in one comment block):

/* Don't use streaming store because it's better when the target
ends up in cache. */

/* Could vary the prefetch distance based on SMP/UP */

- (there's other non-standard comment blocks in this file as well)

- The copy_page/copy_page_c naming is needlessly obfuscated, it
should be copy_page, copy_page_norep or so - the _c postfix has no
obvious meaning.

- all #include's should be at the top

- please standardize it on the 'instrn %x, %y' pattern that we
generally use in arch/x86/, not 'instrn %x,%y' pattern.

and do this cleanup patch first and the speedup on top of it, and
keep the two in two separate patches so that the modification to the
assembly code can be reviewed more easily.

Thanks,

Ingo

2011-06-24 02:02:36

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

Sure, I separate two patches ASAP, one is for performance tuning code after some experiments,
another code style patch.

Thanks
Ling

> -----Original Message-----
> From: Ingo Molnar [mailto:[email protected]]
> Sent: Thursday, June 23, 2011 3:05 PM
> To: Andi Kleen
> Cc: Ma, Ling; [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> from HW prefetch
>
>
> * Andi Kleen <[email protected]> wrote:
>
> > [email protected] writes:
> >
> > > impact(DCU prefetcher), and simplify original code. The
> > > performance is improved about 15% on core2, 36% on snb
> > > respectively. (We use our micro-benchmark, and will do further
> > > test according to your requirment)
> >
> > This doesn't make a lot of sense because neither Core-2 nor SNB use
> > the code path you patched. They all use the rep ; movs path
>
> Ling, mind double checking which one is the faster/better one on SNB,
> in cold-cache and hot-cache situations, copy_page or copy_page_c?
>
> Also, while looking at this file please fix the countless pieces of
> style excrements it has before modifying it:
>
> - non-Linux comment style (and needless two comments - it can
> be in one comment block):
>
> /* Don't use streaming store because it's better when the target
> ends up in cache. */
>
> /* Could vary the prefetch distance based on SMP/UP */
>
> - (there's other non-standard comment blocks in this file as well)
>
> - The copy_page/copy_page_c naming is needlessly obfuscated, it
> should be copy_page, copy_page_norep or so - the _c postfix has no
> obvious meaning.
>
> - all #include's should be at the top
>
> - please standardize it on the 'instrn %x, %y' pattern that we
> generally use in arch/x86/, not 'instrn %x,%y' pattern.
>
> and do this cleanup patch first and the speedup on top of it, and
> keep the two in two separate patches so that the modification to the
> assembly code can be reviewed more easily.
>
> Thanks,
>
> Ingo

2011-06-24 02:09:52

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

Yes, clean up patch is first.

> -----Original Message-----
> From: Ma, Ling
> Sent: Friday, June 24, 2011 10:01 AM
> To: 'Ingo Molnar'; Andi Kleen
> Cc: [email protected]; [email protected]; [email protected]
> Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> from HW prefetch
>
> Sure, I separate two patches ASAP, one is for performance tuning code
> after some experiments,
> another code style patch.
>
> Thanks
> Ling
>
> > -----Original Message-----
> > From: Ingo Molnar [mailto:[email protected]]
> > Sent: Thursday, June 23, 2011 3:05 PM
> > To: Andi Kleen
> > Cc: Ma, Ling; [email protected]; [email protected]; linux-
> > [email protected]
> > Subject: Re: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> > from HW prefetch
> >
> >
> > * Andi Kleen <[email protected]> wrote:
> >
> > > [email protected] writes:
> > >
> > > > impact(DCU prefetcher), and simplify original code. The
> > > > performance is improved about 15% on core2, 36% on snb
> > > > respectively. (We use our micro-benchmark, and will do further
> > > > test according to your requirment)
> > >
> > > This doesn't make a lot of sense because neither Core-2 nor SNB use
> > > the code path you patched. They all use the rep ; movs path
> >
> > Ling, mind double checking which one is the faster/better one on SNB,
> > in cold-cache and hot-cache situations, copy_page or copy_page_c?
> >
> > Also, while looking at this file please fix the countless pieces of
> > style excrements it has before modifying it:
> >
> > - non-Linux comment style (and needless two comments - it can
> > be in one comment block):
> >
> > /* Don't use streaming store because it's better when the target
> > ends up in cache. */
> >
> > /* Could vary the prefetch distance based on SMP/UP */
> >
> > - (there's other non-standard comment blocks in this file as well)
> >
> > - The copy_page/copy_page_c naming is needlessly obfuscated, it
> > should be copy_page, copy_page_norep or so - the _c postfix has no
> > obvious meaning.
> >
> > - all #include's should be at the top
> >
> > - please standardize it on the 'instrn %x, %y' pattern that we
> > generally use in arch/x86/, not 'instrn %x,%y' pattern.
> >
> > and do this cleanup patch first and the speedup on top of it, and
> > keep the two in two separate patches so that the modification to the
> > assembly code can be reviewed more easily.
> >
> > Thanks,
> >
> > Ingo

2011-06-28 15:27:22

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

Hi Ingo

> Ling, mind double checking which one is the faster/better one on SNB,
> in cold-cache and hot-cache situations, copy_page or copy_page_c?
Copy_page_c
on hot-cache copy_page_c on SNB combines data to 128bit (processor limit 128bit/cycle for write) after startup latency
so it is faster than copy_page which provides 64bit/cycle for write.

on cold-cache copy_page_c doesn't use prefetch, which uses prfetch according to copy size,
so copy_page function is better.

Thanks
Ling