This patchset implements use of cacheable versions of memset and
memcpy when the len is greater than the cacheline size and the
destination is in RAM.
On MPC885, we observe a 7% rate increase on FTP transfer
Christophe Leroy (4):
Partially revert "powerpc: Remove duplicate cacheable_memcpy/memzero
functions"
powerpc32: swap r4 and r5 in cacheable_memzero
powerpc32: memset(0): use cacheable_memzero
powerpc32: memcpy: use cacheable_memcpy
arch/powerpc/lib/copy_32.S | 148 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 148 insertions(+)
--
2.1.0
This partially reverts
commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
("f909a35bdfb7cb350d078a2cf888162eeb20381c")'
Functions cacheable_memcpy/memzero are more efficient than
memcpy/memset as they use the dcbz instruction which avoids refill
of the cacheline with the data that we will overwrite.
Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/lib/copy_32.S | 127 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 6813f80..55f19f9 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -69,6 +69,54 @@ CACHELINE_BYTES = L1_CACHE_BYTES
LG_CACHELINE_BYTES = L1_CACHE_SHIFT
CACHELINE_MASK = (L1_CACHE_BYTES-1)
+/*
+ * Use dcbz on the complete cache lines in the destination
+ * to set them to zero. This requires that the destination
+ * area is cacheable. -- paulus
+ */
+_GLOBAL(cacheable_memzero)
+ mr r5,r4
+ li r4,0
+ addi r6,r3,-4
+ cmplwi 0,r5,4
+ blt 7f
+ stwu r4,4(r6)
+ beqlr
+ andi. r0,r6,3
+ add r5,r0,r5
+ subf r6,r0,r6
+ clrlwi r7,r6,32-LG_CACHELINE_BYTES
+ add r8,r7,r5
+ srwi r9,r8,LG_CACHELINE_BYTES
+ addic. r9,r9,-1 /* total number of complete cachelines */
+ ble 2f
+ xori r0,r7,CACHELINE_MASK & ~3
+ srwi. r0,r0,2
+ beq 3f
+ mtctr r0
+4: stwu r4,4(r6)
+ bdnz 4b
+3: mtctr r9
+ li r7,4
+10: dcbz r7,r6
+ addi r6,r6,CACHELINE_BYTES
+ bdnz 10b
+ clrlwi r5,r8,32-LG_CACHELINE_BYTES
+ addi r5,r5,4
+2: srwi r0,r5,2
+ mtctr r0
+ bdz 6f
+1: stwu r4,4(r6)
+ bdnz 1b
+6: andi. r5,r5,3
+7: cmpwi 0,r5,0
+ beqlr
+ mtctr r5
+ addi r6,r6,3
+8: stbu r4,1(r6)
+ bdnz 8b
+ blr
+
_GLOBAL(memset)
rlwimi r4,r4,8,16,23
rlwimi r4,r4,16,0,15
@@ -94,6 +142,85 @@ _GLOBAL(memset)
bdnz 8b
blr
+/*
+ * This version uses dcbz on the complete cache lines in the
+ * destination area to reduce memory traffic. This requires that
+ * the destination area is cacheable.
+ * We only use this version if the source and dest don't overlap.
+ * -- paulus.
+ */
+_GLOBAL(cacheable_memcpy)
+ add r7,r3,r5 /* test if the src & dst overlap */
+ add r8,r4,r5
+ cmplw 0,r4,r7
+ cmplw 1,r3,r8
+ crand 0,0,4 /* cr0.lt &= cr1.lt */
+ blt memcpy /* if regions overlap */
+
+ addi r4,r4,-4
+ addi r6,r3,-4
+ neg r0,r3
+ andi. r0,r0,CACHELINE_MASK /* # bytes to start of cache line */
+ beq 58f
+
+ cmplw 0,r5,r0 /* is this more than total to do? */
+ blt 63f /* if not much to do */
+ andi. r8,r0,3 /* get it word-aligned first */
+ subf r5,r0,r5
+ mtctr r8
+ beq+ 61f
+70: lbz r9,4(r4) /* do some bytes */
+ stb r9,4(r6)
+ addi r4,r4,1
+ addi r6,r6,1
+ bdnz 70b
+61: srwi. r0,r0,2
+ mtctr r0
+ beq 58f
+72: lwzu r9,4(r4) /* do some words */
+ stwu r9,4(r6)
+ bdnz 72b
+
+58: srwi. r0,r5,LG_CACHELINE_BYTES /* # complete cachelines */
+ clrlwi r5,r5,32-LG_CACHELINE_BYTES
+ li r11,4
+ mtctr r0
+ beq 63f
+53:
+ dcbz r11,r6
+ COPY_16_BYTES
+#if L1_CACHE_BYTES >= 32
+ COPY_16_BYTES
+#if L1_CACHE_BYTES >= 64
+ COPY_16_BYTES
+ COPY_16_BYTES
+#if L1_CACHE_BYTES >= 128
+ COPY_16_BYTES
+ COPY_16_BYTES
+ COPY_16_BYTES
+ COPY_16_BYTES
+#endif
+#endif
+#endif
+ bdnz 53b
+
+63: srwi. r0,r5,2
+ mtctr r0
+ beq 64f
+30: lwzu r0,4(r4)
+ stwu r0,4(r6)
+ bdnz 30b
+
+64: andi. r0,r5,3
+ mtctr r0
+ beq+ 65f
+40: lbz r0,4(r4)
+ stb r0,4(r6)
+ addi r4,r4,1
+ addi r6,r6,1
+ bdnz 40b
+65: blr
+
_GLOBAL(memmove)
cmplw 0,r3,r4
bgt backwards_memcpy
--
2.1.0
We swap r4 and r5, this avoids having to move the len contained in r4
into r5
Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/lib/copy_32.S | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index 55f19f9..cbca76c 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -75,18 +75,17 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
* area is cacheable. -- paulus
*/
_GLOBAL(cacheable_memzero)
- mr r5,r4
- li r4,0
+ li r5,0
addi r6,r3,-4
- cmplwi 0,r5,4
+ cmplwi 0,r4,4
blt 7f
- stwu r4,4(r6)
+ stwu r5,4(r6)
beqlr
andi. r0,r6,3
- add r5,r0,r5
+ add r4,r0,r4
subf r6,r0,r6
clrlwi r7,r6,32-LG_CACHELINE_BYTES
- add r8,r7,r5
+ add r8,r7,r4
srwi r9,r8,LG_CACHELINE_BYTES
addic. r9,r9,-1 /* total number of complete cachelines */
ble 2f
@@ -94,26 +93,26 @@ _GLOBAL(cacheable_memzero)
srwi. r0,r0,2
beq 3f
mtctr r0
-4: stwu r4,4(r6)
+4: stwu r5,4(r6)
bdnz 4b
3: mtctr r9
li r7,4
10: dcbz r7,r6
addi r6,r6,CACHELINE_BYTES
bdnz 10b
- clrlwi r5,r8,32-LG_CACHELINE_BYTES
- addi r5,r5,4
-2: srwi r0,r5,2
+ clrlwi r4,r8,32-LG_CACHELINE_BYTES
+ addi r4,r4,4
+2: srwi r0,r4,2
mtctr r0
bdz 6f
-1: stwu r4,4(r6)
+1: stwu r5,4(r6)
bdnz 1b
-6: andi. r5,r5,3
-7: cmpwi 0,r5,0
+6: andi. r4,r4,3
+7: cmpwi 0,r4,0
beqlr
- mtctr r5
+ mtctr r4
addi r6,r6,3
-8: stbu r4,1(r6)
+8: stbu r5,1(r6)
bdnz 8b
blr
--
2.1.0
cacheable_memzero uses dcbz instruction and is more efficient than
memset(0) when the destination is in RAM
This patch renames memset as generic_memset, and defines memset
as a prolog to cacheable_memzero. This prolog checks if the byte
to set is 0 and if the buffer is in RAM. If not, it falls back to
generic_memcpy()
Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index cbca76c..d8a9a86 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -12,6 +12,7 @@
#include <asm/cache.h>
#include <asm/errno.h>
#include <asm/ppc_asm.h>
+#include <asm/page.h>
#define COPY_16_BYTES \
lwz r7,4(r4); \
@@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
* to set them to zero. This requires that the destination
* area is cacheable. -- paulus
*/
+_GLOBAL(memset)
+ cmplwi r4,0
+ bne- generic_memset
+ cmplwi r5,L1_CACHE_BYTES
+ blt- generic_memset
+ lis r8,max_pfn@ha
+ lwz r8,max_pfn@l(r8)
+ tophys (r9,r3)
+ srwi r9,r9,PAGE_SHIFT
+ cmplw r9,r8
+ bge- generic_memset
+ mr r4,r5
_GLOBAL(cacheable_memzero)
li r5,0
addi r6,r3,-4
@@ -116,7 +129,7 @@ _GLOBAL(cacheable_memzero)
bdnz 8b
blr
-_GLOBAL(memset)
+_GLOBAL(generic_memset)
rlwimi r4,r4,8,16,23
rlwimi r4,r4,16,0,15
addi r6,r3,-4
--
2.1.0
cacheable_memcpy uses dcbz instruction and is more efficient than
memcpy when the destination is in RAM
This patch renames memcpy as generic_memcpy, and defines memcpy as a
prolog to cacheable_memcpy. This prolog checks if the buffer is
in RAM. If not, it falls back to generic_memcpy()
On MPC885, we get approximatly 7% increase of the transfer rate
on an FTP reception
Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/lib/copy_32.S | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index d8a9a86..8f76d49 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -161,13 +161,27 @@ _GLOBAL(generic_memset)
* We only use this version if the source and dest don't overlap.
* -- paulus.
*/
+_GLOBAL(memmove)
+ cmplw 0,r3,r4
+ bgt backwards_memcpy
+ /* fall through */
+
+_GLOBAL(memcpy)
+ cmplwi r5,L1_CACHE_BYTES
+ blt- generic_memcpy
+ lis r8,max_pfn@ha
+ lwz r8,max_pfn@l(r8)
+ tophys (r9,r3)
+ srwi r9,r9,PAGE_SHIFT
+ cmplw r9,r8
+ bge- generic_memcpy
_GLOBAL(cacheable_memcpy)
add r7,r3,r5 /* test if the src & dst overlap */
add r8,r4,r5
cmplw 0,r4,r7
cmplw 1,r3,r8
crand 0,0,4 /* cr0.lt &= cr1.lt */
- blt memcpy /* if regions overlap */
+ blt generic_memcpy /* if regions overlap */
addi r4,r4,-4
addi r6,r3,-4
@@ -233,12 +247,7 @@ _GLOBAL(cacheable_memcpy)
bdnz 40b
65: blr
-_GLOBAL(memmove)
- cmplw 0,r3,r4
- bgt backwards_memcpy
- /* fall through */
-
-_GLOBAL(memcpy)
+_GLOBAL(generic_memcpy)
srwi. r7,r5,3
addi r6,r3,-4
addi r4,r4,-4
--
2.1.0
On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> This partially reverts
> commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
> ("f909a35bdfb7cb350d078a2cf888162eeb20381c")'
I don't have that SHA. Do you mean
b05ae4ee602b7dc90771408ccf0972e1b3801a35?
> Functions cacheable_memcpy/memzero are more efficient than
> memcpy/memset as they use the dcbz instruction which avoids refill
> of the cacheline with the data that we will overwrite.
I don't see anything in this patchset that addresses the "NOTE: The old
routines are just flat buggy on kernels that support hardware with
different cacheline sizes" comment.
-Scott
On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> cacheable_memzero uses dcbz instruction and is more efficient than
> memset(0) when the destination is in RAM
>
> This patch renames memset as generic_memset, and defines memset
> as a prolog to cacheable_memzero. This prolog checks if the byte
> to set is 0 and if the buffer is in RAM. If not, it falls back to
> generic_memcpy()
>
> Signed-off-by: Christophe Leroy <[email protected]>
> ---
> arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
> index cbca76c..d8a9a86 100644
> --- a/arch/powerpc/lib/copy_32.S
> +++ b/arch/powerpc/lib/copy_32.S
> @@ -12,6 +12,7 @@
> #include <asm/cache.h>
> #include <asm/errno.h>
> #include <asm/ppc_asm.h>
> +#include <asm/page.h>
>
> #define COPY_16_BYTES \
> lwz r7,4(r4); \
> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
> * to set them to zero. This requires that the destination
> * area is cacheable. -- paulus
> */
> +_GLOBAL(memset)
> + cmplwi r4,0
> + bne- generic_memset
> + cmplwi r5,L1_CACHE_BYTES
> + blt- generic_memset
> + lis r8,max_pfn@ha
> + lwz r8,max_pfn@l(r8)
> + tophys (r9,r3)
> + srwi r9,r9,PAGE_SHIFT
> + cmplw r9,r8
> + bge- generic_memset
> + mr r4,r5
max_pfn includes highmem, and tophys only works on normal kernel
addresses.
If we were to point memset_io, memcpy_toio, etc. at noncacheable
versions, are there any other callers left that can reasonably point at
uncacheable memory?
-Scott
Le 14/05/2015 02:55, Scott Wood a écrit :
> On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
>> cacheable_memzero uses dcbz instruction and is more efficient than
>> memset(0) when the destination is in RAM
>>
>> This patch renames memset as generic_memset, and defines memset
>> as a prolog to cacheable_memzero. This prolog checks if the byte
>> to set is 0 and if the buffer is in RAM. If not, it falls back to
>> generic_memcpy()
>>
>> Signed-off-by: Christophe Leroy <[email protected]>
>> ---
>> arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
>> index cbca76c..d8a9a86 100644
>> --- a/arch/powerpc/lib/copy_32.S
>> +++ b/arch/powerpc/lib/copy_32.S
>> @@ -12,6 +12,7 @@
>> #include <asm/cache.h>
>> #include <asm/errno.h>
>> #include <asm/ppc_asm.h>
>> +#include <asm/page.h>
>>
>> #define COPY_16_BYTES \
>> lwz r7,4(r4); \
>> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
>> * to set them to zero. This requires that the destination
>> * area is cacheable. -- paulus
>> */
>> +_GLOBAL(memset)
>> + cmplwi r4,0
>> + bne- generic_memset
>> + cmplwi r5,L1_CACHE_BYTES
>> + blt- generic_memset
>> + lis r8,max_pfn@ha
>> + lwz r8,max_pfn@l(r8)
>> + tophys (r9,r3)
>> + srwi r9,r9,PAGE_SHIFT
>> + cmplw r9,r8
>> + bge- generic_memset
>> + mr r4,r5
> max_pfn includes highmem, and tophys only works on normal kernel
> addresses.
Is there any other simple way to determine whether an address is in RAM
or not ?
I did that because of the below function from mm/mem.c
|int page_is_ram(unsigned long pfn)
{
#ifndef CONFIG_PPC64 /* XXX for now */
return pfn< max_pfn;
#else
unsigned long paddr= (pfn<< PAGE_SHIFT);
struct memblock_region*reg;
for_each_memblock(memory, reg)
if (paddr>= reg->base&& paddr< (reg->base+ reg->size))
return 1;
return 0;
#endif
}
|
>
> If we were to point memset_io, memcpy_toio, etc. at noncacheable
> versions, are there any other callers left that can reasonably point at
> uncacheable memory?
Do you mean we could just consider that memcpy() and memset() are called
only with destination on RAM and thus we could avoid the check ?
copy_tofrom_user() already does this assumption (allthought a user app
could possibly provide a buffer located in an ALSA mapped IO area)
Christophe
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
http://www.avast.com
On Thu, 2015-05-14 at 10:50 +0200, christophe leroy wrote:
>
> Le 14/05/2015 02:55, Scott Wood a écrit :
> > On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
> >> cacheable_memzero uses dcbz instruction and is more efficient than
> >> memset(0) when the destination is in RAM
> >>
> >> This patch renames memset as generic_memset, and defines memset
> >> as a prolog to cacheable_memzero. This prolog checks if the byte
> >> to set is 0 and if the buffer is in RAM. If not, it falls back to
> >> generic_memcpy()
> >>
> >> Signed-off-by: Christophe Leroy <[email protected]>
> >> ---
> >> arch/powerpc/lib/copy_32.S | 15 ++++++++++++++-
> >> 1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
> >> index cbca76c..d8a9a86 100644
> >> --- a/arch/powerpc/lib/copy_32.S
> >> +++ b/arch/powerpc/lib/copy_32.S
> >> @@ -12,6 +12,7 @@
> >> #include <asm/cache.h>
> >> #include <asm/errno.h>
> >> #include <asm/ppc_asm.h>
> >> +#include <asm/page.h>
> >>
> >> #define COPY_16_BYTES \
> >> lwz r7,4(r4); \
> >> @@ -74,6 +75,18 @@ CACHELINE_MASK = (L1_CACHE_BYTES-1)
> >> * to set them to zero. This requires that the destination
> >> * area is cacheable. -- paulus
> >> */
> >> +_GLOBAL(memset)
> >> + cmplwi r4,0
> >> + bne- generic_memset
> >> + cmplwi r5,L1_CACHE_BYTES
> >> + blt- generic_memset
> >> + lis r8,max_pfn@ha
> >> + lwz r8,max_pfn@l(r8)
> >> + tophys (r9,r3)
> >> + srwi r9,r9,PAGE_SHIFT
> >> + cmplw r9,r8
> >> + bge- generic_memset
> >> + mr r4,r5
> > max_pfn includes highmem, and tophys only works on normal kernel
> > addresses.
> Is there any other simple way to determine whether an address is in RAM
> or not ?
If you want to do it based on the virtual address, rather than doing a
tablewalk or TLB search, you need to limit it to lowmem.
> I did that because of the below function from mm/mem.c
>
> |int page_is_ram(unsigned long pfn)
> {
> #ifndef CONFIG_PPC64 /* XXX for now */
> return pfn< max_pfn;
> #else
> unsigned long paddr= (pfn<< PAGE_SHIFT);
> struct memblock_region*reg;
>
> for_each_memblock(memory, reg)
> if (paddr>= reg->base&& paddr< (reg->base+ reg->size))
> return 1;
> return 0;
> #endif
> }
Right, the problem is figuring out the pfn in the first place.
> > If we were to point memset_io, memcpy_toio, etc. at noncacheable
> > versions, are there any other callers left that can reasonably point at
> > uncacheable memory?
> Do you mean we could just consider that memcpy() and memset() are called
> only with destination on RAM and thus we could avoid the check ?
Maybe. If that's not a safe assumption I hope someone will point it
out.
> copy_tofrom_user() already does this assumption (allthought a user app
> could possibly provide a buffer located in an ALSA mapped IO area)
The user could also pass in NULL. That's what the fixups are for. :-)
-Scott
Le 14/05/2015 02:49, Scott Wood a écrit :
> On Tue, 2015-05-12 at 15:32 +0200, Christophe Leroy wrote:
>> This partially reverts
>> commit 'powerpc: Remove duplicate cacheable_memcpy/memzero functions
>> ("f909a35bdfb7cb350d078a2cf888162eeb20381c")'
> I don't have that SHA. Do you mean
> b05ae4ee602b7dc90771408ccf0972e1b3801a35?
Right, took it from the wrong tree sorry.
>
>> Functions cacheable_memcpy/memzero are more efficient than
>> memcpy/memset as they use the dcbz instruction which avoids refill
>> of the cacheline with the data that we will overwrite.
> I don't see anything in this patchset that addresses the "NOTE: The old
> routines are just flat buggy on kernels that support hardware with
> different cacheline sizes" comment.
I believe the NOTE means that if a kernel is compiled for several CPUs
having different cache line size,
then it will not work. But it is also the case of other functions using
dcbz instruction, like copy_page() clear_page() copy_tofrom_user().
And indeed, this seems only possible in three cases:
1/ With CONFIG_44x as 47x has different size than 44x and 46x. However
it is explicitly stated in arch/powerpc/platforms/44x/Kconfig : "config
PPC_47x This option enables support for the 47x family of processors and
is not currently compatible with other 44x or 46x varients"
2/ With CONFIG_PPC_85xx, as PPC_E500MC has different size than other
E500. However it is explicitly stated in
arch/powerpc/platforms/Kconfig.cputype : "config PPC_E500MC This must be
enabled for running on e500mc (and derivatives such as e5500/e6500), and
must be disabled for running on e500v1 or e500v2."
3/ With CONFIG_403GCX as 403GCX has different size than other 40x.
However it seems to be no way to select CONFIG_403GCX from
arch/powerpc/platforms/40x/Kconfig
Christophe
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
http://www.avast.com