2013-06-02 16:59:37

by Jussi Kivilinna

[permalink] [raw]
Subject: [PATCH 1/2] crypto: twofish - disable AVX2 implementation

It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
workload (tested on Core i5-4570) and causes twofish_avx2 to be significantly
slower than twofish_avx. So disable the AVX2 implementation to avoid
performance regressions.

Signed-off-by: Jussi Kivilinna <[email protected]>
---
crypto/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index d1ca631..678a6ed 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1318,6 +1318,7 @@ config CRYPTO_TWOFISH_AVX_X86_64
config CRYPTO_TWOFISH_AVX2_X86_64
tristate "Twofish cipher algorithm (x86_64/AVX2)"
depends on X86 && 64BIT
+ depends on BROKEN
select CRYPTO_ALGAPI
select CRYPTO_CRYPTD
select CRYPTO_ABLK_HELPER_X86


2013-06-02 16:57:48

by Jussi Kivilinna

[permalink] [raw]
Subject: [PATCH 2/2] crypto: blowfish - disable AVX2 implementation

It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
workload (tested on Core i5-4570) and causes blowfish-avx2 to be significantly
slower than blowfish-amd64. So disable the AVX2 implementation to avoid
performance regressions.

Signed-off-by: Jussi Kivilinna <[email protected]>
---
crypto/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 678a6ed..8ca52c5 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -842,6 +842,7 @@ config CRYPTO_BLOWFISH_X86_64
config CRYPTO_BLOWFISH_AVX2_X86_64
tristate "Blowfish cipher algorithm (x86_64/AVX2)"
depends on X86 && 64BIT
+ depends on BROKEN
select CRYPTO_ALGAPI
select CRYPTO_CRYPTD
select CRYPTO_ABLK_HELPER_X86

2013-06-05 08:34:29

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 2/2] crypto: blowfish - disable AVX2 implementation

On Sun, Jun 02, 2013 at 07:51:52PM +0300, Jussi Kivilinna wrote:
> It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
> workload (tested on Core i5-4570) and causes blowfish-avx2 to be significantly
> slower than blowfish-amd64. So disable the AVX2 implementation to avoid
> performance regressions.
>
> Signed-off-by: Jussi Kivilinna <[email protected]>

Both patches applied to crypto. I presume you're working on
a more permanent solution on this?

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2013-06-05 12:26:27

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH 2/2] crypto: blowfish - disable AVX2 implementation

On 05.06.2013 11:34, Herbert Xu wrote:
> On Sun, Jun 02, 2013 at 07:51:52PM +0300, Jussi Kivilinna wrote:
>> It appears that the performance of 'vpgatherdd' is suboptimal for this kind of
>> workload (tested on Core i5-4570) and causes blowfish-avx2 to be significantly
>> slower than blowfish-amd64. So disable the AVX2 implementation to avoid
>> performance regressions.
>>
>> Signed-off-by: Jussi Kivilinna <[email protected]>
>
> Both patches applied to crypto. I presume you're working on
> a more permanent solution on this?

Yes, I've been looking for solution. Problem is, well, that I assumed vgather to be quicker than emulating gather using vpextr/vpinsr instructions. But it appears that vgather has about the same speed as group of vpextr/vpinsr doing gather manually. So doing

asm volatile(
"vpgatherdd %%xmm0, (%[ptr], %%xmm8, 4), %%xmm9; \n\t"
"vpcmpeqd %%xmm0, %%xmm0, %%xmm0; /* reset mask */ \n\t"
"vpgatherdd %%xmm0, (%[ptr], %%xmm9, 4), %%xmm8; \n\t"
"vpcmpeqd %%xmm0, %%xmm0, %%xmm0; \n\t"
:: [ptr] "r" (&mem[0]) : "memory"
);

in loop is slightly _slower_ than manually extracting&inserting values with

asm volatile(
"vmovd %%xmm8, %%eax; \n\t"
"vpextrd $1, %%xmm8, %%edx; \n\t"
"vmovd (%[ptr], %%rax, 4), %%xmm10; \n\t"
"vpextrd $2, %%xmm8, %%eax; \n\t"
"vpinsrd $1, (%[ptr], %%rdx, 4), %%xmm10, %%xmm10; \n\t"
"vpextrd $3, %%xmm8, %%edx; \n\t"
"vpinsrd $2, (%[ptr], %%rax, 4), %%xmm10, %%xmm10; \n\t"
"vpinsrd $3, (%[ptr], %%rdx, 4), %%xmm10, %%xmm9; \n\t"

"vmovd %%xmm9, %%eax; \n\t"
"vpextrd $1, %%xmm9, %%edx; \n\t"
"vmovd (%[ptr], %%rax, 4), %%xmm10; \n\t"
"vpextrd $2, %%xmm9, %%eax; \n\t"
"vpinsrd $1, (%[ptr], %%rdx, 4), %%xmm10, %%xmm10; \n\t"
"vpextrd $3, %%xmm9, %%edx; \n\t"
"vpinsrd $2, (%[ptr], %%rax, 4), %%xmm10, %%xmm10; \n\t"
"vpinsrd $3, (%[ptr], %%rdx, 4), %%xmm10, %%xmm8; \n\t"
:: [ptr] "r" (&mem[0]) : "memory", "eax", "edx"
);

vpextr/vpinsr cannot be used with 256-bit wide ymm registers, so 'vinserti128/vextracti128' is needed and make manual gather about the same speed as vpgatherdd.

Now the block cipher implementations need to use all bytes of vector register for table look-ups, and the way that this is done in the AVX implementation of Twofish (move data from vector register to generic purpose registers, handle byte-extraction and table look-ups there and move processed data back to vector register) is about two to three times faster than the way with current AVX2 implementation using vgather.

Blowfish does not do much processing in addition to table look-ups, so there is not much to that can be done. With Twofish, the table look-ups are the most computationally heavy part and I don't think that the wider vector registers in the other parts are going to give much boost. So permanent solution is likely to be revert.

-Jussi

>
> Thanks,
>