2007-10-25 11:44:08

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH0/5] camellia: cleanup, de-unrolling, and 64bit-ization

Hi Hervert,

Please review and maybe propagate upstream following patches.

camellia1.diff:
Move code blocks around so that related pieces are closer together:
e.g. CAMELLIA_ROUNDSM macro does not need to be separated
from the rest of the code by huge array of constants.

Remove unused macros (COPY4WORD, SWAP4WORD, XOR4WORD[2])

Drop SUBL(), SUBR() macros which only obscure things.
Same for CAMELLIA_SP1110() macro and KEY_TABLE_TYPE typedef.

Remove useless comments:
/* encryption */ -- well it's obvious enough already!
void camellia_encrypt128(...)

Combine swap with copying at the beginning/end of encrypt/decrypt.


camellia2.diff
Rename some macros to shorter names: CAMELLIA_RR8 -> ROR8,
making it easier to understand that it is just a right rotation,
nothing camellia-specific in it.
CAMELLIA_SUBKEY_L() -> SUBKEY_L() - just shorter.

Move be32 <-> cpu conversions out of en/decrypt128/256 and into
camellia_en/decrypt - no reason to have that code duplicated twice.


camellia3.diff
Optimize GETU32 to use 4-byte memcpy (modern gcc will convert
such memcpy to single move instruction on i386).
Original GETU32 did four byte fetches, and shifted/XORed those.


camellia4.diff
Move huge unrolled pieces of code (3 screenfuls) at the end of
128/256 key setup routines into common camellia_setup_tail(),
convert it to loop there.
Loop is still unrolled six times, so performance hit is very small,
code size win is big.


camellia5.diff
Use alternative key setup implementation with mostly 64-bit ops
if BITS_PER_LONG >= 64. Both much smaller and much faster.

Unify camellia_en/decrypt128/256 into camellia_do_en/decrypt.
Code was similar, with just one additional if() we can use came code.

If CONFIG_CC_OPTIMIZE_FOR_SIZE is defined,
use loop in camellia_do_en/decrypt instead of unrolled code.
~5% encrypt/decrypt slowdown.

Replace (x & 0xff) with (u8)x, gcc is not smart enough to realize
that it can do (x & 0xff) this way (which is smaller at least on i386).

Don't do (x & 0xff) in a few places where x cannot be > 255 anyway:
t0 = il >> 16; v = camellia_sp0222[(t1 >> 8) & 0xff];
il16 is u32, (thus t1 >> 8) is one byte!



Benchmarking was done in userspace (see attached tarball for code).
All times are in microseconds. Two runs give some idea of test variability.
"Setup NN: NNNNNN NNNNNN" - time taken by 100000 key setups (two runs).
"Encrypt: NNNNNN NNNNNN" - time taken by 1000 encryptions of 8K buffer.
"Decrypt: NNNNNN NNNNNN" - time taken by 1000 decryptions of 8K buffer.
"(matches)" - encrypt/decrypt cycle produced non corrupted plaintext.

CONFIG_CC_OPTIMIZE_FOR_SIZE is not set:

$ ./camellia
Setup 16:32779 33169 Encrypt:153582 153740 Decrypt:150985 149811 (matches)
Setup 24:49333 48987 Encrypt:197973 198853 Decrypt:201240 197585 (matches)
Setup 32:46700 47680 Encrypt:195650 195800 Decrypt:195450 195469 (matches)
$ ./camellia5
Setup 16:33417 32968 Encrypt:149195 149095 Decrypt:148593 148661 (matches)
Setup 24:50082 50064 Encrypt:201214 199204 Decrypt:197078 197579 (matches)
Setup 32:48938 48824 Encrypt:200231 199545 Decrypt:198954 198996 (matches)
$ ./camellia_64
Setup 16:22247 22473 Encrypt:152321 149860 Decrypt:149058 148451 (matches)
Setup 24:33832 34017 Encrypt:200428 202969 Decrypt:196789 195524 (matches)
Setup 32:32884 32821 Encrypt:200414 200640 Decrypt:197857 195987 (matches)

$ size camellia.o camellia7.o camellia_64.o
text data bss dec hex filename
24586 0 0 24586 600a camellia.o
21714 0 0 21714 54d2 camellia5.o
18666 0 0 18666 48ea camellia_64.o

Very small speed loss in camellia -> camellia5, noticeably smaller size.
Big key setup speedup in 64-bit camellia_64, and it is even smaller.


CONFIG_CC_OPTIMIZE_FOR_SIZE is set:

$ ./camellia_Os
Setup 16:32573 34985 Encrypt:151825 152011 Decrypt:147581 147630 (matches)
Setup 24:48528 49250 Encrypt:196223 199056 Decrypt:198811 196394 (matches)
Setup 32:46650 47538 Encrypt:197466 196412 Decrypt:196290 196550 (matches)
$ ./camellia5_Os
Setup 16:33360 34487 Encrypt:154718 154499 Decrypt:157432 157135 (matches)
Setup 24:53969 54304 Encrypt:205184 205818 Decrypt:210675 208552 (matches)
Setup 32:53064 52904 Encrypt:205350 205439 Decrypt:211654 208468 (matches)
$ ./camellia_64_Os
Setup 16:24696 25894 Encrypt:155903 155747 Decrypt:157385 155696 (matches)
Setup 24:33873 33230 Encrypt:206111 206385 Decrypt:208111 207650 (matches)
Setup 32:32799 32325 Encrypt:209715 205973 Decrypt:207578 207644 (matches)

$ size camellia_Os.o camellia7_Os.o camellia_64_Os.o
text data bss dec hex filename
24586 0 0 24586 600a camellia_Os.o
15906 0 0 15906 3e22 camellia5_Os.o
13098 0 0 13098 332a camellia_64_Os.o

Some speed loss in camellia -> camellia5, much smaller size.
Big key setup speedup in 64-bit camellia_64, and it is even smaller still.


Above sizes are for userspace test programs. Kernel sizes are similar.
For example, kernel module sizes with CONFIG_CC_OPTIMIZE_FOR_SIZE set, AMD64:

$ size */camellia.o
text data bss dec hex filename
23208 272 0 23480 5bb8 crypto.org/camellia.o
11328 272 0 11600 2d50 crypto/camellia.o

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (5.33 kB)
test_camellia.tar.bz2 (17.67 kB)
Download all attachments

2007-10-25 11:45:14

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 1/5] camellia: cleanup

On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> Hi Hervert,
>
> Please review and maybe propagate upstream following patches.
>
> camellia1.diff:
> Move code blocks around so that related pieces are closer together:
> e.g. CAMELLIA_ROUNDSM macro does not need to be separated
> from the rest of the code by huge array of constants.
>
> Remove unused macros (COPY4WORD, SWAP4WORD, XOR4WORD[2])
>
> Drop SUBL(), SUBR() macros which only obscure things.
> Same for CAMELLIA_SP1110() macro and KEY_TABLE_TYPE typedef.
>
> Remove useless comments:
> /* encryption */ -- well it's obvious enough already!
> void camellia_encrypt128(...)
>
> Combine swap with copying at the beginning/end of encrypt/decrypt.

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (824.00 B)
camellia1.diff (42.44 kB)
Download all attachments

2007-10-25 11:45:54

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 2/5] camellia: cleanup

On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> Hi Hervert,
>
> Please review and maybe propagate upstream following patches.
>
> camellia2.diff
> Rename some macros to shorter names: CAMELLIA_RR8 -> ROR8,
> making it easier to understand that it is just a right rotation,
> nothing camellia-specific in it.
> CAMELLIA_SUBKEY_L() -> SUBKEY_L() - just shorter.
>
> Move be32 <-> cpu conversions out of en/decrypt128/256 and into
> camellia_en/decrypt - no reason to have that code duplicated twice.

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (601.00 B)
camellia2.diff (55.22 kB)
Download all attachments

2007-10-25 11:46:43

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 3/5] camellia: cleanup

On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> Hi Hervert,
>
> Please review and maybe propagate upstream following patches.
>
> camellia3.diff
> Optimize GETU32 to use 4-byte memcpy (modern gcc will convert
> such memcpy to single move instruction on i386).
> Original GETU32 did four byte fetches, and shifted/XORed those.

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (416.00 B)
camellia3.diff (2.06 kB)
Download all attachments

2007-10-25 11:47:24

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 4/5] camellia: de-unrolling

On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> Hi Hervert,
>
> Please review and maybe propagate upstream following patches.
>
> camellia4.diff
> Move huge unrolled pieces of code (3 screenfuls) at the end of
> 128/256 key setup routines into common camellia_setup_tail(),
> convert it to loop there.
> Loop is still unrolled six times, so performance hit is very small,
> code size win is big.

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (494.00 B)
camellia4.diff (6.69 kB)
Download all attachments

2007-10-25 11:48:38

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> Hi Hervert,
>
> Please review and maybe propagate upstream following patches.
>
> camellia5.diff
> Use alternative key setup implementation with mostly 64-bit ops
> if BITS_PER_LONG >= 64. Both much smaller and much faster.
>
> Unify camellia_en/decrypt128/256 into camellia_do_en/decrypt.
> Code was similar, with just one additional if() we can use came code.
>
> If CONFIG_CC_OPTIMIZE_FOR_SIZE is defined,
> use loop in camellia_do_en/decrypt instead of unrolled code.
> ~5% encrypt/decrypt slowdown.
>
> Replace (x & 0xff) with (u8)x, gcc is not smart enough to realize
> that it can do (x & 0xff) this way (which is smaller at least on i386).
>
> Don't do (x & 0xff) in a few places where x cannot be > 255 anyway:
> t0 = il >> 16; v = camellia_sp0222[(t1 >> 8) & 0xff];
> il16 is u32, (thus t1 >> 8) is one byte!

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (0.98 kB)
camellia5.diff (54.42 kB)
Download all attachments

2007-10-25 11:57:48

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH0/5] camellia: cleanup, de-unrolling, and 64bit-ization

Hi HERBERT (with B!)

On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> Hi Hervert,
^

Sorry.
--
vda

2007-10-26 09:11:59

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [PATCH 3/5] camellia: cleanup

>> Thu, 25 Oct 2007 12:46:35 +0100
>> [Subject: [PATCH 3/5] camellia: cleanup]
>> Denys Vlasenko <[email protected]> wrote...

> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia3.diff
> > Optimize GETU32 to use 4-byte memcpy (modern gcc will convert
> > such memcpy to single move instruction on i386).
> > Original GETU32 did four byte fetches, and shifted/XORed those.
>
> Signed-off-by: Denys Vlasenko <[email protected]>

Acked-by: Noriaki TAKAMIYA <[email protected]>

--
Noriaki TAKAMIYA

2007-10-26 09:12:07

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [PATCH 4/5] camellia: de-unrolling

Hi,

>> Thu, 25 Oct 2007 12:47:16 +0100
>> [Subject: [PATCH 4/5] camellia: de-unrolling]
>> Denys Vlasenko <[email protected]> wrote...

> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia4.diff
> > Move huge unrolled pieces of code (3 screenfuls) at the end of
> > 128/256 key setup routines into common camellia_setup_tail(),
> > convert it to loop there.
> > Loop is still unrolled six times, so performance hit is very small,
> > code size win is big.
>
> Signed-off-by: Denys Vlasenko <[email protected]>
> --
> vda
Acked-by: Noriaki TAKAMIYA <[email protected]>

--
Noriaki TAKAMIYA

2007-10-26 09:12:21

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [PATCH 2/5] camellia: cleanup

>> Thu, 25 Oct 2007 12:45:42 +0100
>> [Subject: [PATCH 2/5] camellia: cleanup]
>> Denys Vlasenko <[email protected]> wrote...

> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia2.diff
> > Rename some macros to shorter names: CAMELLIA_RR8 -> ROR8,
> > making it easier to understand that it is just a right rotation,
> > nothing camellia-specific in it.
> > CAMELLIA_SUBKEY_L() -> SUBKEY_L() - just shorter.
> >
> > Move be32 <-> cpu conversions out of en/decrypt128/256 and into
> > camellia_en/decrypt - no reason to have that code duplicated twice.
>
> Signed-off-by: Denys Vlasenko <[email protected]>

Acked-by: Noriaki TAKAMIYA <[email protected]>

--
Noriaki TAKAMIYA

2007-10-26 09:12:31

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [PATCH 1/5] camellia: cleanup

Hi,

>> Thu, 25 Oct 2007 12:45:04 +0100
>> [Subject: [PATCH 1/5] camellia: cleanup]
>> Denys Vlasenko <[email protected]> wrote...

> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia1.diff:
> > Move code blocks around so that related pieces are closer together:
> > e.g. CAMELLIA_ROUNDSM macro does not need to be separated
> > from the rest of the code by huge array of constants.
> >
> > Remove unused macros (COPY4WORD, SWAP4WORD, XOR4WORD[2])
> >
> > Drop SUBL(), SUBR() macros which only obscure things.
> > Same for CAMELLIA_SP1110() macro and KEY_TABLE_TYPE typedef.
> >
> > Remove useless comments:
> > /* encryption */ -- well it's obvious enough already!
> > void camellia_encrypt128(...)
> >
> > Combine swap with copying at the beginning/end of encrypt/decrypt.
>
> Signed-off-by: Denys Vlasenko <[email protected]>

Acked-by: Noriaki TAKAMIYA <[email protected]>

--
Noriaki TAKAMIYA

2007-10-26 09:13:03

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

>> Thu, 25 Oct 2007 12:48:29 +0100
>> [Subject: [PATCH 5/5] camellia: de-unrolling, 64bit-ization]
>> Denys Vlasenko <[email protected]> wrote...

> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia5.diff
> > Use alternative key setup implementation with mostly 64-bit ops
> > if BITS_PER_LONG >= 64. Both much smaller and much faster.
> >
> > Unify camellia_en/decrypt128/256 into camellia_do_en/decrypt.
> > Code was similar, with just one additional if() we can use came code.
> >
> > If CONFIG_CC_OPTIMIZE_FOR_SIZE is defined,
> > use loop in camellia_do_en/decrypt instead of unrolled code.
> > ~5% encrypt/decrypt slowdown.
> >
> > Replace (x & 0xff) with (u8)x, gcc is not smart enough to realize
> > that it can do (x & 0xff) this way (which is smaller at least on i386).
> >
> > Don't do (x & 0xff) in a few places where x cannot be > 255 anyway:
> > t0 = il >> 16; v = camellia_sp0222[(t1 >> 8) & 0xff];
> > il16 is u32, (thus t1 >> 8) is one byte!
>
> Signed-off-by: Denys Vlasenko <[email protected]>
> --
> vda
Acked-by: Noriaki TAKAMIYA <[email protected]>

--
Noriaki TAKAMIYA

2007-11-06 14:17:45

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] camellia: cleanup

On Thu, Oct 25, 2007 at 12:45:04PM +0100, Denys Vlasenko wrote:
> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia1.diff:
> > Move code blocks around so that related pieces are closer together:
> > e.g. CAMELLIA_ROUNDSM macro does not need to be separated
> > from the rest of the code by huge array of constants.
> >
> > Remove unused macros (COPY4WORD, SWAP4WORD, XOR4WORD[2])
> >
> > Drop SUBL(), SUBR() macros which only obscure things.
> > Same for CAMELLIA_SP1110() macro and KEY_TABLE_TYPE typedef.
> >
> > Remove useless comments:
> > /* encryption */ -- well it's obvious enough already!
> > void camellia_encrypt128(...)
> >
> > Combine swap with copying at the beginning/end of encrypt/decrypt.
>
> Signed-off-by: Denys Vlasenko <[email protected]>

Patch applied. Thanks Denis!
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-06 14:19:49

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 2/5] camellia: cleanup

On Thu, Oct 25, 2007 at 12:45:42PM +0100, Denys Vlasenko wrote:
> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia2.diff
> > Rename some macros to shorter names: CAMELLIA_RR8 -> ROR8,
> > making it easier to understand that it is just a right rotation,
> > nothing camellia-specific in it.
> > CAMELLIA_SUBKEY_L() -> SUBKEY_L() - just shorter.
> >
> > Move be32 <-> cpu conversions out of en/decrypt128/256 and into
> > camellia_en/decrypt - no reason to have that code duplicated twice.
>
> Signed-off-by: Denys Vlasenko <[email protected]>

Patch applied. BTW, we have a function called ror32 that can
replace all of these ROR* macros.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-06 14:21:11

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 3/5] camellia: cleanup

On Thu, Oct 25, 2007 at 12:46:35PM +0100, Denys Vlasenko wrote:
> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia3.diff
> > Optimize GETU32 to use 4-byte memcpy (modern gcc will convert
> > such memcpy to single move instruction on i386).
> > Original GETU32 did four byte fetches, and shifted/XORed those.
>
> Signed-off-by: Denys Vlasenko <[email protected]>

Patch applied.

> +# define GETU32(v, pt) \
> + do { \
> + /* latest breed of gcc is clever enough to use move */ \
> + memcpy(&(v), (pt), 4); \
> + (v) = be32_to_cpu(v); \
> + } while(0)

You can get rid of this memcpy too since camellia declares an
alignmask of 3 which means that the key is 32-bit aligned.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-06 14:21:43

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 4/5] camellia: de-unrolling

On Thu, Oct 25, 2007 at 12:47:16PM +0100, Denys Vlasenko wrote:
> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia4.diff
> > Move huge unrolled pieces of code (3 screenfuls) at the end of
> > 128/256 key setup routines into common camellia_setup_tail(),
> > convert it to loop there.
> > Loop is still unrolled six times, so performance hit is very small,
> > code size win is big.
>
> Signed-off-by: Denys Vlasenko <[email protected]>

Good work! Patch applied.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-06 14:23:52

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Thu, Oct 25, 2007 at 12:48:29PM +0100, Denys Vlasenko wrote:
> On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > Hi Hervert,
> >
> > Please review and maybe propagate upstream following patches.
> >
> > camellia5.diff
> > Use alternative key setup implementation with mostly 64-bit ops
> > if BITS_PER_LONG >= 64. Both much smaller and much faster.
> >
> > Unify camellia_en/decrypt128/256 into camellia_do_en/decrypt.
> > Code was similar, with just one additional if() we can use came code.
> >
> > If CONFIG_CC_OPTIMIZE_FOR_SIZE is defined,
> > use loop in camellia_do_en/decrypt instead of unrolled code.
> > ~5% encrypt/decrypt slowdown.

Having two versions of the cdoe is unmaintainable. So please
either decide that 5% is worth it or isn't.

The rest of this patch looks fine.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-07 13:23:07

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Tuesday 06 November 2007 14:23, Herbert Xu wrote:
> On Thu, Oct 25, 2007 at 12:48:29PM +0100, Denys Vlasenko wrote:
> > On Thursday 25 October 2007 12:43, Denys Vlasenko wrote:
> > > Hi Hervert,
> > >
> > > Please review and maybe propagate upstream following patches.
> > >
> > > camellia5.diff
> > > Use alternative key setup implementation with mostly 64-bit ops
> > > if BITS_PER_LONG >= 64. Both much smaller and much faster.
> > >
> > > Unify camellia_en/decrypt128/256 into camellia_do_en/decrypt.
> > > Code was similar, with just one additional if() we can use came code.
> > >
> > > If CONFIG_CC_OPTIMIZE_FOR_SIZE is defined,
> > > use loop in camellia_do_en/decrypt instead of unrolled code.
> > > ~5% encrypt/decrypt slowdown.
>
> Having two versions of the cdoe is unmaintainable. So please
> either decide that 5% is worth it or isn't.

*I* am happy with 5% speed sacrifice. I'm afraid other people won't be.

I just want to escape vicious cycle of -Os people arguing with
-O2 people to no end. I don't want somebody to come later
and unroll the loop again. And then me to come
and de-unroll it again...

It's better for everybody to recognize that both POVs are valid,
and have provisions for tuning size/speed tradeoff by the user
(person which builds the binary).

That's why I made a patch where unrolling can be enabled by CONFIG_xxx.

If you disagree with me and don't want this type of selectability,
the updated patch is attached.

Thanks!
--
vda


Attachments:
(No filename) (1.47 kB)
camellia5.diff (54.09 kB)
Download all attachments

2007-11-08 13:30:28

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Wed, Nov 07, 2007 at 01:22:52PM +0000, Denys Vlasenko wrote:
>
> *I* am happy with 5% speed sacrifice. I'm afraid other people won't be.

I'd like to hear the opinion of the author.

Takamiya-san, what do you think about this change?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-13 06:08:04

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

Hi,

Sorry for late reply

>> Thu, 8 Nov 2007 21:30:20 +0800 $B:"!"(B
>> [Subject: Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization] $B$K$*$$$F!"(B
>> Herbert Xu <[email protected]>$B$5$s$,=q$-$^$7$?(B....

> On Wed, Nov 07, 2007 at 01:22:52PM +0000, Denys Vlasenko wrote:
> >
> > *I* am happy with 5% speed sacrifice. I'm afraid other people won't be.
>
> I'd like to hear the opinion of the author.
>
> Takamiya-san, what do you think about this change?

For IPsec processing, I think performance is important.

If this fix improves the performance, it is acceptable.

But, there are many duplicate decralations between camellia.c and
camellia_64.c...
(e.g., CAMELLIA_MIN_KEY_SIZE and so on)

Regards,

--
Noriaki TAKAMIYA

2007-11-13 06:26:44

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

Hi,

sorry, again.

>> Tue, 13 Nov 2007 15:07:02 +0900 (JST)
>> [Subject: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization]
>> Noriaki TAKAMIYA <[email protected]> wrote...

> > I'd like to hear the opinion of the author.
> >
> > Takamiya-san, what do you think about this change?
>
> For IPsec processing, I think performance is important.
>
> If this fix improves the performance, it is acceptable.

I misunderstood the meaning. If this fix decreases the performance,
I wouldn't prefer this patch(and the below is also one of the
reason).

> But, there are many duplicate decralations between camellia.c and
> camellia_64.c...
> (e.g., CAMELLIA_MIN_KEY_SIZE and so on)

Regards,

--
Noriaki TAKAMIYA

2007-11-13 22:34:42

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Monday 12 November 2007 23:25, Noriaki TAKAMIYA wrote:
> Hi,
>
> sorry, again.
>
> >> Tue, 13 Nov 2007 15:07:02 +0900 (JST)
> >> [Subject: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling,
> >> 64bit-ization] Noriaki TAKAMIYA <[email protected]> wrote...
> >>
> > > I'd like to hear the opinion of the author.
> > >
> > > Takamiya-san, what do you think about this change?
> >
> > For IPsec processing, I think performance is important.
> >
> > If this fix improves the performance, it is acceptable.
>
> I misunderstood the meaning. If this fix decreases the performance,
> I wouldn't prefer this patch(and the below is also one of the
> reason).

My preferred solution is to make loop unrolling conditional on
CONFIG_CC_OPTIMIZE_FOR_SIZE - and this is what is done in my
(first) patch (see attached). This part:

+#ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
+ while (1) {
+ i -= 8;
+ ROUNDS(i);
+ if (i == 0)
+ break;
+ FLS(i);
+ }
+#else
+ if (i == 32) {
+ ROUNDS(24);
+ FLS(24);
+ }
+ ROUNDS(16);
+ FLS(16);
+ ROUNDS(8);
+ FLS(8);
+ ROUNDS(0);
+#endif

Do you agree that this solution does not look too ugly
and would satisfy both "speed" and "size" camps?


For reference, size and speed numbers again:

All times are in microseconds. Two runs give some idea of test variability.
"Setup NN: NNNNNN NNNNNN" - time taken by 100000 key setups (two runs).
"Encrypt: NNNNNN NNNNNN" - time taken by 1000 encryptions of 8K buffer.
"Decrypt: NNNNNN NNNNNN" - time taken by 1000 decryptions of 8K buffer.
"(matches)" - encrypt/decrypt cycle produced non corrupted plaintext.

CONFIG_CC_OPTIMIZE_FOR_SIZE is not set:

$ ./camellia
Setup 16:32779 33169 Encrypt:153582 153740 Decrypt:150985 149811 (matches)
Setup 24:49333 48987 Encrypt:197973 198853 Decrypt:201240 197585 (matches)
Setup 32:46700 47680 Encrypt:195650 195800 Decrypt:195450 195469 (matches)
$ ./camellia5
Setup 16:33417 32968 Encrypt:149195 149095 Decrypt:148593 148661 (matches)
Setup 24:50082 50064 Encrypt:201214 199204 Decrypt:197078 197579 (matches)
Setup 32:48938 48824 Encrypt:200231 199545 Decrypt:198954 198996 (matches)
$ ./camellia_64
Setup 16:22247 22473 Encrypt:152321 149860 Decrypt:149058 148451 (matches)
Setup 24:33832 34017 Encrypt:200428 202969 Decrypt:196789 195524 (matches)
Setup 32:32884 32821 Encrypt:200414 200640 Decrypt:197857 195987 (matches)
$ size camellia.o camellia7.o camellia_64.o
? ?text ? ?data ? ? bss ? ? dec ? ? hex filename
? 24586 ? ? ? 0 ? ? ? 0 ? 24586 ? ?600a camellia.o
? 21714 ? ? ? 0 ? ? ? 0 ? 21714 ? ?54d2 camellia5.o
? 18666 ? ? ? 0 ? ? ? 0 ? 18666 ? ?48ea camellia_64.o

Very small speed loss in camellia -> camellia5, noticeably smaller size.
Big key setup speedup in 64-bit camellia_64, and it is even smaller.

CONFIG_CC_OPTIMIZE_FOR_SIZE is set:

$ ./camellia_Os
Setup 16:32573 34985 Encrypt:151825 152011 Decrypt:147581 147630 (matches)
Setup 24:48528 49250 Encrypt:196223 199056 Decrypt:198811 196394 (matches)
Setup 32:46650 47538 Encrypt:197466 196412 Decrypt:196290 196550 (matches)
$ ./camellia5_Os
Setup 16:33360 34487 Encrypt:154718 154499 Decrypt:157432 157135 (matches)
Setup 24:53969 54304 Encrypt:205184 205818 Decrypt:210675 208552 (matches)
Setup 32:53064 52904 Encrypt:205350 205439 Decrypt:211654 208468 (matches)
$ ./camellia_64_Os
Setup 16:24696 25894 Encrypt:155903 155747 Decrypt:157385 155696 (matches)
Setup 24:33873 33230 Encrypt:206111 206385 Decrypt:208111 207650 (matches)
Setup 32:32799 32325 Encrypt:209715 205973 Decrypt:207578 207644 (matches)
$ size camellia_Os.o camellia7_Os.o camellia_64_Os.o
text data bss dec hex filename
24586 0 0 24586 600a camellia_Os.o
15906 0 0 15906 3e22 camellia5_Os.o
13098 0 0 13098 332a camellia_64_Os.o

~5% speed loss in camellia -> camellia5, much smaller size.
Big key setup speedup in 64-bit camellia_64, and it is even smaller still.
--
vda


Attachments:
(No filename) (3.99 kB)
camellia5.diff (54.42 kB)
Download all attachments

2007-11-14 01:41:05

by David Miller

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

From: Denys Vlasenko <[email protected]>
Date: Tue, 13 Nov 2007 15:34:33 -0700

> My preferred solution is to make loop unrolling conditional on
> CONFIG_CC_OPTIMIZE_FOR_SIZE - and this is what is done in my
> (first) patch (see attached). This part:

The default build is going to be CONFIG_CC_OPTIMIZE_FOR_SIZE
basically for everyone, this is what people get by default
and this is what every distribution uses.

Therefore %99.9999 of folks will get the slowdown.

So in my book this is not an acceptable way to deal with
this problem.

2007-11-14 02:47:16

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Tuesday 13 November 2007 18:41, David Miller wrote:
> From: Denys Vlasenko <[email protected]>
> Date: Tue, 13 Nov 2007 15:34:33 -0700
>
> > My preferred solution is to make loop unrolling conditional on
> > CONFIG_CC_OPTIMIZE_FOR_SIZE - and this is what is done in my
> > (first) patch (see attached). This part:
>
> The default build is going to be CONFIG_CC_OPTIMIZE_FOR_SIZE
> basically for everyone, this is what people get by default
> and this is what every distribution uses.
>
> Therefore %99.9999 of folks will get the slowdown.
>
> So in my book this is not an acceptable way to deal with
> this problem.

Loop unrolling here amounts to 25% code growth:

text data bss dec hex filename
21714 0 0 21714 54d2 camellia5.o
15906 0 0 15906 3e22 camellia5_Os.o

Saving 25% or code size and going 5% slower is perfectly acceptable
tradeof for some users. NB: I'm not saying all, ut some significant
part of users would like to be able to have this choice.

If CONFIG_CC_OPTIMIZE_FOR_SIZE is not an acceptable method,
do you have other ideas?
--
vda

2007-11-14 03:49:53

by David Miller

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

From: Denys Vlasenko <[email protected]>
Date: Tue, 13 Nov 2007 19:47:08 -0700

> If CONFIG_CC_OPTIMIZE_FOR_SIZE is not an acceptable method,
> do you have other ideas?

Look at ways to make the code run faster without loop unrolling?

2007-11-14 04:18:47

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

Hi,

>> Tue, 13 Nov 2007 19:47:08 -0700
>> [Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization]
>> Denys Vlasenko <[email protected]> wrote...

> On Tuesday 13 November 2007 18:41, David Miller wrote:
> > From: Denys Vlasenko <[email protected]>
> > Date: Tue, 13 Nov 2007 15:34:33 -0700
> >
> > > My preferred solution is to make loop unrolling conditional on
> > > CONFIG_CC_OPTIMIZE_FOR_SIZE - and this is what is done in my
> > > (first) patch (see attached). This part:
> >
> > The default build is going to be CONFIG_CC_OPTIMIZE_FOR_SIZE
> > basically for everyone, this is what people get by default
> > and this is what every distribution uses.
> >
> > Therefore %99.9999 of folks will get the slowdown.
> >
> > So in my book this is not an acceptable way to deal with
> > this problem.
>
> Loop unrolling here amounts to 25% code growth:
>
> text data bss dec hex filename
> 21714 0 0 21714 54d2 camellia5.o
> 15906 0 0 15906 3e22 camellia5_Os.o
>
> Saving 25% or code size and going 5% slower is perfectly acceptable
> tradeof for some users. NB: I'm not saying all, ut some significant
> part of users would like to be able to have this choice.

IMHO, if you are going to use camellia on the embedded system, size
of code will be important.

On the other hand, I think typically the CPU performance is
restricted on the embedded system, so the performance of code will
be important...

I'm not sure 5% slow down is important or not. It will depend on the
system.

Regards,

--
Noriaki TAKAMYA

2007-11-14 05:31:00

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Tuesday 13 November 2007 20:49, David Miller wrote:
> From: Denys Vlasenko <[email protected]>
> Date: Tue, 13 Nov 2007 19:47:08 -0700
>
> > If CONFIG_CC_OPTIMIZE_FOR_SIZE is not an acceptable method,
> > do you have other ideas?
>
> Look at ways to make the code run faster without loop unrolling?

I did it. I noticed that key setup is mostly operating on 64-bit
quantities, and provided alternative implementation which
exploits that fact. It's smaller and faster.

However, after I've done that, the question still stands:
should I unroll the loop or not?

The situation we are in now is exactly the sutiation I want to
avoid:

On Wednesday 07 November 2007 06:22, Denys Vlasenko wrote:
> > Having two versions of the cdoe is unmaintainable. So please
> > either decide that 5% is worth it or isn't.
>
> *I* am happy with 5% speed sacrifice. I'm afraid other people won't be.
>
> I just want to escape vicious cycle of -Os people arguing with
> -O2 people to no end. I don't want somebody to come later
> and unroll the loop again. And then me to come
> and de-unroll it again...
>
> It's better for everybody to recognize that both POVs are valid,
> and have provisions for tuning size/speed tradeoff by the user
> (person which builds the binary).

That's why I made a patch where unrolling can be enabled by CONFIG_xxx.

I will resubmit the patch without de-unrolling.
Meanwhile, I'd like to ask you guys to think about ways
to make size/speed tradeoffs selectable at build time.
--
vda

2007-11-14 06:10:42

by David Miller

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

From: Denys Vlasenko <[email protected]>
Date: Tue, 13 Nov 2007 22:30:47 -0700

> On Tuesday 13 November 2007 20:49, David Miller wrote:
> > From: Denys Vlasenko <[email protected]>
> > Date: Tue, 13 Nov 2007 19:47:08 -0700
> >
> > > If CONFIG_CC_OPTIMIZE_FOR_SIZE is not an acceptable method,
> > > do you have other ideas?
> >
> > Look at ways to make the code run faster without loop unrolling?
>
> I did it. I noticed that key setup is mostly operating on 64-bit
> quantities, and provided alternative implementation which
> exploits that fact. It's smaller and faster.

Great, then you don't have to unroll the loop and performance
is at least as good as before _and_ you save code space.

It's perfect, you don't need compile time checks or anything
silly like that.

Please submit this new version :-)

2007-11-14 07:15:42

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Tuesday 13 November 2007 22:30, Denys Vlasenko wrote:
> I will resubmit the patch without de-unrolling.
> Meanwhile, I'd like to ask you guys to think about ways
> to make size/speed tradeoffs selectable at build time.

Here is the patch which has loops still unrolled,
but otherwise unchanged.

Description:

Use alternative key setup implementation with mostly 64-bit ops
if BITS_PER_LONG >= 64. Both much smaller and much faster.

Unify camellia_en/decrypt128/256 into camellia_do_en/decrypt.
Code was similar, with just one additional if() we can use came code.

Replace (x & 0xff) with (u8)x, gcc is not smart enough to realize
that it can do (x & 0xff) this way (which is smaller at least on i386).

Don't do (x & 0xff) in a few places where x cannot be > 255 anyway:
t0 = il >> 16; v = camellia_sp0222[(t1 >> 8) & 0xff];
il16 is u32, (thus t1 >> 8) is one byte!

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (977.00 B)
linux-2.6.23.1.camellia5.diff (54.08 kB)
Download all attachments

2007-11-14 07:38:41

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Tuesday 13 November 2007 23:10, David Miller wrote:
> From: Denys Vlasenko <[email protected]>
> Date: Tue, 13 Nov 2007 22:30:47 -0700
>
> > On Tuesday 13 November 2007 20:49, David Miller wrote:
> > > From: Denys Vlasenko <[email protected]>
> > > Date: Tue, 13 Nov 2007 19:47:08 -0700
> > >
> > > > If CONFIG_CC_OPTIMIZE_FOR_SIZE is not an acceptable method,
> > > > do you have other ideas?
> > >
> > > Look at ways to make the code run faster without loop unrolling?
> >
> > I did it. I noticed that key setup is mostly operating on 64-bit
> > quantities, and provided alternative implementation which
> > exploits that fact. It's smaller and faster.
>
> Great, then you don't have to unroll the loop and performance
> is at least as good as before _and_ you save code space.

Unfortunately, it's applicable only to key setup,
and unrolling happens in actual encryption.

But the point still stands: irrespective of other optimizations,
unrolled and non-unrolled forms will still have different sizes
and speeds, and in some cases (like this one) you can't
pick one form which fits all.

> Please submit this new version :-)

Just did it. It's linux-2.6.23.1.camellia5.diff
--
vda

2007-11-14 14:14:36

by Herbert Xu

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Wed, Nov 14, 2007 at 12:15:19AM -0700, Denys Vlasenko wrote:
>
> Use alternative key setup implementation with mostly 64-bit ops
> if BITS_PER_LONG >= 64. Both much smaller and much faster.

Can we please not have two versions of the same algorithm in C?
They're a pain to maintain and test.

Where performance is paramount you could look at doing an assembly
version. Unlike two C versions at least that can be easily tested
by someone who has access to the platform in question.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-14 21:28:33

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Wednesday 14 November 2007 07:14, Herbert Xu wrote:
> On Wed, Nov 14, 2007 at 12:15:19AM -0700, Denys Vlasenko wrote:
> > Use alternative key setup implementation with mostly 64-bit ops
> > if BITS_PER_LONG >= 64. Both much smaller and much faster.
>
> Can we please not have two versions of the same algorithm in C?
> They're a pain to maintain and test.
>
> Where performance is paramount you could look at doing an assembly
> version. Unlike two C versions at least that can be easily tested
> by someone who has access to the platform in question.

Having two versions, one in C and another in assembly cannot be easier
than two C versions. Moreover, asm version will be arch specific -
one needs to write separate amd64/ppc64/sparc64/etc versions.
It means even more versions to maintain.

It would be faster too, though, and I think it makes sense to do it
for most popular arches sometime in future.

What I have now is a generic 64-bit C implentation which is
likely to be much faster and a bit smaller than 32-bit one
on _all_ 64-bit arches. For i386 it's 33% faster.

I think this win is big enough to justify having two versions.

I think that you are right that having separate camellia_64.c
with substantial duplication is bad. I reworked ot so that
both 32-bit and 64-bit code is now in camellia.c,
and I removed (merged) all duplicated stuff (constants, macros,
and whole encryption/decryption part).

I also split this patch into two parts for easier review:
camellia5:
adds 64-bit key setup
camellia6:
unifies encrypt/decrypt routines for different key lengths.
This reduces module size by ~25%, with tiny (less than 1%)
speed impact.
Also collapses encrypt/decrypt into more readable
(visually shorter) form using macros.

Compiled it on i385 and amd64:

text data bss dec hex filename
29724 224 0 29948 74fc 2.6.23.1.camellia.t/crypto/camellia.o
29233 224 0 29457 7311 2.6.23.1.camellia5.t/crypto/camellia.o
21190 224 0 21414 53a6 2.6.23.1.camellia6.t/crypto/camellia.o

22498 288 0 22786 5902 2.6.23.1.camellia.t64/crypto/camellia.o
21134 288 0 21422 53ae 2.6.23.1.camellia5.t64/crypto/camellia.o
16067 288 0 16355 3fe3 2.6.23.1.camellia6.t64/crypto/camellia.o

Takamiya-san, can you review attached patches please?

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (2.42 kB)
linux-2.6.23.1.camellia5.diff (31.81 kB)
linux-2.6.23.1.camellia6.diff (14.92 kB)
Download all attachments

2007-11-18 13:22:03

by Herbert Xu

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Wed, Nov 14, 2007 at 02:28:25PM -0700, Denys Vlasenko wrote:
>
> I also split this patch into two parts for easier review:
> camellia5:
> adds 64-bit key setup

Sorry but this still duplicates way too much code. Also key
setup is the slow path relatively speaking so it's even less
justifiable.

> camellia6:
> unifies encrypt/decrypt routines for different key lengths.
> This reduces module size by ~25%, with tiny (less than 1%)
> speed impact.
> Also collapses encrypt/decrypt into more readable
> (visually shorter) form using macros.

This looks pretty neat though. I'll merge it unless I hear any
objections.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-19 04:30:26

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

Hi Herbert,

On Sunday 18 November 2007 05:21, Herbert Xu wrote:
> On Wed, Nov 14, 2007 at 02:28:25PM -0700, Denys Vlasenko wrote:
> > I also split this patch into two parts for easier review:
> > camellia5:
> > adds 64-bit key setup
>
> Sorry but this still duplicates way too much code. Also key
> setup is the slow path relatively speaking so it's even less
> justifiable.

Oh, Herbert, have heart, my camellia.c source file is smaller
than the one I started from. It's not like it's twice as big.
It's smaller already.

64-bit key setup is not just faster, it is also smaller
by ~4k, and this benefit is always there, not only when
key setup is performed.

With attached camellia7 patch, I further reduce the size
of key setup routines by reusing a bit of the code
at the end of them. 2 screenfuls of code less.

I hope it makes code duplication a bit more tolerable.

> > camellia6:
> > unifies encrypt/decrypt routines for different key lengths.
> > This reduces module size by ~25%, with tiny (less than 1%)
> > speed impact.
> > Also collapses encrypt/decrypt into more readable
> > (visually shorter) form using macros.

And here is

camellia7:
Move "key XOR is end of F-function" code part into
camellia_setup_tail(), it is sufficiently similar
between camellia_setup128 and camellia_setup256.
This shaves off another ~1k:
dec hex filename
21414 53a6 2.6.23.1.camellia6.t/crypto/camellia.o
20518 5026 2.6.23.1.camellia7.t/crypto/camellia.o
16355 3fe3 2.6.23.1.camellia6.t64/crypto/camellia.o
15813 3dc5 2.6.23.1.camellia7.t64/crypto/camellia.o


At the moment I cannot run test it, try to do it ASAP.

Takamiya-san, can you review attached patch please?

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (1.83 kB)
linux-2.6.23.1.camellia7.diff (21.47 kB)
Download all attachments

2007-11-19 19:01:52

by Noriaki TAKAMIYA

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

Hi,

>> Sun, 18 Nov 2007 20:30:16 -0800
>> [Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization]
>> Denys Vlasenko <[email protected]> wrote...

> > > camellia6:
> > > unifies encrypt/decrypt routines for different key lengths.
> > > This reduces module size by ~25%, with tiny (less than 1%)
> > > speed impact.
> > > Also collapses encrypt/decrypt into more readable
> > > (visually shorter) form using macros.
>
> And here is
>
> camellia7:
> Move "key XOR is end of F-function" code part into
> camellia_setup_tail(), it is sufficiently similar
> between camellia_setup128 and camellia_setup256.
> This shaves off another ~1k:
> dec hex filename
> 21414 53a6 2.6.23.1.camellia6.t/crypto/camellia.o
> 20518 5026 2.6.23.1.camellia7.t/crypto/camellia.o
> 16355 3fe3 2.6.23.1.camellia6.t64/crypto/camellia.o
> 15813 3dc5 2.6.23.1.camellia7.t64/crypto/camellia.o
>
>
> At the moment I cannot run test it, try to do it ASAP.
>
> Takamiya-san, can you review attached patch please?

Sorry for late reply.

I think you're testing now:-), and if speed impact is less than 1%
as you say, I think it is acceptable.

The smaller code size is, the easier to enable camellia in the
embedded systems.

Regards,

Acked-by: Noriaki TAKAMIYA <[email protected]>

--
Noriaki TAKAMIYA

2007-11-21 02:44:46

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

Hi Herbert, Noriaki,

On Monday 19 November 2007 10:49, Noriaki TAKAMIYA wrote:
> > > > camellia6:
> > > > unifies encrypt/decrypt routines for different key lengths.
> > > > This reduces module size by ~25%, with tiny (less than 1%)
> > > > speed impact.
> > > > Also collapses encrypt/decrypt into more readable
> > > > (visually shorter) form using macros.
> >
> > And here is
> >
> > camellia7:
> > Move "key XOR is end of F-function" code part into
> > camellia_setup_tail(), it is sufficiently similar
> > between camellia_setup128 and camellia_setup256.
> > This shaves off another ~1k:
> > dec hex filename
> > 21414 53a6 2.6.23.1.camellia6.t/crypto/camellia.o
> > 20518 5026 2.6.23.1.camellia7.t/crypto/camellia.o
> > 16355 3fe3 2.6.23.1.camellia6.t64/crypto/camellia.o
> > 15813 3dc5 2.6.23.1.camellia7.t64/crypto/camellia.o
> >
> >
> > At the moment I cannot run test it, try to do it ASAP.
> >
> > Takamiya-san, can you review attached patch please?
>
> Sorry for late reply.
>
> I think you're testing now:-), and if speed impact is less than 1%
> as you say, I think it is acceptable.

Actaually I tested it only now. Explored wonders of ubuntu
package management. Why it doesn't rebuild module
when I do "touch camellia.c" and then run their magic stuff?
I suspected that these .deb, .rpm, .whatever are evil,
but it changed today. Now I *know* that. ;)

Back to patches.

It turns out that two more key setup stages can be easily folded into
common "tail". Attached patches do this.

I'm also attaching all previous patches not yet ACKed by Herbert
(patches 5,6,7).

camellia8:
Analogously to camellia7 patch, move
"absorb kw2 to other subkeys" and "absorb kw4 to other subkeys"
code parts into camellia_setup_tail(). This further reduces
source and object code size at the cost of two brances
in key setup code.

Code sizes (starting from the state with pathces 1-4 already applied):
64-bit:
dec hex filename
22786 5902 2.6.23.1.camellia4.t64/crypto/camellia.o
21422 53ae 2.6.23.1.camellia5.t64/crypto/camellia.o
16355 3fe3 2.6.23.1.camellia6.t64/crypto/camellia.o
15813 3dc5 2.6.23.1.camellia7.t64/crypto/camellia.o
15670 3d36 2.6.23.1.camellia8.t64/crypto/camellia.o
32-bit:
29948 74fc 2.6.23.1.camellia4.t/crypto/camellia.o
29457 7311 2.6.23.1.camellia5.t/crypto/camellia.o
21414 53a6 2.6.23.1.camellia6.t/crypto/camellia.o
20518 5026 2.6.23.1.camellia7.t/crypto/camellia.o
18454 4816 2.6.23.1.camellia8.t/crypto/camellia.o

Code is compile-tested for 32/64-bit x86 and run tested on 32-bit
(tcrypt tests all pass).

Herbert, please let me know what you think about them.

Signed-off-by: Denys Vlasenko <[email protected]>
--
vda


Attachments:
(No filename) (2.81 kB)
linux-2.6.23.1.camellia5.diff (31.81 kB)
linux-2.6.23.1.camellia6.diff (14.92 kB)
linux-2.6.23.1.camellia8.diff (13.42 kB)
linux-2.6.23.1.camellia7.diff (21.47 kB)
Download all attachments

2007-11-21 04:50:02

by Herbert Xu

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Sun, Nov 18, 2007 at 08:30:16PM -0800, Denys Vlasenko wrote:
>
> Oh, Herbert, have heart, my camellia.c source file is smaller
> than the one I started from. It's not like it's twice as big.
> It's smaller already.
>
> 64-bit key setup is not just faster, it is also smaller
> by ~4k, and this benefit is always there, not only when
> key setup is performed.

The key setup path is the slow path so I don't see why we can't
just switch to the 64-bit version if it's better.

BTW I tried to apply your -6 patch but it doesn't apply against
cryptodev-2.6 so I had to drop it.

Please make sure that you only send one patch per email and that
they apply against cryptodev-2.6. If the patches have dependencies
then please make that clear as well.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-21 08:09:25

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Tuesday 20 November 2007 19:53, Herbert Xu wrote:
> On Sun, Nov 18, 2007 at 08:30:16PM -0800, Denys Vlasenko wrote:
> > Oh, Herbert, have heart, my camellia.c source file is smaller
> > than the one I started from. It's not like it's twice as big.
> > It's smaller already.
> >
> > 64-bit key setup is not just faster, it is also smaller
> > by ~4k, and this benefit is always there, not only when
> > key setup is performed.
>
> The key setup path is the slow path so I don't see why we can't
> just switch to the 64-bit version if it's better.

Yes, with minor modifications "64-bit" version
can be compiled and will work correctly on 32-bit CPU.
But it will be larger. This is what I got on i386:

text data bss dec hex filename
18230 224 0 18454 4816 t/crypto/camellia.o
20198 224 0 20422 4fc6 t_fake64/crypto/camellia.o

> BTW I tried to apply your -6 patch but it doesn't apply against
> cryptodev-2.6 so I had to drop it.

Will correct this and re-post.
--
vda

2007-11-21 08:12:51

by Herbert Xu

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Wed, Nov 21, 2007 at 12:08:57AM -0800, Denys Vlasenko wrote:
>
> Yes, with minor modifications "64-bit" version
> can be compiled and will work correctly on 32-bit CPU.
> But it will be larger. This is what I got on i386:
>
> text data bss dec hex filename
> 18230 224 0 18454 4816 t/crypto/camellia.o
> 20198 224 0 20422 4fc6 t_fake64/crypto/camellia.o

What are the size differences on x86-64?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-11-21 08:38:31

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [camellia-oss:00952] Re: [PATCH 5/5] camellia: de-unrolling, 64bit-ization

On Wednesday 21 November 2007 00:12, Herbert Xu wrote:
> On Wed, Nov 21, 2007 at 12:08:57AM -0800, Denys Vlasenko wrote:
> > Yes, with minor modifications "64-bit" version
> > can be compiled and will work correctly on 32-bit CPU.
> > But it will be larger. This is what I got on i386:
> >
> > text data bss dec hex filename
> > 18230 224 0 18454 4816 t/crypto/camellia.o
> > 20198 224 0 20422 4fc6 t_fake64/crypto/camellia.o
>
> What are the size differences on x86-64?

The above sizes were: final code (with all patches applied)
built for i386
versus same code with #if BITS_PER_LONG >= 64 replaced by #if 1,
and a few fixes for "integer is too big for long" warnings)

For 64-bit, replacing that #if is a no-op, sizes
will be the same.

If you are asking about 64-bit size comparison *across patches*
5..8, here they are:

64-bit:
dec hex filename
22786 5902 2.6.23.1.camellia4.t64/crypto/camellia.o
21422 53ae 2.6.23.1.camellia5.t64/crypto/camellia.o
16355 3fe3 2.6.23.1.camellia6.t64/crypto/camellia.o
15813 3dc5 2.6.23.1.camellia7.t64/crypto/camellia.o
15670 3d36 2.6.23.1.camellia8.t64/crypto/camellia.o

--
vda