2004-10-02 17:54:14

by Florian.Bohrer

[permalink] [raw]
Subject: [PATCH] AES x86-64-asm impl.

hi,

this is my first public kernel patch. it is an x86_64 asm optimized version of AES for the
crypto-framework. the patch is against 2.6.9-rc2-mm1 but should work with other
versions too.


the asm-code is from Jari Ruusu (loop-aes).
the org. glue-code is from Fruhwirth Clemens.



--- linux-2.6.9-rc2-mm1/arch/x86_64/crypto/aes-x86_64-asm.S 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-rc2-mm1-aes/arch/x86_64/crypto/aes-x86_64-asm.S 2004-09-26 23:57:35.936380752 +0200
@@ -0,0 +1,896 @@
+//
+// Copyright (c) 2001, Dr Brian Gladman <[email protected]>, Worcester, UK.
+// All rights reserved.
+//
+// TERMS
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted subject to the following conditions:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. The copyright holder's name must not be used to endorse or promote
+// any products derived from this software without his specific prior
+// written permission.
+//
+// This software is provided 'as is' with no express or implied warranties
+// of correctness or fitness for purpose.
+
+// Modified by Jari Ruusu, December 24 2001
+// - Converted syntax to GNU CPP/assembler syntax
+// - C programming interface converted back to "old" API
+// - Minor portability cleanups and speed optimizations
+
+// Modified by Jari Ruusu, April 11 2002
+// - Added above copyright and terms to resulting object code so that
+// binary distributions can avoid legal trouble
+
+// Modified by Jari Ruusu, June 12 2004
+// - Converted 32 bit x86 code to 64 bit AMD64 code
+// - Re-wrote encrypt and decrypt code from scratch
+
+// Modified by Florian Bohrer, September 26 2004
+// - Switched in/out
+
+// An AES (Rijndael) implementation for the AMD64. This version only
+// implements the standard AES block length (128 bits, 16 bytes). This code
+// does not preserve the rax, rcx, rdx, rsi, rdi or r8-r11 registers or the
+// artihmetic status flags. However, the rbx, rbp and r12-r15 registers are
+// preserved across calls.
+
+// void aes_set_key(aes_context *cx, const unsigned char key[], const int key_len, const int f)
+// void aes_encrypt(const aes_context *cx, const unsigned char out_blk[], unsigned char in_blk[])
+// void aes_decrypt(const aes_context *cx, const unsigned char out_blk[], unsigned char in_blk[])
+
+#if defined(USE_UNDERLINE)
+# define aes_set_key _aes_set_key
+# define aes_encrypt _aes_encrypt
+# define aes_decrypt _aes_decrypt
+#endif
+#if !defined(ALIGN64BYTES)
+# define ALIGN64BYTES 64
+#endif
+
+ .file "aes-x86_64-asm.S"
+ .globl aes_set_key
+ .globl aes_encrypt
+ .globl aes_decrypt
+
+ .section .rodata
+copyright:
+ .ascii " \000"
+ .ascii "Copyright (c) 2001, Dr Brian Gladman <[email protected]>, Worcester, UK.\000"
+ .ascii "All rights reserved.\000"
+ .ascii " \000"
+ .ascii "TERMS\000"
+ .ascii " \000"
+ .ascii " Redistribution and use in source and binary forms, with or without\000"
+ .ascii " modification, are permitted subject to the following conditions:\000"
+ .ascii " \000"
+ .ascii " 1. Redistributions of source code must retain the above copyright\000"
+ .ascii " notice, this list of conditions and the following disclaimer.\000"
+ .ascii " \000"
+ .ascii " 2. Redistributions in binary form must reproduce the above copyright\000"
+ .ascii " notice, this list of conditions and the following disclaimer in the\000"
+ .ascii " documentation and/or other materials provided with the distribution.\000"
+ .ascii " \000"
+ .ascii " 3. The copyright holder's name must not be used to endorse or promote\000"
+ .ascii " any products derived from this software without his specific prior\000"
+ .ascii " written permission.\000"
+ .ascii " \000"
+ .ascii " This software is provided 'as is' with no express or implied warranties\000"
+ .ascii " of correctness or fitness for purpose.\000"
+ .ascii " \000"
+
+#define tlen 1024 // length of each of 4 'xor' arrays (256 32-bit words)
+
+// offsets in context structure
+
+#define nkey 0 // key length, size 4
+#define nrnd 4 // number of rounds, size 4
+#define ekey 8 // encryption key schedule base address, size 256
+#define dkey 264 // decryption key schedule base address, size 256
+
+// This macro performs a forward encryption cycle. It is entered with
+// the first previous round column values in I1E, I2E, I3E and I4E and
+// exits with the final values OU1, OU2, OU3 and OU4 registers.
+
+#define fwd_rnd(p1,p2,I1E,I1B,I1H,I2E,I2B,I2H,I3E,I3B,I3R,I4E,I4B,I4R,OU1,OU2,OU3,OU4) \
+ movl p2(%rbp),OU1 ;\
+ movl p2+4(%rbp),OU2 ;\
+ movl p2+8(%rbp),OU3 ;\
+ movl p2+12(%rbp),OU4 ;\
+ movzbl I1B,%edi ;\
+ movzbl I2B,%esi ;\
+ movzbl I3B,%r8d ;\
+ movzbl I4B,%r13d ;\
+ shrl $8,I3E ;\
+ shrl $8,I4E ;\
+ xorl p1(,%rdi,4),OU1 ;\
+ xorl p1(,%rsi,4),OU2 ;\
+ xorl p1(,%r8,4),OU3 ;\
+ xorl p1(,%r13,4),OU4 ;\
+ movzbl I2H,%esi ;\
+ movzbl I3B,%r8d ;\
+ movzbl I4B,%r13d ;\
+ movzbl I1H,%edi ;\
+ shrl $8,I3E ;\
+ shrl $8,I4E ;\
+ xorl p1+tlen(,%rsi,4),OU1 ;\
+ xorl p1+tlen(,%r8,4),OU2 ;\
+ xorl p1+tlen(,%r13,4),OU3 ;\
+ xorl p1+tlen(,%rdi,4),OU4 ;\
+ shrl $16,I1E ;\
+ shrl $16,I2E ;\
+ movzbl I3B,%r8d ;\
+ movzbl I4B,%r13d ;\
+ movzbl I1B,%edi ;\
+ movzbl I2B,%esi ;\
+ xorl p1+2*tlen(,%r8,4),OU1 ;\
+ xorl p1+2*tlen(,%r13,4),OU2 ;\
+ xorl p1+2*tlen(,%rdi,4),OU3 ;\
+ xorl p1+2*tlen(,%rsi,4),OU4 ;\
+ shrl $8,I4E ;\
+ movzbl I1H,%edi ;\
+ movzbl I2H,%esi ;\
+ shrl $8,I3E ;\
+ xorl p1+3*tlen(,I4R,4),OU1 ;\
+ xorl p1+3*tlen(,%rdi,4),OU2 ;\
+ xorl p1+3*tlen(,%rsi,4),OU3 ;\
+ xorl p1+3*tlen(,I3R,4),OU4
+
+// This macro performs an inverse encryption cycle. It is entered with
+// the first previous round column values in I1E, I2E, I3E and I4E and
+// exits with the final values OU1, OU2, OU3 and OU4 registers.
+
+#define inv_rnd(p1,p2,I1E,I1B,I1R,I2E,I2B,I2R,I3E,I3B,I3H,I4E,I4B,I4H,OU1,OU2,OU3,OU4) \
+ movl p2+12(%rbp),OU4 ;\
+ movl p2+8(%rbp),OU3 ;\
+ movl p2+4(%rbp),OU2 ;\
+ movl p2(%rbp),OU1 ;\
+ movzbl I4B,%edi ;\
+ movzbl I3B,%esi ;\
+ movzbl I2B,%r8d ;\
+ movzbl I1B,%r13d ;\
+ shrl $8,I2E ;\
+ shrl $8,I1E ;\
+ xorl p1(,%rdi,4),OU4 ;\
+ xorl p1(,%rsi,4),OU3 ;\
+ xorl p1(,%r8,4),OU2 ;\
+ xorl p1(,%r13,4),OU1 ;\
+ movzbl I3H,%esi ;\
+ movzbl I2B,%r8d ;\
+ movzbl I1B,%r13d ;\
+ movzbl I4H,%edi ;\
+ shrl $8,I2E ;\
+ shrl $8,I1E ;\
+ xorl p1+tlen(,%rsi,4),OU4 ;\
+ xorl p1+tlen(,%r8,4),OU3 ;\
+ xorl p1+tlen(,%r13,4),OU2 ;\
+ xorl p1+tlen(,%rdi,4),OU1 ;\
+ shrl $16,I4E ;\
+ shrl $16,I3E ;\
+ movzbl I2B,%r8d ;\
+ movzbl I1B,%r13d ;\
+ movzbl I4B,%edi ;\
+ movzbl I3B,%esi ;\
+ xorl p1+2*tlen(,%r8,4),OU4 ;\
+ xorl p1+2*tlen(,%r13,4),OU3 ;\
+ xorl p1+2*tlen(,%rdi,4),OU2 ;\
+ xorl p1+2*tlen(,%rsi,4),OU1 ;\
+ shrl $8,I1E ;\
+ movzbl I4H,%edi ;\
+ movzbl I3H,%esi ;\
+ shrl $8,I2E ;\
+ xorl p1+3*tlen(,I1R,4),OU4 ;\
+ xorl p1+3*tlen(,%rdi,4),OU3 ;\
+ xorl p1+3*tlen(,%rsi,4),OU2 ;\
+ xorl p1+3*tlen(,I2R,4),OU1
+
+// AES (Rijndael) Encryption Subroutine
+
+// rdi = pointer to AES context
+// rsi = pointer to output ciphertext bytes
+// rdx = pointer to input plaintext bytes
+
+ .text
+ .align ALIGN64BYTES
+aes_encrypt:
+ movl (%rdx),%eax // read in plaintext
+ movl 4(%rdx),%ecx
+ movl 8(%rdx),%r10d
+ movl 12(%rdx),%r11d
+
+ pushq %rbp
+ leaq ekey+16(%rdi),%rbp // encryption key pointer
+ movq %rsi,%r9 // pointer to out block
+ movl nrnd(%rdi),%edx // number of rounds
+ pushq %rbx
+ pushq %r13
+ pushq %r14
+ pushq %r15
+
+ xorl -16(%rbp),%eax // xor in first round key
+ xorl -12(%rbp),%ecx
+ xorl -8(%rbp),%r10d
+ xorl -4(%rbp),%r11d
+
+ subl $10,%edx
+ je aes_15
+ addq $32,%rbp
+ subl $2,%edx
+ je aes_13
+ addq $32,%rbp
+
+ fwd_rnd(aes_ft_tab,-64,%eax,%al,%ah,%ecx,%cl,%ch,%r10d,%r10b,%r10,%r11d,%r11b,%r11,%ebx,%edx,%r14d,%r15d)
+ fwd_rnd(aes_ft_tab,-48,%ebx,%bl,%bh,%edx,%dl,%dh,%r14d,%r14b,%r14,%r15d,%r15b,%r15,%eax,%ecx,%r10d,%r11d)
+ jmp aes_13
+ .align ALIGN64BYTES
+aes_13: fwd_rnd(aes_ft_tab,-32,%eax,%al,%ah,%ecx,%cl,%ch,%r10d,%r10b,%r10,%r11d,%r11b,%r11,%ebx,%edx,%r14d,%r15d)
+ fwd_rnd(aes_ft_tab,-16,%ebx,%bl,%bh,%edx,%dl,%dh,%r14d,%r14b,%r14,%r15d,%r15b,%r15,%eax,%ecx,%r10d,%r11d)
+ jmp aes_15
+ .align ALIGN64BYTES
+aes_15: fwd_rnd(aes_ft_tab,0, %eax,%al,%ah,%ecx,%cl,%ch,%r10d,%r10b,%r10,%r11d,%r11b,%r11,%ebx,%edx,%r14d,%r15d)
+ fwd_rnd(aes_ft_tab,16, %ebx,%bl,%bh,%edx,%dl,%dh,%r14d,%r14b,%r14,%r15d,%r15b,%r15,%eax,%ecx,%r10d,%r11d)
+ fwd_rnd(aes_ft_tab,32, %eax,%al,%ah,%ecx,%cl,%ch,%r10d,%r10b,%r10,%r11d,%r11b,%r11,%ebx,%edx,%r14d,%r15d)
+ fwd_rnd(aes_ft_tab,48, %ebx,%bl,%bh,%edx,%dl,%dh,%r14d,%r14b,%r14,%r15d,%r15b,%r15,%eax,%ecx,%r10d,%r11d)
+ fwd_rnd(aes_ft_tab,64, %eax,%al,%ah,%ecx,%cl,%ch,%r10d,%r10b,%r10,%r11d,%r11b,%r11,%ebx,%edx,%r14d,%r15d)
+ fwd_rnd(aes_ft_tab,80, %ebx,%bl,%bh,%edx,%dl,%dh,%r14d,%r14b,%r14,%r15d,%r15b,%r15,%eax,%ecx,%r10d,%r11d)
+ fwd_rnd(aes_ft_tab,96, %eax,%al,%ah,%ecx,%cl,%ch,%r10d,%r10b,%r10,%r11d,%r11b,%r11,%ebx,%edx,%r14d,%r15d)
+ fwd_rnd(aes_ft_tab,112,%ebx,%bl,%bh,%edx,%dl,%dh,%r14d,%r14b,%r14,%r15d,%r15b,%r15,%eax,%ecx,%r10d,%r11d)
+ fwd_rnd(aes_ft_tab,128,%eax,%al,%ah,%ecx,%cl,%ch,%r10d,%r10b,%r10,%r11d,%r11b,%r11,%ebx,%edx,%r14d,%r15d)
+ fwd_rnd(aes_fl_tab,144,%ebx,%bl,%bh,%edx,%dl,%dh,%r14d,%r14b,%r14,%r15d,%r15b,%r15,%eax,%ecx,%r10d,%r11d)
+
+ popq %r15
+ popq %r14
+ popq %r13
+ popq %rbx
+ popq %rbp
+
+ movl %eax,(%r9) // move final values to the output array.
+ movl %ecx,4(%r9)
+ movl %r10d,8(%r9)
+ movl %r11d,12(%r9)
+ ret
+
+// AES (Rijndael) Decryption Subroutine
+
+// rdi = pointer to AES context
+// rsi = pointer to output plaintext bytes
+// rdx = pointer to input ciphertext bytes
+
+ .align ALIGN64BYTES
+aes_decrypt:
+ movl 12(%rdx),%eax // read in ciphertext
+ movl 8(%rdx),%ecx
+ movl 4(%rdx),%r10d
+ movl (%rdx),%r11d
+
+ pushq %rbp
+ leaq dkey+16(%rdi),%rbp // decryption key pointer
+ movq %rsi,%r9 // pointer to out block
+ movl nrnd(%rdi),%edx // number of rounds
+ pushq %rbx
+ pushq %r13
+ pushq %r14
+ pushq %r15
+
+ xorl -4(%rbp),%eax // xor in first round key
+ xorl -8(%rbp),%ecx
+ xorl -12(%rbp),%r10d
+ xorl -16(%rbp),%r11d
+
+ subl $10,%edx
+ je aes_25
+ addq $32,%rbp
+ subl $2,%edx
+ je aes_23
+ addq $32,%rbp
+
+ inv_rnd(aes_it_tab,-64,%r11d,%r11b,%r11,%r10d,%r10b,%r10,%ecx,%cl,%ch,%eax,%al,%ah,%r15d,%r14d,%edx,%ebx)
+ inv_rnd(aes_it_tab,-48,%r15d,%r15b,%r15,%r14d,%r14b,%r14,%edx,%dl,%dh,%ebx,%bl,%bh,%r11d,%r10d,%ecx,%eax)
+ jmp aes_23
+ .align ALIGN64BYTES
+aes_23: inv_rnd(aes_it_tab,-32,%r11d,%r11b,%r11,%r10d,%r10b,%r10,%ecx,%cl,%ch,%eax,%al,%ah,%r15d,%r14d,%edx,%ebx)
+ inv_rnd(aes_it_tab,-16,%r15d,%r15b,%r15,%r14d,%r14b,%r14,%edx,%dl,%dh,%ebx,%bl,%bh,%r11d,%r10d,%ecx,%eax)
+ jmp aes_25
+ .align ALIGN64BYTES
+aes_25: inv_rnd(aes_it_tab,0, %r11d,%r11b,%r11,%r10d,%r10b,%r10,%ecx,%cl,%ch,%eax,%al,%ah,%r15d,%r14d,%edx,%ebx)
+ inv_rnd(aes_it_tab,16, %r15d,%r15b,%r15,%r14d,%r14b,%r14,%edx,%dl,%dh,%ebx,%bl,%bh,%r11d,%r10d,%ecx,%eax)
+ inv_rnd(aes_it_tab,32, %r11d,%r11b,%r11,%r10d,%r10b,%r10,%ecx,%cl,%ch,%eax,%al,%ah,%r15d,%r14d,%edx,%ebx)
+ inv_rnd(aes_it_tab,48, %r15d,%r15b,%r15,%r14d,%r14b,%r14,%edx,%dl,%dh,%ebx,%bl,%bh,%r11d,%r10d,%ecx,%eax)
+ inv_rnd(aes_it_tab,64, %r11d,%r11b,%r11,%r10d,%r10b,%r10,%ecx,%cl,%ch,%eax,%al,%ah,%r15d,%r14d,%edx,%ebx)
+ inv_rnd(aes_it_tab,80, %r15d,%r15b,%r15,%r14d,%r14b,%r14,%edx,%dl,%dh,%ebx,%bl,%bh,%r11d,%r10d,%ecx,%eax)
+ inv_rnd(aes_it_tab,96, %r11d,%r11b,%r11,%r10d,%r10b,%r10,%ecx,%cl,%ch,%eax,%al,%ah,%r15d,%r14d,%edx,%ebx)
+ inv_rnd(aes_it_tab,112,%r15d,%r15b,%r15,%r14d,%r14b,%r14,%edx,%dl,%dh,%ebx,%bl,%bh,%r11d,%r10d,%ecx,%eax)
+ inv_rnd(aes_it_tab,128,%r11d,%r11b,%r11,%r10d,%r10b,%r10,%ecx,%cl,%ch,%eax,%al,%ah,%r15d,%r14d,%edx,%ebx)
+ inv_rnd(aes_il_tab,144,%r15d,%r15b,%r15,%r14d,%r14b,%r14,%edx,%dl,%dh,%ebx,%bl,%bh,%r11d,%r10d,%ecx,%eax)
+
+ popq %r15
+ popq %r14
+ popq %r13
+ popq %rbx
+ popq %rbp
+
+ movl %eax,12(%r9) // move final values to the output array.
+ movl %ecx,8(%r9)
+ movl %r10d,4(%r9)
+ movl %r11d,(%r9)
+ ret
+
+// AES (Rijndael) Key Schedule Subroutine
+
+// This macro performs a column mixing operation on an input 32-bit
+// word to give a 32-bit result. It uses each of the 4 bytes in the
+// the input column to index 4 different tables of 256 32-bit words
+// that are xored together to form the output value.
+
+#define mix_col(p1) \
+ movzbl %bl,%ecx ;\
+ movl p1(,%rcx,4),%eax ;\
+ movzbl %bh,%ecx ;\
+ ror $16,%ebx ;\
+ xorl p1+tlen(,%rcx,4),%eax ;\
+ movzbl %bl,%ecx ;\
+ xorl p1+2*tlen(,%rcx,4),%eax ;\
+ movzbl %bh,%ecx ;\
+ xorl p1+3*tlen(,%rcx,4),%eax
+
+// Key Schedule Macros
+
+#define ksc4(p1) \
+ rol $24,%ebx ;\
+ mix_col(aes_fl_tab) ;\
+ ror $8,%ebx ;\
+ xorl 4*p1+aes_rcon_tab,%eax ;\
+ xorl %eax,%esi ;\
+ xorl %esi,%ebp ;\
+ movl %esi,16*p1(%rdi) ;\
+ movl %ebp,16*p1+4(%rdi) ;\
+ xorl %ebp,%edx ;\
+ xorl %edx,%ebx ;\
+ movl %edx,16*p1+8(%rdi) ;\
+ movl %ebx,16*p1+12(%rdi)
+
+#define ksc6(p1) \
+ rol $24,%ebx ;\
+ mix_col(aes_fl_tab) ;\
+ ror $8,%ebx ;\
+ xorl 4*p1+aes_rcon_tab,%eax ;\
+ xorl 24*p1-24(%rdi),%eax ;\
+ movl %eax,24*p1(%rdi) ;\
+ xorl 24*p1-20(%rdi),%eax ;\
+ movl %eax,24*p1+4(%rdi) ;\
+ xorl %eax,%esi ;\
+ xorl %esi,%ebp ;\
+ movl %esi,24*p1+8(%rdi) ;\
+ movl %ebp,24*p1+12(%rdi) ;\
+ xorl %ebp,%edx ;\
+ xorl %edx,%ebx ;\
+ movl %edx,24*p1+16(%rdi) ;\
+ movl %ebx,24*p1+20(%rdi)
+
+#define ksc8(p1) \
+ rol $24,%ebx ;\
+ mix_col(aes_fl_tab) ;\
+ ror $8,%ebx ;\
+ xorl 4*p1+aes_rcon_tab,%eax ;\
+ xorl 32*p1-32(%rdi),%eax ;\
+ movl %eax,32*p1(%rdi) ;\
+ xorl 32*p1-28(%rdi),%eax ;\
+ movl %eax,32*p1+4(%rdi) ;\
+ xorl 32*p1-24(%rdi),%eax ;\
+ movl %eax,32*p1+8(%rdi) ;\
+ xorl 32*p1-20(%rdi),%eax ;\
+ movl %eax,32*p1+12(%rdi) ;\
+ pushq %rbx ;\
+ movl %eax,%ebx ;\
+ mix_col(aes_fl_tab) ;\
+ popq %rbx ;\
+ xorl %eax,%esi ;\
+ xorl %esi,%ebp ;\
+ movl %esi,32*p1+16(%rdi) ;\
+ movl %ebp,32*p1+20(%rdi) ;\
+ xorl %ebp,%edx ;\
+ xorl %edx,%ebx ;\
+ movl %edx,32*p1+24(%rdi) ;\
+ movl %ebx,32*p1+28(%rdi)
+
+// rdi = pointer to AES context
+// rsi = pointer to key bytes
+// rdx = key length, bytes or bits
+// rcx = ed_flag, 1=encrypt only, 0=both encrypt and decrypt
+
+ .align ALIGN64BYTES
+aes_set_key:
+ pushfq
+ pushq %rbp
+ pushq %rbx
+
+ movq %rcx,%r11 // ed_flg
+ movq %rdx,%rcx // key length
+ movq %rdi,%r10 // AES context
+
+ cmpl $128,%ecx
+ jb aes_30
+ shrl $3,%ecx
+aes_30: cmpl $32,%ecx
+ je aes_32
+ cmpl $24,%ecx
+ je aes_32
+ movl $16,%ecx
+aes_32: shrl $2,%ecx
+ movl %ecx,nkey(%r10)
+ leaq 6(%rcx),%rax // 10/12/14 for 4/6/8 32-bit key length
+ movl %eax,nrnd(%r10)
+ leaq ekey(%r10),%rdi // key position in AES context
+ cld
+ movl %ecx,%eax // save key length in eax
+ rep ; movsl // words in the key schedule
+ movl -4(%rsi),%ebx // put some values in registers
+ movl -8(%rsi),%edx // to allow faster code
+ movl -12(%rsi),%ebp
+ movl -16(%rsi),%esi
+
+ cmpl $4,%eax // jump on key size
+ je aes_36
+ cmpl $6,%eax
+ je aes_35
+
+ ksc8(0)
+ ksc8(1)
+ ksc8(2)
+ ksc8(3)
+ ksc8(4)
+ ksc8(5)
+ ksc8(6)
+ jmp aes_37
+aes_35: ksc6(0)
+ ksc6(1)
+ ksc6(2)
+ ksc6(3)
+ ksc6(4)
+ ksc6(5)
+ ksc6(6)
+ ksc6(7)
+ jmp aes_37
+aes_36: ksc4(0)
+ ksc4(1)
+ ksc4(2)
+ ksc4(3)
+ ksc4(4)
+ ksc4(5)
+ ksc4(6)
+ ksc4(7)
+ ksc4(8)
+ ksc4(9)
+aes_37: cmpl $0,%r11d // ed_flg
+ jne aes_39
+
+// compile decryption key schedule from encryption schedule - reverse
+// order and do mix_column operation on round keys except first and last
+
+ movl nrnd(%r10),%eax // kt = cx->d_key + nc * cx->Nrnd
+ shl $2,%rax
+ leaq dkey(%r10,%rax,4),%rdi
+ leaq ekey(%r10),%rsi // kf = cx->e_key
+
+ movsq // copy first round key (unmodified)
+ movsq
+ subq $32,%rdi
+ movl $1,%r9d
+aes_38: // do mix column on each column of
+ lodsl // each round key
+ movl %eax,%ebx
+ mix_col(aes_im_tab)
+ stosl
+ lodsl
+ movl %eax,%ebx
+ mix_col(aes_im_tab)
+ stosl
+ lodsl
+ movl %eax,%ebx
+ mix_col(aes_im_tab)
+ stosl
+ lodsl
+ movl %eax,%ebx
+ mix_col(aes_im_tab)
+ stosl
+ subq $32,%rdi
+
+ incl %r9d
+ cmpl nrnd(%r10),%r9d
+ jb aes_38
+
+ movsq // copy last round key (unmodified)
+ movsq
+aes_39: popq %rbx
+ popq %rbp
+ popfq
+ ret
+
+
+// finite field multiplies by {02}, {04} and {08}
+
+#define f2(x) ((x<<1)^(((x>>7)&1)*0x11b))
+#define f4(x) ((x<<2)^(((x>>6)&1)*0x11b)^(((x>>6)&2)*0x11b))
+#define f8(x) ((x<<3)^(((x>>5)&1)*0x11b)^(((x>>5)&2)*0x11b)^(((x>>5)&4)*0x11b))
+
+// finite field multiplies required in table generation
+
+#define f3(x) (f2(x) ^ x)
+#define f9(x) (f8(x) ^ x)
+#define fb(x) (f8(x) ^ f2(x) ^ x)
+#define fd(x) (f8(x) ^ f4(x) ^ x)
+#define fe(x) (f8(x) ^ f4(x) ^ f2(x))
+
+// These defines generate the forward table entries
+
+#define u0(x) ((f3(x) << 24) | (x << 16) | (x << 8) | f2(x))
+#define u1(x) ((x << 24) | (x << 16) | (f2(x) << 8) | f3(x))
+#define u2(x) ((x << 24) | (f2(x) << 16) | (f3(x) << 8) | x)
+#define u3(x) ((f2(x) << 24) | (f3(x) << 16) | (x << 8) | x)
+
+// These defines generate the inverse table entries
+
+#define v0(x) ((fb(x) << 24) | (fd(x) << 16) | (f9(x) << 8) | fe(x))
+#define v1(x) ((fd(x) << 24) | (f9(x) << 16) | (fe(x) << 8) | fb(x))
+#define v2(x) ((f9(x) << 24) | (fe(x) << 16) | (fb(x) << 8) | fd(x))
+#define v3(x) ((fe(x) << 24) | (fb(x) << 16) | (fd(x) << 8) | f9(x))
+
+// These defines generate entries for the last round tables
+
+#define w0(x) (x)
+#define w1(x) (x << 8)
+#define w2(x) (x << 16)
+#define w3(x) (x << 24)
+
+// macro to generate inverse mix column tables (needed for the key schedule)
+
+#define im_data0(p1) \
+ .long p1(0x00),p1(0x01),p1(0x02),p1(0x03),p1(0x04),p1(0x05),p1(0x06),p1(0x07) ;\
+ .long p1(0x08),p1(0x09),p1(0x0a),p1(0x0b),p1(0x0c),p1(0x0d),p1(0x0e),p1(0x0f) ;\
+ .long p1(0x10),p1(0x11),p1(0x12),p1(0x13),p1(0x14),p1(0x15),p1(0x16),p1(0x17) ;\
+ .long p1(0x18),p1(0x19),p1(0x1a),p1(0x1b),p1(0x1c),p1(0x1d),p1(0x1e),p1(0x1f)
+#define im_data1(p1) \
+ .long p1(0x20),p1(0x21),p1(0x22),p1(0x23),p1(0x24),p1(0x25),p1(0x26),p1(0x27) ;\
+ .long p1(0x28),p1(0x29),p1(0x2a),p1(0x2b),p1(0x2c),p1(0x2d),p1(0x2e),p1(0x2f) ;\
+ .long p1(0x30),p1(0x31),p1(0x32),p1(0x33),p1(0x34),p1(0x35),p1(0x36),p1(0x37) ;\
+ .long p1(0x38),p1(0x39),p1(0x3a),p1(0x3b),p1(0x3c),p1(0x3d),p1(0x3e),p1(0x3f)
+#define im_data2(p1) \
+ .long p1(0x40),p1(0x41),p1(0x42),p1(0x43),p1(0x44),p1(0x45),p1(0x46),p1(0x47) ;\
+ .long p1(0x48),p1(0x49),p1(0x4a),p1(0x4b),p1(0x4c),p1(0x4d),p1(0x4e),p1(0x4f) ;\
+ .long p1(0x50),p1(0x51),p1(0x52),p1(0x53),p1(0x54),p1(0x55),p1(0x56),p1(0x57) ;\
+ .long p1(0x58),p1(0x59),p1(0x5a),p1(0x5b),p1(0x5c),p1(0x5d),p1(0x5e),p1(0x5f)
+#define im_data3(p1) \
+ .long p1(0x60),p1(0x61),p1(0x62),p1(0x63),p1(0x64),p1(0x65),p1(0x66),p1(0x67) ;\
+ .long p1(0x68),p1(0x69),p1(0x6a),p1(0x6b),p1(0x6c),p1(0x6d),p1(0x6e),p1(0x6f) ;\
+ .long p1(0x70),p1(0x71),p1(0x72),p1(0x73),p1(0x74),p1(0x75),p1(0x76),p1(0x77) ;\
+ .long p1(0x78),p1(0x79),p1(0x7a),p1(0x7b),p1(0x7c),p1(0x7d),p1(0x7e),p1(0x7f)
+#define im_data4(p1) \
+ .long p1(0x80),p1(0x81),p1(0x82),p1(0x83),p1(0x84),p1(0x85),p1(0x86),p1(0x87) ;\
+ .long p1(0x88),p1(0x89),p1(0x8a),p1(0x8b),p1(0x8c),p1(0x8d),p1(0x8e),p1(0x8f) ;\
+ .long p1(0x90),p1(0x91),p1(0x92),p1(0x93),p1(0x94),p1(0x95),p1(0x96),p1(0x97) ;\
+ .long p1(0x98),p1(0x99),p1(0x9a),p1(0x9b),p1(0x9c),p1(0x9d),p1(0x9e),p1(0x9f)
+#define im_data5(p1) \
+ .long p1(0xa0),p1(0xa1),p1(0xa2),p1(0xa3),p1(0xa4),p1(0xa5),p1(0xa6),p1(0xa7) ;\
+ .long p1(0xa8),p1(0xa9),p1(0xaa),p1(0xab),p1(0xac),p1(0xad),p1(0xae),p1(0xaf) ;\
+ .long p1(0xb0),p1(0xb1),p1(0xb2),p1(0xb3),p1(0xb4),p1(0xb5),p1(0xb6),p1(0xb7) ;\
+ .long p1(0xb8),p1(0xb9),p1(0xba),p1(0xbb),p1(0xbc),p1(0xbd),p1(0xbe),p1(0xbf)
+#define im_data6(p1) \
+ .long p1(0xc0),p1(0xc1),p1(0xc2),p1(0xc3),p1(0xc4),p1(0xc5),p1(0xc6),p1(0xc7) ;\
+ .long p1(0xc8),p1(0xc9),p1(0xca),p1(0xcb),p1(0xcc),p1(0xcd),p1(0xce),p1(0xcf) ;\
+ .long p1(0xd0),p1(0xd1),p1(0xd2),p1(0xd3),p1(0xd4),p1(0xd5),p1(0xd6),p1(0xd7) ;\
+ .long p1(0xd8),p1(0xd9),p1(0xda),p1(0xdb),p1(0xdc),p1(0xdd),p1(0xde),p1(0xdf)
+#define im_data7(p1) \
+ .long p1(0xe0),p1(0xe1),p1(0xe2),p1(0xe3),p1(0xe4),p1(0xe5),p1(0xe6),p1(0xe7) ;\
+ .long p1(0xe8),p1(0xe9),p1(0xea),p1(0xeb),p1(0xec),p1(0xed),p1(0xee),p1(0xef) ;\
+ .long p1(0xf0),p1(0xf1),p1(0xf2),p1(0xf3),p1(0xf4),p1(0xf5),p1(0xf6),p1(0xf7) ;\
+ .long p1(0xf8),p1(0xf9),p1(0xfa),p1(0xfb),p1(0xfc),p1(0xfd),p1(0xfe),p1(0xff)
+
+// S-box data - 256 entries
+
+#define sb_data0(p1) \
+ .long p1(0x63),p1(0x7c),p1(0x77),p1(0x7b),p1(0xf2),p1(0x6b),p1(0x6f),p1(0xc5) ;\
+ .long p1(0x30),p1(0x01),p1(0x67),p1(0x2b),p1(0xfe),p1(0xd7),p1(0xab),p1(0x76) ;\
+ .long p1(0xca),p1(0x82),p1(0xc9),p1(0x7d),p1(0xfa),p1(0x59),p1(0x47),p1(0xf0) ;\
+ .long p1(0xad),p1(0xd4),p1(0xa2),p1(0xaf),p1(0x9c),p1(0xa4),p1(0x72),p1(0xc0)
+#define sb_data1(p1) \
+ .long p1(0xb7),p1(0xfd),p1(0x93),p1(0x26),p1(0x36),p1(0x3f),p1(0xf7),p1(0xcc) ;\
+ .long p1(0x34),p1(0xa5),p1(0xe5),p1(0xf1),p1(0x71),p1(0xd8),p1(0x31),p1(0x15) ;\
+ .long p1(0x04),p1(0xc7),p1(0x23),p1(0xc3),p1(0x18),p1(0x96),p1(0x05),p1(0x9a) ;\
+ .long p1(0x07),p1(0x12),p1(0x80),p1(0xe2),p1(0xeb),p1(0x27),p1(0xb2),p1(0x75)
+#define sb_data2(p1) \
+ .long p1(0x09),p1(0x83),p1(0x2c),p1(0x1a),p1(0x1b),p1(0x6e),p1(0x5a),p1(0xa0) ;\
+ .long p1(0x52),p1(0x3b),p1(0xd6),p1(0xb3),p1(0x29),p1(0xe3),p1(0x2f),p1(0x84) ;\
+ .long p1(0x53),p1(0xd1),p1(0x00),p1(0xed),p1(0x20),p1(0xfc),p1(0xb1),p1(0x5b) ;\
+ .long p1(0x6a),p1(0xcb),p1(0xbe),p1(0x39),p1(0x4a),p1(0x4c),p1(0x58),p1(0xcf)
+#define sb_data3(p1) \
+ .long p1(0xd0),p1(0xef),p1(0xaa),p1(0xfb),p1(0x43),p1(0x4d),p1(0x33),p1(0x85) ;\
+ .long p1(0x45),p1(0xf9),p1(0x02),p1(0x7f),p1(0x50),p1(0x3c),p1(0x9f),p1(0xa8) ;\
+ .long p1(0x51),p1(0xa3),p1(0x40),p1(0x8f),p1(0x92),p1(0x9d),p1(0x38),p1(0xf5) ;\
+ .long p1(0xbc),p1(0xb6),p1(0xda),p1(0x21),p1(0x10),p1(0xff),p1(0xf3),p1(0xd2)
+#define sb_data4(p1) \
+ .long p1(0xcd),p1(0x0c),p1(0x13),p1(0xec),p1(0x5f),p1(0x97),p1(0x44),p1(0x17) ;\
+ .long p1(0xc4),p1(0xa7),p1(0x7e),p1(0x3d),p1(0x64),p1(0x5d),p1(0x19),p1(0x73) ;\
+ .long p1(0x60),p1(0x81),p1(0x4f),p1(0xdc),p1(0x22),p1(0x2a),p1(0x90),p1(0x88) ;\
+ .long p1(0x46),p1(0xee),p1(0xb8),p1(0x14),p1(0xde),p1(0x5e),p1(0x0b),p1(0xdb)
+#define sb_data5(p1) \
+ .long p1(0xe0),p1(0x32),p1(0x3a),p1(0x0a),p1(0x49),p1(0x06),p1(0x24),p1(0x5c) ;\
+ .long p1(0xc2),p1(0xd3),p1(0xac),p1(0x62),p1(0x91),p1(0x95),p1(0xe4),p1(0x79) ;\
+ .long p1(0xe7),p1(0xc8),p1(0x37),p1(0x6d),p1(0x8d),p1(0xd5),p1(0x4e),p1(0xa9) ;\
+ .long p1(0x6c),p1(0x56),p1(0xf4),p1(0xea),p1(0x65),p1(0x7a),p1(0xae),p1(0x08)
+#define sb_data6(p1) \
+ .long p1(0xba),p1(0x78),p1(0x25),p1(0x2e),p1(0x1c),p1(0xa6),p1(0xb4),p1(0xc6) ;\
+ .long p1(0xe8),p1(0xdd),p1(0x74),p1(0x1f),p1(0x4b),p1(0xbd),p1(0x8b),p1(0x8a) ;\
+ .long p1(0x70),p1(0x3e),p1(0xb5),p1(0x66),p1(0x48),p1(0x03),p1(0xf6),p1(0x0e) ;\
+ .long p1(0x61),p1(0x35),p1(0x57),p1(0xb9),p1(0x86),p1(0xc1),p1(0x1d),p1(0x9e)
+#define sb_data7(p1) \
+ .long p1(0xe1),p1(0xf8),p1(0x98),p1(0x11),p1(0x69),p1(0xd9),p1(0x8e),p1(0x94) ;\
+ .long p1(0x9b),p1(0x1e),p1(0x87),p1(0xe9),p1(0xce),p1(0x55),p1(0x28),p1(0xdf) ;\
+ .long p1(0x8c),p1(0xa1),p1(0x89),p1(0x0d),p1(0xbf),p1(0xe6),p1(0x42),p1(0x68) ;\
+ .long p1(0x41),p1(0x99),p1(0x2d),p1(0x0f),p1(0xb0),p1(0x54),p1(0xbb),p1(0x16)
+
+// Inverse S-box data - 256 entries
+
+#define ib_data0(p1) \
+ .long p1(0x52),p1(0x09),p1(0x6a),p1(0xd5),p1(0x30),p1(0x36),p1(0xa5),p1(0x38) ;\
+ .long p1(0xbf),p1(0x40),p1(0xa3),p1(0x9e),p1(0x81),p1(0xf3),p1(0xd7),p1(0xfb) ;\
+ .long p1(0x7c),p1(0xe3),p1(0x39),p1(0x82),p1(0x9b),p1(0x2f),p1(0xff),p1(0x87) ;\
+ .long p1(0x34),p1(0x8e),p1(0x43),p1(0x44),p1(0xc4),p1(0xde),p1(0xe9),p1(0xcb)
+#define ib_data1(p1) \
+ .long p1(0x54),p1(0x7b),p1(0x94),p1(0x32),p1(0xa6),p1(0xc2),p1(0x23),p1(0x3d) ;\
+ .long p1(0xee),p1(0x4c),p1(0x95),p1(0x0b),p1(0x42),p1(0xfa),p1(0xc3),p1(0x4e) ;\
+ .long p1(0x08),p1(0x2e),p1(0xa1),p1(0x66),p1(0x28),p1(0xd9),p1(0x24),p1(0xb2) ;\
+ .long p1(0x76),p1(0x5b),p1(0xa2),p1(0x49),p1(0x6d),p1(0x8b),p1(0xd1),p1(0x25)
+#define ib_data2(p1) \
+ .long p1(0x72),p1(0xf8),p1(0xf6),p1(0x64),p1(0x86),p1(0x68),p1(0x98),p1(0x16) ;\
+ .long p1(0xd4),p1(0xa4),p1(0x5c),p1(0xcc),p1(0x5d),p1(0x65),p1(0xb6),p1(0x92) ;\
+ .long p1(0x6c),p1(0x70),p1(0x48),p1(0x50),p1(0xfd),p1(0xed),p1(0xb9),p1(0xda) ;\
+ .long p1(0x5e),p1(0x15),p1(0x46),p1(0x57),p1(0xa7),p1(0x8d),p1(0x9d),p1(0x84)
+#define ib_data3(p1) \
+ .long p1(0x90),p1(0xd8),p1(0xab),p1(0x00),p1(0x8c),p1(0xbc),p1(0xd3),p1(0x0a) ;\
+ .long p1(0xf7),p1(0xe4),p1(0x58),p1(0x05),p1(0xb8),p1(0xb3),p1(0x45),p1(0x06) ;\
+ .long p1(0xd0),p1(0x2c),p1(0x1e),p1(0x8f),p1(0xca),p1(0x3f),p1(0x0f),p1(0x02) ;\
+ .long p1(0xc1),p1(0xaf),p1(0xbd),p1(0x03),p1(0x01),p1(0x13),p1(0x8a),p1(0x6b)
+#define ib_data4(p1) \
+ .long p1(0x3a),p1(0x91),p1(0x11),p1(0x41),p1(0x4f),p1(0x67),p1(0xdc),p1(0xea) ;\
+ .long p1(0x97),p1(0xf2),p1(0xcf),p1(0xce),p1(0xf0),p1(0xb4),p1(0xe6),p1(0x73) ;\
+ .long p1(0x96),p1(0xac),p1(0x74),p1(0x22),p1(0xe7),p1(0xad),p1(0x35),p1(0x85) ;\
+ .long p1(0xe2),p1(0xf9),p1(0x37),p1(0xe8),p1(0x1c),p1(0x75),p1(0xdf),p1(0x6e)
+#define ib_data5(p1) \
+ .long p1(0x47),p1(0xf1),p1(0x1a),p1(0x71),p1(0x1d),p1(0x29),p1(0xc5),p1(0x89) ;\
+ .long p1(0x6f),p1(0xb7),p1(0x62),p1(0x0e),p1(0xaa),p1(0x18),p1(0xbe),p1(0x1b) ;\
+ .long p1(0xfc),p1(0x56),p1(0x3e),p1(0x4b),p1(0xc6),p1(0xd2),p1(0x79),p1(0x20) ;\
+ .long p1(0x9a),p1(0xdb),p1(0xc0),p1(0xfe),p1(0x78),p1(0xcd),p1(0x5a),p1(0xf4)
+#define ib_data6(p1) \
+ .long p1(0x1f),p1(0xdd),p1(0xa8),p1(0x33),p1(0x88),p1(0x07),p1(0xc7),p1(0x31) ;\
+ .long p1(0xb1),p1(0x12),p1(0x10),p1(0x59),p1(0x27),p1(0x80),p1(0xec),p1(0x5f) ;\
+ .long p1(0x60),p1(0x51),p1(0x7f),p1(0xa9),p1(0x19),p1(0xb5),p1(0x4a),p1(0x0d) ;\
+ .long p1(0x2d),p1(0xe5),p1(0x7a),p1(0x9f),p1(0x93),p1(0xc9),p1(0x9c),p1(0xef)
+#define ib_data7(p1) \
+ .long p1(0xa0),p1(0xe0),p1(0x3b),p1(0x4d),p1(0xae),p1(0x2a),p1(0xf5),p1(0xb0) ;\
+ .long p1(0xc8),p1(0xeb),p1(0xbb),p1(0x3c),p1(0x83),p1(0x53),p1(0x99),p1(0x61) ;\
+ .long p1(0x17),p1(0x2b),p1(0x04),p1(0x7e),p1(0xba),p1(0x77),p1(0xd6),p1(0x26) ;\
+ .long p1(0xe1),p1(0x69),p1(0x14),p1(0x63),p1(0x55),p1(0x21),p1(0x0c),p1(0x7d)
+
+// The rcon_table (needed for the key schedule)
+//
+// Here is original Dr Brian Gladman's source code:
+// _rcon_tab:
+// %assign x 1
+// %rep 29
+// dd x
+// %assign x f2(x)
+// %endrep
+//
+// Here is precomputed output (it's more portable this way):
+
+ .section .rodata
+ .align ALIGN64BYTES
+aes_rcon_tab:
+ .long 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80
+ .long 0x1b,0x36,0x6c,0xd8,0xab,0x4d,0x9a,0x2f
+ .long 0x5e,0xbc,0x63,0xc6,0x97,0x35,0x6a,0xd4
+ .long 0xb3,0x7d,0xfa,0xef,0xc5
+
+// The forward xor tables
+
+ .align ALIGN64BYTES
+aes_ft_tab:
+ sb_data0(u0)
+ sb_data1(u0)
+ sb_data2(u0)
+ sb_data3(u0)
+ sb_data4(u0)
+ sb_data5(u0)
+ sb_data6(u0)
+ sb_data7(u0)
+
+ sb_data0(u1)
+ sb_data1(u1)
+ sb_data2(u1)
+ sb_data3(u1)
+ sb_data4(u1)
+ sb_data5(u1)
+ sb_data6(u1)
+ sb_data7(u1)
+
+ sb_data0(u2)
+ sb_data1(u2)
+ sb_data2(u2)
+ sb_data3(u2)
+ sb_data4(u2)
+ sb_data5(u2)
+ sb_data6(u2)
+ sb_data7(u2)
+
+ sb_data0(u3)
+ sb_data1(u3)
+ sb_data2(u3)
+ sb_data3(u3)
+ sb_data4(u3)
+ sb_data5(u3)
+ sb_data6(u3)
+ sb_data7(u3)
+
+ .align ALIGN64BYTES
+aes_fl_tab:
+ sb_data0(w0)
+ sb_data1(w0)
+ sb_data2(w0)
+ sb_data3(w0)
+ sb_data4(w0)
+ sb_data5(w0)
+ sb_data6(w0)
+ sb_data7(w0)
+
+ sb_data0(w1)
+ sb_data1(w1)
+ sb_data2(w1)
+ sb_data3(w1)
+ sb_data4(w1)
+ sb_data5(w1)
+ sb_data6(w1)
+ sb_data7(w1)
+
+ sb_data0(w2)
+ sb_data1(w2)
+ sb_data2(w2)
+ sb_data3(w2)
+ sb_data4(w2)
+ sb_data5(w2)
+ sb_data6(w2)
+ sb_data7(w2)
+
+ sb_data0(w3)
+ sb_data1(w3)
+ sb_data2(w3)
+ sb_data3(w3)
+ sb_data4(w3)
+ sb_data5(w3)
+ sb_data6(w3)
+ sb_data7(w3)
+
+// The inverse xor tables
+
+ .align ALIGN64BYTES
+aes_it_tab:
+ ib_data0(v0)
+ ib_data1(v0)
+ ib_data2(v0)
+ ib_data3(v0)
+ ib_data4(v0)
+ ib_data5(v0)
+ ib_data6(v0)
+ ib_data7(v0)
+
+ ib_data0(v1)
+ ib_data1(v1)
+ ib_data2(v1)
+ ib_data3(v1)
+ ib_data4(v1)
+ ib_data5(v1)
+ ib_data6(v1)
+ ib_data7(v1)
+
+ ib_data0(v2)
+ ib_data1(v2)
+ ib_data2(v2)
+ ib_data3(v2)
+ ib_data4(v2)
+ ib_data5(v2)
+ ib_data6(v2)
+ ib_data7(v2)
+
+ ib_data0(v3)
+ ib_data1(v3)
+ ib_data2(v3)
+ ib_data3(v3)
+ ib_data4(v3)
+ ib_data5(v3)
+ ib_data6(v3)
+ ib_data7(v3)
+
+ .align ALIGN64BYTES
+aes_il_tab:
+ ib_data0(w0)
+ ib_data1(w0)
+ ib_data2(w0)
+ ib_data3(w0)
+ ib_data4(w0)
+ ib_data5(w0)
+ ib_data6(w0)
+ ib_data7(w0)
+
+ ib_data0(w1)
+ ib_data1(w1)
+ ib_data2(w1)
+ ib_data3(w1)
+ ib_data4(w1)
+ ib_data5(w1)
+ ib_data6(w1)
+ ib_data7(w1)
+
+ ib_data0(w2)
+ ib_data1(w2)
+ ib_data2(w2)
+ ib_data3(w2)
+ ib_data4(w2)
+ ib_data5(w2)
+ ib_data6(w2)
+ ib_data7(w2)
+
+ ib_data0(w3)
+ ib_data1(w3)
+ ib_data2(w3)
+ ib_data3(w3)
+ ib_data4(w3)
+ ib_data5(w3)
+ ib_data6(w3)
+ ib_data7(w3)
+
+// The inverse mix column tables
+
+ .align ALIGN64BYTES
+aes_im_tab:
+ im_data0(v0)
+ im_data1(v0)
+ im_data2(v0)
+ im_data3(v0)
+ im_data4(v0)
+ im_data5(v0)
+ im_data6(v0)
+ im_data7(v0)
+
+ im_data0(v1)
+ im_data1(v1)
+ im_data2(v1)
+ im_data3(v1)
+ im_data4(v1)
+ im_data5(v1)
+ im_data6(v1)
+ im_data7(v1)
+
+ im_data0(v2)
+ im_data1(v2)
+ im_data2(v2)
+ im_data3(v2)
+ im_data4(v2)
+ im_data5(v2)
+ im_data6(v2)
+ im_data7(v2)
+
+ im_data0(v3)
+ im_data1(v3)
+ im_data2(v3)
+ im_data3(v3)
+ im_data4(v3)
+ im_data5(v3)
+ im_data6(v3)
+ im_data7(v3)
--- linux-2.6.9-rc2-mm1/arch/x86_64/crypto/aes-x86_64-glue.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-rc2-mm1-aes/arch/x86_64/crypto/aes-x86_64-glue.c 2004-09-26 23:50:32.296783760 +0200
@@ -0,0 +1,91 @@
+/*
+ *
+ * Glue Code for optimized x86_64 assembler version of AES
+ *
+ * Copyright (c) 2001, Dr Brian Gladman <[email protected]>, Worcester, UK.
+ * Copyright (c) 2003, Adam J. Richter <[email protected]> (conversion to
+ * 2.5 API).
+ * Copyright (c) 2003, 2004 Fruhwirth Clemens <[email protected]>
+ * Copyright (c) 2004, Florian Bohrer <[email protected]>
+*/
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/crypto.h>
+#include <linux/linkage.h>
+
+#define AES_MIN_KEY_SIZE 16
+#define AES_MAX_KEY_SIZE 32
+#define AES_BLOCK_SIZE 16
+#define AES_KS_LENGTH 4 * AES_BLOCK_SIZE
+#define AES_RC_LENGTH (9 * AES_BLOCK_SIZE) / 8 - 8
+
+typedef struct
+{
+ u_int32_t aes_Nkey; // the number of words in the key input block
+ u_int32_t aes_Nrnd; // the number of cipher rounds
+ u_int32_t aes_e_key[AES_KS_LENGTH]; // the encryption key schedule
+ u_int32_t aes_d_key[AES_KS_LENGTH]; // the decryption key schedule
+ u_int32_t aes_Ncol; // the number of columns in the cipher state
+} aes_context;
+
+
+asmlinkage void aes_set_key(void *, const unsigned char [], const int, const int);
+asmlinkage void aes_encrypt(void*, unsigned char [], const unsigned char []);
+asmlinkage void aes_decrypt(void*, unsigned char [], const unsigned char []);
+
+
+static int aes_set_key_glue(void *cx, const u8 *key,unsigned int key_length, u32 *flags)
+{
+ if(key_length != 16 && key_length != 24 && key_length != 32)
+ {
+ *flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
+ return -EINVAL;
+ }
+ aes_set_key(cx, key, key_length, 0);
+ return 0;
+}
+
+static void aes_encrypt_glue(void* a, unsigned char b[], const unsigned char c[]) {
+ aes_encrypt(a,b,c);
+}
+static void aes_decrypt_glue(void* a, unsigned char b[], const unsigned char c[]) {
+ aes_decrypt(a,b,c);
+}
+
+static struct crypto_alg aes_alg = {
+ .cra_name = "aes",
+ .cra_flags = CRYPTO_ALG_TYPE_CIPHER,
+ .cra_blocksize = AES_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(aes_context),
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(aes_alg.cra_list),
+ .cra_u = {
+ .cipher = {
+ .cia_min_keysize = AES_MIN_KEY_SIZE,
+ .cia_max_keysize = AES_MAX_KEY_SIZE,
+ .cia_setkey = aes_set_key_glue,
+ .cia_encrypt = aes_encrypt_glue,
+ .cia_decrypt = aes_decrypt_glue
+ }
+ }
+};
+
+static int __init aes_init(void)
+{
+ return crypto_register_alg(&aes_alg);
+}
+
+static void __exit aes_fini(void)
+{
+ crypto_unregister_alg(&aes_alg);
+}
+
+module_init(aes_init);
+module_exit(aes_fini);
+
+MODULE_DESCRIPTION("Rijndael (AES) Cipher Algorithm, x86_64 asm optimized");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_AUTHOR("Florian Bohrer");
+MODULE_ALIAS("aes");
--- linux-2.6.9-rc2-mm1/crypto/Kconfig 2004-09-26 11:50:39.692188448 +0200
+++ linux-2.6.9-rc2-mm1-aes/crypto/Kconfig 2004-09-26 10:24:16.219233840 +0200
@@ -173,6 +173,26 @@

See http://csrc.nist.gov/encryption/aes/ for more information.

+config CRYPTO_AES_X86_64
+ tristate "AES cipher algorithms (x86_64)"
+ depends on CRYPTO && (X86 && X86_64)
+ help
+ AES cipher algorithms (FIPS-197). AES uses the Rijndael
+ algorithm.
+
+ Rijndael appears to be consistently a very good performer in
+ both hardware and software across a wide range of computing
+ environments regardless of its use in feedback or non-feedback
+ modes. Its key setup time is excellent, and its key agility is
+ good. Rijndael's very low memory requirements make it very well
+ suited for restricted-space environments, in which it also
+ demonstrates excellent performance. Rijndael's operations are
+ among the easiest to defend against power and timing attacks.
+
+ The AES specifies three key sizes: 128, 192 and 256 bits
+
+ See http://csrc.nist.gov/encryption/aes/ for more information.
+
config CRYPTO_CAST5
tristate "CAST5 (CAST-128) cipher algorithm"
depends on CRYPTO
--- linux-2.6.9-rc2-mm1/arch/x86_64/Makefile 2004-09-26 11:50:39.654194224 +0200
+++ linux-2.6.9-rc2-mm1-aes/arch/x86_64/Makefile 2004-09-26 10:25:40.214464624 +0200
@@ -63,7 +63,9 @@
head-y := arch/x86_64/kernel/head.o arch/x86_64/kernel/head64.o arch/x86_64/kernel/init_task.o

libs-y += arch/x86_64/lib/
-core-y += arch/x86_64/kernel/ arch/x86_64/mm/
+core-y += arch/x86_64/kernel/ \
+ arch/x86_64/mm/ \
+ arch/x86_64/crypto/
core-$(CONFIG_IA32_EMULATION) += arch/x86_64/ia32/
drivers-$(CONFIG_PCI) += arch/x86_64/pci/
drivers-$(CONFIG_OPROFILE) += arch/x86_64/oprofile/
--- linux-2.6.9-rc2-mm1/arch/x86_64/crypto/Makefile 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-rc2-mm1-aes/arch/x86_64/crypto/Makefile 2004-09-26 10:22:51.074177856 +0200
@@ -0,0 +1,9 @@
+#
+# x86_64/crypto/Makefile
+#
+# Arch-specific CryptoAPI modules.
+#
+
+obj-$(CONFIG_CRYPTO_AES_X86_64) += aes-x86_64.o
+
+aes-x86_64-y := aes-x86_64-asm.o aes-x86_64-glue.o

--


-----------------------------------------------------------------------------
"Real Programmers consider "what you see is what you get" to
be just as bad a concept in Text Editors as it is in women.
No, the Real Programmer wants a "you asked for it, you got
it" text editor -- complicated, cryptic, powerful,
unforgiving, dangerous."
-----------------------------------------------------------------------------


2004-10-02 19:37:41

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

On Sat, 2004-10-02 at 13:53, Florian Bohrer wrote:
> hi,
>
> this is my first public kernel patch. it is an x86_64 asm optimized version of AES for the
> crypto-framework. the patch is against 2.6.9-rc2-mm1 but should work with other
> versions too.
>
>
> the asm-code is from Jari Ruusu (loop-aes).
> the org. glue-code is from Fruhwirth Clemens.

You should have cc'ed Jari and Fruwirth, you'd probably get an amusing
flame fest.

Lee

2004-10-02 19:41:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

[email protected] (Florian Bohrer) writes:

> hi,
>
> this is my first public kernel patch. it is an x86_64 asm optimized version of AES for the
> crypto-framework. the patch is against 2.6.9-rc2-mm1 but should work with other
> versions too.
>
>
> the asm-code is from Jari Ruusu (loop-aes).
> the org. glue-code is from Fruhwirth Clemens.
>

Thanks. I will add it to the x86-64 patchkit. I have a 64bit version
here too, but it had a bug somewhere and I didn't have time to fix it yet.

Unfortunately it's still fundamentally 32bit. Anybody interested
in doing a true 64bit AES?

-Andi


2004-10-04 02:15:47

by dean gaudet

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.



On Sat, 2 Oct 2004, Andi Kleen wrote:

> Unfortunately it's still fundamentally 32bit. Anybody interested
> in doing a true 64bit AES?

i doubt it helps any -- except for benchmark-only purposes.

there's a description of the 32-bit T-table approach in section 7.3 of
<http://fp.gladman.plus.com/cryptography_technology/rijndael/aesspec.pdf>

basically the tables are 8-bit -> 32-bit maps, and there are 4 of them (2
for each direction). to go to 64-bit you'd need 16-bit -> 64-bit maps...
512KiB per table. there are some other variations on the tables which are
smaller, but nothing as small as the 1024 bytes per table of the 32-bit
implementation.

there's a completely different approach using bit-slicing (basically
consider each register as 64 1-bit registers), which yields great
throughput but cruddy latency -- you basically need lots of non-dependant
streams to make this pay off (i.e. it might work for disk crypto
processing multiple blocks simultaneously).

-dean

2004-10-04 11:51:23

by Jari Ruusu

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

Andi Kleen wrote:
> [email protected] (Florian Bohrer) writes:
> > the asm-code is from Jari Ruusu (loop-aes).
> > the org. glue-code is from Fruhwirth Clemens.
>
> Thanks. I will add it to the x86-64 patchkit.

Here we go again...

Linus promised that he will not merge my code, and I am quite happy with my
code not being anywhere near mainline linux cryptoapi.

Linus, please consider dropping this.

--
Jari Ruusu 1024R/3A220F51 5B 4B F9 BB D3 3F 52 E9 DB 1D EB E3 24 0E A9 DD

2004-10-04 12:09:13

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

On Mon, 04 Oct 2004 14:51:19 +0300, Jari Ruusu
<[email protected]> wrote:
> Andi Kleen wrote:
> > [email protected] (Florian Bohrer) writes:
> > > the asm-code is from Jari Ruusu (loop-aes).
> > > the org. glue-code is from Fruhwirth Clemens.
> >
> > Thanks. I will add it to the x86-64 patchkit.
>
> Here we go again...
>
> Linus promised that he will not merge my code, and I am quite happy with my
> code not being anywhere near mainline linux cryptoapi.
>
> Linus, please consider dropping this.

I guess Linus will do so,
but may I ask you why don't you want to see your code merged in mainline ?

Thanks.

--
Paolo
Personal home page: http://www.ciarrocchi.tk
See my photos: http://paolociarrocchi.fotopic.net/
Buy cool stuff here: http://www.cafepress.com/paoloc

2004-10-04 12:20:56

by Jari Ruusu

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

Paolo Ciarrocchi wrote:
> On Mon, 04 Oct 2004 14:51:19 +0300, Jari Ruusu
> > Linus promised that he will not merge my code, and I am quite happy with my
> > code not being anywhere near mainline linux cryptoapi.
> >
> > Linus, please consider dropping this.
>
> I guess Linus will do so,
> but may I ask you why don't you want to see your code merged in mainline ?

I don't want my name associated with mainline linux cryptoapi or cryptoloop
or their developers.

--
Jari Ruusu 1024R/3A220F51 5B 4B F9 BB D3 3F 52 E9 DB 1D EB E3 24 0E A9 DD

2004-10-04 12:24:30

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

On Mon, 04 Oct 2004 15:20:43 +0300, Jari Ruusu
<[email protected]> wrote:
> Paolo Ciarrocchi wrote:
> > On Mon, 04 Oct 2004 14:51:19 +0300, Jari Ruusu
> > > Linus promised that he will not merge my code, and I am quite happy with my
> > > code not being anywhere near mainline linux cryptoapi.
> > >
> > > Linus, please consider dropping this.
> >
> > I guess Linus will do so,
> > but may I ask you why don't you want to see your code merged in mainline ?
>
> I don't want my name associated with mainline linux cryptoapi or cryptoloop
> or their developers.

I understand that, I still don't understand the reaseon.
But hey, feel free to ignore my question ;)
--
Paolo
Personal home page: http://www.ciarrocchi.tk
See my photos: http://paolociarrocchi.fotopic.net/
Buy cool stuff here: http://www.cafepress.com/paoloc

2004-10-04 12:34:09

by Jari Ruusu

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

Paolo Ciarrocchi wrote:
> On Mon, 04 Oct 2004 15:20:43 +0300, Jari Ruusu
> I understand that, I still don't understand the reaseon.
> But hey, feel free to ignore my question ;)

You haven't looked at cryptoloop security, have you?

No sane person wants to be accociated with that kind of broken and
backdoored scam. I certainly don't.

--
Jari Ruusu 1024R/3A220F51 5B 4B F9 BB D3 3F 52 E9 DB 1D EB E3 24 0E A9 DD

2004-10-04 12:35:44

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

On Mon, 04 Oct 2004 15:32:29 +0300, Jari Ruusu
<[email protected]> wrote:
> Paolo Ciarrocchi wrote:
> > On Mon, 04 Oct 2004 15:20:43 +0300, Jari Ruusu
> > I understand that, I still don't understand the reaseon.
> > But hey, feel free to ignore my question ;)
>
> You haven't looked at cryptoloop security, have you?

No at all.

> No sane person wants to be accociated with that kind of broken and
> backdoored scam. I certainly don't.

It was just curiosity ;)
Thank you for the answer.

--
Paolo
Personal home page: http://www.ciarrocchi.tk
See my photos: http://paolociarrocchi.fotopic.net/
Buy cool stuff here: http://www.cafepress.com/paoloc

2004-10-04 13:08:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

On Mon, Oct 04, 2004 at 02:51:19PM +0300, Jari Ruusu wrote:
> Andi Kleen wrote:
> > [email protected] (Florian Bohrer) writes:
> > > the asm-code is from Jari Ruusu (loop-aes).
> > > the org. glue-code is from Fruhwirth Clemens.
> >
> > Thanks. I will add it to the x86-64 patchkit.
>
> Here we go again...
>
> Linus promised that he will not merge my code, and I am quite happy with my
> code not being anywhere near mainline linux cryptoapi.
>
> Linus, please consider dropping this.

Ok, I will drop that version and go back to the older version based
on the i386 code.

-Andi

2004-10-04 19:06:25

by Raul Miller

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH] AES x86-64-asm impl.

On Mon, Oct 04, 2004 at 03:32:29PM +0300, Jari Ruusu wrote:
> You haven't looked at cryptoloop security, have you?
>
> No sane person wants to be accociated with that kind of broken and
> backdoored scam. I certainly don't.

Most kernel software is broken, initially -- and eventually it's either
replaced with something better or tossed because no one is interested
in it.

It's not clear to me whether you're more in the "offering something
better" camp or the "not interested" camp. But I'm curious -- what do
you see as the major issues?

Thanks,

--
Raul

2004-10-04 19:30:01

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

Jari Ruusu wrote:
> Paolo Ciarrocchi wrote:
>
>>On Mon, 04 Oct 2004 15:20:43 +0300, Jari Ruusu
>>I understand that, I still don't understand the reaseon.
>>But hey, feel free to ignore my question ;)
>
>
> You haven't looked at cryptoloop security, have you?
>
> No sane person wants to be accociated with that kind of broken and
> backdoored scam. I certainly don't.
>
Would you be happy if the code were wrapped as a general use package
like blowfish, or have you decided that because one part of Linux
doesn't meet your standards you don't want to allow any of your code to
be used in it?


--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-10-04 21:20:51

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

On Mon, 2004-10-04 at 15:26, Bill Davidsen wrote:
> Jari Ruusu wrote:
> > Paolo Ciarrocchi wrote:
> >
> >>On Mon, 04 Oct 2004 15:20:43 +0300, Jari Ruusu
> >>I understand that, I still don't understand the reaseon.
> >>But hey, feel free to ignore my question ;)
> >
> >
> > You haven't looked at cryptoloop security, have you?
> >
> > No sane person wants to be accociated with that kind of broken and
> > backdoored scam. I certainly don't.
> >
> Would you be happy if the code were wrapped as a general use package
> like blowfish, or have you decided that because one part of Linux
> doesn't meet your standards you don't want to allow any of your code to
> be used in it?
>

Please check the archives, Jari's reasons are well documented. I cannot
summarize the technical issues here as IANA cryptographer but please,
let's not start that thread again.

Lee

2004-10-05 00:34:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

Andi Kleen wrote:
> On Mon, Oct 04, 2004 at 02:51:19PM +0300, Jari Ruusu wrote:

>>
>>Here we go again...
>>
>>Linus promised that he will not merge my code, and I am quite happy with my
>>code not being anywhere near mainline linux cryptoapi.
>>
>>Linus, please consider dropping this.
>
>
> Ok, I will drop that version and go back to the older version based
> on the i386 code.
>
> -Andi

WHAT? We're dropping potentially better code because someone _who
didn't submit it_ disagrees for personal political reasons? (Jari- I'm
not questioning your reasons for not wanting to be involved in
cryptoapi, but IIRC you did release that code under the GPL.)

--Andy

2004-10-05 05:16:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.



On Mon, 4 Oct 2004, Andy Lutomirski wrote:
>
> WHAT? We're dropping potentially better code because someone _who
> didn't submit it_ disagrees for personal political reasons? (Jari- I'm
> not questioning your reasons for not wanting to be involved in
> cryptoapi, but IIRC you did release that code under the GPL.)

Guys. Please remember this: don't bother with code that Jari supposedly
"releases". It's simply not worth the bother.

Linus

2004-10-05 15:24:00

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] AES x86-64-asm impl.

On Mon, 4 Oct 2004, Lee Revell wrote:

> On Mon, 2004-10-04 at 15:26, Bill Davidsen wrote:
> > Jari Ruusu wrote:
> > > Paolo Ciarrocchi wrote:
> > >
> > >>On Mon, 04 Oct 2004 15:20:43 +0300, Jari Ruusu
> > >>I understand that, I still don't understand the reaseon.
> > >>But hey, feel free to ignore my question ;)
> > >
> > >
> > > You haven't looked at cryptoloop security, have you?
> > >
> > > No sane person wants to be accociated with that kind of broken and
> > > backdoored scam. I certainly don't.
> > >
> > Would you be happy if the code were wrapped as a general use package
> > like blowfish, or have you decided that because one part of Linux
> > doesn't meet your standards you don't want to allow any of your code to
> > be used in it?
> >
>
> Please check the archives, Jari's reasons are well documented. I cannot
> summarize the technical issues here as IANA cryptographer but please,
> let's not start that thread again.

I'm not starting a thread, I read the discussion the first time and I'm
not asking about his reasons, I'm asking a yes/no question which he will
answer or not as he pleases.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.