2004-11-01 06:57:23

by Marc Bevand

[permalink] [raw]
Subject: [rc4-amd64] RC4 optimized for AMD64

I have just published a small paper about optimizing RC4 for
AMD64 (x86-64). A working implementation is also provided:

http://epita.fr/~bevand_m/papers/rc4-amd64.html

Kernel people may be interested given the fact that Linux
already implements RC4.

--
Marc Bevand http://www.epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.


2004-11-01 07:36:42

by James Morris

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64

On Mon, 1 Nov 2004, Marc Bevand wrote:

> I have just published a small paper about optimizing RC4 for
> AMD64 (x86-64). A working implementation is also provided:
>
> http://epita.fr/~bevand_m/papers/rc4-amd64.html
>
> Kernel people may be interested given the fact that Linux
> already implements RC4.

Only problem is that the setkey code is released under a GPL incompatible
license. Although it's probably not difficult to make the kernel's
existing C setkey code to work with the new asm code.


- James
--
James Morris
<[email protected]>



2004-11-01 12:24:19

by Marc Bevand

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64

On 2004-11-01, James Morris <[email protected]> wrote:
|
| Only problem is that the setkey code is released under a GPL incompatible
| license. Although it's probably not difficult to make the kernel's
| existing C setkey code to work with the new asm code.

Yes, it would be very easy to do. This patch (completetly untested)
is probably all that is necessary to make Linux arc4_set_key() work
with rc4-amd64:

--- 8< -----------------------------------------------------------------
--- crypto/arc4.c.orig 2004-11-01 13:16:41.739375512 +0100
+++ crypto/arc4.c 2004-11-01 13:18:16.799924112 +0100
@@ -20,8 +20,8 @@
#define ARC4_BLOCK_SIZE 1

struct arc4_ctx {
- u8 S[256];
- u8 x, y;
+ u64 x, y;
+ u64 S[256];
};

static int arc4_set_key(void *ctx_arg, const u8 *in_key, unsigned int key_len, u32 *flags)
--- 8< -----------------------------------------------------------------

--
Marc Bevand http://www.epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.

2004-11-01 22:44:28

by dean gaudet

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64



On Mon, 1 Nov 2004, dean gaudet wrote:

> On Mon, 1 Nov 2004, Marc Bevand wrote:
>
> > I have just published a small paper about optimizing RC4 for
> > AMD64 (x86-64). A working implementation is also provided:
> >
> > http://epita.fr/~bevand_m/papers/rc4-amd64.html
> >
> > Kernel people may be interested given the fact that Linux
> > already implements RC4.
>
> you've made a non-portable flags assumption:
>
> > dec %r11b
> > ror $8, %r8 # (ror does not change ZF)
> > jnz 1b
>
> the contents of ZF are undefined after a rotation... most importantly
> they differ between p4 (ZF is set according to result) and k8 (ZF
> unchanged).

ack... it's too early on a monday morning -- i misread the documentation.
this ZF assumption is actually defined and portable... still kind of ugly.
how much benefit do you see?

-dean

2004-11-01 23:39:49

by Marc Bevand

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64

dean gaudet wrote:
|
| [...]
| ack... it's too early on a monday morning -- i misread the documentation.
| this ZF assumption is actually defined and portable... still kind of ugly.
| how much benefit do you see?

When "dec" is placed before "ror", throughput goes up by about 5%
on my test system (Opteron 244 rev C0). I don't find it "ugly"
because the optimization no intrusive at all (only 1 moved instruction).

Concerning the "dec / sub $1" case, it makes absolutely no difference
on the Opteron, I just used "dec" because the opcode is 3 bytes length
instead of 4.

--
Marc Bevand http://www.epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.

2004-11-02 04:49:38

by dean gaudet

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64

On Mon, 1 Nov 2004, Marc Bevand wrote:

> I have just published a small paper about optimizing RC4 for
> AMD64 (x86-64). A working implementation is also provided:
>
> http://epita.fr/~bevand_m/papers/rc4-amd64.html
>
> Kernel people may be interested given the fact that Linux
> already implements RC4.

you've made a non-portable flags assumption:

> dec %r11b
> ror $8, %r8 # (ror does not change ZF)
> jnz 1b

the contents of ZF are undefined after a rotation... most importantly they
differ between p4 (ZF is set according to result) and k8 (ZF unchanged).

do you really measure a perf improvement from this assumption? note that
p4 would prefer "sub $1, %r11b" here instead of dec... but the difference
is likely minimal.

-dean

2004-11-02 18:52:17

by dean gaudet

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64

On Tue, 2 Nov 2004, Denis Vlasenko wrote:

> On Monday 01 November 2004 22:44, dean gaudet wrote:
> > note that
> > p4 would prefer "sub $1, %r11b" here instead of dec... but the difference
> > is likely minimal.
>
> p4 is not the only x86 CPU on the planet. Why should I use
> longer instruction then?

you're asking about spending one byte? one byte extra for code which
could perform better on more CPUs?

i could equally well say "k8 is not the only x86-64 cpu on the planet".

i really don't care whether this change is made or not, i only mentioned a
general perf rule.

go ahead and use -Os for the rest of the kernel if you're worried about
size, it'll likely go faster. but spending 1 byte in code which is perf
critical is nothing.

-dean

2004-11-02 21:46:54

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64

On Tuesday 02 November 2004 20:52, dean gaudet wrote:
> On Tue, 2 Nov 2004, Denis Vlasenko wrote:
>
> > On Monday 01 November 2004 22:44, dean gaudet wrote:
> > > note that
> > > p4 would prefer "sub $1, %r11b" here instead of dec... but the difference
> > > is likely minimal.
> >
> > p4 is not the only x86 CPU on the planet. Why should I use
> > longer instruction then?
>
> you're asking about spending one byte? one byte extra for code which
> could perform better on more CPUs?

You're asking about speedup by 1 cycle on a CPU which will be outdated
6 months from now?
--
vda

2004-11-03 10:07:45

by Marc Bevand

[permalink] [raw]
Subject: Re: [rc4-amd64] RC4 optimized for AMD64

dean gaudet wrote:
|
| [...]
| you're asking about spending one byte? one byte extra for code which
| could perform better on more CPUs?

Guys, this does not matter _for now_, because AFAIK nobody has
benchmarked this code on an EM64T P4 CPU.

Obviously, if 'sub $1,X' is proved to be faster than 'dec' on the
Intel CPU, then I will change the code (since both instructions are
equivalent on AMD CPUs).

--
Marc Bevand http://www.epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.