Subject: [PATCH] crypto: rmd128: make it work on my prefered architecture

From: Sebastian Siewior <[email protected]>

Not everybody counts 10 as 01.

Signed-off-by: Sebastian Siewior <[email protected]>
---

Adrian-Ken: I expect the other implementiation to be broken. Please fix
it :)

crypto/rmd128.c | 316 +++++++++++++++++++++++++++----------------------------
1 files changed, 153 insertions(+), 163 deletions(-)

diff --git a/crypto/rmd128.c b/crypto/rmd128.c
index 146a167..34e9e4a 100644
--- a/crypto/rmd128.c
+++ b/crypto/rmd128.c
@@ -7,6 +7,8 @@
*
* Copyright (c) 2008 Adrian-Ken Rueegsegger <rueegsegger (at) swiss-it.ch>
*
+ * Sebastian Siewior tried to use this on PowerPC. Now it does work.
+ *
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the Free
* Software Foundation; either version 2 of the License, or (at your option)
@@ -52,190 +54,178 @@ static void rmd128_transform(u32 *state, u32 const *in)
u32 aa, bb, cc, dd, aaa, bbb, ccc, ddd;

/* Initialize left lane */
- aa = state[0];
- bb = state[1];
- cc = state[2];
- dd = state[3];
+ aa = le32_to_cpu(state[0]);
+ bb = le32_to_cpu(state[1]);
+ cc = le32_to_cpu(state[2]);
+ dd = le32_to_cpu(state[3]);

/* Initialize right lane */
- aaa = state[0];
- bbb = state[1];
- ccc = state[2];
- ddd = state[3];
+ aaa = le32_to_cpu(state[0]);
+ bbb = le32_to_cpu(state[1]);
+ ccc = le32_to_cpu(state[2]);
+ ddd = le32_to_cpu(state[3]);

/* round 1: left lane */
- ROUND(aa, bb, cc, dd, F1, K1, in[0], 11);
- ROUND(dd, aa, bb, cc, F1, K1, in[1], 14);
- ROUND(cc, dd, aa, bb, F1, K1, in[2], 15);
- ROUND(bb, cc, dd, aa, F1, K1, in[3], 12);
- ROUND(aa, bb, cc, dd, F1, K1, in[4], 5);
- ROUND(dd, aa, bb, cc, F1, K1, in[5], 8);
- ROUND(cc, dd, aa, bb, F1, K1, in[6], 7);
- ROUND(bb, cc, dd, aa, F1, K1, in[7], 9);
- ROUND(aa, bb, cc, dd, F1, K1, in[8], 11);
- ROUND(dd, aa, bb, cc, F1, K1, in[9], 13);
- ROUND(cc, dd, aa, bb, F1, K1, in[10], 14);
- ROUND(bb, cc, dd, aa, F1, K1, in[11], 15);
- ROUND(aa, bb, cc, dd, F1, K1, in[12], 6);
- ROUND(dd, aa, bb, cc, F1, K1, in[13], 7);
- ROUND(cc, dd, aa, bb, F1, K1, in[14], 9);
- ROUND(bb, cc, dd, aa, F1, K1, in[15], 8);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 0]), 11);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[ 1]), 14);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[ 2]), 15);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[ 3]), 12);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 4]), 5);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[ 5]), 8);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[ 6]), 7);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[ 7]), 9);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 8]), 11);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[ 9]), 13);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[10]), 14);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[11]), 15);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[12]), 6);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[13]), 7);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[14]), 9);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[15]), 8);

/* round 2: left lane */
- ROUND(aa, bb, cc, dd, F2, K2, in[7], 7);
- ROUND(dd, aa, bb, cc, F2, K2, in[4], 6);
- ROUND(cc, dd, aa, bb, F2, K2, in[13], 8);
- ROUND(bb, cc, dd, aa, F2, K2, in[1], 13);
- ROUND(aa, bb, cc, dd, F2, K2, in[10], 11);
- ROUND(dd, aa, bb, cc, F2, K2, in[6], 9);
- ROUND(cc, dd, aa, bb, F2, K2, in[15], 7);
- ROUND(bb, cc, dd, aa, F2, K2, in[3], 15);
- ROUND(aa, bb, cc, dd, F2, K2, in[12], 7);
- ROUND(dd, aa, bb, cc, F2, K2, in[0], 12);
- ROUND(cc, dd, aa, bb, F2, K2, in[9], 15);
- ROUND(bb, cc, dd, aa, F2, K2, in[5], 9);
- ROUND(aa, bb, cc, dd, F2, K2, in[2], 11);
- ROUND(dd, aa, bb, cc, F2, K2, in[14], 7);
- ROUND(cc, dd, aa, bb, F2, K2, in[11], 13);
- ROUND(bb, cc, dd, aa, F2, K2, in[8], 12);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[ 7]), 7);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[ 4]), 6);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[13]), 8);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 1]), 13);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[10]), 11);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[ 6]), 9);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[15]), 7);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 3]), 15);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[12]), 7);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[ 0]), 12);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[ 9]), 15);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 5]), 9);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[ 2]), 11);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[14]), 7);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[11]), 13);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 8]), 12);

/* round 3: left lane */
- ROUND(aa, bb, cc, dd, F3, K3, in[3], 11);
- ROUND(dd, aa, bb, cc, F3, K3, in[10], 13);
- ROUND(cc, dd, aa, bb, F3, K3, in[14], 6);
- ROUND(bb, cc, dd, aa, F3, K3, in[4], 7);
- ROUND(aa, bb, cc, dd, F3, K3, in[9], 14);
- ROUND(dd, aa, bb, cc, F3, K3, in[15], 9);
- ROUND(cc, dd, aa, bb, F3, K3, in[8], 13);
- ROUND(bb, cc, dd, aa, F3, K3, in[1], 15);
- ROUND(aa, bb, cc, dd, F3, K3, in[2], 14);
- ROUND(dd, aa, bb, cc, F3, K3, in[7], 8);
- ROUND(cc, dd, aa, bb, F3, K3, in[0], 13);
- ROUND(bb, cc, dd, aa, F3, K3, in[6], 6);
- ROUND(aa, bb, cc, dd, F3, K3, in[13], 5);
- ROUND(dd, aa, bb, cc, F3, K3, in[11], 12);
- ROUND(cc, dd, aa, bb, F3, K3, in[5], 7);
- ROUND(bb, cc, dd, aa, F3, K3, in[12], 5);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[ 3]), 11);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[10]), 13);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[14]), 6);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[ 4]), 7);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[ 9]), 14);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[15]), 9);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[ 8]), 13);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[ 1]), 15);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[ 2]), 14);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[ 7]), 8);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[ 0]), 13);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[ 6]), 6);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[13]), 5);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[11]), 12);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[ 5]), 7);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[12]), 5);

/* round 4: left lane */
- ROUND(aa, bb, cc, dd, F4, K4, in[1], 11);
- ROUND(dd, aa, bb, cc, F4, K4, in[9], 12);
- ROUND(cc, dd, aa, bb, F4, K4, in[11], 14);
- ROUND(bb, cc, dd, aa, F4, K4, in[10], 15);
- ROUND(aa, bb, cc, dd, F4, K4, in[0], 14);
- ROUND(dd, aa, bb, cc, F4, K4, in[8], 15);
- ROUND(cc, dd, aa, bb, F4, K4, in[12], 9);
- ROUND(bb, cc, dd, aa, F4, K4, in[4], 8);
- ROUND(aa, bb, cc, dd, F4, K4, in[13], 9);
- ROUND(dd, aa, bb, cc, F4, K4, in[3], 14);
- ROUND(cc, dd, aa, bb, F4, K4, in[7], 5);
- ROUND(bb, cc, dd, aa, F4, K4, in[15], 6);
- ROUND(aa, bb, cc, dd, F4, K4, in[14], 8);
- ROUND(dd, aa, bb, cc, F4, K4, in[5], 6);
- ROUND(cc, dd, aa, bb, F4, K4, in[6], 5);
- ROUND(bb, cc, dd, aa, F4, K4, in[2], 12);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[ 1]), 11);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 9]), 12);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[11]), 14);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[10]), 15);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[ 0]), 14);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 8]), 15);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[12]), 9);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[ 4]), 8);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[13]), 9);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 3]), 14);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[ 7]), 5);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[15]), 6);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[14]), 8);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 5]), 6);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[ 6]), 5);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[ 2]), 12);

/* round 1: right lane */
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[5], 8);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[14], 9);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[7], 9);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[0], 11);
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[9], 13);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[2], 15);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[11], 15);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[4], 5);
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[13], 7);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[6], 7);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[15], 8);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[8], 11);
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[1], 14);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[10], 14);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[3], 12);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[12], 6);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[ 5]), 8);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[14]), 9);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[ 7]), 9);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[ 0]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[ 9]), 13);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[ 2]), 15);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[11]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[ 4]), 5);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[13]), 7);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[ 6]), 7);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[15]), 8);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[ 8]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[ 1]), 14);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[10]), 14);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[ 3]), 12);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[12]), 6);

/* round 2: right lane */
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[6], 9);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[11], 13);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[3], 15);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[7], 7);
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[0], 12);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[13], 8);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[5], 9);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[10], 11);
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[14], 7);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[15], 7);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[8], 12);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[12], 7);
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[4], 6);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[9], 15);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[1], 13);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[2], 11);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[ 6]), 9);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[11]), 13);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 3]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[ 7]), 7);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[ 0]), 12);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[13]), 8);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 5]), 9);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[10]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[14]), 7);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[15]), 7);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 8]), 12);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[12]), 7);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[ 4]), 6);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[ 9]), 15);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 1]), 13);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[ 2]), 11);

/* round 3: right lane */
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[15], 9);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[5], 7);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[1], 15);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[3], 11);
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[7], 8);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[14], 6);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[6], 6);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[9], 14);
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[11], 12);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[8], 13);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[12], 5);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[2], 14);
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[10], 13);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[0], 13);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[4], 7);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[13], 5);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[15]), 9);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[ 5]), 7);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[ 1]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[ 3]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[ 7]), 8);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[14]), 6);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[ 6]), 6);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[ 9]), 14);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[11]), 12);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[ 8]), 13);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[12]), 5);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[ 2]), 14);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[10]), 13);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[ 0]), 13);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[ 4]), 7);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[13]), 5);

/* round 4: right lane */
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[8], 15);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[6], 5);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[4], 8);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[1], 11);
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[3], 14);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[11], 14);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[15], 6);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[0], 14);
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[5], 6);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[12], 9);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[2], 12);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[13], 9);
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[9], 12);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[7], 5);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[10], 15);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[14], 8);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 8]), 15);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[ 6]), 5);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[ 4]), 8);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[ 1]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 3]), 14);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[11]), 14);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[15]), 6);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[ 0]), 14);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 5]), 6);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[12]), 9);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[ 2]), 12);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[13]), 9);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 9]), 12);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[ 7]), 5);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[10]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[14]), 8);

/* combine results */
- ddd += cc + state[1]; /* final result for state[0] */
- state[1] = state[2] + dd + aaa;
- state[2] = state[3] + aa + bbb;
- state[3] = state[0] + bb + ccc;
- state[0] = ddd;
+ ddd += cc + le32_to_cpu(state[1]); /* final result for state[0] */
+ le32_add_cpu(&state[2], dd + aaa);
+ state[1] = state[2];

- return;
-}
+ le32_add_cpu(&state[3], aa + bbb);
+ state[2] = state[3];

-static inline void le32_to_cpu_array(u32 *buf, unsigned int words)
-{
- while (words--) {
- le32_to_cpus(buf);
- buf++;
- }
-}
+ le32_add_cpu(&state[0], bb + ccc);
+ state[3] = state[0];

-static inline void cpu_to_le32_array(u32 *buf, unsigned int words)
-{
- while (words--) {
- cpu_to_le32s(buf);
- buf++;
- }
+ state[0] = cpu_to_le32(ddd);
+ return;
}

-static inline void rmd128_transform_helper(struct rmd128_ctx *ctx)
+static void rmd128_transform_helper(struct rmd128_ctx *ctx)
{
- le32_to_cpu_array(ctx->buffer, sizeof(ctx->buffer) / sizeof(u32));
rmd128_transform(ctx->state, ctx->buffer);
}

@@ -245,10 +235,10 @@ static void rmd128_init(struct crypto_tfm *tfm)

rctx->byte_count = 0;

- rctx->state[0] = RMD_H0;
- rctx->state[1] = RMD_H1;
- rctx->state[2] = RMD_H2;
- rctx->state[3] = RMD_H3;
+ rctx->state[0] = __constant_cpu_to_le32(RMD_H0);
+ rctx->state[1] = __constant_cpu_to_le32(RMD_H1);
+ rctx->state[2] = __constant_cpu_to_le32(RMD_H2);
+ rctx->state[3] = __constant_cpu_to_le32(RMD_H3);

memset(rctx->buffer, 0, sizeof(rctx->buffer));
}
@@ -292,8 +282,8 @@ static void rmd128_final(struct crypto_tfm *tfm, u8 *out)
u32 index, padlen;
u64 bits;
static const u8 padding[64] = { 0x80, };
- bits = rctx->byte_count << 3;

+ bits = cpu_to_le64(rctx->byte_count << 3);
/* Pad out to 56 mod 64 */
index = rctx->byte_count & 0x3f;
padlen = (index < 56) ? (56 - index) : ((64+56) - index);
--
1.5.4.3



Subject: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

Not everybody counts 10 as 01.

Signed-off-by: Sebastian Siewior <[email protected]>
---
Changelog:
v2: state is now cpu endian and converted on final. This saves a few
cycles :)
v1: quick make it work

crypto/rmd128.c | 288 ++++++++++++++++++++++++++-----------------------------
1 files changed, 138 insertions(+), 150 deletions(-)

diff --git a/crypto/rmd128.c b/crypto/rmd128.c
index 146a167..0d946a3 100644
--- a/crypto/rmd128.c
+++ b/crypto/rmd128.c
@@ -7,6 +7,8 @@
*
* Copyright (c) 2008 Adrian-Ken Rueegsegger <rueegsegger (at) swiss-it.ch>
*
+ * Sebastian Siewior tried to use this on PowerPC. Now it does work.
+ *
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the Free
* Software Foundation; either version 2 of the License, or (at your option)
@@ -64,148 +66,148 @@ static void rmd128_transform(u32 *state, u32 const *in)
ddd = state[3];

/* round 1: left lane */
- ROUND(aa, bb, cc, dd, F1, K1, in[0], 11);
- ROUND(dd, aa, bb, cc, F1, K1, in[1], 14);
- ROUND(cc, dd, aa, bb, F1, K1, in[2], 15);
- ROUND(bb, cc, dd, aa, F1, K1, in[3], 12);
- ROUND(aa, bb, cc, dd, F1, K1, in[4], 5);
- ROUND(dd, aa, bb, cc, F1, K1, in[5], 8);
- ROUND(cc, dd, aa, bb, F1, K1, in[6], 7);
- ROUND(bb, cc, dd, aa, F1, K1, in[7], 9);
- ROUND(aa, bb, cc, dd, F1, K1, in[8], 11);
- ROUND(dd, aa, bb, cc, F1, K1, in[9], 13);
- ROUND(cc, dd, aa, bb, F1, K1, in[10], 14);
- ROUND(bb, cc, dd, aa, F1, K1, in[11], 15);
- ROUND(aa, bb, cc, dd, F1, K1, in[12], 6);
- ROUND(dd, aa, bb, cc, F1, K1, in[13], 7);
- ROUND(cc, dd, aa, bb, F1, K1, in[14], 9);
- ROUND(bb, cc, dd, aa, F1, K1, in[15], 8);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 0]), 11);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[ 1]), 14);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[ 2]), 15);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[ 3]), 12);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 4]), 5);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[ 5]), 8);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[ 6]), 7);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[ 7]), 9);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 8]), 11);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[ 9]), 13);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[10]), 14);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[11]), 15);
+ ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[12]), 6);
+ ROUND(dd, aa, bb, cc, F1, K1, le32_to_cpu(in[13]), 7);
+ ROUND(cc, dd, aa, bb, F1, K1, le32_to_cpu(in[14]), 9);
+ ROUND(bb, cc, dd, aa, F1, K1, le32_to_cpu(in[15]), 8);

/* round 2: left lane */
- ROUND(aa, bb, cc, dd, F2, K2, in[7], 7);
- ROUND(dd, aa, bb, cc, F2, K2, in[4], 6);
- ROUND(cc, dd, aa, bb, F2, K2, in[13], 8);
- ROUND(bb, cc, dd, aa, F2, K2, in[1], 13);
- ROUND(aa, bb, cc, dd, F2, K2, in[10], 11);
- ROUND(dd, aa, bb, cc, F2, K2, in[6], 9);
- ROUND(cc, dd, aa, bb, F2, K2, in[15], 7);
- ROUND(bb, cc, dd, aa, F2, K2, in[3], 15);
- ROUND(aa, bb, cc, dd, F2, K2, in[12], 7);
- ROUND(dd, aa, bb, cc, F2, K2, in[0], 12);
- ROUND(cc, dd, aa, bb, F2, K2, in[9], 15);
- ROUND(bb, cc, dd, aa, F2, K2, in[5], 9);
- ROUND(aa, bb, cc, dd, F2, K2, in[2], 11);
- ROUND(dd, aa, bb, cc, F2, K2, in[14], 7);
- ROUND(cc, dd, aa, bb, F2, K2, in[11], 13);
- ROUND(bb, cc, dd, aa, F2, K2, in[8], 12);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[ 7]), 7);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[ 4]), 6);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[13]), 8);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 1]), 13);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[10]), 11);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[ 6]), 9);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[15]), 7);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 3]), 15);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[12]), 7);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[ 0]), 12);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[ 9]), 15);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 5]), 9);
+ ROUND(aa, bb, cc, dd, F2, K2, le32_to_cpu(in[ 2]), 11);
+ ROUND(dd, aa, bb, cc, F2, K2, le32_to_cpu(in[14]), 7);
+ ROUND(cc, dd, aa, bb, F2, K2, le32_to_cpu(in[11]), 13);
+ ROUND(bb, cc, dd, aa, F2, K2, le32_to_cpu(in[ 8]), 12);

/* round 3: left lane */
- ROUND(aa, bb, cc, dd, F3, K3, in[3], 11);
- ROUND(dd, aa, bb, cc, F3, K3, in[10], 13);
- ROUND(cc, dd, aa, bb, F3, K3, in[14], 6);
- ROUND(bb, cc, dd, aa, F3, K3, in[4], 7);
- ROUND(aa, bb, cc, dd, F3, K3, in[9], 14);
- ROUND(dd, aa, bb, cc, F3, K3, in[15], 9);
- ROUND(cc, dd, aa, bb, F3, K3, in[8], 13);
- ROUND(bb, cc, dd, aa, F3, K3, in[1], 15);
- ROUND(aa, bb, cc, dd, F3, K3, in[2], 14);
- ROUND(dd, aa, bb, cc, F3, K3, in[7], 8);
- ROUND(cc, dd, aa, bb, F3, K3, in[0], 13);
- ROUND(bb, cc, dd, aa, F3, K3, in[6], 6);
- ROUND(aa, bb, cc, dd, F3, K3, in[13], 5);
- ROUND(dd, aa, bb, cc, F3, K3, in[11], 12);
- ROUND(cc, dd, aa, bb, F3, K3, in[5], 7);
- ROUND(bb, cc, dd, aa, F3, K3, in[12], 5);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[ 3]), 11);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[10]), 13);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[14]), 6);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[ 4]), 7);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[ 9]), 14);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[15]), 9);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[ 8]), 13);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[ 1]), 15);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[ 2]), 14);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[ 7]), 8);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[ 0]), 13);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[ 6]), 6);
+ ROUND(aa, bb, cc, dd, F3, K3, le32_to_cpu(in[13]), 5);
+ ROUND(dd, aa, bb, cc, F3, K3, le32_to_cpu(in[11]), 12);
+ ROUND(cc, dd, aa, bb, F3, K3, le32_to_cpu(in[ 5]), 7);
+ ROUND(bb, cc, dd, aa, F3, K3, le32_to_cpu(in[12]), 5);

/* round 4: left lane */
- ROUND(aa, bb, cc, dd, F4, K4, in[1], 11);
- ROUND(dd, aa, bb, cc, F4, K4, in[9], 12);
- ROUND(cc, dd, aa, bb, F4, K4, in[11], 14);
- ROUND(bb, cc, dd, aa, F4, K4, in[10], 15);
- ROUND(aa, bb, cc, dd, F4, K4, in[0], 14);
- ROUND(dd, aa, bb, cc, F4, K4, in[8], 15);
- ROUND(cc, dd, aa, bb, F4, K4, in[12], 9);
- ROUND(bb, cc, dd, aa, F4, K4, in[4], 8);
- ROUND(aa, bb, cc, dd, F4, K4, in[13], 9);
- ROUND(dd, aa, bb, cc, F4, K4, in[3], 14);
- ROUND(cc, dd, aa, bb, F4, K4, in[7], 5);
- ROUND(bb, cc, dd, aa, F4, K4, in[15], 6);
- ROUND(aa, bb, cc, dd, F4, K4, in[14], 8);
- ROUND(dd, aa, bb, cc, F4, K4, in[5], 6);
- ROUND(cc, dd, aa, bb, F4, K4, in[6], 5);
- ROUND(bb, cc, dd, aa, F4, K4, in[2], 12);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[ 1]), 11);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 9]), 12);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[11]), 14);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[10]), 15);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[ 0]), 14);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 8]), 15);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[12]), 9);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[ 4]), 8);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[13]), 9);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 3]), 14);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[ 7]), 5);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[15]), 6);
+ ROUND(aa, bb, cc, dd, F4, K4, le32_to_cpu(in[14]), 8);
+ ROUND(dd, aa, bb, cc, F4, K4, le32_to_cpu(in[ 5]), 6);
+ ROUND(cc, dd, aa, bb, F4, K4, le32_to_cpu(in[ 6]), 5);
+ ROUND(bb, cc, dd, aa, F4, K4, le32_to_cpu(in[ 2]), 12);

/* round 1: right lane */
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[5], 8);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[14], 9);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[7], 9);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[0], 11);
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[9], 13);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[2], 15);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[11], 15);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[4], 5);
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[13], 7);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[6], 7);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[15], 8);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[8], 11);
- ROUND(aaa, bbb, ccc, ddd, F4, KK1, in[1], 14);
- ROUND(ddd, aaa, bbb, ccc, F4, KK1, in[10], 14);
- ROUND(ccc, ddd, aaa, bbb, F4, KK1, in[3], 12);
- ROUND(bbb, ccc, ddd, aaa, F4, KK1, in[12], 6);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[ 5]), 8);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[14]), 9);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[ 7]), 9);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[ 0]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[ 9]), 13);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[ 2]), 15);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[11]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[ 4]), 5);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[13]), 7);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[ 6]), 7);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[15]), 8);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[ 8]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F4, KK1, le32_to_cpu(in[ 1]), 14);
+ ROUND(ddd, aaa, bbb, ccc, F4, KK1, le32_to_cpu(in[10]), 14);
+ ROUND(ccc, ddd, aaa, bbb, F4, KK1, le32_to_cpu(in[ 3]), 12);
+ ROUND(bbb, ccc, ddd, aaa, F4, KK1, le32_to_cpu(in[12]), 6);

/* round 2: right lane */
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[6], 9);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[11], 13);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[3], 15);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[7], 7);
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[0], 12);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[13], 8);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[5], 9);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[10], 11);
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[14], 7);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[15], 7);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[8], 12);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[12], 7);
- ROUND(aaa, bbb, ccc, ddd, F3, KK2, in[4], 6);
- ROUND(ddd, aaa, bbb, ccc, F3, KK2, in[9], 15);
- ROUND(ccc, ddd, aaa, bbb, F3, KK2, in[1], 13);
- ROUND(bbb, ccc, ddd, aaa, F3, KK2, in[2], 11);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[ 6]), 9);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[11]), 13);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 3]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[ 7]), 7);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[ 0]), 12);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[13]), 8);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 5]), 9);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[10]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[14]), 7);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[15]), 7);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 8]), 12);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[12]), 7);
+ ROUND(aaa, bbb, ccc, ddd, F3, KK2, le32_to_cpu(in[ 4]), 6);
+ ROUND(ddd, aaa, bbb, ccc, F3, KK2, le32_to_cpu(in[ 9]), 15);
+ ROUND(ccc, ddd, aaa, bbb, F3, KK2, le32_to_cpu(in[ 1]), 13);
+ ROUND(bbb, ccc, ddd, aaa, F3, KK2, le32_to_cpu(in[ 2]), 11);

/* round 3: right lane */
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[15], 9);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[5], 7);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[1], 15);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[3], 11);
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[7], 8);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[14], 6);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[6], 6);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[9], 14);
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[11], 12);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[8], 13);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[12], 5);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[2], 14);
- ROUND(aaa, bbb, ccc, ddd, F2, KK3, in[10], 13);
- ROUND(ddd, aaa, bbb, ccc, F2, KK3, in[0], 13);
- ROUND(ccc, ddd, aaa, bbb, F2, KK3, in[4], 7);
- ROUND(bbb, ccc, ddd, aaa, F2, KK3, in[13], 5);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[15]), 9);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[ 5]), 7);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[ 1]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[ 3]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[ 7]), 8);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[14]), 6);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[ 6]), 6);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[ 9]), 14);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[11]), 12);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[ 8]), 13);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[12]), 5);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[ 2]), 14);
+ ROUND(aaa, bbb, ccc, ddd, F2, KK3, le32_to_cpu(in[10]), 13);
+ ROUND(ddd, aaa, bbb, ccc, F2, KK3, le32_to_cpu(in[ 0]), 13);
+ ROUND(ccc, ddd, aaa, bbb, F2, KK3, le32_to_cpu(in[ 4]), 7);
+ ROUND(bbb, ccc, ddd, aaa, F2, KK3, le32_to_cpu(in[13]), 5);

/* round 4: right lane */
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[8], 15);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[6], 5);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[4], 8);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[1], 11);
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[3], 14);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[11], 14);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[15], 6);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[0], 14);
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[5], 6);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[12], 9);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[2], 12);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[13], 9);
- ROUND(aaa, bbb, ccc, ddd, F1, KK4, in[9], 12);
- ROUND(ddd, aaa, bbb, ccc, F1, KK4, in[7], 5);
- ROUND(ccc, ddd, aaa, bbb, F1, KK4, in[10], 15);
- ROUND(bbb, ccc, ddd, aaa, F1, KK4, in[14], 8);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 8]), 15);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[ 6]), 5);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[ 4]), 8);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[ 1]), 11);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 3]), 14);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[11]), 14);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[15]), 6);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[ 0]), 14);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 5]), 6);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[12]), 9);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[ 2]), 12);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[13]), 9);
+ ROUND(aaa, bbb, ccc, ddd, F1, KK4, le32_to_cpu(in[ 9]), 12);
+ ROUND(ddd, aaa, bbb, ccc, F1, KK4, le32_to_cpu(in[ 7]), 5);
+ ROUND(ccc, ddd, aaa, bbb, F1, KK4, le32_to_cpu(in[10]), 15);
+ ROUND(bbb, ccc, ddd, aaa, F1, KK4, le32_to_cpu(in[14]), 8);

/* combine results */
ddd += cc + state[1]; /* final result for state[0] */
@@ -213,29 +215,11 @@ static void rmd128_transform(u32 *state, u32 const *in)
state[2] = state[3] + aa + bbb;
state[3] = state[0] + bb + ccc;
state[0] = ddd;
-
return;
}

-static inline void le32_to_cpu_array(u32 *buf, unsigned int words)
-{
- while (words--) {
- le32_to_cpus(buf);
- buf++;
- }
-}
-
-static inline void cpu_to_le32_array(u32 *buf, unsigned int words)
-{
- while (words--) {
- cpu_to_le32s(buf);
- buf++;
- }
-}
-
-static inline void rmd128_transform_helper(struct rmd128_ctx *ctx)
+static void rmd128_transform_helper(struct rmd128_ctx *ctx)
{
- le32_to_cpu_array(ctx->buffer, sizeof(ctx->buffer) / sizeof(u32));
rmd128_transform(ctx->state, ctx->buffer);
}

@@ -286,14 +270,15 @@ static void rmd128_update(struct crypto_tfm *tfm, const u8 *data,
}

/* Add padding and return the message digest. */
-static void rmd128_final(struct crypto_tfm *tfm, u8 *out)
+static void rmd128_final(struct crypto_tfm *tfm, u8 *out8)
{
struct rmd128_ctx *rctx = crypto_tfm_ctx(tfm);
u32 index, padlen;
u64 bits;
+ u32 *out = (u32 *)out8;
static const u8 padding[64] = { 0x80, };
- bits = rctx->byte_count << 3;

+ bits = cpu_to_le64(rctx->byte_count << 3);
/* Pad out to 56 mod 64 */
index = rctx->byte_count & 0x3f;
padlen = (index < 56) ? (56 - index) : ((64+56) - index);
@@ -303,7 +288,10 @@ static void rmd128_final(struct crypto_tfm *tfm, u8 *out)
rmd128_update(tfm, (const u8 *)&bits, sizeof(bits));

/* Store state in digest */
- memcpy(out, rctx->state, sizeof(rctx->state));
+ out[0] = cpu_to_le32(rctx->state[0]);
+ out[1] = cpu_to_le32(rctx->state[1]);
+ out[2] = cpu_to_le32(rctx->state[2]);
+ out[3] = cpu_to_le32(rctx->state[3]);

/* Wipe context */
memset(rctx, 0, sizeof(*rctx));
--
1.5.4.3


Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* Sebastian Siewior | 2008-05-17 10:10:03 [+0200]:

>diff --git a/crypto/rmd128.c b/crypto/rmd128.c
>index 146a167..0d946a3 100644
>--- a/crypto/rmd128.c
>+++ b/crypto/rmd128.c
>-static inline void le32_to_cpu_array(u32 *buf, unsigned int words)
>-{
>- while (words--) {
>- le32_to_cpus(buf);
>- buf++;
>- }
>-}
>-
>-static inline void cpu_to_le32_array(u32 *buf, unsigned int words)
>-{
>- while (words--) {
>- cpu_to_le32s(buf);
>- buf++;
>- }
>-}
>-
>-static inline void rmd128_transform_helper(struct rmd128_ctx *ctx)
>+static void rmd128_transform_helper(struct rmd128_ctx *ctx)
> {
>- le32_to_cpu_array(ctx->buffer, sizeof(ctx->buffer) / sizeof(u32));
> rmd128_transform(ctx->state, ctx->buffer);
> }
Now, before someone asks why is it better to do the endian conversion in
rmd128_transform() instead in those inline functions, here are some
numbers:

Original code fixed:
~~~~~~~~~~~~~~~~~~~~
testing speed of rmd128
test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 104 cycles/operation, 6 cycles/byte
test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 201 cycles/operation, 3 cycles/byte
test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 161 cycles/operation, 2 cycles/byte
test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 518 cycles/operation, 2 cycles/byte
test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 367 cycles/operation, 1 cycles/byte
test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 331 cycles/operation, 1 cycles/byte
test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 1793 cycles/operation, 1 cycles/byte
test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1048 cycles/operation, 1 cycles/byte
test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1005 cycles/operation, 0 cycles/byte
test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 3493 cycles/operation, 1 cycles/byte
test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 2003 cycles/operation, 0 cycles/byte
test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 1919 cycles/operation, 0 cycles/byte
test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 1904 cycles/operation, 0 cycles/byte
test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 6893 cycles/operation, 1 cycles/byte
test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 3913 cycles/operation, 0 cycles/byte
test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 3745 cycles/operation, 0 cycles/byte
test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 3701 cycles/operation, 0 cycles/byte
test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 13694 cycles/operation, 1 cycles/byte
test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 7732 cycles/operation, 0 cycles/byte
test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 7396 cycles/operation, 0 cycles/byte
test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 7311 cycles/operation, 0 cycles/byte
test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 7305 cycles/operation, 0 cycles/byte

moved cpu_to_le32 into rmd128_transform()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
testing speed of rmd128
test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 103 cycles/operation, 6 cycles/byte
test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 197 cycles/operation, 3 cycles/byte
test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 159 cycles/operation, 2 cycles/byte
test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 510 cycles/operation, 1 cycles/byte
test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 361 cycles/operation, 1 cycles/byte
test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 327 cycles/operation, 1 cycles/byte
test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 1771 cycles/operation, 1 cycles/byte
test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1034 cycles/operation, 1 cycles/byte
test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 992 cycles/operation, 0 cycles/byte
test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 3451 cycles/operation, 1 cycles/byte
test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1979 cycles/operation, 0 cycles/byte
test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 1896 cycles/operation, 0 cycles/byte
test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 1882 cycles/operation, 0 cycles/byte
test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 6812 cycles/operation, 1 cycles/byte
test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 3864 cycles/operation, 0 cycles/byte
test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 3697 cycles/operation, 0 cycles/byte
test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 3655 cycles/operation, 0 cycles/byte
test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 13533 cycles/operation, 1 cycles/byte
test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 7638 cycles/operation, 0 cycles/byte
test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 7304 cycles/operation, 0 cycles/byte
test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 7219 cycles/operation, 0 cycles/byte
test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 7214 cycles/operation, 0 cycles/byte

Switched from cpu_to_le32 to cpu_to_le32p:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
testing speed of rmd128
test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 122 cycles/operation, 7 cycles/byte
test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 235 cycles/operation, 3 cycles/byte
test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 197 cycles/operation, 3 cycles/byte
test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 609 cycles/operation, 2 cycles/byte
test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 458 cycles/operation, 1 cycles/byte
test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 424 cycles/operation, 1 cycles/byte
test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 2106 cycles/operation, 2 cycles/byte
test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1367 cycles/operation, 1 cycles/byte
test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1324 cycles/operation, 1 cycles/byte
test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 4104 cycles/operation, 2 cycles/byte
test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 2625 cycles/operation, 1 cycles/byte
test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 2539 cycles/operation, 1 cycles/byte
test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 2524 cycles/operation, 1 cycles/byte
test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 8099 cycles/operation, 1 cycles/byte
test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 5140 cycles/operation, 1 cycles/byte
test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 4968 cycles/operation, 1 cycles/byte
test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 4924 cycles/operation, 1 cycles/byte
test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 16089 cycles/operation, 1 cycles/byte
test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 10169 cycles/operation, 1 cycles/byte
test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 9826 cycles/operation, 1 cycles/byte
test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 9739 cycles/operation, 1 cycles/byte
test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 9733 cycles/operation, 1 cycles/byte

Sebastian

2008-05-17 08:22:41

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

From: Sebastian Siewior <[email protected]>
Date: Sat, 17 May 2008 10:10:03 +0200

> + ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 0]), 11);

Not to nitpick, but if you use le32_to_cpup() this will allow the
use of little-endian load instructions on powerpc and sparc.

Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* David Miller | 2008-05-17 01:22:35 [-0700]:

>From: Sebastian Siewior <[email protected]>
>Date: Sat, 17 May 2008 10:10:03 +0200
>
>> + ROUND(aa, bb, cc, dd, F1, K1, le32_to_cpu(in[ 0]), 11);
>
>Not to nitpick, but if you use le32_to_cpup() this will allow the
>use of little-endian load instructions on powerpc and sparc.

I know that. Please see my follow up mail with some tiny numbers.
gcc-4.1.1 was used on a mpc8544.

Sebastian

2008-05-17 08:37:45

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

On Sat, May 17, 2008 at 10:27:54AM +0200, Sebastian Siewior wrote:
>
> I know that. Please see my follow up mail with some tiny numbers.
> gcc-4.1.1 was used on a mpc8544.

But what do the numbers look like on other architectures? In
particular, x86-* and sparc64?

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* Herbert Xu | 2008-05-17 16:37:38 [+0800]:

>On Sat, May 17, 2008 at 10:27:54AM +0200, Sebastian Siewior wrote:
>>
>> I know that. Please see my follow up mail with some tiny numbers.
>> gcc-4.1.1 was used on a mpc8544.
>
>But what do the numbers look like on other architectures? In
>particular, x86-* and sparc64?
Since x86-* is little endian it should not change much except that final
uses not four single copy instructions instead of memcpy (and this is
not the hot path).
The endian conversian is done exactly the same way in sha1 (which is
very simlar to this algorithm).

David: would you please be so kind to run a test on sparc machine?

>Cheers,

Sebastian

2008-05-17 09:01:28

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

From: Sebastian Siewior <[email protected]>
Date: Sat, 17 May 2008 10:47:35 +0200

> David: would you please be so kind to run a test on sparc machine?

How do I run the test?

Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* David Miller | 2008-05-17 02:01:22 [-0700]:

>From: Sebastian Siewior <[email protected]>
>Date: Sat, 17 May 2008 10:47:35 +0200
>
>> David: would you please be so kind to run a test on sparc machine?
>
>How do I run the test?

modprobe tcrypt mode=314

do you need / want the three patches or do you convert them yourself?

Sebastian

2008-05-17 09:55:47

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

From: Sebastian Siewior <[email protected]>
Date: Sat, 17 May 2008 11:14:51 +0200

> * David Miller | 2008-05-17 02:01:22 [-0700]:
>
> >From: Sebastian Siewior <[email protected]>
> >Date: Sat, 17 May 2008 10:47:35 +0200
> >
> >> David: would you please be so kind to run a test on sparc machine?
> >
> >How do I run the test?
>
> modprobe tcrypt mode=314
>
> do you need / want the three patches or do you convert them yourself?

I'll use your patch and result posting as a guide and let you
know if I need any help :)

2008-05-17 09:56:32

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

On Sat, May 17, 2008 at 11:14:51AM +0200, Sebastian Siewior wrote:
>
> modprobe tcrypt mode=314
>
> do you need / want the three patches or do you convert them yourself?

If you pull my cryptodev-2.6 tree then you'll be able to run
the above test.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-05-18 21:36:00

by Adrian-Ken Rueegsegger

[permalink] [raw]
Subject: [PATCH] [CRYPTO] rmd128: Fix endian problems

This patch is based on Sebastian Siewior's patch and
fixes endian issues making rmd128 work properly on
big-endian machines.


Signed-off-by: Adrian-Ken Rueegsegger <[email protected]>
---

I put the le32_to_cpu call in the ROUND-define so code-size is smaller
compared to Sebastians patch. I also removed the three now obsolete
functions (le32_to_cpu_array, cpu_to_le32_array and rmd_transform_helper),
which makes the code smaller.
The other changes make rmd128_final more "sha1-like".

I will fix the other RIPEMD modules once consensus is reached on how to
fix the endian issues for rmd128.

Sebastian, would you be so kind to test this patch on PowerPC?

crypto/rmd128.c | 37 +++++++++----------------------------
1 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/crypto/rmd128.c b/crypto/rmd128.c
index 146a167..6125a4d 100644
--- a/crypto/rmd128.c
+++ b/crypto/rmd128.c
@@ -43,7 +43,7 @@ struct rmd128_ctx {
#define F4(x, y, z) (y ^ (z & (x ^ y))) /* z ? x : y */

#define ROUND(a, b, c, d, f, k, x, s) { \
- (a) += f((b), (c), (d)) + (x) + (k); \
+ (a) += f((b), (c), (d)) + le32_to_cpu(x) + (k); \
(a) = rol32((a), (s)); \
}

@@ -217,28 +217,6 @@ static void rmd128_transform(u32 *state, u32 const *in)
return;
}

-static inline void le32_to_cpu_array(u32 *buf, unsigned int words)
-{
- while (words--) {
- le32_to_cpus(buf);
- buf++;
- }
-}
-
-static inline void cpu_to_le32_array(u32 *buf, unsigned int words)
-{
- while (words--) {
- cpu_to_le32s(buf);
- buf++;
- }
-}
-
-static inline void rmd128_transform_helper(struct rmd128_ctx *ctx)
-{
- le32_to_cpu_array(ctx->buffer, sizeof(ctx->buffer) / sizeof(u32));
- rmd128_transform(ctx->state, ctx->buffer);
-}
-
static void rmd128_init(struct crypto_tfm *tfm)
{
struct rmd128_ctx *rctx = crypto_tfm_ctx(tfm);
@@ -271,13 +249,13 @@ static void rmd128_update(struct crypto_tfm *tfm, const u8 *data,
memcpy((char *)rctx->buffer + (sizeof(rctx->buffer) - avail),
data, avail);

- rmd128_transform_helper(rctx);
+ rmd128_transform(rctx->state, rctx->buffer);
data += avail;
len -= avail;

while (len >= sizeof(rctx->buffer)) {
memcpy(rctx->buffer, data, sizeof(rctx->buffer));
- rmd128_transform_helper(rctx);
+ rmd128_transform(rctx->state, rctx->buffer);
data += sizeof(rctx->buffer);
len -= sizeof(rctx->buffer);
}
@@ -289,10 +267,12 @@ static void rmd128_update(struct crypto_tfm *tfm, const u8 *data,
static void rmd128_final(struct crypto_tfm *tfm, u8 *out)
{
struct rmd128_ctx *rctx = crypto_tfm_ctx(tfm);
- u32 index, padlen;
+ u32 i, index, padlen;
u64 bits;
+ u32 *dst = (u32 *)out;
static const u8 padding[64] = { 0x80, };
- bits = rctx->byte_count << 3;
+
+ bits = cpu_to_le64(rctx->byte_count << 3);

/* Pad out to 56 mod 64 */
index = rctx->byte_count & 0x3f;
@@ -303,7 +283,8 @@ static void rmd128_final(struct crypto_tfm *tfm, u8 *out)
rmd128_update(tfm, (const u8 *)&bits, sizeof(bits));

/* Store state in digest */
- memcpy(out, rctx->state, sizeof(rctx->state));
+ for (i = 0; i < 4; i++)
+ dst[i] = cpu_to_le32(rctx->state[i]);

/* Wipe context */
memset(rctx, 0, sizeof(*rctx));
--
1.5.2.5


Subject: Re: [PATCH] [CRYPTO] rmd128: Fix endian problems

* Adrian-Ken Rueegsegger | 2008-05-18 23:35:55 [+0200]:

>I put the le32_to_cpu call in the ROUND-define so code-size is smaller
>compared to Sebastians patch. I also removed the three now obsolete
>functions (le32_to_cpu_array, cpu_to_le32_array and rmd_transform_helper),
>which makes the code smaller.
Looks nice.

>I will fix the other RIPEMD modules once consensus is reached on how to
>fix the endian issues for rmd128.
cool.

>Sebastian, would you be so kind to test this patch on PowerPC?
Sure. I do it once I'm able to and let you know.

Sebastian

Subject: Re: [PATCH] [CRYPTO] rmd128: Fix endian problems

* Adrian-Ken Rueegsegger | 2008-05-18 23:35:55 [+0200]:

>Sebastian, would you be so kind to test this patch on PowerPC?
Acked-by: Sebastian Siewior <[email protected]>

2008-05-19 20:37:56

by Adrian-Ken Rueegsegger

[permalink] [raw]
Subject: Re: [PATCH] [CRYPTO] rmd128: Fix endian problems


Sebastian Siewior wrote:
> * Adrian-Ken Rueegsegger | 2008-05-18 23:35:55 [+0200]:
>
>> Sebastian, would you be so kind to test this patch on PowerPC?
> Acked-by: Sebastian Siewior <[email protected]>

Sebastian, thanks for testing :)

If there are no objections I will prepare patches for the other RIPEMD modules and submit all of them in one patch set.

-Adrian

Subject: Re: [PATCH] [CRYPTO] rmd128: Fix endian problems

* Adrian-Ken R?egsegger | 2008-05-19 22:37:31 [+0200]:

>Sebastian Siewior wrote:
>> * Adrian-Ken Rueegsegger | 2008-05-18 23:35:55 [+0200]:
>>
>>> Sebastian, would you be so kind to test this patch on PowerPC?
>> Acked-by: Sebastian Siewior <[email protected]>
>
>Sebastian, thanks for testing :)
You welcome.
Any particular reason why named it rmd... instead of ripemd...?

>If there are no objections I will prepare patches for the other RIPEMD modules and submit all of them in one patch set.
Any objections on moving ripemd.h from include/crypto into crypto?
Herbert?
Since you only use it within crypto/ and you don't need it in userspace
or arch/ I would prefer to have it in crypto.

>-Adrian
Sebastian

2008-05-19 21:24:36

by Adrian-Ken Rueegsegger

[permalink] [raw]
Subject: Re: [PATCH] [CRYPTO] rmd128: Fix endian problems

Sebastian Siewior wrote:
> * Adrian-Ken R?egsegger | 2008-05-19 22:37:31 [+0200]:
>> Sebastian Siewior wrote:
>>> * Adrian-Ken Rueegsegger | 2008-05-18 23:35:55 [+0200]:
>>>> Sebastian, would you be so kind to test this patch on PowerPC?
>>> Acked-by: Sebastian Siewior <[email protected]>
>> Sebastian, thanks for testing :)
> You welcome.
> Any particular reason why named it rmd... instead of ripemd...?

Looking at other crypto modules like tgr192 or wp512 I got the impression that names should
be kept short.

>> If there are no objections I will prepare patches for the other RIPEMD modules and submit all of them in one patch set.
> Any objections on moving ripemd.h from include/crypto into crypto?

Not from my side. I put it there because the SHA header file is in the same place.

-Adrian

> Herbert?
> Since you only use it within crypto/ and you don't need it in userspace
> or arch/ I would prefer to have it in crypto.
>
>> -Adrian
> Sebastian

2008-05-20 02:28:44

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH] [CRYPTO] rmd128: Fix endian problems

On Mon, May 19, 2008 at 11:24:32PM +0200, Adrian-Ken R?egsegger wrote:
>
> >> If there are no objections I will prepare patches for the other RIPEMD modules and submit all of them in one patch set.
> > Any objections on moving ripemd.h from include/crypto into crypto?
>
> Not from my side. I put it there because the SHA header file is in the same place.

OK it's moved.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-05-20 03:42:24

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH] [CRYPTO] rmd128: Fix endian problems

On Mon, May 19, 2008 at 10:01:45PM +0200, Sebastian Siewior wrote:
> * Adrian-Ken Rueegsegger | 2008-05-18 23:35:55 [+0200]:
>
> >Sebastian, would you be so kind to test this patch on PowerPC?
> Acked-by: Sebastian Siewior <[email protected]>

Patch applied. Thanks a lot!
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-05-21 02:47:28

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

From: Herbert Xu <[email protected]>
Date: Sat, 17 May 2008 17:56:25 +0800

> If you pull my cryptodev-2.6 tree then you'll be able to run
> the above test.

Performance is significantly increased on Niagara2 by using
the little-endian loads inside of the transformation loop, as
expected. The numbers below are first before, then after,
the patch at the very end of this email is applied.

And this is what I suggested in the first place. Was was not
suggesting that the endian converting preparation loop be retained.
Rather, I was suggesting that the in[] array be accessed with the
special loads.

-------------------- before patch --------------------

[452862.338505] testing speed of rmd128
[452862.354441] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 6064 cycles/operation, 379 cycles/byte
[452862.354535] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 12016 cycles/operation, 187 cycles/byte
[452862.354672] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 10800 cycles/operation, 168 cycles/byte
[452862.354795] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 31584 cycles/operation, 123 cycles/byte
[452862.355098] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 26576 cycles/operation, 103 cycles/byte
[452862.355357] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 24768 cycles/operation, 96 cycles/byte
[452862.355616] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 90112 cycles/operation, 88 cycles/byte
[452862.356482] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 41088 cycles/operation, 40 cycles/byte
[452862.356857] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 44656 cycles/operation, 43 cycles/byte
[452862.357248] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 135312 cycles/operation, 66 cycles/byte
[452862.358413] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 78352 cycles/operation, 38 cycles/byte
[452862.359152] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 86464 cycles/operation, 42 cycles/byte
[452862.359887] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 74336 cycles/operation, 36 cycles/byte
[452862.360543] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 258112 cycles/operation, 63 cycles/byte
[452862.362769] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 164992 cycles/operation, 40 cycles/byte
[452862.364202] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 146704 cycles/operation, 35 cycles/byte
[452862.365472] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 164176 cycles/operation, 40 cycles/byte
[452862.366938] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 470432 cycles/operation, 57 cycles/byte
[452862.371087] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 219328 cycles/operation, 26 cycles/byte
[452862.372977] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 212880 cycles/operation, 25 cycles/byte
[452862.374874] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 237872 cycles/operation, 29 cycles/byte
[452862.376857] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 222240 cycles/operation, 27 cycles/byte

-------------------- after patch --------------------

[453226.216294] testing speed of rmd128
[453226.216322] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 2784 cycles/operation, 174 cycles/byte
[453226.216381] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 5296 cycles/operation, 82 cycles/byte
[453226.216448] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 4336 cycles/operation, 67 cycles/byte
[453226.216506] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 13360 cycles/operation, 52 cycles/byte
[453226.216640] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 9856 cycles/operation, 38 cycles/byte
[453226.216745] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 9008 cycles/operation, 35 cycles/byte
[453226.216842] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 46032 cycles/operation, 44 cycles/byte
[453226.217254] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 28640 cycles/operation, 27 cycles/byte
[453226.217519] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 27808 cycles/operation, 27 cycles/byte
[453226.217777] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 89600 cycles/operation, 43 cycles/byte
[453226.218558] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 54800 cycles/operation, 26 cycles/byte
[453226.219046] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 53168 cycles/operation, 25 cycles/byte
[453226.219519] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 52864 cycles/operation, 25 cycles/byte
[453226.219991] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 176640 cycles/operation, 43 cycles/byte
[453226.221511] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 131008 cycles/operation, 31 cycles/byte
[453226.222592] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 103840 cycles/operation, 25 cycles/byte
[453226.223502] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 102960 cycles/operation, 25 cycles/byte
[453226.224402] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 353760 cycles/operation, 43 cycles/byte
[453226.227424] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 214496 cycles/operation, 26 cycles/byte
[453226.229271] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 207952 cycles/operation, 25 cycles/byte
[453226.231063] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 218960 cycles/operation, 26 cycles/byte
[453226.232922] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 205664 cycles/operation, 25 cycles/byte

diff --git a/crypto/rmd128.c b/crypto/rmd128.c
index 89a535a..9cf1a6d 100644
--- a/crypto/rmd128.c
+++ b/crypto/rmd128.c
@@ -44,7 +44,7 @@ struct rmd128_ctx {
#define F4(x, y, z) (y ^ (z & (x ^ y))) /* z ? x : y */

#define ROUND(a, b, c, d, f, k, x, s) { \
- (a) += f((b), (c), (d)) + le32_to_cpu(x) + (k); \
+ (a) += f((b), (c), (d)) + le32_to_cpup(&(x)) + (k); \
(a) = rol32((a), (s)); \
}


Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* David Miller | 2008-05-20 19:47:23 [-0700]:

>From: Herbert Xu <[email protected]>
I though you and Herbert are two different persons.

>Performance is significantly increased on Niagara2 by using
>the little-endian loads inside of the transformation loop, as
>expected. The numbers below are first before, then after,
>the patch at the very end of this email is applied.
That was, what I expected as well but the numbers were different. I
checked the assembly code and I had the le loads but more code. I will
check with different compiler maybe it will get better here as well.

>And this is what I suggested in the first place. Was was not
>suggesting that the endian converting preparation loop be retained.
>Rather, I was suggesting that the in[] array be accessed with the
>special loads.
This was what I did. The only difference between now and my patch is
that I haven't put the le load into the macro.

Sebastian

2008-05-21 07:11:48

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

From: Sebastian Siewior <[email protected]>
Date: Wed, 21 May 2008 09:09:54 +0200

> That was, what I expected as well but the numbers were different. I
> checked the assembly code and I had the le loads but more code. I will
> check with different compiler maybe it will get better here as well.

Your particular powerpc cpu might also do these loads more slowly
than other ones.

Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* David Miller | 2008-05-21 00:11:42 [-0700]:

>From: Sebastian Siewior <[email protected]>
>Date: Wed, 21 May 2008 09:09:54 +0200
>
>> That was, what I expected as well but the numbers were different. I
>> checked the assembly code and I had the le loads but more code. I will
>> check with different compiler maybe it will get better here as well.
>
>Your particular powerpc cpu might also do these loads more slowly
>than other ones.
Yes, that could be case. However a "manual" swap has three opcodes here,
the le load has one. I should not end up with more code in the latter
cases hould I?

Sebastian

2008-05-21 07:36:15

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

From: Sebastian Siewior <[email protected]>
Date: Wed, 21 May 2008 09:20:59 +0200

> Yes, that could be case. However a "manual" swap has three opcodes here,
> the le load has one. I should not end up with more code in the latter
> cases hould I?

You indeed can, because GCC has less information to work with when
the inline asm powerpc has for byteswapped loads is used.

For example, only limited addressing modes work with those inline
asms, so gcc has to load addresses into registers.

In fact it's even worse, look at the inline asm in asm-powerpc/byteorder.h,
it always loads the final address into a register, so there is zero possiblity
of using indexed addressing modes.

So yes, you should in fact see more code :-)

2008-05-27 11:37:12

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

From: Herbert Xu <[email protected]>
Date: Mon, 26 May 2008 21:05:08 +1000

> On Tue, May 20, 2008 at 07:47:23PM -0700, David Miller wrote:
>
> > -------------------- before patch --------------------
> >
> > [452862.338505] testing speed of rmd128
> > [452862.354441] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 6064 cycles/operation, 379 cycles/byte
>
> > -------------------- after patch --------------------
> >
> > [453226.216294] testing speed of rmd128
> > [453226.216322] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 2784 cycles/operation, 174 cycles/byte
>
> Looks good Dave! I've done the same thing for the other rmd* files
> and for the store on the result. Let me know if this looks OK and
> I'll commit.

Looks ok to me, thanks!

2008-05-27 16:24:15

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

On Tue, May 20, 2008 at 07:47:23PM -0700, David Miller wrote:

> -------------------- before patch --------------------
>
> [452862.338505] testing speed of rmd128
> [452862.354441] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 6064 cycles/operation, 379 cycles/byte

> -------------------- after patch --------------------
>
> [453226.216294] testing speed of rmd128
> [453226.216322] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 2784 cycles/operation, 174 cycles/byte

Looks good Dave! I've done the same thing for the other rmd* files
and for the store on the result. Let me know if this looks OK and
I'll commit.

Sebastian, if you're still seeing worse results on powerpc could you
post the actual numbers with/without this patch?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/crypto/rmd128.c b/crypto/rmd128.c
index 89a535a..1a481df 100644
--- a/crypto/rmd128.c
+++ b/crypto/rmd128.c
@@ -44,7 +44,7 @@ struct rmd128_ctx {
#define F4(x, y, z) (y ^ (z & (x ^ y))) /* z ? x : y */

#define ROUND(a, b, c, d, f, k, x, s) { \
- (a) += f((b), (c), (d)) + le32_to_cpu(x) + (k); \
+ (a) += f((b), (c), (d)) + le32_to_cpup(&(x)) + (k); \
(a) = rol32((a), (s)); \
}

@@ -285,7 +285,7 @@ static void rmd128_final(struct crypto_tfm *tfm, u8 *out)

/* Store state in digest */
for (i = 0; i < 4; i++)
- dst[i] = cpu_to_le32(rctx->state[i]);
+ dst[i] = cpu_to_le32p(&rctx->state[i]);

/* Wipe context */
memset(rctx, 0, sizeof(*rctx));
diff --git a/crypto/rmd160.c b/crypto/rmd160.c
index 136e31f..e9fd5f6 100644
--- a/crypto/rmd160.c
+++ b/crypto/rmd160.c
@@ -47,7 +47,7 @@ struct rmd160_ctx {
#define F5(x, y, z) (x ^ (y | ~z))

#define ROUND(a, b, c, d, e, f, k, x, s) { \
- (a) += f((b), (c), (d)) + le32_to_cpu(x) + (k); \
+ (a) += f((b), (c), (d)) + le32_to_cpup(&(x)) + (k); \
(a) = rol32((a), (s)) + (e); \
(c) = rol32((c), 10); \
}
@@ -329,7 +329,7 @@ static void rmd160_final(struct crypto_tfm *tfm, u8 *out)

/* Store state in digest */
for (i = 0; i < 5; i++)
- dst[i] = cpu_to_le32(rctx->state[i]);
+ dst[i] = cpu_to_le32p(&rctx->state[i]);

/* Wipe context */
memset(rctx, 0, sizeof(*rctx));
diff --git a/crypto/rmd256.c b/crypto/rmd256.c
index 88f2203..b088526 100644
--- a/crypto/rmd256.c
+++ b/crypto/rmd256.c
@@ -44,7 +44,7 @@ struct rmd256_ctx {
#define F4(x, y, z) (y ^ (z & (x ^ y))) /* z ? x : y */

#define ROUND(a, b, c, d, f, k, x, s) { \
- (a) += f((b), (c), (d)) + le32_to_cpu(x) + (k); \
+ (a) += f((b), (c), (d)) + le32_to_cpup(&(x)) + (k); \
(a) = rol32((a), (s)); \
}

@@ -304,7 +304,7 @@ static void rmd256_final(struct crypto_tfm *tfm, u8 *out)

/* Store state in digest */
for (i = 0; i < 8; i++)
- dst[i] = cpu_to_le32(rctx->state[i]);
+ dst[i] = cpu_to_le32p(&rctx->state[i]);

/* Wipe context */
memset(rctx, 0, sizeof(*rctx));
diff --git a/crypto/rmd320.c b/crypto/rmd320.c
index 5b172f8..dba03ec 100644
--- a/crypto/rmd320.c
+++ b/crypto/rmd320.c
@@ -47,7 +47,7 @@ struct rmd320_ctx {
#define F5(x, y, z) (x ^ (y | ~z))

#define ROUND(a, b, c, d, e, f, k, x, s) { \
- (a) += f((b), (c), (d)) + le32_to_cpu(x) + (k); \
+ (a) += f((b), (c), (d)) + le32_to_cpup(&(x)) + (k); \
(a) = rol32((a), (s)) + (e); \
(c) = rol32((c), 10); \
}
@@ -353,7 +353,7 @@ static void rmd320_final(struct crypto_tfm *tfm, u8 *out)

/* Store state in digest */
for (i = 0; i < 10; i++)
- dst[i] = cpu_to_le32(rctx->state[i]);
+ dst[i] = cpu_to_le32p(&rctx->state[i]);

/* Wipe context */
memset(rctx, 0, sizeof(*rctx));

Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* Herbert Xu | 2008-05-26 21:05:08 [+1000]:

>Sebastian, if you're still seeing worse results on powerpc could you
>post the actual numbers with/without this patch?
Sure. I test it around Monday.

>Thanks,
Sebastian

Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

* Herbert Xu | 2008-05-26 21:05:08 [+1000]:

>Sebastian, if you're still seeing worse results on powerpc could you
>post the actual numbers with/without this patch?

le32:
~~~~~
|testing speed of rmd128
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 105 cycles/operation, 6 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 201 cycles/operation, 3 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 161 cycles/operation, 2 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 519 cycles/operation, 2 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 365 cycles/operation, 1 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 329 cycles/operation, 1 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 1798 cycles/operation, 1 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1038 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 994 cycles/operation, 0 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 3503 cycles/operation, 1 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1981 cycles/operation, 0 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 1896 cycles/operation, 0 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 1881 cycles/operation, 0 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 6914 cycles/operation, 1 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 3870 cycles/operation, 0 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 3698 cycles/operation, 0 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 3654 cycles/operation, 0 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 13736 cycles/operation, 1 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 7649 cycles/operation, 0 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 7305 cycles/operation, 0 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 7215 cycles/operation, 0 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 7210 cycles/operation, 0 cycles/byte
|
|testing speed of rmd160
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 144 cycles/operation, 9 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 276 cycles/operation, 4 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 237 cycles/operation, 3 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 706 cycles/operation, 2 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 552 cycles/operation, 2 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 517 cycles/operation, 2 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 2432 cycles/operation, 2 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1671 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1628 cycles/operation, 1 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 4731 cycles/operation, 2 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 3211 cycles/operation, 1 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 3124 cycles/operation, 1 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 3109 cycles/operation, 1 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 9332 cycles/operation, 2 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 6290 cycles/operation, 1 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 6116 cycles/operation, 1 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 6072 cycles/operation, 1 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 18532 cycles/operation, 2 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 12450 cycles/operation, 1 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 12102 cycles/operation, 1 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 12011 cycles/operation, 1 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 12006 cycles/operation, 1 cycles/byte
|
|testing speed of rmd256
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 116 cycles/operation, 7 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 217 cycles/operation, 3 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 178 cycles/operation, 2 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 551 cycles/operation, 2 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 399 cycles/operation, 1 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 365 cycles/operation, 1 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 1890 cycles/operation, 1 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1147 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1104 cycles/operation, 1 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 3677 cycles/operation, 1 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 2190 cycles/operation, 1 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 2104 cycles/operation, 1 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 2089 cycles/operation, 1 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 7251 cycles/operation, 1 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 4276 cycles/operation, 1 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 4104 cycles/operation, 1 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 4060 cycles/operation, 0 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 14398 cycles/operation, 1 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 8447 cycles/operation, 1 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 8103 cycles/operation, 0 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 8015 cycles/operation, 0 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 8011 cycles/operation, 0 cycles/byte
|
|testing speed of rmd320
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 144 cycles/operation, 9 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 270 cycles/operation, 4 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 231 cycles/operation, 3 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 680 cycles/operation, 2 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 529 cycles/operation, 2 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 493 cycles/operation, 1 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 2326 cycles/operation, 2 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1579 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1535 cycles/operation, 1 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 4521 cycles/operation, 2 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 3026 cycles/operation, 1 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 2939 cycles/operation, 1 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 2925 cycles/operation, 1 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 8911 cycles/operation, 2 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 5922 cycles/operation, 1 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 5748 cycles/operation, 1 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 5703 cycles/operation, 1 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 17690 cycles/operation, 2 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 11711 cycles/operation, 1 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 11363 cycles/operation, 1 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 11275 cycles/operation, 1 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 11271 cycles/operation, 1 cycles/byte

le32p:
~~~~~~
|testing speed of rmd128
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 124 cycles/operation, 7 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 238 cycles/operation, 3 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 199 cycles/operation, 3 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 613 cycles/operation, 2 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 462 cycles/operation, 1 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 426 cycles/operation, 1 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 2118 cycles/operation, 2 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1371 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1328 cycles/operation, 1 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 4127 cycles/operation, 2 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 2630 cycles/operation, 1 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 2545 cycles/operation, 1 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 2531 cycles/operation, 1 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 8144 cycles/operation, 1 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 5149 cycles/operation, 1 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 4979 cycles/operation, 1 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 4936 cycles/operation, 1 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 16176 cycles/operation, 1 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 10187 cycles/operation, 1 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 9847 cycles/operation, 1 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 9761 cycles/operation, 1 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 9756 cycles/operation, 1 cycles/byte
|
|testing speed of rmd160
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 161 cycles/operation, 10 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 311 cycles/operation, 4 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 273 cycles/operation, 4 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 796 cycles/operation, 3 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 645 cycles/operation, 2 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 610 cycles/operation, 2 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 2737 cycles/operation, 2 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1992 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1949 cycles/operation, 1 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 5325 cycles/operation, 2 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 3835 cycles/operation, 1 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 3749 cycles/operation, 1 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 3734 cycles/operation, 1 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 10501 cycles/operation, 2 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 7520 cycles/operation, 1 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 7348 cycles/operation, 1 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 7305 cycles/operation, 1 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 20853 cycles/operation, 2 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 14892 cycles/operation, 1 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 14547 cycles/operation, 1 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 14460 cycles/operation, 1 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 14456 cycles/operation, 1 cycles/byte
|
|testing speed of rmd256
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 129 cycles/operation, 8 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 245 cycles/operation, 3 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 206 cycles/operation, 3 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 626 cycles/operation, 2 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 475 cycles/operation, 1 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 443 cycles/operation, 1 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 2155 cycles/operation, 2 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 1418 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1370 cycles/operation, 1 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 4194 cycles/operation, 2 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 2717 cycles/operation, 1 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 2621 cycles/operation, 1 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 2606 cycles/operation, 1 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 8271 cycles/operation, 2 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 5318 cycles/operation, 1 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 5126 cycles/operation, 1 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 5077 cycles/operation, 1 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 16426 cycles/operation, 2 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 10518 cycles/operation, 1 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 10134 cycles/operation, 1 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 10037 cycles/operation, 1 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 10033 cycles/operation, 1 cycles/byte
|
|testing speed of rmd320
|test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 167 cycles/operation, 10 cycles/byte
|test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 319 cycles/operation, 4 cycles/byte
|test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 280 cycles/operation, 4 cycles/byte
|test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 809 cycles/operation, 3 cycles/byte
|test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 658 cycles/operation, 2 cycles/byte
|test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 623 cycles/operation, 2 cycles/byte
|test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 2774 cycles/operation, 2 cycles/byte
|test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 2028 cycles/operation, 1 cycles/byte
|test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 1985 cycles/operation, 1 cycles/byte
|test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 5394 cycles/operation, 2 cycles/byte
|test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 3902 cycles/operation, 1 cycles/byte
|test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 3815 cycles/operation, 1 cycles/byte
|test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 3801 cycles/operation, 1 cycles/byte
|test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 10634 cycles/operation, 2 cycles/byte
|test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 7650 cycles/operation, 1 cycles/byte
|test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 7476 cycles/operation, 1 cycles/byte
|test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 7433 cycles/operation, 1 cycles/byte
|test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 21115 cycles/operation, 2 cycles/byte
|test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 15145 cycles/operation, 1 cycles/byte
|test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 14799 cycles/operation, 1 cycles/byte
|test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 14711 cycles/operation, 1 cycles/byte
|test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 14706 cycles/operation, 1 cycles/byte

This is mpc8544 with gcc-4.1.1. The other powerpc machine I have
available and could run a test is a ps3. Unfortunately I have to
suspend this for two weeks.
Arnd told me, that the powerpc folks were discussing an index field in
their in/out macros. I check that once I'm back again.

Sebastian

2008-06-03 00:00:26

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: rmd128: make it work on my prefered architecture

Hi Sebastian:

On Mon, Jun 02, 2008 at 10:17:39PM +0200, Sebastian Siewior wrote:
>
> This is mpc8544 with gcc-4.1.1. The other powerpc machine I have
> available and could run a test is a ps3. Unfortunately I have to
> suspend this for two weeks.
> Arnd told me, that the powerpc folks were discussing an index field in
> their in/out macros. I check that once I'm back again.

Thanks for the results! There is no doubt that this results in
worse numbers on your platform. However, on the balance of it
I would say that the pointer version is a plus when all factors
are considered. As such I'm going to stick with it.

I agree with you that optimising those macros on powerpc would
be a good course of action.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt