2010-12-19 13:03:51

by Hamid Nassiby

[permalink] [raw]
Subject: crypto accelerator driver problems

Hi All,

In a research project, we've developed a crypto accelerator based on Xilinx
Virtex5 FPGA family which is connected to PC through PCI-Express slot and is
used by IPSec to offload crypto processing from CPU. The accelerator only
provides AES and DES3_EDE algorithms and I am responsible for providing driver
of the stuff. I inspired much of driver work from geode_aes.c which is
located in "drivers/crypto" subdir of kernel source directory. Both algorithms
are registered as blkcipher providing cbc wrapper "cbc(aes)" just as one that is
registered in geode_aes. Now after months of work, the accelerator is ready to
work (Correctness of hardware operation is assured by direct crypto
test and not by IPSec) and it is time of driver to provide IPSec
access to accelerator. In first
try I could get  "ping" through the IPsec tunnel. One end of IPSec tunnel is
equipped by our accelerator and the other end is using kernel native IPSec and
built in AES and DES3_EDE algorithms. Now I am faced with 2 problems:

1. Ping will stop getting reply with packet sizes greater than 1426 Bytes
(ping dest_ip -s  1427). I guessed that it might be MTU problem, but reducing
mtu with "ifconfig eth1 mtu xxx" or
"echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc"
 does not solve the problem. Also when I ping each of tunnel ends from
another end
simultaneously with "ping other_node_ip  -i 0.001", the kernel hangs
out completely.

2. Iperf problem. When I try to measure throughput of the IPSec gateway equipped
by our accelerator ( AES-MD5 ), using iperf in tcp mode, the kernel hangs such
that sometimes "Magic SysRq key" does not respond too! And so I could not trace
the problem anyway. Using iperf in udp mode works but I get "UDP bad cheksum" in
'dmesg' output of other end of tunnel (Native IPSec and built in kernel
algorithms).

Two gateways are connected by a cross cable and no router/switch is located
between them to cause mtu problems. In my test pcrypt is not used by now and
booting the kernel with nosmp (so no fear of thread contention) does not change
the situation.

So I request you to help me solve the problem. I bring some parts of driver
that is changed from geode_aes.c and might give useful information. If
it is required,
I'll post all driver text.
------------------------------ ----------------------------

static struct crypto_alg mydriver_cbc_alg = {
       .cra_name               =       "cbc(aes)",
       .cra_driver_name        =       "cbc-aes-mydriver",
       .cra_priority           =       400,
       .cra_flags                      =       CRYPTO_ALG_TYPE_BLKCIPHER |

CRYPTO_ALG_NEED_FALLBACK,
       .cra_init                       =       fallback_init_blk,
       .cra_exit                       =       fallback_exit_blk,
       .cra_blocksize          =       AES_MIN_BLOCK_SIZE,
       .cra_ctxsize            =       sizeof(struct mydriver_aes_op),
       .cra_alignmask          =       15,
       .cra_type                       =       &crypto_blkcipher_type,
       .cra_module                     =       THIS_MODULE,
       .cra_list                       =
LIST_HEAD_INIT(mydriver_cbc_alg.cra_list),
       .cra_u                          =       {
               .blkcipher      =       {
                       .min_keysize    =       AES_MIN_KEY_SIZE,
                       .max_keysize    =       AES_MIN_KEY_SIZE,
                       .setkey                 =       mydriver_setkey_blk,
                       .encrypt                =       mydriver_cbc_encrypt,
                       .decrypt                =       mydriver_cbc_decrypt,
                       .ivsize                 =       AES_IV_LENGTH,
               }
       }
};
//---------------
static int
mydriver_cbc_encrypt(struct blkcipher_desc *desc,
                 struct scatterlist *dst, struct scatterlist *src,
                 unsigned int nbytes)
{

       struct mydriver_aes_op *op = crypto_blkcipher_ctx(desc->tfm);
       struct blkcipher_walk walk;
       int err, ret;

       if (unlikely(op->keylen != AES_KEYSIZE_128))
               return fallback_blk_enc(desc, dst, src, nbytes);

       blkcipher_walk_init(&walk, dst, src, nbytes);
       err = blkcipher_walk_virt(desc, &walk);
       op->iv = walk.iv;

       while((nbytes = walk.nbytes)) {

               op->src = walk.src.virt.addr,
               op->dst = walk.dst.virt.addr;
               op->mode = AES_MODE_CBC;
               op->len = nbytes - (nbytes % AES_MIN_BLOCK_SIZE);
               op->dir = AES_DIR_ENCRYPT;
                       //ret = mydriver_aes_crypt(op);
               ret = mydriver_transform(op, 0);
               nbytes -= ret;
               err = blkcipher_walk_done(desc, &walk, nbytes);
       }

       return err;
}
/*--------- mydriver_transform which makes a buffer containing key, iv, data
with
some additional header that is required by our accelerator, writes the buffer
to accelerator by DMA and then reads response from hardware.*/

static inline int mydriver_transform(struct mydriver_aes_op *op, int alg)
{

               int  req_len, err;
               u8 *req_buf = NULL, *res_buf = NULL;
               alg_operation operation;
               u32 my_req_id;
               if (op->len == 0)
                       return 0;

               if ((op->dir == AES_DIR_ENCRYPT) ||(op->dir ==
DES3_DIR_ENCRYPT)){
                        operation = SH_ENCRYPT;
                        my_req_id = smp_processor_id();// This ID is
put into our packet and is checked by each thread when the hardware
response is ready to see if the packet is its?
               }
               else {
                       operation = SH_DECRYPT;
                       my_req_id = smp_processor_id() + 64;
//uniqueness of ID does not solve problem described in mail :( .
               }



               err = create_request(alg, op->mode, operation, htonl(my_req_id),
op->key, op->iv, op->src, op->len, &req_buf, &req_len);


               if (err){
                       printk(KERN_EMERG"mydriver_transform : Error
CreateReuest :
errcode = %d\n", err);
                       //goto error;
               }

               err = write_request(req_buf, req_len);
               if (err){
                       printk(KERN_EMERG"mydriver_transform : Error WriteReuest
:
errcode = %d\n", err);
                       //goto error;
               }
               kfree(req_buf);
               req_buf = NULL;

               err = read_response(&res_buf, /*local_hdr.Length*/my_req_id);

               memcpy(op->dst, (res_buf + sizeof(struct response_hdr)),
op->len);

               kfree(res_buf);
               res_buf = NULL;
               return op->len;
}
//-----------
/* create_request wich builds packet for mydriver_transform */
static inline int create_request(int alg, char mode, char enc_dec, u32
request_id,
                  char *key, char *iv, char *data, int datalen,
                  u8 **outbuf, int *outlen)
{
       int req_len, n_padding, keylen, blocklen, algid;
       struct request_hdr *p_hdr;
       char *ptr;

       if (alg == 0){ //AES Algorithm
               keylen = 16;
               blocklen = 16;
               algid = 4;
       } else if (alg == 1){ //DES3 Algorithm
               keylen = 24;
               blocklen = 8;
               algid = 3;
       }

       req_len = sizeof(struct request_hdr) + keylen;
       if (keylen != 0 && keylen % 16 == 0)
               req_len += 8; //For request packet to be 128bit aligned
       if (mode == SHAMS_CBC)
               req_len += blocklen; // for IV len

       n_padding = (blocklen - (datalen % blocklen)) % blocklen; //padding
data to be multiple of 128 bits.

       req_len += (n_padding + datalen);
       *outbuf = kmalloc(req_len, GFP_ATOMIC);
       p_hdr = (struct request_hdr *) *outbuf;
       *outlen = p_hdr->Length = req_len;

       p_hdr->request_id = request_id;
       p_hdr->AlgID_Mode_EncDec = (enc_dec << 15) | (mode << 12) | algid;
       // Filling key
       ptr = *outbuf + sizeof(struct request_hdr);
       memcpy(ptr, key, keylen);
       ptr += keylen;
       if (keylen != 0 && keylen % 16 == 0){
               memset(ptr, 0, 8);
               ptr += 8;
       }
       // Filling IV
       if (mode == SHAMS_CBC){
               memcpy(ptr, iv, blocklen);
               ptr += blocklen;
       }
       // Copy data
       memcpy(ptr, data, datalen);
       ptr += datalen;
       // Zeroing padd bits
       memset(ptr, 0, n_padding);

       return 0;

}
//--------------------------------
/* write_request that writes the provided buffer to device */

static inline int write_request(u8 *buff, unsigned int count)
{
unsigned long  iflags;
u32 tlp_count, tlp_size;
dma_addr_t dma_addr;
struct x5pcie_dma_desc *desc_table = (struct x5pcie_dma_desc *)global_bar[0];

/** DMA operations:*/
       dma_addr = pci_map_single(global_dev, buff, count, PCI_DMA_TODEVICE);
       if (0 == dma_addr) {
               printk(KERN_EMERG"XPCIe_Read: Map error.\n");
               return -1;
       }

// Do DMA transfer here....
       count = count /4;//
       for (tlp_size = 32; tlp_size > 0; tlp_size--)
               if ((count % tlp_size) == 0){
                       tlp_count = count / tlp_size;
                       break;
               }

       tlp_size = tlp_count | (tlp_size << 16);
       spin_lock_irqsave(&wlock, iflags);
       //down(&my_sem);
//      if (down_interruptible(&my_sem)){
//              printk(KERN_EMERG "\nwrite_request: Error Acquire Semaphore!!");
//              return -ERESTARTSYS;
//      }
       writel(cpu_to_le32(tlp_size),&desc_table->rdmatlpc);             // read
DMA TLP count:  TLPs to transfer
       writel(cpu_to_le32(dma_addr),&desc_table->rdmatlpa);  // physical bus
address of DMA able buffer
       wmb();
       writew(cpu_to_le16(0x0001),(global_bar[0]+6));                // read
dma start bit[16] to ddmacr
       wmb();
       while(readw((global_bar[0]+6)) != 0x0101);
       spin_unlock_irqrestore(&wlock, iflags);
       //up(&my_sem);
       // Unmap the DMA buffer so it is safe for normal access again.
       pci_unmap_single(global_dev, dma_addr, count, PCI_DMA_TODEVICE);

       /** End of dma section*/
       return 0;

}
//--------------
/* read_response that reads the en/decrypted buffer from device */

static inline int read_response(u8 **buff,  u16 my_req_id)
{
       dma_addr_t dma_addr;
       u16 count, tmp_req_id;
       unsigned long  iflags1;//, iflags2;
       u32 tlp_count, tlp_size;
       struct x5pcie_dma_desc *desc_table = (struct x5pcie_dma_desc
*)global_bar[0];

       for(;;){

               spin_lock_irqsave(&alock, iflags1);
               tmp_req_id = readw((global_bar[0] + 82 + (fifo_entry * 4)));
               spin_unlock_irqrestore(&alock, iflags1);
               if(my_req_id == tmp_req_id) // Is the provided packet mine?
                       break;

       }

                       count = readw(global_bar[0] + 80 + (fifo_entry
* 4));//What is the size of my packet?
                       printk(KERN_EMERG "read_response : my_req_id = %d has
count = %d\n", my_req_id, count);

                       *buff = kmalloc(count, GFP_ATOMIC);
                       dma_addr = pci_map_single(global_dev, *buff, count,
PCI_DMA_FROMDEVICE);
                       if (0 == dma_addr){
                               printk(KERN_EMERG"XPCIe_Read: Map error.\n");
                               return -1;
                       }

                       count = count /4;//
                       for (tlp_size = 32; tlp_size > 0; tlp_size--)
                               if ((count % tlp_size) == 0){
                                       tlp_count = count / tlp_size;
                                       break;
                               }

                       tlp_size = tlp_count | (tlp_size << 16);
       //              down(&my_sem);
//                      if (down_interruptible(&my_sem)){
//                              printk(KERN_EMERG "\nread_response: Error
Acquire Semaphore!!");
//                              return -ERESTARTSYS;
//                      }
                       writel(cpu_to_le32(tlp_size),&desc_table->wdmatlpc);
       // read DMA TLP count:  TLPs to transfer
                       writel(cpu_to_le32(dma_addr),&desc_table->wdmatlpa);  //
physical bus address of DMA able buffer
                       wmb();
                       writew(cpu_to_le16(0x0001),(global_bar[0]+4));
    // read dma start bit[16] to ddmacr
                       wmb();
                       while(readw(global_bar[0]+4) != 0x0101);

                       fifo_entry = (fifo_entry + 1) % 9; // 9 : Number of
registers holding request_id and len of FiFo's elements .
                       //spin_unlock_irqrestore(&rlock, iflags2);
                       //up(&my_sem);
                       pci_unmap_single(global_dev, dma_addr, count,
PCI_DMA_FROMDEVICE);

                       return count;

}


Thanks in advance,
Hamid.


2010-12-21 12:13:40

by Hamid Nassiby

[permalink] [raw]
Subject: Fwd: crypto accelerator driver problems

Hi,

As some good news and additional information, with the following patch
I no more get
"UDP bad cheksum" error as I mentioned erlier with Iperf in udp mode.
But some times I get the following call trace in dmesg after running
Iperf in UDP mode, more than one time (and ofcourse Iperf stops
transferring data while it uses 100% of CPU cycles.



[ 130.171909] mydriver-aes: mydriver Crypto-Engine enabled.
[ 134.767846] NET: Registered protocol family 15
[ 200.031846] iperf: page allocation failure. order:0, mode:0x20
[ 200.031850] Pid: 10935, comm: iperf Tainted: P 2.6.36-zen1 #1
[ 200.031852] Call Trace:
[ 200.031860] [<ffffffff8108ab39>] ? __alloc_pages_nodemask+0x6d3/0x722
[ 200.031864] [<ffffffff810b454f>] ? virt_to_head_page+0x9/0x30
[ 200.031867] [<ffffffff810afac2>] ? alloc_pages_current+0xa5/0xce
[ 200.031869] [<ffffffff810899ad>] ? __get_free_pages+0x9/0x46
[ 200.031872] [<ffffffff8102bbbf>] ? need_resched+0x1a/0x23
[ 200.031876] [<ffffffff811a10ad>] ? blkcipher_walk_next+0x68/0x2d9
[ 200.031882] [<ffffffffa001dad4>] ? mydriver_cbc_encrypt+0x47/0x9c
[mydriver_aes2]
[ 200.031886] [<ffffffff81454789>] ? ipt_do_table+0x5d8/0x619
[ 200.031888] [<ffffffff811a0871>] ? async_encrypt+0x35/0x3a
[ 200.031891] [<ffffffff811a1e0c>] ? eseqiv_givencrypt+0x341/0x389
[ 200.031894] [<ffffffff813b8bb5>] ? __skb_to_sgvec+0x49/0x1ea
[ 200.031897] [<ffffffff813b8d1e>] ? __skb_to_sgvec+0x1b2/0x1ea
[ 200.031899] [<ffffffff811a8fc8>] ? crypto_authenc_givencrypt+0x60/0x7c
[ 200.031902] [<ffffffff814492dd>] ? esp_output+0x320/0x357
[ 200.031905] [<ffffffff814658cd>] ? xfrm_output_resume+0x38d/0x48f
[ 200.031908] [<ffffffff813e1f62>] ? nf_hook_slow+0xc8/0xd9
[ 200.031911] [<ffffffff81416f9f>] ? ip_push_pending_frames+0x2cc/0x328
[ 200.031914] [<ffffffff8143339e>] ? udp_push_pending_frames+0x2c4/0x342
[ 200.031917] [<ffffffff814350ca>] ? udp_sendmsg+0x508/0x600
[ 200.031919] [<ffffffff8102bbbf>] ? need_resched+0x1a/0x23
[ 200.031923] [<ffffffff813b3458>] ? sock_aio_write+0xd5/0xe9
[ 200.031926] [<ffffffff8100340e>] ? apic_timer_interrupt+0xe/0x20
[ 200.031928] [<ffffffff810ba2ea>] ? do_sync_write+0xb0/0xf2
[ 200.031931] [<ffffffff8100864b>] ? sched_clock+0x5/0x8
[ 200.031934] [<ffffffff8119c550>] ? security_file_permission+0x18/0x67
[ 200.031937] [<ffffffff810bac07>] ? vfs_write+0xbc/0x101
[ 200.031939] [<ffffffff810bad08>] ? sys_write+0x45/0x6e
[ 200.031941] [<ffffffff81002a42>] ? system_call_fastpath+0x16/0x1b
[ 200.031942] Mem-Info:
[ 200.031944] Node 0 DMA per-cpu:
[ 200.031946] CPU 0: hi: 0, btch: 1 usd: 0
[ 200.031947] CPU 1: hi: 0, btch: 1 usd: 0
[ 200.031949] CPU 2: hi: 0, btch: 1 usd: 0
[ 200.031950] CPU 3: hi: 0, btch: 1 usd: 0
[ 200.031951] Node 0 DMA32 per-cpu:
[ 200.031953] CPU 0: hi: 186, btch: 31 usd: 30
[ 200.032016] CPU 1: hi: 186, btch: 31 usd: 23
[ 200.032018] CPU 2: hi: 186, btch: 31 usd: 182
[ 200.032019] CPU 3: hi: 186, btch: 31 usd: 171
[ 200.032023] active_anon:248219 inactive_anon:82742 isolated_anon:7
[ 200.032024] active_file:10553 inactive_file:11106 isolated_file:27
[ 200.032025] unevictable:0 dirty:19 writeback:1881 unstable:0
[ 200.032026] free:2536 slab_reclaimable:2970 slab_unreclaimable:6490
[ 200.032026] mapped:19597 shmem:292 pagetables:12316 bounce:0
[ 200.032028] Node 0 DMA free:8012kB min:40kB low:48kB high:60kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15768kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
[ 200.032036] lowmem_reserve[]: 0 2002 2002 2002
[ 200.032039] Node 0 DMA32 free:2132kB min:5704kB low:7128kB
high:8556kB active_anon:992876kB inactive_anon:330968kB
active_file:42212kB inactive_file:44424kB unevictable:0kB
isolated(anon):28kB isolated(file):108kB present:2050992kB mlocked:0kB
dirty:76kB writeback:7524kB mapped:78388kB shmem:1168kB
slab_reclaimable:11880kB slab_unreclaimable:25944kB
kernel_stack:2320kB pagetables:49264kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 200.032050] lowmem_reserve[]: 0 0 0 0
[ 200.032059] Node 0 DMA: 1*4kB 1*8kB 0*16kB 0*32kB 1*64kB 0*128kB
1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8012kB
[ 200.032066] Node 0 DMA32: 1*4kB 0*8kB 1*16kB 0*32kB 1*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2132kB
[ 200.032072] 37527 total pagecache pages
[ 200.032074] 15549 pages in swap cache
[ 200.032075] Swap cache stats: add 72816, delete 57267, find 8267/8477
[ 200.032076] Free swap = 3832196kB
[ 200.032078] Total swap = 4096568kB
[ 200.040499] 523951 pages RAM
[ 200.040501] 9684 pages reserved
[ 200.040502] 231120 pages shared
[ 200.040503] 486710 pages non-shared
[ 200.040514] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 200.040517] IP: [<ffffffffa001d395>] mydriver_transform+0x1a3/0x6a8
[mydriver_aes2]
[ 200.040523] PGD 7c3dd067 PUD 41dc067 PMD 0
[ 200.040526] Oops: 0000 [#1] PREEMPT SMP
[ 200.040528] last sysfs file: /sys/devices/virtual/misc/fuse/dev
[ 200.040530] CPU 0
[ 200.040531] Modules linked in: ctr twofish_generic twofish_x86_64
twofish_common camellia serpent blowfish cast5 xcbc rmd160
sha512_generic sha256_generic crypto_null af_key mydriver_aes2 fuse
nvidia(P) r8169 iTCO_wdt iTCO_vendor_support
[ 200.040542]
[ 200.040544] Pid: 10935, comm: iperf Tainted: P
2.6.36-zen1 #1 EP45-UD3P/EP45-UD3P
[ 200.040546] RIP: 0010:[<ffffffffa001d395>] [<ffffffffa001d395>]
mydriver_transform+0x1a3/0x6a8 [mydriver_aes2]
[ 200.040550] RSP: 0018:ffff880072c5b898 EFLAGS: 00010246
[ 200.040551] RAX: ffff880055a3a030 RBX: 0000000000000680 RCX: 00000000000005f0
[ 200.040553] RDX: 0000000000000680 RSI: 0000000000000000 RDI: ffff880055a3a030
[ 200.040555] RBP: 0000000000000000 R08: 0000000000000680 R09: 0000000000000018
[ 200.040556] R10: 000000007d078004 R11: 0000000000013234 R12: 0000000000000010
[ 200.040558] R13: 0000000000000004 R14: 00000000eaef0000 R15: 00000000000005f0
[ 200.040561] FS: 0000000041767950(0063) GS:ffff880001a00000(0000)
knlGS:0000000000000000
[ 200.040562] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 200.040564] CR2: 0000000000000000 CR3: 000000007409d000 CR4: 00000000000406f0
[ 200.040566] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 200.040568] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 200.040570] Process iperf (pid: 10935, threadinfo ffff880072c5a000,
task ffff880006d09000)
[ 200.040571] Stack:
[ 200.040572] ffff880006d09000 0000000000000000 ffffffff817854f0
0000000000000020
[ 200.040574] <0> 0000000000000000 0000efea72c5b9e8 ffff88007d4a7c58
00000001810afac2
[ 200.040577] <0> 0000000000000000 ffff88004d53dc00 ffff880072c5b901
000000000000000f
[ 200.040580] Call Trace:
[ 200.040585] [<ffffffff8102bbbf>] ? need_resched+0x1a/0x23
[ 200.040588] [<ffffffffa001db0b>] ? mydriver_cbc_encrypt+0x7e/0x9c
[mydriver_aes2]
[ 200.040592] [<ffffffff811a0871>] ? async_encrypt+0x35/0x3a
[ 200.040595] [<ffffffff811a1e0c>] ? eseqiv_givencrypt+0x341/0x389
[ 200.040598] [<ffffffff813b8bb5>] ? __skb_to_sgvec+0x49/0x1ea
[ 200.040600] [<ffffffff813b8d1e>] ? __skb_to_sgvec+0x1b2/0x1ea
[ 200.040603] [<ffffffff811a8fc8>] ? crypto_authenc_givencrypt+0x60/0x7c
[ 200.040607] [<ffffffff814492dd>] ? esp_output+0x320/0x357
[ 200.040610] [<ffffffff814658cd>] ? xfrm_output_resume+0x38d/0x48f
[ 200.040613] [<ffffffff813e1f62>] ? nf_hook_slow+0xc8/0xd9
[ 200.040616] [<ffffffff81416f9f>] ? ip_push_pending_frames+0x2cc/0x328
[ 200.040619] [<ffffffff8143339e>] ? udp_push_pending_frames+0x2c4/0x342
[ 200.040621] [<ffffffff814350ca>] ? udp_sendmsg+0x508/0x600
[ 200.040623] [<ffffffff8102bbbf>] ? need_resched+0x1a/0x23
[ 200.040627] [<ffffffff813b3458>] ? sock_aio_write+0xd5/0xe9
[ 200.040630] [<ffffffff8100340e>] ? apic_timer_interrupt+0xe/0x20
[ 200.040633] [<ffffffff810ba2ea>] ? do_sync_write+0xb0/0xf2
[ 200.040636] [<ffffffff8100864b>] ? sched_clock+0x5/0x8
[ 200.040639] [<ffffffff8119c550>] ? security_file_permission+0x18/0x67
[ 200.040641] [<ffffffff810bac07>] ? vfs_write+0xbc/0x101
[ 200.040643] [<ffffffff810bad08>] ? sys_write+0x45/0x6e
[ 200.040646] [<ffffffff81002a42>] ? system_call_fastpath+0x16/0x1b
[ 200.040647] Code: 83 c0 08 80 7c 24 50 01 75 10 48 89 c7 49 63 cc
48 8b 74 24 48 f3 a4 48 89 f8 48 89 c7 48 8b 74 24 40 41 0f b7 d8 49
63 cf 89 da <f3> a4 4c 8b 2d 62 1d 00 00 48 8b 3d 53 1d 00 00 b1 01 48
8b 74
[ 200.040668] RIP [<ffffffffa001d395>]
mydriver_transform+0x1a3/0x6a8 [mydriver_aes2]
[ 200.040671] RSP <ffff880072c5b898>
[ 200.040672] CR2: 0000000000000000
[ 200.040733] ---[ end trace ae2865df0a025f7d ]---
[ 221.687773] SysRq : Emergency Sync



BUT Iperf in TCP mode has its own problems yet ( the system freezes
with no response ).

Thank in advance,
Hamid.


--- mydriver1 2010-12-21 15:20:17.000000000 +0330
+++ mydriver2 2010-12-21 15:24:18.000000000 +0330
@@ -1,4 +1,3 @@
-
static int
mydriver_cbc_decrypt(struct blkcipher_desc *desc,
struct scatterlist *dst, struct scatterlist *src,
@@ -14,18 +13,17 @@ mydriver_cbc_decrypt(struct blkcipher_desc
err = blkcipher_walk_virt(desc, &walk);
op->iv = walk.iv;

- while((nbytes = walk.nbytes)) {
+
op->src = walk.src.virt.addr,
op->dst = walk.dst.virt.addr;
op->mode = AES_MODE_CBC;
- op->len = nbytes - (nbytes % AES_MIN_BLOCK_SIZE);
+ op->len = nbytes;
op->dir = AES_DIR_DECRYPT;
-
ret = mydriver_transform(op, 0);

nbytes -= ret;
err = blkcipher_walk_done(desc, &walk, nbytes);
- }
+

return err;
}
@@ -45,16 +43,17 @@ mydriver_cbc_encrypt(struct blkcipher_desc
err = blkcipher_walk_virt(desc, &walk);
op->iv = walk.iv;

- while((nbytes = walk.nbytes)) {
+
op->src = walk.src.virt.addr,
op->dst = walk.dst.virt.addr;
op->mode = AES_MODE_CBC;
- op->len = nbytes - (nbytes % AES_MIN_BLOCK_SIZE);
+ op->len = nbytes;
op->dir = AES_DIR_ENCRYPT;
ret = mydriver_transform(op, 0);
nbytes -= ret;
err = blkcipher_walk_done(desc, &walk, nbytes);
- }
+

return err;
}
+

---------- Forwarded message ----------
From: Hamid Nassiby <[email protected]>
Date: Sun, Dec 19, 2010 at 4:28 PM
Subject: crypto accelerator driver problems
To: [email protected]


Hi All,

In a research project, we've developed a crypto accelerator based on Xilinx
Virtex5 FPGA family which is connected to PC through PCI-Express slot and is
used by IPSec to offload crypto processing from CPU. The accelerator only
provides AES and DES3_EDE algorithms and I am responsible for providing driver
of the stuff. I inspired much of driver work from geode_aes.c which is
located in "drivers/crypto" subdir of kernel source directory. Both algorithms
are registered as blkcipher providing cbc wrapper "cbc(aes)" just as one that is
registered in geode_aes. Now after months of work, the accelerator is ready to
work (Correctness of hardware operation is assured by direct crypto
test and not by IPSec) and it is time of driver to provide IPSec
access to accelerator. In first
try I could get  "ping" through the IPsec tunnel. One end of IPSec tunnel is
equipped by our accelerator and the other end is using kernel native IPSec and
built in AES and DES3_EDE algorithms. Now I am faced with 2 problems:

1. Ping will stop getting reply with packet sizes greater than 1426 Bytes
(ping dest_ip -s  1427). I guessed that it might be MTU problem, but reducing
mtu with "ifconfig eth1 mtu xxx" or
"echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc"
 does not solve the problem. Also when I ping each of tunnel ends from
another end
simultaneously with "ping other_node_ip  -i 0.001", the kernel hangs
out completely.

2. Iperf problem. When I try to measure throughput of the IPSec gateway equipped
by our accelerator ( AES-MD5 ), using iperf in tcp mode, the kernel hangs such
that sometimes "Magic SysRq key" does not respond too! And so I could not trace
the problem anyway. Using iperf in udp mode works but I get "UDP bad cheksum" in
'dmesg' output of other end of tunnel (Native IPSec and built in kernel
algorithms).

Two gateways are connected by a cross cable and no router/switch is located
between them to cause mtu problems. In my test pcrypt is not used by now and
booting the kernel with nosmp (so no fear of thread contention) does not change
the situation.

So I request you to help me solve the problem. I bring some parts of driver
that is changed from geode_aes.c and might give useful information. If
it is required,
I'll post all driver text.
------------------------------ ----------------------------

static struct crypto_alg mydriver_cbc_alg = {
       .cra_name               =       "cbc(aes)",
       .cra_driver_name        =       "cbc-aes-mydriver",
       .cra_priority           =       400,
       .cra_flags                      =       CRYPTO_ALG_TYPE_BLKCIPHER |

CRYPTO_ALG_NEED_FALLBACK,
       .cra_init                       =       fallback_init_blk,
       .cra_exit                       =       fallback_exit_blk,
       .cra_blocksize          =       AES_MIN_BLOCK_SIZE,
       .cra_ctxsize            =       sizeof(struct mydriver_aes_op),
       .cra_alignmask          =       15,
       .cra_type                       =       &crypto_blkcipher_type,
       .cra_module                     =       THIS_MODULE,
       .cra_list                       =
LIST_HEAD_INIT(mydriver_cbc_alg.cra_list),
       .cra_u                          =       {
               .blkcipher      =       {
                       .min_keysize    =       AES_MIN_KEY_SIZE,
                       .max_keysize    =       AES_MIN_KEY_SIZE,
                       .setkey                 =       mydriver_setkey_blk,
                       .encrypt                =       mydriver_cbc_encrypt,
                       .decrypt                =       mydriver_cbc_decrypt,
                       .ivsize                 =       AES_IV_LENGTH,
               }
       }
};
//---------------
static int
mydriver_cbc_encrypt(struct blkcipher_desc *desc,
                 struct scatterlist *dst, struct scatterlist *src,
                 unsigned int nbytes)
{

       struct mydriver_aes_op *op = crypto_blkcipher_ctx(desc->tfm);
       struct blkcipher_walk walk;
       int err, ret;

       if (unlikely(op->keylen != AES_KEYSIZE_128))
               return fallback_blk_enc(desc, dst, src, nbytes);

       blkcipher_walk_init(&walk, dst, src, nbytes);
       err = blkcipher_walk_virt(desc, &walk);
       op->iv = walk.iv;

       while((nbytes = walk.nbytes)) {

               op->src = walk.src.virt.addr,
               op->dst = walk.dst.virt.addr;
               op->mode = AES_MODE_CBC;
               op->len = nbytes - (nbytes % AES_MIN_BLOCK_SIZE);
               op->dir = AES_DIR_ENCRYPT;
                       //ret = mydriver_aes_crypt(op);
               ret = mydriver_transform(op, 0);
               nbytes -= ret;
               err = blkcipher_walk_done(desc, &walk, nbytes);
       }

       return err;
}
/*--------- mydriver_transform which makes a buffer containing key, iv, data
with
some additional header that is required by our accelerator, writes the buffer
to accelerator by DMA and then reads response from hardware.*/

static inline int mydriver_transform(struct mydriver_aes_op *op, int alg)
{

               int  req_len, err;
               u8 *req_buf = NULL, *res_buf = NULL;
               alg_operation operation;
               u32 my_req_id;
               if (op->len == 0)
                       return 0;

               if ((op->dir == AES_DIR_ENCRYPT) ||(op->dir ==
DES3_DIR_ENCRYPT)){
                        operation = SH_ENCRYPT;
                        my_req_id = smp_processor_id();// This ID is
put into our packet and is checked by each thread when the hardware
response is ready to see if the packet is its?
               }
               else {
                       operation = SH_DECRYPT;
                       my_req_id = smp_processor_id() + 64;
//uniqueness of ID does not solve problem described in mail :( .
               }



               err = create_request(alg, op->mode, operation, htonl(my_req_id),
op->key, op->iv, op->src, op->len, &req_buf, &req_len);


               if (err){
                       printk(KERN_EMERG"mydriver_transform : Error
CreateReuest :
errcode = %d\n", err);
                       //goto error;
               }

               err = write_request(req_buf, req_len);
               if (err){
                       printk(KERN_EMERG"mydriver_transform : Error WriteReuest
:
errcode = %d\n", err);
                       //goto error;
               }
               kfree(req_buf);
               req_buf = NULL;

               err = read_response(&res_buf, /*local_hdr.Length*/my_req_id);

               memcpy(op->dst, (res_buf + sizeof(struct response_hdr)),
op->len);

               kfree(res_buf);
               res_buf = NULL;
               return op->len;
}
//-----------
/* create_request wich builds packet for mydriver_transform */
static inline int create_request(int alg, char mode, char enc_dec, u32
request_id,
                  char *key, char *iv, char *data, int datalen,
                  u8 **outbuf, int *outlen)
{
       int req_len, n_padding, keylen, blocklen, algid;
       struct request_hdr *p_hdr;
       char *ptr;

       if (alg == 0){ //AES Algorithm
               keylen = 16;
               blocklen = 16;
               algid = 4;
       } else if (alg == 1){ //DES3 Algorithm
               keylen = 24;
               blocklen = 8;
               algid = 3;
       }

       req_len = sizeof(struct request_hdr) + keylen;
       if (keylen != 0 && keylen % 16 == 0)
               req_len += 8; //For request packet to be 128bit aligned
       if (mode == SHAMS_CBC)
               req_len += blocklen; // for IV len

       n_padding = (blocklen - (datalen % blocklen)) % blocklen; //padding
data to be multiple of 128 bits.

       req_len += (n_padding + datalen);
       *outbuf = kmalloc(req_len, GFP_ATOMIC);
       p_hdr = (struct request_hdr *) *outbuf;
       *outlen = p_hdr->Length = req_len;

       p_hdr->request_id = request_id;
       p_hdr->AlgID_Mode_EncDec = (enc_dec << 15) | (mode << 12) | algid;
       // Filling key
       ptr = *outbuf + sizeof(struct request_hdr);
       memcpy(ptr, key, keylen);
       ptr += keylen;
       if (keylen != 0 && keylen % 16 == 0){
               memset(ptr, 0, 8);
               ptr += 8;
       }
       // Filling IV
       if (mode == SHAMS_CBC){
               memcpy(ptr, iv, blocklen);
               ptr += blocklen;
       }
       // Copy data
       memcpy(ptr, data, datalen);
       ptr += datalen;
       // Zeroing padd bits
       memset(ptr, 0, n_padding);

       return 0;

}
//--------------------------------
/* write_request that writes the provided buffer to device */

static inline int write_request(u8 *buff, unsigned int count)
{
unsigned long  iflags;
u32 tlp_count, tlp_size;
dma_addr_t dma_addr;
struct x5pcie_dma_desc *desc_table = (struct x5pcie_dma_desc *)global_bar[0];

/** DMA operations:*/
       dma_addr = pci_map_single(global_dev, buff, count, PCI_DMA_TODEVICE);
       if (0 == dma_addr) {
               printk(KERN_EMERG"XPCIe_Read: Map error.\n");
               return -1;
       }

// Do DMA transfer here....
       count = count /4;//
       for (tlp_size = 32; tlp_size > 0; tlp_size--)
               if ((count % tlp_size) == 0){
                       tlp_count = count / tlp_size;
                       break;
               }

       tlp_size = tlp_count | (tlp_size << 16);
       spin_lock_irqsave(&wlock, iflags);
       //down(&my_sem);
//      if (down_interruptible(&my_sem)){
//              printk(KERN_EMERG "\nwrite_request: Error Acquire Semaphore!!");
//              return -ERESTARTSYS;
//      }
       writel(cpu_to_le32(tlp_size),&desc_table->rdmatlpc);             // read
DMA TLP count:  TLPs to transfer
       writel(cpu_to_le32(dma_addr),&desc_table->rdmatlpa);  // physical bus
address of DMA able buffer
       wmb();
       writew(cpu_to_le16(0x0001),(global_bar[0]+6));                // read
dma start bit[16] to ddmacr
       wmb();
       while(readw((global_bar[0]+6)) != 0x0101);
       spin_unlock_irqrestore(&wlock, iflags);
       //up(&my_sem);
       // Unmap the DMA buffer so it is safe for normal access again.
       pci_unmap_single(global_dev, dma_addr, count, PCI_DMA_TODEVICE);

       /** End of dma section*/
       return 0;

}
//--------------
/* read_response that reads the en/decrypted buffer from device */

static inline int read_response(u8 **buff,  u16 my_req_id)
{
       dma_addr_t dma_addr;
       u16 count, tmp_req_id;
       unsigned long  iflags1;//, iflags2;
       u32 tlp_count, tlp_size;
       struct x5pcie_dma_desc *desc_table = (struct x5pcie_dma_desc
*)global_bar[0];

       for(;;){

               spin_lock_irqsave(&alock, iflags1);
               tmp_req_id = readw((global_bar[0] + 82 + (fifo_entry * 4)));
               spin_unlock_irqrestore(&alock, iflags1);
               if(my_req_id == tmp_req_id) // Is the provided packet mine?
                       break;

       }

                       count = readw(global_bar[0] + 80 + (fifo_entry
* 4));//What is the size of my packet?
                       printk(KERN_EMERG "read_response : my_req_id = %d has
count = %d\n", my_req_id, count);

                       *buff = kmalloc(count, GFP_ATOMIC);
                       dma_addr = pci_map_single(global_dev, *buff, count,
PCI_DMA_FROMDEVICE);
                       if (0 == dma_addr){
                               printk(KERN_EMERG"XPCIe_Read: Map error.\n");
                               return -1;
                       }

                       count = count /4;//
                       for (tlp_size = 32; tlp_size > 0; tlp_size--)
                               if ((count % tlp_size) == 0){
                                       tlp_count = count / tlp_size;
                                       break;
                               }

                       tlp_size = tlp_count | (tlp_size << 16);
       //              down(&my_sem);
//                      if (down_interruptible(&my_sem)){
//                              printk(KERN_EMERG "\nread_response: Error
Acquire Semaphore!!");
//                              return -ERESTARTSYS;
//                      }
                       writel(cpu_to_le32(tlp_size),&desc_table->wdmatlpc);
       // read DMA TLP count:  TLPs to transfer
                       writel(cpu_to_le32(dma_addr),&desc_table->wdmatlpa);  //
physical bus address of DMA able buffer
                       wmb();
                       writew(cpu_to_le16(0x0001),(global_bar[0]+4));
    // read dma start bit[16] to ddmacr
                       wmb();
                       while(readw(global_bar[0]+4) != 0x0101);

                       fifo_entry = (fifo_entry + 1) % 9; // 9 : Number of
registers holding request_id and len of FiFo's elements .
                       //spin_unlock_irqrestore(&rlock, iflags2);
                       //up(&my_sem);
                       pci_unmap_single(global_dev, dma_addr, count,
PCI_DMA_FROMDEVICE);

                       return count;

}


Thanks in advance,
Hamid.

2010-12-30 21:19:03

by Herbert Xu

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

Hamid Nassiby <[email protected]> wrote:
> Hi,
>
> As some good news and additional information, with the following patch
> I no more get
> "UDP bad cheksum" error as I mentioned erlier with Iperf in udp mode.
> But some times I get the following call trace in dmesg after running
> Iperf in UDP mode, more than one time (and ofcourse Iperf stops
> transferring data while it uses 100% of CPU cycles.
>
>
>
> [ 130.171909] mydriver-aes: mydriver Crypto-Engine enabled.
> [ 134.767846] NET: Registered protocol family 15
> [ 200.031846] iperf: page allocation failure. order:0, mode:0x20
> [ 200.031850] Pid: 10935, comm: iperf Tainted: P 2.6.36-zen1 #1
> [ 200.031852] Call Trace:
> [ 200.031860] [<ffffffff8108ab39>] ? __alloc_pages_nodemask+0x6d3/0x722
> [ 200.031864] [<ffffffff810b454f>] ? virt_to_head_page+0x9/0x30
> [ 200.031867] [<ffffffff810afac2>] ? alloc_pages_current+0xa5/0xce
> [ 200.031869] [<ffffffff810899ad>] ? __get_free_pages+0x9/0x46
> [ 200.031872] [<ffffffff8102bbbf>] ? need_resched+0x1a/0x23
> [ 200.031876] [<ffffffff811a10ad>] ? blkcipher_walk_next+0x68/0x2d9

This means that your box has run out of memory temporarily.
If all errors were handled correctly it should continue at this
point.

> --- mydriver1 2010-12-21 15:20:17.000000000 +0330
> +++ mydriver2 2010-12-21 15:24:18.000000000 +0330
> @@ -1,4 +1,3 @@
> -
> static int
> mydriver_cbc_decrypt(struct blkcipher_desc *desc,
> struct scatterlist *dst, struct scatterlist *src,
> @@ -14,18 +13,17 @@ mydriver_cbc_decrypt(struct blkcipher_desc
> err = blkcipher_walk_virt(desc, &walk);
> op->iv = walk.iv;
>
> - while((nbytes = walk.nbytes)) {
> +

However, your patch removes the error checking (and the loop
condition) which is why it crashes.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2011-01-08 07:39:50

by Hamid Nassiby

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Fri, Dec 31, 2010 at 12:49 AM, Herbert Xu
<[email protected]> wrote:
>
> Hamid Nassiby <[email protected]> wrote:
> > Hi,
> >
> > As some good news and additional information, with the following patch
> > I no more get
> > "UDP bad cheksum" error as I mentioned erlier with Iperf in udp mode.
> > But some times I get the following call trace in dmesg after running
> > Iperf in UDP mode, more than one time (and ofcourse Iperf stops
> > transferring data while it uses 100% of CPU cycles.
> >
> >
> >
> > [  130.171909] mydriver-aes: mydriver Crypto-Engine enabled.
> > [  134.767846] NET: Registered protocol family 15
> > [  200.031846] iperf: page allocation failure. order:0, mode:0x20
> > [  200.031850] Pid: 10935, comm: iperf Tainted: P            2.6.36-zen1 #1
> > [  200.031852] Call Trace:
> > [  200.031860]  [<ffffffff8108ab39>] ? __alloc_pages_nodemask+0x6d3/0x722
> > [  200.031864]  [<ffffffff810b454f>] ? virt_to_head_page+0x9/0x30
> > [  200.031867]  [<ffffffff810afac2>] ? alloc_pages_current+0xa5/0xce
> > [  200.031869]  [<ffffffff810899ad>] ? __get_free_pages+0x9/0x46
> > [  200.031872]  [<ffffffff8102bbbf>] ? need_resched+0x1a/0x23
> > [  200.031876]  [<ffffffff811a10ad>] ? blkcipher_walk_next+0x68/0x2d9
>
> This means that your box has run out of memory temporarily.
> If all errors were handled correctly it should continue at this
> point.
>
> > --- mydriver1   2010-12-21 15:20:17.000000000 +0330
> > +++ mydriver2   2010-12-21 15:24:18.000000000 +0330
> > @@ -1,4 +1,3 @@
> > -
> > static int
> > mydriver_cbc_decrypt(struct blkcipher_desc *desc,
> >                  struct scatterlist *dst, struct scatterlist *src,
> > @@ -14,18 +13,17 @@ mydriver_cbc_decrypt(struct blkcipher_desc
> >        err = blkcipher_walk_virt(desc, &walk);
> >        op->iv = walk.iv;
> >
> > -       while((nbytes = walk.nbytes)) {
> > +
>
> However, your patch removes the error checking (and the loop
> condition) which is why it crashes.
>
> Cheers,
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt



Hi Herbert,

First I should notice that by removing while loop iteration, "UDP bad checksum"
error in dmesg output is no longer seen. Diving deeper in problem, It seemed
to me that when mydriver_transform returns 0, I must not get any more bytes
(belonging to previous request) to process in the next iteration of while loop.
But I see that the behavior is not as it has to be (By removing while loop
mydriver_transform gets for example one 1500 byte request, processes it and
copies it back to destination, But in existence of while loop It gets same
request as one 1300 byte request, processes and copies it back to destination,
returning 0, and getting remaining 200 bytes of request in second iteration of
while, so on the other end of tunnel I see "UDP bad checksum"). So I conclude
that blkcipher_walk_done behaves strange, assigns incorrect value to walk.nbytes
resulting in iterating while loop one time more!


Second note is about our accelerator's architecture and the way we should
utilize it. Our device has several crypto engines built in. So for maximum
utilization of device we should feed it with multiple crypto requests
simultaneously (I intended for doing it by using pcrypt) and here is the point
everything freezes. From other point of view, I found that if I protect entering
write_request and read_response in mydriver_transform by one lock
(spin_unlock(x) before write_request and spin_unlock(x) after read_reasponse in
mydriver_transform as shown in following code snippet), I would be able to run
"iperf" in tcp mode successfully. This leads me to uncertainty, because in
such a situation, we only utilize one crypto engine of device and each request
is followed by its response sequentially and arrangement of requests and
responses is not interleaved. So I guess that getting multiple requests to
device and receiving the responses not in the same arrangement they delivered to
device, might cause TCP transfer to freeze, and here my question arises: If my
conclusion is true, SHOULD I change the driver approach to ablkcipher?


Code snippet in the way write_request and read_response are protected by lock
and iperf in TCP mode progresses:


static inline int mydriver_transform(struct mydriver_aes_op *op, int alg)
{
.
.
.
spin_lock_irqsave(&glock, tflag);
write_request(req_buf, req_len);
kfree(req_buf);
req_buf = NULL;
err = read_response(&res_buf,my_req_id);
spin_unlock_irqrestore(&glock, tflag2);
if (err == 0){
kfree(res_buf);
res_buf = NULL;
return 0;
}

memcpy(op->dst, (res_buf + sizeof(struct response_hdr)),
op->len);

kfree(res_buf);
res_buf = NULL;
return op->len;
}


I'm looking forward to hearing you soon.
Thanks,

Hamid.

2011-01-26 07:09:41

by Herbert Xu

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Wed, Jan 26, 2011 at 10:26:33AM +0330, Hamid Nassiby wrote:
>
> As you know, I posted my problem again to crypto list and no one answered.
> Now I
> emphasize one aspect of the problem as a concept related to IPSec protocol,
> free
> of my problem's nature, and I hope to get some guidelines at this time. The
> question is as following:
> If IPSec delivers IP packets to hardware crypto accelerator in sequential
> manner
> (e.g, packets in order: 1, 2, 3, ..., 36, 37, 38,...) and crypto accelerator
> possibly returns back packets out of entering order to IPSec (e.g, packet
> 37 is returned back before the packet 36 to IPSec, so the order of packets
> is
> not the same before entering crypto accelerator and after exiting it); Is it
> possible to rise any problem here?

We do not allow such reordering. All crypto drivers must ensure
ordering within a single tfm. Between different tfms there is no
ordering requirement.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2011-01-26 07:17:06

by Hamid Nassiby

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Sat, Jan 8, 2011 at 11:09 AM, Hamid Nassiby <[email protected]> wrote:
>
> On Fri, Dec 31, 2010 at 12:49 AM, Herbert Xu
> <[email protected]> wrote:
> >
> > Hamid Nassiby <[email protected]> wrote:
> > > Hi,
> > >
> > > As some good news and additional information, with the following patch
> > > I no more get
> > > "UDP bad cheksum" error as I mentioned erlier with Iperf in udp mode.
> > > But some times I get the following call trace in dmesg after running
> > > Iperf in UDP mode, more than one time (and ofcourse Iperf stops
> > > transferring data while it uses 100% of CPU cycles.
> > >
> > >
> > >
> > > [  130.171909] mydriver-aes: mydriver Crypto-Engine enabled.
> > > [  134.767846] NET: Registered protocol family 15
> > > [  200.031846] iperf: page allocation failure. order:0, mode:0x20
> > > [  200.031850] Pid: 10935, comm: iperf Tainted: P            2.6.36-zen1 #1
> > > [  200.031852] Call Trace:
> > > [  200.031860]  [<ffffffff8108ab39>] ? __alloc_pages_nodemask+0x6d3/0x722
> > > [  200.031864]  [<ffffffff810b454f>] ? virt_to_head_page+0x9/0x30
> > > [  200.031867]  [<ffffffff810afac2>] ? alloc_pages_current+0xa5/0xce
> > > [  200.031869]  [<ffffffff810899ad>] ? __get_free_pages+0x9/0x46
> > > [  200.031872]  [<ffffffff8102bbbf>] ? need_resched+0x1a/0x23
> > > [  200.031876]  [<ffffffff811a10ad>] ? blkcipher_walk_next+0x68/0x2d9
> >
> > This means that your box has run out of memory temporarily.
> > If all errors were handled correctly it should continue at this
> > point.
> >
> > > --- mydriver1   2010-12-21 15:20:17.000000000 +0330
> > > +++ mydriver2   2010-12-21 15:24:18.000000000 +0330
> > > @@ -1,4 +1,3 @@
> > > -
> > > static int
> > > mydriver_cbc_decrypt(struct blkcipher_desc *desc,
> > >                  struct scatterlist *dst, struct scatterlist *src,
> > > @@ -14,18 +13,17 @@ mydriver_cbc_decrypt(struct blkcipher_desc
> > >        err = blkcipher_walk_virt(desc, &walk);
> > >        op->iv = walk.iv;
> > >
> > > -       while((nbytes = walk.nbytes)) {
> > > +
> >
> > However, your patch removes the error checking (and the loop
> > condition) which is why it crashes.
> >
> > Cheers,
> > --
> > Email: Herbert Xu <[email protected]>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
>
>
> Hi Herbert,
>
> First I should notice that by removing while loop iteration, "UDP bad checksum"
> error in dmesg output is no longer seen. Diving deeper in problem, It seemed
> to me that when mydriver_transform returns 0, I must not get any more bytes
> (belonging to previous request) to process in the next iteration of while loop.
> But I see that the behavior is not as it has to be (By removing while loop
> mydriver_transform gets for example one 1500 byte request, processes it and
> copies it back to destination, But in existence of while loop It gets same
> request as one 1300 byte request, processes and copies it back to destination,
> returning 0, and getting remaining 200 bytes of request in second iteration of
> while, so on the other end of tunnel I see "UDP bad checksum"). So I conclude
> that blkcipher_walk_done behaves strange, assigns incorrect value to walk.nbytes
> resulting in iterating while loop one time more!
>
>
> Second note is about our accelerator's architecture and the way we should
> utilize it. Our device has several crypto engines built in. So for maximum
> utilization of device we should feed it with multiple crypto requests
> simultaneously (I intended for doing  it by using pcrypt) and here is the point
> everything freezes. From other point of view, I found that if I protect entering
> write_request and read_response in mydriver_transform by one lock
> (spin_unlock(x) before write_request and spin_unlock(x) after read_reasponse in
> mydriver_transform as shown in following code snippet), I would be able to run
> "iperf" in tcp mode successfully. This leads me to uncertainty, because in
> such a situation, we only utilize one crypto engine of device and each request
> is followed by its response sequentially and arrangement of requests and
> responses is not interleaved. So I guess that getting multiple requests to
> device and receiving the responses not in the same arrangement they delivered to
> device, might cause TCP transfer to freeze, and here my question arises: If my
> conclusion is true, SHOULD I change the driver approach to ablkcipher?
>
>
> Code snippet in the way write_request and read_response are protected by lock
> and iperf in TCP mode progresses:
>
>
> static inline int mydriver_transform(struct mydriver_aes_op *op, int alg)
> {
>                .
>                .
>                .
>                spin_lock_irqsave(&glock, tflag);
>                write_request(req_buf, req_len);
>                kfree(req_buf);
>                req_buf = NULL;
>                err = read_response(&res_buf,my_req_id);
>                spin_unlock_irqrestore(&glock, tflag2);
>                if (err == 0){
>                                kfree(res_buf);
>                                res_buf = NULL;
>                                return 0;
>                }
>
>                memcpy(op->dst, (res_buf + sizeof(struct response_hdr)),
> op->len);
>
>                kfree(res_buf);
>                res_buf = NULL;
>                return op->len;
> }
>
>
> I'm looking forward to hearing you soon.
> Thanks,
>
> Hamid.



Hi,

As you know, I posted my problem again to crypto list and no one answered. Now I
emphasize one aspect of the problem as a concept related to IPSec protocol, free
of my problem's nature, and I hope to get some guidelines at this time. The
question is as following:
If IPSec delivers IP packets to hardware crypto accelerator in sequential manner
(e.g, packets in order: 1, 2, 3, ..., 36, 37, 38,...) and crypto accelerator
possibly returns back packets out of entering order to IPSec (e.g, packet
37 is returned back before the packet 36 to IPSec, so the order of packets is
not the same before entering crypto accelerator and after exiting it); Is it
possible to rise any problem here?

Thanks in advance,

Hamid.

2011-01-26 07:50:53

by Hamid Nassiby

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Wed, Jan 26, 2011 at 10:39 AM, Herbert Xu
<[email protected]> wrote:
> On Wed, Jan 26, 2011 at 10:26:33AM +0330, Hamid Nassiby wrote:
>>
>> As you know, I posted my problem again to crypto list and no one answered.
>> Now I
>> emphasize one aspect of the problem as a concept related to IPSec protocol,
>> free
>> of my problem's nature, and I hope to get some guidelines at this time. The
>> question is as following:
>> If IPSec delivers IP packets to hardware crypto accelerator in sequential
>> manner
>> (e.g, packets in order: 1, 2, 3, ..., 36, 37, 38,...) and crypto accelerator
>> possibly returns back packets out of entering order to IPSec (e.g, packet
>> 37 is returned back before the packet 36 to IPSec, so the order of packets
>> is
>> not the same before entering crypto accelerator and after exiting it); Is it
>> possible to rise any problem here?
>
> We do not allow such reordering.  All crypto drivers must ensure
> ordering within a single tfm.  Between different tfms there is no
> ordering requirement.
>
> Cheers,
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>


Do you mean that different IP packets fit into one single Block Cipher tfm?
Would you please explain expansively?

Thanks a lot,

2011-01-26 23:33:17

by Herbert Xu

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Wed, Jan 26, 2011 at 11:20:22AM +0330, Hamid Nassiby wrote:
>
> Do you mean that different IP packets fit into one single Block Cipher tfm?
> Would you please explain expansively?

We allocate one tfm per SA. So as long as ordering is guaranteed
per SA then it's guaranteed per SA which is all that's needed.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2011-07-05 06:45:39

by Hamid Nassiby

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Thu, Jan 27, 2011 at 3:03 AM, Herbert Xu <[email protected]> wrote:
>
> On Wed, Jan 26, 2011 at 11:20:22AM +0330, Hamid Nassiby wrote:
> >
> > Do you mean that different IP packets fit into one single Block Cipher tfm?
> > Would you please explain expansively?
>
> We allocate one tfm per SA.  So as long as ordering is guaranteed
> per SA then it's guaranteed per SA which is all that's needed.
>
> Cheers,
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Dears,
Referring to my previous posts related to a hardware AES accelerator (that is
to be used to accelerate IPSec block cipher operations) driver, I would like to
ask you about an possibly algorithmic problem exists in our solution.
As I said earlier our driver is inspired by geode_aes driver, so assume that we
have defined our supported algorithm as:

static struct crypto_alg shams_cbc_alg = {
.cra_name = "cbc(aes)",
.cra_driver_name = "cbc-aes-mine",
.cra_priority = 400,
.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER |

CRYPTO_ALG_NEED_FALLBACK,
.cra_init = fallback_init_blk,
.cra_exit = fallback_exit_blk,
.cra_blocksize = AES_MIN_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct my_aes_op),
.cra_alignmask = 0,
.cra_type = &crypto_blkcipher_type,
.cra_module = THIS_MODULE,
.cra_list =
LIST_HEAD_INIT(shams_cbc_alg.cra_list),
.cra_u = {
.blkcipher = {
.min_keysize = AES_MIN_KEY_SIZE,
.max_keysize = AES_MIN_KEY_SIZE,
.setkey = my_setkey_blk,
.encrypt = my_cbc_encrypt,
.decrypt = my_cbc_decrypt,
.ivsize = AES_IV_LENGTH,
}
}
};

And our encrypt function, my_cbc_encrypt, looks like:

static int
my_cbc_encrypt(struct blkcipher_desc *desc,
struct scatterlist *dst, struct scatterlist *src,
unsigned int nbytes)
{
struct my_aes_op *op = crypto_blkcipher_ctx(desc->tfm);
struct blkcipher_walk walk;
int err, ret;
unsigned long flag1, c2flag;
u32 my_req_id;

spin_lock_irqsave(&reqlock, c2flag);
/*Our request id sent to device and then retrieved to be able
to distinguish between device responses. */
my_req_id = (global_reqid++) % 63000;
spin_unlock_irqrestore(&reqlock, c2flag);


if (unlikely(op->keylen != AES_KEYSIZE_128))
return fallback_blk_enc(desc, dst, src, nbytes);

blkcipher_walk_init(&walk, dst, src, nbytes);
err = blkcipher_walk_virt(desc, &walk);
op->iv = walk.iv;

while((nbytes = walk.nbytes)) {
op->src = walk.src.virt.addr,
op->dst = walk.dst.virt.addr;
op->mode = AES_MODE_CBC;
op->len = nbytes /*- (nbytes % AES_MIN_BLOCK_SIZE)*/;
op->dir = AES_DIR_ENCRYPT;

/* Critical PSEUDO code */
spin_lock_irqsave(&1lock, flag1);
write_to_device(op, 0, my_req_id);
spin_unlock_irqrestore(&lock1, flag1);

spin_lock_irqsave(&lock1, flag1);
ret = read_from_device(op, 0, my_req_id);
spin_unlock_irqrestore(&lock1, flag1);
/* End of Critical PSEUDO code*/
nbytes -= ret;
err = blkcipher_walk_done(desc, &walk, nbytes);
}

return err;
}

As I mentioned earlier we have multiple AES engines in our hardware, so to
utilize hardware as much as possible, we would like to have the possibility to
give multiple requests to device and get responses from it as soon as one
becomes ready.

Now look at that section of my_cbc_encrypt, commented as "Critical PSEUDO code".
This section gives requests to device and reads back responses (And is the damn
bottleneck) . If we protect write_to_device and read_from_device call, by one
pair of lock/unlock as:

/* Critical PSEUDO code */
spin_lock_irqsave(&lock1, flag1);
write_to_device(op, 0, my_req_id);
ret = read_from_device(op, 0, my_req_id);
spin_unlock_irqrestore(&lock1, flag1);
/* End of Critical PSEUDO code*/

then we would have no problem, system works and IPSec en/decrypts by our
hardware. But ONLY one aes engine of our hardware is utilized; Good(system
works), Bad (only one engine is utilized) and the Ugly (throughput is not
awesome). So we must change the section to:

/* Critical PSEUDO code */
spin_lock_irqsave(&lock1, flag1);
write_to_device(op, 0, my_req_id);
spin_unlock_irqrestore(&lock1, flag1);

spin_lock_irqsave(&glock, t2flag);
ret = read_from_device(op, 0, my_req_id);
spin_unlock_irqrestore(&glock, t2flag);
/* End of Critical PSEUDO code */

and preferably to :

/* Critical PSEUDO code */
/* distinct locks for write_to_device and read_from_device */
spin_lock_irqsave(&lock1, flag1);
write_to_device(op, 0, my_req_id);
spin_unlock_irqrestore(&lock1, flag1);

spin_lock_irqsave(&lock2, flag2);
ret = read_from_device(op, 0, my_req_id);
spin_unlock_irqrestore(&lock2, flag2);
/* End of Critical PSEUDO*/


Here, it seems we must have no problem, but as soon as one TCP flow starts
the system hangs.
Finally, I request your guidelines about the problem.

Thanks in advance,
Hamid.

2011-07-05 06:53:54

by Herbert Xu

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Tue, Jul 05, 2011 at 10:15:08AM +0330, Hamid Nassiby wrote:
>
> and preferably to :
>
> /* Critical PSEUDO code */
> /* distinct locks for write_to_device and read_from_device */
> spin_lock_irqsave(&lock1, flag1);
> write_to_device(op, 0, my_req_id);
> spin_unlock_irqrestore(&lock1, flag1);
>
> spin_lock_irqsave(&lock2, flag2);
> ret = read_from_device(op, 0, my_req_id);
> spin_unlock_irqrestore(&lock2, flag2);
> /* End of Critical PSEUDO*/
>
>
> Here, it seems we must have no problem, but as soon as one TCP flow starts
> the system hangs.

Do you know why it hangs?

It sounds like the problem isn't with the synchronisation itself,
which at worst will produce bogus packets, but something else in
your code that is leading to the dead-lock.

Please enable lockdep and related debugging features to track down
the problem.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2011-10-01 09:08:50

by Hamid Nassiby

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

Hi all,

Referring my previous posts in crypto list related to our hardware aes
accelerator project, I finally could deploy device in IPSec successfully. As I
mentioned earlier, my driver registers itself in kernel as blkcipher for
cbc(aes) as follows:

static struct crypto_alg my_cbc_alg = {
.cra_name = "cbc(aes)",
.cra_driver_name = "cbc-aes-my",
.cra_priority = 400,
.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER |
CRYPTO_ALG_NEED_FALLBACK,
.cra_init = fallback_init_blk,
.cra_exit = fallback_exit_blk,
.cra_blocksize = AES_MIN_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct my_aes_op),
.cra_alignmask = 15,
.cra_type = &crypto_blkcipher_type,
.cra_module = THIS_MODULE,
.cra_list = LIST_HEAD_INIT(my_cbc_alg.cra_list),
.cra_u = {
.blkcipher = {
.min_keysize = AES_MIN_KEY_SIZE,
.max_keysize = AES_MIN_KEY_SIZE,
.setkey = my_setkey_blk,
.encrypt = my_cbc_encrypt,
.decrypt = my_cbc_decrypt,
.ivsize = AES_IV_LENGTH,
}
}
};

And my_cbc_encrypt function as PSEUDO/real code (for simplicity of
representation) is as:

static int
my_cbc_encrypt(struct blkcipher_desc *desc,
struct scatterlist *dst, struct scatterlist *src,
unsigned int nbytes)
{
SOME__common_preparation_and_initializations;

spin_lock_irqsave(&myloc, myflags);
send_request_to_device(&dev); /*sends request to device. After
processing request,device writes
result to destination*/
while(!readl(complete_flag)); /*here we wait for a flag in
device register space indicating completion. */
spin_unlock_irqrestore(&mylock, myflags);


}

With above code, I can successfully test IPSec gateway equipped with our
hardware and get a 200Mbps throughput using Iperf. Now I am facing with another
poblem. As I mentioned earlier, our hardware has 4 aes engines builtin. With
above code I only utilize one of them.
>From this point, we want to go a step further and utilize more than one aes
engines of our device. Simplest solution appears to me is to deploy
pcrypt/padata, made by Steffen Klassert. First instantiate in a dual
core gateway :
modprobe tcrypt alg="pcrypt(authenc(hmac(md5),cbc(aes)))" type=3
and test again. Running Iperf now gives me a very low
throughput about 20Mbps while dmesg shows the following:

BUG: workqueue leaked lock or atomic: kworker/0:1/0x00000001/10
last function: padata_parallel_worker+0x0/0x80
Pid: 10, comm: kworker/0:1 Not tainted 2.6.37 #1
Call Trace:
[<c03e2d7d>] ? printk+0x18/0x1b
[<c014a2b7>] process_one_work+0x177/0x370
[<c0199980>] ? padata_parallel_worker+0x0/0x80
[<c014c467>] worker_thread+0x127/0x390
[<c014c340>] ? worker_thread+0x0/0x390
[<c014fd74>] kthread+0x74/0x80
[<c014fd00>] ? kthread+0x0/0x80
[<c01033f6>] kernel_thread_helper+0x6/0x10
BUG: scheduling while atomic: kworker/0:1/10/0x00000002
Modules linked in: pcrypt my_aes2 binfmt_misc bridge stp
bnep sco rfcomm l2cap crc16 bluetooth rfkill ppdev acpi_cpufreq mperf
cpufreq_stats cpufreq_conservative cpufreq_ondemand cpufreq_userspace
cpufreq_powersave freq_table pci_slot sbs container video output sbshc battery
iptable_filter ip_tables x_tables decnet ctr twofish_i586 twofish_generic
twofish_common camellia serpent blowfish cast5 aes_i586 aes_generic xcbc rmd160
sha512_generic sha256_generic crypto_null af_key ac lp snd_hda_codec_realtek
snd_hda_intel snd_hda_codec snd_pcm_oss evdev snd_mixer_oss snd_pcm psmouse
serio_raw snd_seq_dummy pcspkr parport_pc parport snd_seq_oss snd_seq_midi
snd_rawmidi snd_seq_midi_event option usb_wwan snd_seq usbserial snd_timer
snd_seq_device button processor iTCO_wdt iTCO_vendor_support snd intel_agp
soundcore intel_gtt snd_page_alloc agpgart shpchp pci_hotplug ext3 jbd mbcache
sr_mod cdrom sd_mod sg ata_generic pata_jmicron ata_piix pata_acpi libata floppy
r8169 mii
scsi_mod uhci_hcd ehci_hcd usbcore thermal fan fuse
Pid: 10, comm: kworker/0:1 Not tainted 2.6.37 #1
Call Trace:
[<c012d459>] __schedule_bug+0x59/0x70
[<c03e3757>] schedule+0x6a7/0xa70
[<c0105bf7>] ? show_trace_log_lvl+0x47/0x60
[<c03e2be9>] ? dump_stack+0x6e/0x75
[<c014a308>] ? process_one_work+0x1c8/0x370
[<c0199980>] ? padata_parallel_worker+0x0/0x80
[<c014c51f>] worker_thread+0x1df/0x390
[<c014c340>] ? worker_thread+0x0/0x390
[<c014fd74>] kthread+0x74/0x80
[<c014fd00>] ? kthread+0x0/0x80
[<c01033f6>] kernel_thread_helper+0x6/0x10

I must emphasize again that goal of deploying pcrypt/padata is to have more than
one request present in our hardware (e.g. in a quad cpu system we'll have 4
encryption and 4 decryption requests sent into our hardware). Also I tried using
pcrypt/padata in a single cpu system with one change in pcrypt_init_padata
function of pcrypt.c: passing 4 as max_active parameter of alloc_workqueue.
In fact I called alloc_workqueue as:

alloc_workqueue(name, WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 4);
instead of :
alloc_workqueue(name, WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 1);

But this did not give me 4 encryption requests.
I know that one promising solution might be to choose ablkcipher over blkcipher
scheme, but as we need a quicker solution and we are pressed with
time, I request
your comments about my problem.
Can I solve my problem with pcrypt/padata anyway with any change in my current
blkcipher driver en/deccrypt function or in pcrypt iself? Or should I
take another way?

Please take in mind that minor changes to our current solution is highly
recommended because of our little time.

Thanks in advance,

Hamid.

2011-10-04 07:58:03

by Steffen Klassert

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Sat, Oct 01, 2011 at 12:38:19PM +0330, Hamid Nassiby wrote:
>
> And my_cbc_encrypt function as PSEUDO/real code (for simplicity of
> representation) is as:
>
> static int
> my_cbc_encrypt(struct blkcipher_desc *desc,
> struct scatterlist *dst, struct scatterlist *src,
> unsigned int nbytes)
> {
> SOME__common_preparation_and_initializations;
>
> spin_lock_irqsave(&myloc, myflags);
> send_request_to_device(&dev); /*sends request to device. After
> processing request,device writes
> result to destination*/
> while(!readl(complete_flag)); /*here we wait for a flag in
> device register space indicating completion. */
> spin_unlock_irqrestore(&mylock, myflags);
>
>
> }

As I told you already in the private mail, it makes not too much sense
to parallelize the crypto layer and to hold a global lock during the
crypto operation. So if you really need this lock, you are much better
off without a parallelization.

>
> With above code, I can successfully test IPSec gateway equipped with our
> hardware and get a 200Mbps throughput using Iperf. Now I am facing with another
> poblem. As I mentioned earlier, our hardware has 4 aes engines builtin. With
> above code I only utilize one of them.
> >From this point, we want to go a step further and utilize more than one aes
> engines of our device. Simplest solution appears to me is to deploy
> pcrypt/padata, made by Steffen Klassert. First instantiate in a dual
> core gateway :
> modprobe tcrypt alg="pcrypt(authenc(hmac(md5),cbc(aes)))" type=3
> and test again. Running Iperf now gives me a very low
> throughput about 20Mbps while dmesg shows the following:
>
> BUG: workqueue leaked lock or atomic: kworker/0:1/0x00000001/10
> last function: padata_parallel_worker+0x0/0x80

This looks like the parallel worker exited in atomic context,
but I can't tell you much more as long as you don't show us your code.

>
> I must emphasize again that goal of deploying pcrypt/padata is to have more than
> one request present in our hardware (e.g. in a quad cpu system we'll have 4
> encryption and 4 decryption requests sent into our hardware). Also I tried using
> pcrypt/padata in a single cpu system with one change in pcrypt_init_padata
> function of pcrypt.c: passing 4 as max_active parameter of alloc_workqueue.
> In fact I called alloc_workqueue as:
>
> alloc_workqueue(name, WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 4);

This does not make sense. max_active has to be 1 as we have to care about the
order of the work items, so we don't want to have more than one work item
executing at the same time per CPU. And as we run the parallel workers with BHs
off, it is not even possible to execute more than one work item at the same
time per CPU.

2011-10-05 10:04:05

by Hamid Nassiby

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Tue, Oct 4, 2011 at 11:27 AM, Steffen Klassert
<[email protected]> wrote:
>
> On Sat, Oct 01, 2011 at 12:38:19PM +0330, Hamid Nassiby wrote:
> >
> > And my_cbc_encrypt function as PSEUDO/real code (for simplicity of
> > representation) is as:
> >
> > static int
> > my_cbc_encrypt(struct blkcipher_desc *desc,
> > ? ? ? ? ? ? ? ? struct scatterlist *dst, struct scatterlist *src,
> > ? ? ? ? ? ? ? ? unsigned int nbytes)
> > {
> > ? ? ? ? ? ? ? SOME__common_preparation_and_initializations;
> >
> > ? ? ? ? ? ? ? spin_lock_irqsave(&myloc, myflags);
> > ? ? ? ? ? ? ? send_request_to_device(&dev); /*sends request to device. After
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? processing request,device writes
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? result to destination*/
> > ? ? ? ? ? ? ? while(!readl(complete_flag)); /*here we wait for a flag in
> > ? ? ? ? ? ? ? ? ? ? ? ? device register space indicating completion. */
> > ? ? ? ? ? ? ? spin_unlock_irqrestore(&mylock, myflags);
> >
> >
> > }
>
> As I told you already in the private mail, it makes not too much sense
> to parallelize the crypto layer and to hold a global lock during the
> crypto operation. So if you really need this lock, you are much better
> off without a parallelization.
>
Hi Steffen,
Thanks for your reply :).

It makes sense in two manners:
1. If request transmit time to device is much shorter?than request
processing time
?spent in device and the device has more than one processing engine.

?2. It also can be advantageous when device has only one processing
engine and we
have multiple blkcipher requests pending behind entrance port of device,
because delay between request entrances to device will be shorter. The overall
advantage will be that our IPSec throughput gets nearer to our device bulk
encryption throughput. (It is interesting to note that with our
current driver and device
configuration, if I test gateway throughput with a traffic belonging to two SAs,
traveling through one link that connects them, I'll get a rate about
280Mbps(80Mbps
increase in comparison with one SA's traffic), while our device's bulk
processing is
about 400Mbps.)

Currently we want to take advantage of the latter case and then extend it.

>
>
>
> >
> > With above code, I can successfully test IPSec gateway equipped with our
> > hardware and get a 200Mbps throughput using Iperf. Now I am facing with another
> > poblem. As I mentioned earlier, our hardware has 4 aes engines builtin. With
> > above code I only utilize one of them.
> > >From this point, we want to go a step further and utilize more than one aes
> > engines of our device. Simplest solution appears to me is to deploy
> > pcrypt/padata, made by Steffen Klassert. First instantiate in a dual
> > core gateway :
> > ? ? ? modprobe tcrypt alg="pcrypt(authenc(hmac(md5),cbc(aes)))" type=3
> > ?and test again. Running Iperf now gives me a very low
> > throughput about 20Mbps while dmesg shows the following:
> >
> > ? ?BUG: workqueue leaked lock or atomic: kworker/0:1/0x00000001/10
> > ? ? ? ?last function: padata_parallel_worker+0x0/0x80
>
> This looks like the parallel worker exited in atomic context,
> but I can't tell you much more as long as you don't show us your code.

OK, I represented code as PSEUSO, just to simplify and concentrate problem's
aspects ;), (but it is also possible that I've concentrated it in a
wrong way :D)
This is my_cbc_encrypt code and functions it calls, bottom-up:

int write_request(u8 *buff, unsigned int count)
{

u32 tlp_size = 32;
struct my_dma_desc *desc_table = (struct my_dma_desc *)global_bar[0];
tlp_size = (count/128) | (tlp_size << 16);
memcpy(g_mydev->rdmaBuf_va, buff, count);
wmb();

writel(cpu_to_le32(tlp_size),(&desc_table->wdmaperf));
wmb();

while((readl(&desc_table->ddmacr) | 0xFFFF0000)!= 0xFFFF0101);/*wait for
transfer compeltion*/
return 0;
}

int my_transform(struct my_aes_op *op, int alg)
{

int req_len, err;
unsigned long iflagsq, tflag;
u8 *req_buf = NULL, *res_buf = NULL;
alg_operation operation;
if (op->len == 0)
return 0;
operation = !(op->dir);

create_request(alg, op->mode, operation, 0, op->key,
op->iv, op->src, op->len, &req_buf, &req_len); /*add
header to original request and copy it to req_buf*/

spin_lock_irqsave(&glock, tflag);

write_request(req_buf, req_len);/*now req_buf is sent to device
, device en/decrypts request and writes the
the result to a fixed dma mapped address*/
if (err){
printk(KERN_EMERG"Error WriteReuest:errcode=%d\n", err);
//handle exception (never occured)
}
kfree(req_buf);
req_buf = NULL;

memcpy(op->dst, (g_mydev->wdmaBuf_va, op->len);/*copy result from
fixed coherent dma mapped memory to destination*/
spin_unlock_irqrestore(&glock, tflag);

return op->len;
}

static int
my_cbc_encrypt(struct blkcipher_desc *desc,
struct scatterlist *dst, struct scatterlist *src,
unsigned int nbytes)
{
struct my_aes_op *op = crypto_blkcipher_ctx(desc->tfm);
struct blkcipher_walk walk;
int err, ret;
unsigned long c2flag;
if (unlikely(op->keylen != AES_KEYSIZE_128))
return fallback_blk_enc(desc, dst, src, nbytes);


blkcipher_walk_init(&walk, dst, src, nbytes);
err = blkcipher_walk_virt(desc, &walk);
op->iv = walk.iv;

while((nbytes = walk.nbytes)) {

op->src = walk.src.virt.addr,
op->dst = walk.dst.virt.addr;
op->mode = AES_MODE_CBC;
op->len = nbytes /*- (nbytes % AES_MIN_BLOCK_SIZE)*/;
op->dir = AES_DIR_ENCRYPT;
ret = my_transform(op, 0);
nbytes -= ret;
err = blkcipher_walk_done(desc, &walk, nbytes);
}

return err;
}

>
> >
> > I must emphasize again that goal of deploying pcrypt/padata is to have more than
> > one request present in our hardware (e.g. in a quad cpu system we'll have 4
> > encryption and 4 decryption requests sent into our hardware). Also I tried using
> > pcrypt/padata in a single cpu system with one change in pcrypt_init_padata
> > function of pcrypt.c: passing 4 as max_active parameter of alloc_workqueue.
> > In fact I called alloc_workqueue as:
> >
> > alloc_workqueue(name, WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 4);
>
> This does not make sense. max_active has to be 1 as we have to care about the
> order of the work items, so we don't want to have more than one work item
> executing at the same time per CPU. And as we run the parallel workers with BHs
> off, it is not even possible to execute more than one work item at the same
> time per CPU.
>

Did you turn BHs off, to prevent deadlocks between your workqueues and
network's softirqs?
If there is any other thing that will help, I am pleased to hear.

Thanks.

2011-10-11 09:42:29

by Steffen Klassert

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On Wed, Oct 05, 2011 at 01:33:33PM +0330, Hamid Nassiby wrote:
>
> OK, I represented code as PSEUSO, just to simplify and concentrate problem's
> aspects ;), (but it is also possible that I've concentrated it in a
> wrong way :D)
> This is my_cbc_encrypt code and functions it calls, bottom-up:
>
> int write_request(u8 *buff, unsigned int count)
> {
>
> u32 tlp_size = 32;
> struct my_dma_desc *desc_table = (struct my_dma_desc *)global_bar[0];
> tlp_size = (count/128) | (tlp_size << 16);
> memcpy(g_mydev->rdmaBuf_va, buff, count);
> wmb();
>
> writel(cpu_to_le32(tlp_size),(&desc_table->wdmaperf));
> wmb();
>
> while((readl(&desc_table->ddmacr) | 0xFFFF0000)!= 0xFFFF0101);/*wait for
> transfer compeltion*/
> return 0;
> }
>
> int my_transform(struct my_aes_op *op, int alg)
> {
>
> int req_len, err;
> unsigned long iflagsq, tflag;
> u8 *req_buf = NULL, *res_buf = NULL;
> alg_operation operation;
> if (op->len == 0)
> return 0;
> operation = !(op->dir);
>
> create_request(alg, op->mode, operation, 0, op->key,
> op->iv, op->src, op->len, &req_buf, &req_len); /*add
> header to original request and copy it to req_buf*/
>
> spin_lock_irqsave(&glock, tflag);
>
> write_request(req_buf, req_len);/*now req_buf is sent to device
> , device en/decrypts request and writes the
> the result to a fixed dma mapped address*/
> if (err){
> printk(KERN_EMERG"Error WriteReuest:errcode=%d\n", err);
> //handle exception (never occured)
> }
> kfree(req_buf);
> req_buf = NULL;
>
> memcpy(op->dst, (g_mydev->wdmaBuf_va, op->len);/*copy result from
> fixed coherent dma mapped memory to destination*/
> spin_unlock_irqrestore(&glock, tflag);
>
> return op->len;
> }
>
> static int
> my_cbc_encrypt(struct blkcipher_desc *desc,
> struct scatterlist *dst, struct scatterlist *src,
> unsigned int nbytes)
> {
> struct my_aes_op *op = crypto_blkcipher_ctx(desc->tfm);
> struct blkcipher_walk walk;
> int err, ret;
> unsigned long c2flag;
> if (unlikely(op->keylen != AES_KEYSIZE_128))
> return fallback_blk_enc(desc, dst, src, nbytes);
>
>
> blkcipher_walk_init(&walk, dst, src, nbytes);
> err = blkcipher_walk_virt(desc, &walk);
> op->iv = walk.iv;
>
> while((nbytes = walk.nbytes)) {
>
> op->src = walk.src.virt.addr,
> op->dst = walk.dst.virt.addr;
> op->mode = AES_MODE_CBC;
> op->len = nbytes /*- (nbytes % AES_MIN_BLOCK_SIZE)*/;
> op->dir = AES_DIR_ENCRYPT;
> ret = my_transform(op, 0);
> nbytes -= ret;
> err = blkcipher_walk_done(desc, &walk, nbytes);
> }
>
> return err;
> }
>

I can't tell much when looking at this code snippet. One guess would be
someone (maybe you) has set the CRYPTO_TFM_REQ_MAY_SLEEP flag, as
blkcipher_walk_done calls crypto_yield() which in turn might call
schedule() if this flag is set. prcypt removes this flag explicit.

>
> Did you turn BHs off, to prevent deadlocks between your workqueues and
> network's softirqs?
> If there is any other thing that will help, I am pleased to hear.
>

Basically, the bottom halves are off to keep up with the network softirqs.
They run with much higher priority and would interrupt the parallel
workers frequently.

2011-10-15 11:26:30

by Hamid Nassiby

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

On 10/11/11, Steffen Klassert <[email protected]> wrote:

>
> I can't tell much when looking at this code snippet. One guess would be
> someone (maybe you) has set the CRYPTO_TFM_REQ_MAY_SLEEP flag, as
> blkcipher_walk_done calls crypto_yield() which in turn might call
> schedule() if this flag is set. prcypt removes this flag explicit.
>

I've not set such a flag.

>
> Basically, the bottom halves are off to keep up with the network softirqs.
> They run with much higher priority and would interrupt the parallel
> workers frequently.
>

Do you mean that with BHs on, we only have some performance degrades?

Thanks for your reply.
Any other idea?

2013-04-25 03:50:04

by Vakul Garg

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

Herbert Xu <herbert <at> gondor.apana.org.au> writes:

>
> On Wed, Jan 26, 2011 at 10:26:33AM +0330, Hamid Nassiby wrote:
> >
> > As you know, I posted my problem again to crypto list and no one
answered.
> > Now I
> > emphasize one aspect of the problem as a concept related to IPSec
protocol,
> > free
> > of my problem's nature, and I hope to get some guidelines at this time.
The
> > question is as following:
> > If IPSec delivers IP packets to hardware crypto accelerator in
sequential
> > manner
> > (e.g, packets in order: 1, 2, 3, ..., 36, 37, 38,...) and crypto
accelerator
> > possibly returns back packets out of entering order to IPSec (e.g,
packet
> > 37 is returned back before the packet 36 to IPSec, so the order of
packets
> > is
> > not the same before entering crypto accelerator and after exiting it);
Is it
> > possible to rise any problem here?
>
> We do not allow such reordering. All crypto drivers must ensure
> ordering within a single tfm. Between different tfms there is no
> ordering requirement.
>
> Cheers,


Hello Herbert,

Does this mean that processing of all the crypto requests from a single tfm
must be serialized even if they execute on multiple different cores?

Regards

2013-04-25 09:31:21

by Herbert Xu

[permalink] [raw]
Subject: Re: Fwd: crypto accelerator driver problems

Vakul Garg <[email protected]> wrote:
> Herbert Xu <herbert <at> gondor.apana.org.au> writes:
>
>>
>> On Wed, Jan 26, 2011 at 10:26:33AM +0330, Hamid Nassiby wrote:
>> >
>> > As you know, I posted my problem again to crypto list and no one
> answered.
>> > Now I
>> > emphasize one aspect of the problem as a concept related to IPSec
> protocol,
>> > free
>> > of my problem's nature, and I hope to get some guidelines at this time.
> The
>> > question is as following:
>> > If IPSec delivers IP packets to hardware crypto accelerator in
> sequential
>> > manner
>> > (e.g, packets in order: 1, 2, 3, ..., 36, 37, 38,...) and crypto
> accelerator
>> > possibly returns back packets out of entering order to IPSec (e.g,
> packet
>> > 37 is returned back before the packet 36 to IPSec, so the order of
> packets
>> > is
>> > not the same before entering crypto accelerator and after exiting it);
> Is it
>> > possible to rise any problem here?
>>
>> We do not allow such reordering. All crypto drivers must ensure
>> ordering within a single tfm. Between different tfms there is no
>> ordering requirement.
>>
>> Cheers,
>
>
> Hello Herbert,
>
> Does this mean that processing of all the crypto requests from a single tfm
> must be serialized even if they execute on multiple different cores?

Correct. Conceptually a single tfm is like a thread, if one
wanted parallelism then multiple tfms should be used. Of course
there are exceptions such as pcrypt.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt