2015-03-03 13:21:49

by Martin Hicks

[permalink] [raw]
Subject: [PATCH v2 0/5] crypto: talitos: Add crypto async queue handling

I was testing dm-crypt performance with a Freescale P1022 board with
a recent kernel and was getting IO errors while doing testing with LUKS.
Investigation showed that all hardware FIFO slots were filling and
the driver was returning EAGAIN to the block layer, which is not an
expected response for an async crypto implementation.

The following patch series adds a few small fixes, and reworks the
submission path to use the crypto_queue mechanism to handle the
request backlog.

Changes since v1:

- Ran checkpatch.pl
- Split the path for submitting new requests vs. issuing backlogged
requests.
- Avoid enqueuing a submitted request to the crypto queue unnecessarily.
- Fix return paths where CRYPTO_TFM_REQ_MAY_BACKLOG is not set.


Martin Hicks (5):
crypto: talitos: Simplify per-channel initialization
crypto: talitos: Remove MD5_BLOCK_SIZE
crypto: talitos: Fix off-by-one and use all hardware slots
crypto: talitos: Reorganize request submission data structures
crypto: talitos: Add software backlog queue handling

drivers/crypto/talitos.c | 240 +++++++++++++++++++++++++++-------------------
drivers/crypto/talitos.h | 44 +++++++--
2 files changed, 177 insertions(+), 107 deletions(-)

--
1.7.10.4


2015-03-03 13:21:51

by Martin Hicks

[permalink] [raw]
Subject: [PATCH v2 2/5] crypto: talitos: Remove MD5_BLOCK_SIZE

This is properly defined in the md5 header file.

Signed-off-by: Martin Hicks <[email protected]>
---
drivers/crypto/talitos.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index c49d977..89cf4d5 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -637,8 +637,6 @@ static void talitos_unregister_rng(struct device *dev)
#define TALITOS_MAX_KEY_SIZE 96
#define TALITOS_MAX_IV_LENGTH 16 /* max of AES_BLOCK_SIZE, DES3_EDE_BLOCK_SIZE */

-#define MD5_BLOCK_SIZE 64
-
struct talitos_ctx {
struct device *dev;
int ch;
@@ -2195,7 +2193,7 @@ static struct talitos_alg_template driver_algs[] = {
.halg.base = {
.cra_name = "md5",
.cra_driver_name = "md5-talitos",
- .cra_blocksize = MD5_BLOCK_SIZE,
+ .cra_blocksize = MD5_HMAC_BLOCK_SIZE,
.cra_flags = CRYPTO_ALG_TYPE_AHASH |
CRYPTO_ALG_ASYNC,
}
@@ -2285,7 +2283,7 @@ static struct talitos_alg_template driver_algs[] = {
.halg.base = {
.cra_name = "hmac(md5)",
.cra_driver_name = "hmac-md5-talitos",
- .cra_blocksize = MD5_BLOCK_SIZE,
+ .cra_blocksize = MD5_HMAC_BLOCK_SIZE,
.cra_flags = CRYPTO_ALG_TYPE_AHASH |
CRYPTO_ALG_ASYNC,
}
--
1.7.10.4

2015-03-03 13:21:54

by Martin Hicks

[permalink] [raw]
Subject: [PATCH v2 4/5] crypto: talitos: Reorganize request submission data structures

This is preparatory work for moving to using the crypto async queue
handling code. A talitos_request structure is buried into each
talitos_edesc so that when talitos_submit() is called, everything required
to defer the submission to the hardware is contained within talitos_edesc.

Signed-off-by: Martin Hicks <[email protected]>
---
drivers/crypto/talitos.c | 95 +++++++++++++++-------------------------------
drivers/crypto/talitos.h | 41 +++++++++++++++++---
2 files changed, 66 insertions(+), 70 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 7709805..b0c85ce 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -186,22 +186,16 @@ static int init_device(struct device *dev)
* talitos_submit - submits a descriptor to the device for processing
* @dev: the SEC device to be used
* @ch: the SEC device channel to be used
- * @desc: the descriptor to be processed by the device
- * @callback: whom to call when processing is complete
- * @context: a handle for use by caller (optional)
+ * @edesc: the descriptor to be processed by the device
*
* desc must contain valid dma-mapped (bus physical) address pointers.
* callback must check err and feedback in descriptor header
* for device processing status.
*/
-int talitos_submit(struct device *dev, int ch, struct talitos_desc *desc,
- void (*callback)(struct device *dev,
- struct talitos_desc *desc,
- void *context, int error),
- void *context)
+int talitos_submit(struct device *dev, int ch, struct talitos_edesc *edesc)
{
struct talitos_private *priv = dev_get_drvdata(dev);
- struct talitos_request *request;
+ struct talitos_request *request = &edesc->req;
unsigned long flags;
int head;

@@ -214,19 +208,15 @@ int talitos_submit(struct device *dev, int ch, struct talitos_desc *desc,
}

head = priv->chan[ch].head;
- request = &priv->chan[ch].fifo[head];
-
- /* map descriptor and save caller data */
- request->dma_desc = dma_map_single(dev, desc, sizeof(*desc),
+ request->dma_desc = dma_map_single(dev, request->desc,
+ sizeof(*request->desc),
DMA_BIDIRECTIONAL);
- request->callback = callback;
- request->context = context;

/* increment fifo head */
priv->chan[ch].head = (priv->chan[ch].head + 1) & (priv->fifo_len - 1);

smp_wmb();
- request->desc = desc;
+ priv->chan[ch].fifo[head] = request;

/* GO! */
wmb();
@@ -247,15 +237,16 @@ EXPORT_SYMBOL(talitos_submit);
static void flush_channel(struct device *dev, int ch, int error, int reset_ch)
{
struct talitos_private *priv = dev_get_drvdata(dev);
- struct talitos_request *request, saved_req;
+ struct talitos_request *request;
unsigned long flags;
int tail, status;

spin_lock_irqsave(&priv->chan[ch].tail_lock, flags);

tail = priv->chan[ch].tail;
- while (priv->chan[ch].fifo[tail].desc) {
- request = &priv->chan[ch].fifo[tail];
+ while (priv->chan[ch].fifo[tail]) {
+ request = priv->chan[ch].fifo[tail];
+ status = 0;

/* descriptors with their done bits set don't get the error */
rmb();
@@ -271,14 +262,9 @@ static void flush_channel(struct device *dev, int ch, int error, int reset_ch)
sizeof(struct talitos_desc),
DMA_BIDIRECTIONAL);

- /* copy entries so we can call callback outside lock */
- saved_req.desc = request->desc;
- saved_req.callback = request->callback;
- saved_req.context = request->context;
-
/* release request entry in fifo */
smp_wmb();
- request->desc = NULL;
+ priv->chan[ch].fifo[tail] = NULL;

/* increment fifo tail */
priv->chan[ch].tail = (tail + 1) & (priv->fifo_len - 1);
@@ -287,8 +273,8 @@ static void flush_channel(struct device *dev, int ch, int error, int reset_ch)

atomic_dec(&priv->chan[ch].submit_count);

- saved_req.callback(dev, saved_req.desc, saved_req.context,
- status);
+ request->callback(dev, request->desc, request->context, status);
+
/* channel may resume processing in single desc error case */
if (error && !reset_ch && status == error)
return;
@@ -352,7 +338,8 @@ static u32 current_desc_hdr(struct device *dev, int ch)
tail = priv->chan[ch].tail;

iter = tail;
- while (priv->chan[ch].fifo[iter].dma_desc != cur_desc) {
+ while (priv->chan[ch].fifo[iter] &&
+ priv->chan[ch].fifo[iter]->dma_desc != cur_desc) {
iter = (iter + 1) & (priv->fifo_len - 1);
if (iter == tail) {
dev_err(dev, "couldn't locate current descriptor\n");
@@ -360,7 +347,8 @@ static u32 current_desc_hdr(struct device *dev, int ch)
}
}

- return priv->chan[ch].fifo[iter].desc->hdr;
+ return priv->chan[ch].fifo[iter] ?
+ priv->chan[ch].fifo[iter]->desc->hdr : 0;
}

/*
@@ -702,37 +690,6 @@ badkey:
return -EINVAL;
}

-/*
- * talitos_edesc - s/w-extended descriptor
- * @assoc_nents: number of segments in associated data scatterlist
- * @src_nents: number of segments in input scatterlist
- * @dst_nents: number of segments in output scatterlist
- * @assoc_chained: whether assoc is chained or not
- * @src_chained: whether src is chained or not
- * @dst_chained: whether dst is chained or not
- * @iv_dma: dma address of iv for checking continuity and link table
- * @dma_len: length of dma mapped link_tbl space
- * @dma_link_tbl: bus physical address of link_tbl
- * @desc: h/w descriptor
- * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1)
- *
- * if decrypting (with authcheck), or either one of src_nents or dst_nents
- * is greater than 1, an integrity check value is concatenated to the end
- * of link_tbl data
- */
-struct talitos_edesc {
- int assoc_nents;
- int src_nents;
- int dst_nents;
- bool assoc_chained;
- bool src_chained;
- bool dst_chained;
- dma_addr_t iv_dma;
- int dma_len;
- dma_addr_t dma_link_tbl;
- struct talitos_desc desc;
- struct talitos_ptr link_tbl[0];
-};

static int talitos_map_sg(struct device *dev, struct scatterlist *sg,
unsigned int nents, enum dma_data_direction dir,
@@ -1078,7 +1035,10 @@ static int ipsec_esp(struct talitos_edesc *edesc, struct aead_request *areq,
map_single_talitos_ptr(dev, &desc->ptr[6], ivsize, ctx->iv, 0,
DMA_FROM_DEVICE);

- ret = talitos_submit(dev, ctx->ch, desc, callback, areq);
+ edesc->req.callback = callback;
+ edesc->req.context = areq;
+
+ ret = talitos_submit(dev, ctx->ch, edesc);
if (ret != -EINPROGRESS) {
ipsec_esp_unmap(dev, edesc, areq);
kfree(edesc);
@@ -1209,6 +1169,7 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
edesc->dma_link_tbl = dma_map_single(dev, &edesc->link_tbl[0],
edesc->dma_len,
DMA_BIDIRECTIONAL);
+ edesc->req.desc = &edesc->desc;

return edesc;
}
@@ -1449,7 +1410,10 @@ static int common_nonsnoop(struct talitos_edesc *edesc,
to_talitos_ptr(&desc->ptr[6], 0);
desc->ptr[6].j_extent = 0;

- ret = talitos_submit(dev, ctx->ch, desc, callback, areq);
+ edesc->req.callback = callback;
+ edesc->req.context = areq;
+
+ ret = talitos_submit(dev, ctx->ch, edesc);
if (ret != -EINPROGRESS) {
common_nonsnoop_unmap(dev, edesc, areq);
kfree(edesc);
@@ -1629,7 +1593,10 @@ static int common_nonsnoop_hash(struct talitos_edesc *edesc,
/* last DWORD empty */
desc->ptr[6] = zero_entry;

- ret = talitos_submit(dev, ctx->ch, desc, callback, areq);
+ edesc->req.callback = callback;
+ edesc->req.context = areq;
+
+ ret = talitos_submit(dev, ctx->ch, edesc);
if (ret != -EINPROGRESS) {
common_nonsnoop_hash_unmap(dev, edesc, areq);
kfree(edesc);
@@ -2714,7 +2681,7 @@ static int talitos_probe(struct platform_device *ofdev)
spin_lock_init(&priv->chan[i].head_lock);
spin_lock_init(&priv->chan[i].tail_lock);

- priv->chan[i].fifo = kzalloc(sizeof(struct talitos_request) *
+ priv->chan[i].fifo = kzalloc(sizeof(struct talitos_request *) *
priv->fifo_len, GFP_KERNEL);
if (!priv->chan[i].fifo) {
dev_err(dev, "failed to allocate request fifo %d\n", i);
diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h
index 61a1405..914eb15 100644
--- a/drivers/crypto/talitos.h
+++ b/drivers/crypto/talitos.h
@@ -77,7 +77,7 @@ struct talitos_channel {
void __iomem *reg;

/* request fifo */
- struct talitos_request *fifo;
+ struct talitos_request **fifo;

/* number of requests pending in channel h/w fifo */
atomic_t submit_count ____cacheline_aligned;
@@ -133,11 +133,40 @@ struct talitos_private {
struct hwrng rng;
};

-extern int talitos_submit(struct device *dev, int ch, struct talitos_desc *desc,
- void (*callback)(struct device *dev,
- struct talitos_desc *desc,
- void *context, int error),
- void *context);
+/*
+ * talitos_edesc - s/w-extended descriptor
+ * @assoc_nents: number of segments in associated data scatterlist
+ * @src_nents: number of segments in input scatterlist
+ * @dst_nents: number of segments in output scatterlist
+ * @assoc_chained: whether assoc is chained or not
+ * @src_chained: whether src is chained or not
+ * @dst_chained: whether dst is chained or not
+ * @iv_dma: dma address of iv for checking continuity and link table
+ * @dma_len: length of dma mapped link_tbl space
+ * @dma_link_tbl: bus physical address of link_tbl
+ * @desc: h/w descriptor
+ * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1)
+ *
+ * if decrypting (with authcheck), or either one of src_nents or dst_nents
+ * is greater than 1, an integrity check value is concatenated to the end
+ * of link_tbl data
+ */
+struct talitos_edesc {
+ struct talitos_request req;
+ int assoc_nents;
+ int src_nents;
+ int dst_nents;
+ bool assoc_chained;
+ bool src_chained;
+ bool dst_chained;
+ dma_addr_t iv_dma;
+ int dma_len;
+ dma_addr_t dma_link_tbl;
+ struct talitos_desc desc;
+ struct talitos_ptr link_tbl[0];
+};
+
+int talitos_submit(struct device *dev, int ch, struct talitos_edesc *edesc);

/* .features flag */
#define TALITOS_FTR_SRC_LINK_TBL_LEN_INCLUDES_EXTENT 0x00000001
--
1.7.10.4

2015-03-03 13:21:50

by Martin Hicks

[permalink] [raw]
Subject: [PATCH v2 1/5] crypto: talitos: Simplify per-channel initialization

There were multiple loops in a row, for each separate step of the
initialization of the channels. Simplify to a single loop.

Signed-off-by: Martin Hicks <[email protected]>
---
drivers/crypto/talitos.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 067ec21..c49d977 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -2706,20 +2706,16 @@ static int talitos_probe(struct platform_device *ofdev)
goto err_out;
}

+ priv->fifo_len = roundup_pow_of_two(priv->chfifo_len);
+
for (i = 0; i < priv->num_channels; i++) {
priv->chan[i].reg = priv->reg + TALITOS_CH_STRIDE * (i + 1);
if (!priv->irq[1] || !(i & 1))
priv->chan[i].reg += TALITOS_CH_BASE_OFFSET;
- }

- for (i = 0; i < priv->num_channels; i++) {
spin_lock_init(&priv->chan[i].head_lock);
spin_lock_init(&priv->chan[i].tail_lock);
- }

- priv->fifo_len = roundup_pow_of_two(priv->chfifo_len);
-
- for (i = 0; i < priv->num_channels; i++) {
priv->chan[i].fifo = kzalloc(sizeof(struct talitos_request) *
priv->fifo_len, GFP_KERNEL);
if (!priv->chan[i].fifo) {
@@ -2727,11 +2723,10 @@ static int talitos_probe(struct platform_device *ofdev)
err = -ENOMEM;
goto err_out;
}
- }

- for (i = 0; i < priv->num_channels; i++)
atomic_set(&priv->chan[i].submit_count,
-(priv->chfifo_len - 1));
+ }

dma_set_mask(dev, DMA_BIT_MASK(36));

--
1.7.10.4

2015-03-03 13:21:55

by Martin Hicks

[permalink] [raw]
Subject: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

I was running into situations where the hardware FIFO was filling up, and
the code was returning EAGAIN to dm-crypt and just dropping the submitted
crypto request.

This adds support in talitos for a software backlog queue. When requests
can't be queued to the hardware immediately EBUSY is returned. The queued
requests are dispatched to the hardware in received order as hardware FIFO
slots become available.

Signed-off-by: Martin Hicks <[email protected]>
---
drivers/crypto/talitos.c | 135 ++++++++++++++++++++++++++++++++++++----------
drivers/crypto/talitos.h | 3 ++
2 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index b0c85ce..bb7fba0 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -182,55 +182,118 @@ static int init_device(struct device *dev)
return 0;
}

-/**
- * talitos_submit - submits a descriptor to the device for processing
- * @dev: the SEC device to be used
- * @ch: the SEC device channel to be used
- * @edesc: the descriptor to be processed by the device
- *
- * desc must contain valid dma-mapped (bus physical) address pointers.
- * callback must check err and feedback in descriptor header
- * for device processing status.
- */
-int talitos_submit(struct device *dev, int ch, struct talitos_edesc *edesc)
+/* Dispatch 'request' if provided, otherwise a backlogged request */
+static int __talitos_handle_queue(struct device *dev, int ch,
+ struct talitos_edesc *edesc,
+ unsigned long *irq_flags)
{
struct talitos_private *priv = dev_get_drvdata(dev);
- struct talitos_request *request = &edesc->req;
- unsigned long flags;
+ struct talitos_request *request;
+ struct crypto_async_request *areq;
int head;
+ int ret = -EINPROGRESS;

- spin_lock_irqsave(&priv->chan[ch].head_lock, flags);
-
- if (!atomic_inc_not_zero(&priv->chan[ch].submit_count)) {
+ if (!atomic_inc_not_zero(&priv->chan[ch].submit_count))
/* h/w fifo is full */
- spin_unlock_irqrestore(&priv->chan[ch].head_lock, flags);
- return -EAGAIN;
+ return -EBUSY;
+
+ if (!edesc) {
+ /* Dequeue the oldest request */
+ areq = crypto_dequeue_request(&priv->chan[ch].queue);
+ request = container_of(areq, struct talitos_request, base);
+ } else {
+ request = &edesc->req;
}

- head = priv->chan[ch].head;
request->dma_desc = dma_map_single(dev, request->desc,
sizeof(*request->desc),
DMA_BIDIRECTIONAL);

/* increment fifo head */
+ head = priv->chan[ch].head;
priv->chan[ch].head = (priv->chan[ch].head + 1) & (priv->fifo_len - 1);

- smp_wmb();
- priv->chan[ch].fifo[head] = request;
+ spin_unlock_irqrestore(&priv->chan[ch].head_lock, *irq_flags);
+
+ /*
+ * Mark a backlogged request as in-progress.
+ */
+ if (!edesc) {
+ areq = request->context;
+ areq->complete(areq, -EINPROGRESS);
+ }
+
+ spin_lock_irqsave(&priv->chan[ch].head_lock, *irq_flags);

/* GO! */
+ priv->chan[ch].fifo[head] = request;
wmb();
out_be32(priv->chan[ch].reg + TALITOS_FF,
upper_32_bits(request->dma_desc));
out_be32(priv->chan[ch].reg + TALITOS_FF_LO,
lower_32_bits(request->dma_desc));

+ return ret;
+}
+
+/**
+ * talitos_submit - performs submissions of a new descriptors
+ *
+ * @dev: the SEC device to be used
+ * @ch: the SEC device channel to be used
+ * @edesc: the request to be processed by the device
+ *
+ * edesc->req must contain valid dma-mapped (bus physical) address pointers.
+ * callback must check err and feedback in descriptor header
+ * for device processing status upon completion.
+ */
+int talitos_submit(struct device *dev, int ch, struct talitos_edesc *edesc)
+{
+ struct talitos_private *priv = dev_get_drvdata(dev);
+ struct talitos_request *request = &edesc->req;
+ unsigned long flags;
+ int ret = -EINPROGRESS;
+
+ spin_lock_irqsave(&priv->chan[ch].head_lock, flags);
+
+ if (priv->chan[ch].queue.qlen) {
+ /*
+ * There are backlogged requests. Just queue this new request
+ * and dispatch the oldest backlogged request to the hardware.
+ */
+ crypto_enqueue_request(&priv->chan[ch].queue,
+ &request->base);
+ __talitos_handle_queue(dev, ch, NULL, &flags);
+ ret = -EBUSY;
+ } else {
+ ret = __talitos_handle_queue(dev, ch, edesc, &flags);
+ if (ret == -EBUSY)
+ /* Hardware FIFO is full */
+ crypto_enqueue_request(&priv->chan[ch].queue,
+ &request->base);
+ }
+
spin_unlock_irqrestore(&priv->chan[ch].head_lock, flags);

- return -EINPROGRESS;
+ return ret;
}
EXPORT_SYMBOL(talitos_submit);

+static int talitos_handle_queue(struct device *dev, int ch)
+{
+ struct talitos_private *priv = dev_get_drvdata(dev);
+ unsigned long flags;
+ int ret = -EINPROGRESS;
+
+ spin_lock_irqsave(&priv->chan[ch].head_lock, flags);
+ /* Queue backlogged requests as long as the hardware has room */
+ while (priv->chan[ch].queue.qlen && ret == -EINPROGRESS)
+ ret = __talitos_handle_queue(dev, ch, NULL, &flags);
+ spin_unlock_irqrestore(&priv->chan[ch].head_lock, flags);
+
+ return 0;
+}
+
/*
* process what was done, notify callback of error if not
*/
@@ -283,6 +346,8 @@ static void flush_channel(struct device *dev, int ch, int error, int reset_ch)
}

spin_unlock_irqrestore(&priv->chan[ch].tail_lock, flags);
+
+ talitos_handle_queue(dev, ch);
}

/*
@@ -1039,7 +1104,8 @@ static int ipsec_esp(struct talitos_edesc *edesc, struct aead_request *areq,
edesc->req.context = areq;

ret = talitos_submit(dev, ctx->ch, edesc);
- if (ret != -EINPROGRESS) {
+ if (ret != -EINPROGRESS &&
+ !(ret == -EBUSY && areq->base.flags & CRYPTO_TFM_REQ_MAY_BACKLOG)) {
ipsec_esp_unmap(dev, edesc, areq);
kfree(edesc);
}
@@ -1080,6 +1146,7 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
unsigned int ivsize,
int icv_stashing,
u32 cryptoflags,
+ struct crypto_async_request *areq,
bool encrypt)
{
struct talitos_edesc *edesc;
@@ -1170,6 +1237,8 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
edesc->dma_len,
DMA_BIDIRECTIONAL);
edesc->req.desc = &edesc->desc;
+ /* A copy of the crypto_async_request to use the crypto_queue backlog */
+ memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));

return edesc;
}
@@ -1184,7 +1253,7 @@ static struct talitos_edesc *aead_edesc_alloc(struct aead_request *areq, u8 *iv,
return talitos_edesc_alloc(ctx->dev, areq->assoc, areq->src, areq->dst,
iv, areq->assoclen, areq->cryptlen,
ctx->authsize, ivsize, icv_stashing,
- areq->base.flags, encrypt);
+ areq->base.flags, &areq->base, encrypt);
}

static int aead_encrypt(struct aead_request *req)
@@ -1414,7 +1483,8 @@ static int common_nonsnoop(struct talitos_edesc *edesc,
edesc->req.context = areq;

ret = talitos_submit(dev, ctx->ch, edesc);
- if (ret != -EINPROGRESS) {
+ if (ret != -EINPROGRESS &&
+ !(ret == -EBUSY && areq->base.flags & CRYPTO_TFM_REQ_MAY_BACKLOG)) {
common_nonsnoop_unmap(dev, edesc, areq);
kfree(edesc);
}
@@ -1430,7 +1500,7 @@ static struct talitos_edesc *ablkcipher_edesc_alloc(struct ablkcipher_request *

return talitos_edesc_alloc(ctx->dev, NULL, areq->src, areq->dst,
areq->info, 0, areq->nbytes, 0, ivsize, 0,
- areq->base.flags, encrypt);
+ areq->base.flags, &areq->base, encrypt);
}

static int ablkcipher_encrypt(struct ablkcipher_request *areq)
@@ -1597,7 +1667,8 @@ static int common_nonsnoop_hash(struct talitos_edesc *edesc,
edesc->req.context = areq;

ret = talitos_submit(dev, ctx->ch, edesc);
- if (ret != -EINPROGRESS) {
+ if (ret != -EINPROGRESS &&
+ !(ret == -EBUSY && areq->base.flags & CRYPTO_TFM_REQ_MAY_BACKLOG)) {
common_nonsnoop_hash_unmap(dev, edesc, areq);
kfree(edesc);
}
@@ -1612,7 +1683,8 @@ static struct talitos_edesc *ahash_edesc_alloc(struct ahash_request *areq,
struct talitos_ahash_req_ctx *req_ctx = ahash_request_ctx(areq);

return talitos_edesc_alloc(ctx->dev, NULL, req_ctx->psrc, NULL, NULL, 0,
- nbytes, 0, 0, 0, areq->base.flags, false);
+ nbytes, 0, 0, 0, areq->base.flags,
+ &areq->base, false);
}

static int ahash_init(struct ahash_request *areq)
@@ -2690,6 +2762,13 @@ static int talitos_probe(struct platform_device *ofdev)
}

atomic_set(&priv->chan[i].submit_count, -priv->chfifo_len);
+
+ /*
+ * The crypto_queue is used to manage the backlog only. While
+ * the hardware FIFO has space requests are dispatched
+ * immediately.
+ */
+ crypto_init_queue(&priv->chan[i].queue, 0);
}

dma_set_mask(dev, DMA_BIT_MASK(36));
diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h
index 914eb15..2762e47 100644
--- a/drivers/crypto/talitos.h
+++ b/drivers/crypto/talitos.h
@@ -65,6 +65,7 @@ struct talitos_desc {
* @context: caller context (optional)
*/
struct talitos_request {
+ struct crypto_async_request base;
struct talitos_desc *desc;
dma_addr_t dma_desc;
void (*callback) (struct device *dev, struct talitos_desc *desc,
@@ -91,6 +92,8 @@ struct talitos_channel {
spinlock_t tail_lock ____cacheline_aligned;
/* index to next in-progress/done descriptor request */
int tail;
+
+ struct crypto_queue queue;
};

struct talitos_private {
--
1.7.10.4

2015-03-03 13:21:52

by Martin Hicks

[permalink] [raw]
Subject: [PATCH v2 3/5] crypto: talitos: Fix off-by-one and use all hardware slots

The submission count was off by one.

Signed-off-by: Martin Hicks <[email protected]>
---
drivers/crypto/talitos.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 89cf4d5..7709805 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -2722,8 +2722,7 @@ static int talitos_probe(struct platform_device *ofdev)
goto err_out;
}

- atomic_set(&priv->chan[i].submit_count,
- -(priv->chfifo_len - 1));
+ atomic_set(&priv->chan[i].submit_count, -priv->chfifo_len);
}

dma_set_mask(dev, DMA_BIT_MASK(36));
--
1.7.10.4

2015-03-04 00:28:53

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On Tue, 3 Mar 2015 08:21:37 -0500
Martin Hicks <[email protected]> wrote:

> @@ -1170,6 +1237,8 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
> edesc->dma_len,
> DMA_BIDIRECTIONAL);
> edesc->req.desc = &edesc->desc;
> + /* A copy of the crypto_async_request to use the crypto_queue backlog */
> + memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));

this seems backward, or, at least can be done more efficiently IMO:
talitos_cra_init should set the tfm's reqsize so the rest of
the driver can wholly embed its talitos_edesc (which should also
wholly encapsulate its talitos_request (i.e., not via a pointer))
into the crypto API's request handle allocation. This
would absorb and eliminate the talitos_edesc kmalloc and frees, the
above memcpy, and replace the container_of after the
crypto_dequeue_request with an offset_of, right?

When scatter-gather buffers are needed, we can assume a slower-path
and make them do their own allocations, since their sizes vary
depending on each request. Of course, a pointer to those
allocations would need to be retained somewhere in the request
handle.

Only potential problem is getting the crypto API to set the GFP_DMA
flag in the allocation request, but presumably a
CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.

Kim

2015-03-04 00:40:50

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] crypto: talitos: Fix off-by-one and use all hardware slots

On Tue, 3 Mar 2015 08:21:35 -0500
Martin Hicks <[email protected]> wrote:

> The submission count was off by one.
>
> Signed-off-by: Martin Hicks <[email protected]>
> ---
sadly, this directly contradicts:

commit 4b24ea971a93f5d0bec34bf7bfd0939f70cfaae6
Author: Vishnu Suresh <[email protected]>
Date: Mon Oct 20 21:06:18 2008 +0800

crypto: talitos - Preempt overflow interrupts off-by-one fix

My guess is your request submission pattern differs from that of
Vishnu's (probably IPSec and/or tcrypt), or later h/w versions have
gotten better about dealing with channel near-overflow conditions.
Either way, I'd prefer we not do this: it might break others, and
I'm guessing doesn't improve performance _that_ much?

If it does, we could risk it and restrict it to SEC versions 3.3 and
above maybe? Not sure what to do here exactly, barring digging up
and old 2.x SEC and testing.

Kim

p.s. I checked, Vishnu isn't with Freescale anymore, so I can't
cc him.

2015-03-04 14:46:20

by Martin Hicks

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] crypto: talitos: Fix off-by-one and use all hardware slots

Ok, I'm fine dropping this patch. I'm sure it doesn't affect
performance in a measurable way.

mh

On Tue, Mar 3, 2015 at 7:35 PM, Kim Phillips <[email protected]> wrote:
> On Tue, 3 Mar 2015 08:21:35 -0500
> Martin Hicks <[email protected]> wrote:
>
>> The submission count was off by one.
>>
>> Signed-off-by: Martin Hicks <[email protected]>
>> ---
> sadly, this directly contradicts:
>
> commit 4b24ea971a93f5d0bec34bf7bfd0939f70cfaae6
> Author: Vishnu Suresh <[email protected]>
> Date: Mon Oct 20 21:06:18 2008 +0800
>
> crypto: talitos - Preempt overflow interrupts off-by-one fix
>
> My guess is your request submission pattern differs from that of
> Vishnu's (probably IPSec and/or tcrypt), or later h/w versions have
> gotten better about dealing with channel near-overflow conditions.
> Either way, I'd prefer we not do this: it might break others, and
> I'm guessing doesn't improve performance _that_ much?
>
> If it does, we could risk it and restrict it to SEC versions 3.3 and
> above maybe? Not sure what to do here exactly, barring digging up
> and old 2.x SEC and testing.
>
> Kim
>
> p.s. I checked, Vishnu isn't with Freescale anymore, so I can't
> cc him.



--
Martin Hicks P.Eng. | [email protected]
Bork Consulting Inc. | +1 (613) 266-2296

2015-03-05 09:35:37

by Horia Geantă

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On 3/4/2015 2:23 AM, Kim Phillips wrote:
> On Tue, 3 Mar 2015 08:21:37 -0500
> Martin Hicks <[email protected]> wrote:
>
>> @@ -1170,6 +1237,8 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
>> edesc->dma_len,
>> DMA_BIDIRECTIONAL);
>> edesc->req.desc = &edesc->desc;
>> + /* A copy of the crypto_async_request to use the crypto_queue backlog */
>> + memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));
>
> this seems backward, or, at least can be done more efficiently IMO:
> talitos_cra_init should set the tfm's reqsize so the rest of
> the driver can wholly embed its talitos_edesc (which should also
> wholly encapsulate its talitos_request (i.e., not via a pointer))
> into the crypto API's request handle allocation. This
> would absorb and eliminate the talitos_edesc kmalloc and frees, the
> above memcpy, and replace the container_of after the
> crypto_dequeue_request with an offset_of, right?
>
> When scatter-gather buffers are needed, we can assume a slower-path
> and make them do their own allocations, since their sizes vary
> depending on each request. Of course, a pointer to those
> allocations would need to be retained somewhere in the request
> handle.

Unfortunately talitos_edesc structure size is most of the times
variable. Its exact size can only be established at "request time", and
not at "tfm init time".

Fixed size would be sizeof(talitos_edesc).
Below are factors that influence the variable part, i.e. link_tbl in
talitos_edesc:
- whether any assoc / src / dst data is scattered
- icv_stashing (case when ICV checking is done in SW)

Still we'd be better with:
-crypto API allocates request + request context (i.e.
sizeof(talitos_edesc) + any alignment required)
-talitos driver allocates variable part of talitos_edesc (if needed)

instead of:
-crypto API allocates request
-talitos driver allocates talitos_edesc (fixed + variable)
-memcopy of the req.base (crypto_async_request) into talitos_edesc

both in terms of performance and readability.

At first look, the driver wouldn't change that much:
-talitos_cra_init() callback would have to set tfm.reqsize to
sizeof(talitos_edesc) + padding and also add the CRYPTO_TFM_REQ_DMA
indication in tfm.crt_flags
-talitos_edesc_alloc() logic would be pretty much the same, but would
allocate memory only for the link_tbl

I'm willing to do these changes if needed.

>
> Only potential problem is getting the crypto API to set the GFP_DMA
> flag in the allocation request, but presumably a
> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.

Right. And this flag would apply only to request __ctx[].

Herbert, would this be an acceptable addition to crypto API?

Thanks,
Horia

2015-03-06 00:11:59

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] crypto: talitos: Simplify per-channel initialization

On Tue, 3 Mar 2015 08:21:33 -0500
Martin Hicks <[email protected]> wrote:

> There were multiple loops in a row, for each separate step of the
> initialization of the channels. Simplify to a single loop.
>
> Signed-off-by: Martin Hicks <[email protected]>
> ---

Acked-by: Kim Phillips <[email protected]>

Kim

2015-03-06 00:12:44

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] crypto: talitos: Remove MD5_BLOCK_SIZE

On Tue, 3 Mar 2015 08:21:34 -0500
Martin Hicks <[email protected]> wrote:

> This is properly defined in the md5 header file.
>
> Signed-off-by: Martin Hicks <[email protected]>
> ---

Acked-by: Kim Phillips <[email protected]>

Kim

2015-03-06 00:40:22

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On Thu, 5 Mar 2015 11:35:23 +0200
Horia Geantă <[email protected]> wrote:

> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> > On Tue, 3 Mar 2015 08:21:37 -0500
> > Martin Hicks <[email protected]> wrote:
> >
> >> @@ -1170,6 +1237,8 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
> >> edesc->dma_len,
> >> DMA_BIDIRECTIONAL);
> >> edesc->req.desc = &edesc->desc;
> >> + /* A copy of the crypto_async_request to use the crypto_queue backlog */
> >> + memcpy(&edesc->req.base, areq, sizeof(struct crypto_async_request));
> >
> > this seems backward, or, at least can be done more efficiently IMO:
> > talitos_cra_init should set the tfm's reqsize so the rest of
> > the driver can wholly embed its talitos_edesc (which should also
> > wholly encapsulate its talitos_request (i.e., not via a pointer))
> > into the crypto API's request handle allocation. This
> > would absorb and eliminate the talitos_edesc kmalloc and frees, the
> > above memcpy, and replace the container_of after the
> > crypto_dequeue_request with an offset_of, right?
> >
> > When scatter-gather buffers are needed, we can assume a slower-path
> > and make them do their own allocations, since their sizes vary
> > depending on each request. Of course, a pointer to those
> > allocations would need to be retained somewhere in the request
> > handle.
>
> Unfortunately talitos_edesc structure size is most of the times
> variable. Its exact size can only be established at "request time", and
> not at "tfm init time".

yes, I was suggesting a common minimum should be set in cra_init.

> Fixed size would be sizeof(talitos_edesc).
> Below are factors that influence the variable part, i.e. link_tbl in
> talitos_edesc:
> - whether any assoc / src / dst data is scattered
> - icv_stashing (case when ICV checking is done in SW)

both being slow(er) paths, IMO.

> Still we'd be better with:
> -crypto API allocates request + request context (i.e.
> sizeof(talitos_edesc) + any alignment required)
> -talitos driver allocates variable part of talitos_edesc (if needed)
>
> instead of:
> -crypto API allocates request
> -talitos driver allocates talitos_edesc (fixed + variable)
> -memcopy of the req.base (crypto_async_request) into talitos_edesc
>
> both in terms of performance and readability.

indeed.

> At first look, the driver wouldn't change that much:
> -talitos_cra_init() callback would have to set tfm.reqsize to
> sizeof(talitos_edesc) + padding and also add the CRYPTO_TFM_REQ_DMA
> indication in tfm.crt_flags
> -talitos_edesc_alloc() logic would be pretty much the same, but would
> allocate memory only for the link_tbl
>
> I'm willing to do these changes if needed.

Please coordinate with Martin.

fwiw, caam could use this, too.

Kim

2015-03-06 04:48:42

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On Thu, Mar 05, 2015 at 11:35:23AM +0200, Horia Geantă wrote:
>
> > Only potential problem is getting the crypto API to set the GFP_DMA
> > flag in the allocation request, but presumably a
> > CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>
> Right. And this flag would apply only to request __ctx[].
>
> Herbert, would this be an acceptable addition to crypto API?

How would such a flag work?

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-03-06 12:02:33

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] crypto: talitos: Remove MD5_BLOCK_SIZE

Kim Phillips <[email protected]> wrote:
> On Tue, 3 Mar 2015 08:21:34 -0500
> Martin Hicks <[email protected]> wrote:
>
>> This is properly defined in the md5 header file.
>>
>> Signed-off-by: Martin Hicks <[email protected]>
>> ---
>
> Acked-by: Kim Phillips <[email protected]>

Patches 1-2 applied. Thanks.
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-03-09 12:08:40

by Horia Geantă

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On 3/6/2015 6:48 AM, Herbert Xu wrote:
> On Thu, Mar 05, 2015 at 11:35:23AM +0200, Horia Geantă wrote:
>>
>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>> flag in the allocation request, but presumably a
>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>
>> Right. And this flag would apply only to request __ctx[].
>>
>> Herbert, would this be an acceptable addition to crypto API?
>
> How would such a flag work?

Hm, I thought that GFP_DMA memory could be allocated only for request
private ctx. This is obviously not the case.

*_request_alloc(tfm, gfp) crypto API functions would do:
if (crypto_tfm_get_flags(tfm) & CRYPTO_TFM_REQ_DMA)
gfp |= GFP_DMA;

Thanks,
Horia

2015-03-16 10:03:08

by Horia Geantă

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On 3/4/2015 2:23 AM, Kim Phillips wrote:
> Only potential problem is getting the crypto API to set the GFP_DMA
> flag in the allocation request, but presumably a
> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.

Seems there are quite a few places that do not use the
{aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
Among them, IPsec and dm-crypt.
I've looked at the code and I don't think it can be converted to use
crypto API.

This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
places. Some of the maintainers do not agree, as you've seen.

An alternative would be for talitos to use the page allocator to get 1 /
2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
= 8 kB), dma_map_page the area and manage it internally for talitos_desc
hw descriptors.
What do you think?

Thanks,
Horia

2015-03-17 00:19:35

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On Mon, 16 Mar 2015 12:02:51 +0200
Horia Geantă <[email protected]> wrote:

> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> > Only potential problem is getting the crypto API to set the GFP_DMA
> > flag in the allocation request, but presumably a
> > CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>
> Seems there are quite a few places that do not use the
> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
> Among them, IPsec and dm-crypt.
> I've looked at the code and I don't think it can be converted to use
> crypto API.

why not?

> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
> places. Some of the maintainers do not agree, as you've seen.

would modifying the crypto API to either have a different
*_request_alloc() API, and/or adding calls to negotiate the GFP mask
between crypto users and drivers, e.g., get/set_gfp_mask, work?

> An alternative would be for talitos to use the page allocator to get 1 /
> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
> hw descriptors.
> What do you think?

There's a comment in esp_alloc_tmp(): "Use spare space in skb for
this where possible," which is ideally where we'd want to be (esp.
because that memory could already be DMA-able). Your above
suggestion would be in the opposite direction of that.

Kim
_______________________________________________
Linuxppc-dev mailing list
[email protected]
https://lists.ozlabs.org/listinfo/linuxppc-dev

2015-03-17 17:58:55

by Horia Geantă

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On 3/17/2015 2:19 AM, Kim Phillips wrote:
> On Mon, 16 Mar 2015 12:02:51 +0200
> Horia Geantă <[email protected]> wrote:
>
>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>> flag in the allocation request, but presumably a
>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>
>> Seems there are quite a few places that do not use the
>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
>> Among them, IPsec and dm-crypt.
>> I've looked at the code and I don't think it can be converted to use
>> crypto API.
>
> why not?

It would imply having 2 memory allocations, one for crypto request and
the other for the rest of the data bundled with the request (for IPsec
that would be ESN + space for IV + sg entries for authenticated-only
data and sk_buff extension, if needed).

Trying to have a single allocation by making ESN, IV etc. part of the
request private context requires modifying tfm.reqsize on the fly.
This won't work without adding some kind of locking for the tfm.

>
>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
>> places. Some of the maintainers do not agree, as you've seen.
>
> would modifying the crypto API to either have a different
> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
> between crypto users and drivers, e.g., get/set_gfp_mask, work?

I think what DaveM asked for was the change to be transparent.

Besides converting to *_request_alloc(), seems that all other options
require some extra awareness from the user.
Could you elaborate on the idea above?

>
>> An alternative would be for talitos to use the page allocator to get 1 /
>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
>> hw descriptors.
>> What do you think?
>
> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
> this where possible," which is ideally where we'd want to be (esp.

Ok, I'll check that. But note the "where possible" - finding room in the
skb to avoid the allocation won't always be the case, and then we're
back to square one.

> because that memory could already be DMA-able). Your above
> suggestion would be in the opposite direction of that.

The proposal:
-removes dma (un)mapping on the fast path
-avoids requesting dma mappable memory for more than it's actually
needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
only its private context)
-for caam it has the added benefit of speeding the below search for the
offending descriptor in the SW ring from O(n) to O(1):
for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
sw_idx = (tail + i) & (JOBR_DEPTH - 1);

if (jrp->outring[hw_idx].desc ==
jrp->entinfo[sw_idx].desc_addr_dma)
break; /* found */
}
(drivers/crypto/caam/jr.c - caam_dequeue)

Horia


_______________________________________________
Linuxppc-dev mailing list
[email protected]
https://lists.ozlabs.org/listinfo/linuxppc-dev

2015-03-17 22:08:42

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On Tue, 17 Mar 2015 19:58:55 +0200
Horia Geantă <[email protected]> wrote:

> On 3/17/2015 2:19 AM, Kim Phillips wrote:
> > On Mon, 16 Mar 2015 12:02:51 +0200
> > Horia Geantă <[email protected]> wrote:
> >
> >> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> >>> Only potential problem is getting the crypto API to set the GFP_DMA
> >>> flag in the allocation request, but presumably a
> >>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
> >>
> >> Seems there are quite a few places that do not use the
> >> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
> >> Among them, IPsec and dm-crypt.
> >> I've looked at the code and I don't think it can be converted to use
> >> crypto API.
> >
> > why not?
>
> It would imply having 2 memory allocations, one for crypto request and
> the other for the rest of the data bundled with the request (for IPsec
> that would be ESN + space for IV + sg entries for authenticated-only
> data and sk_buff extension, if needed).
>
> Trying to have a single allocation by making ESN, IV etc. part of the
> request private context requires modifying tfm.reqsize on the fly.
> This won't work without adding some kind of locking for the tfm.

can't a common minimum tfm.reqsize be co-established up front, at
least for the fast path?

> >> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
> >> places. Some of the maintainers do not agree, as you've seen.
> >
> > would modifying the crypto API to either have a different
> > *_request_alloc() API, and/or adding calls to negotiate the GFP mask
> > between crypto users and drivers, e.g., get/set_gfp_mask, work?
>
> I think what DaveM asked for was the change to be transparent.
>
> Besides converting to *_request_alloc(), seems that all other options
> require some extra awareness from the user.
> Could you elaborate on the idea above?

was merely suggesting communicating GFP flags anonymously across the
API, i.e., GFP_DMA wouldn't appear in user code.

> >> An alternative would be for talitos to use the page allocator to get 1 /
> >> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
> >> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
> >> hw descriptors.
> >> What do you think?
> >
> > There's a comment in esp_alloc_tmp(): "Use spare space in skb for
> > this where possible," which is ideally where we'd want to be (esp.
>
> Ok, I'll check that. But note the "where possible" - finding room in the
> skb to avoid the allocation won't always be the case, and then we're
> back to square one.
>
> > because that memory could already be DMA-able). Your above
> > suggestion would be in the opposite direction of that.
>
> The proposal:
> -removes dma (un)mapping on the fast path

sure, but at the expense of additional complexity.

> -avoids requesting dma mappable memory for more than it's actually
> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
> only its private context)

compared to the payload? Plus, we have plenty of DMA space these
days.

> -for caam it has the added benefit of speeding the below search for the
> offending descriptor in the SW ring from O(n) to O(1):
> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
> sw_idx = (tail + i) & (JOBR_DEPTH - 1);
>
> if (jrp->outring[hw_idx].desc ==
> jrp->entinfo[sw_idx].desc_addr_dma)
> break; /* found */
> }
> (drivers/crypto/caam/jr.c - caam_dequeue)

how? The job ring h/w will still be spitting things out
out-of-order.

Plus, like I said, it's taking the problem in the wrong direction:
we need to strive to merge the allocation and mapping with the upper
layers as much as possible.

Kim

2015-03-19 15:57:12

by Horia Geantă

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On 3/18/2015 12:03 AM, Kim Phillips wrote:
> On Tue, 17 Mar 2015 19:58:55 +0200
> Horia Geantă <[email protected]> wrote:
>
>> On 3/17/2015 2:19 AM, Kim Phillips wrote:
>>> On Mon, 16 Mar 2015 12:02:51 +0200
>>> Horia Geantă <[email protected]> wrote:
>>>
>>>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
>>>>> Only potential problem is getting the crypto API to set the GFP_DMA
>>>>> flag in the allocation request, but presumably a
>>>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
>>>>
>>>> Seems there are quite a few places that do not use the
>>>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
>>>> Among them, IPsec and dm-crypt.
>>>> I've looked at the code and I don't think it can be converted to use
>>>> crypto API.
>>>
>>> why not?
>>
>> It would imply having 2 memory allocations, one for crypto request and
>> the other for the rest of the data bundled with the request (for IPsec
>> that would be ESN + space for IV + sg entries for authenticated-only
>> data and sk_buff extension, if needed).
>>
>> Trying to have a single allocation by making ESN, IV etc. part of the
>> request private context requires modifying tfm.reqsize on the fly.
>> This won't work without adding some kind of locking for the tfm.
>
> can't a common minimum tfm.reqsize be co-established up front, at
> least for the fast path?

Indeed, for IPsec at tfm allocation time - esp_init_state() -
tfm.reqsize could be increased to account for what is known for a given
flow: ESN, IV and asg (S/G entries for authenticated-only data).
The layout would be:
aead request (fixed part)
private ctx of backend algorithm
seq_no_hi (if ESN)
IV
asg
sg <-- S/G table for skb_to_sgvec; how many entries is the question

Do you have a suggestion for how many S/G entries to preallocate for
representing the sk_buff data to be encrypted?
An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
Btw, currently maximum number of fragments supported by the net stack
(MAX_SKB_FRAGS) is 16 or more.

>>>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
>>>> places. Some of the maintainers do not agree, as you've seen.
>>>
>>> would modifying the crypto API to either have a different
>>> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
>>> between crypto users and drivers, e.g., get/set_gfp_mask, work?
>>
>> I think what DaveM asked for was the change to be transparent.
>>
>> Besides converting to *_request_alloc(), seems that all other options
>> require some extra awareness from the user.
>> Could you elaborate on the idea above?
>
> was merely suggesting communicating GFP flags anonymously across the
> API, i.e., GFP_DMA wouldn't appear in user code.

Meaning user would have to get_gfp_mask before allocating a crypto
request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?

>>>> An alternative would be for talitos to use the page allocator to get 1 /
>>>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
>>>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
>>>> hw descriptors.
>>>> What do you think?
>>>
>>> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
>>> this where possible," which is ideally where we'd want to be (esp.
>>
>> Ok, I'll check that. But note the "where possible" - finding room in the
>> skb to avoid the allocation won't always be the case, and then we're
>> back to square one.

So the skb cb is out of the question, being too small (48B).
Any idea what was the intention of the "TODO" - maybe to use the
tailroom in the skb data area?

>>> because that memory could already be DMA-able). Your above
>>> suggestion would be in the opposite direction of that.
>>
>> The proposal:
>> -removes dma (un)mapping on the fast path
>
> sure, but at the expense of additional complexity.

Right, there's no free lunch. But it's cheaper.

>> -avoids requesting dma mappable memory for more than it's actually
>> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
>> only its private context)
>
> compared to the payload? Plus, we have plenty of DMA space these
> days.
>
>> -for caam it has the added benefit of speeding the below search for the
>> offending descriptor in the SW ring from O(n) to O(1):
>> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
>> sw_idx = (tail + i) & (JOBR_DEPTH - 1);
>>
>> if (jrp->outring[hw_idx].desc ==
>> jrp->entinfo[sw_idx].desc_addr_dma)
>> break; /* found */
>> }
>> (drivers/crypto/caam/jr.c - caam_dequeue)
>
> how? The job ring h/w will still be spitting things out
> out-of-order.

jrp->outring[hw_idx].desc bus address can be used to find the sw_idx in
O(1):

dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
[...]
sw_idx = (desc_base - jrp->outring[hw_idx].desc) / JD_SIZE;

JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
descriptor, 3 words can be used for smth. else.
Basically all JDs would be filled at a 64B-aligned offset in the memory
page.

> Plus, like I said, it's taking the problem in the wrong direction:
> we need to strive to merge the allocation and mapping with the upper
> layers as much as possible.

IMHO propagating the GFP_DMA from backend crypto implementations to
crypto API users doesn't seem feasable.
It's error-prone to audit all places that allocate crypto requests w/out
using *_request_alloc API.
And even if all these places would be identified:
-in some cases there's some heavy rework involved
-more places might show up in the future and there's no way to detect them

Horia

2015-03-19 18:43:38

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] crypto: talitos: Add software backlog queue handling

On Thu, 19 Mar 2015 17:56:57 +0200
Horia Geantă <[email protected]> wrote:

> On 3/18/2015 12:03 AM, Kim Phillips wrote:
> > On Tue, 17 Mar 2015 19:58:55 +0200
> > Horia Geantă <[email protected]> wrote:
> >
> >> On 3/17/2015 2:19 AM, Kim Phillips wrote:
> >>> On Mon, 16 Mar 2015 12:02:51 +0200
> >>> Horia Geantă <[email protected]> wrote:
> >>>
> >>>> On 3/4/2015 2:23 AM, Kim Phillips wrote:
> >>>>> Only potential problem is getting the crypto API to set the GFP_DMA
> >>>>> flag in the allocation request, but presumably a
> >>>>> CRYPTO_TFM_REQ_DMA crt_flag can be made to handle that.
> >>>>
> >>>> Seems there are quite a few places that do not use the
> >>>> {aead,ablkcipher_ahash}_request_alloc() API to allocate crypto requests.
> >>>> Among them, IPsec and dm-crypt.
> >>>> I've looked at the code and I don't think it can be converted to use
> >>>> crypto API.
> >>>
> >>> why not?
> >>
> >> It would imply having 2 memory allocations, one for crypto request and
> >> the other for the rest of the data bundled with the request (for IPsec
> >> that would be ESN + space for IV + sg entries for authenticated-only
> >> data and sk_buff extension, if needed).
> >>
> >> Trying to have a single allocation by making ESN, IV etc. part of the
> >> request private context requires modifying tfm.reqsize on the fly.
> >> This won't work without adding some kind of locking for the tfm.
> >
> > can't a common minimum tfm.reqsize be co-established up front, at
> > least for the fast path?
>
> Indeed, for IPsec at tfm allocation time - esp_init_state() -
> tfm.reqsize could be increased to account for what is known for a given
> flow: ESN, IV and asg (S/G entries for authenticated-only data).
> The layout would be:
> aead request (fixed part)
> private ctx of backend algorithm
> seq_no_hi (if ESN)
> IV
> asg
> sg <-- S/G table for skb_to_sgvec; how many entries is the question
>
> Do you have a suggestion for how many S/G entries to preallocate for
> representing the sk_buff data to be encrypted?
> An ancient esp4.c used ESP_NUM_FAST_SG, set to 4.
> Btw, currently maximum number of fragments supported by the net stack
> (MAX_SKB_FRAGS) is 16 or more.
>
> >>>> This means that the CRYPTO_TFM_REQ_DMA would be visible to all of these
> >>>> places. Some of the maintainers do not agree, as you've seen.
> >>>
> >>> would modifying the crypto API to either have a different
> >>> *_request_alloc() API, and/or adding calls to negotiate the GFP mask
> >>> between crypto users and drivers, e.g., get/set_gfp_mask, work?
> >>
> >> I think what DaveM asked for was the change to be transparent.
> >>
> >> Besides converting to *_request_alloc(), seems that all other options
> >> require some extra awareness from the user.
> >> Could you elaborate on the idea above?
> >
> > was merely suggesting communicating GFP flags anonymously across the
> > API, i.e., GFP_DMA wouldn't appear in user code.
>
> Meaning user would have to get_gfp_mask before allocating a crypto
> request - i.e. instead of kmalloc(..., GFP_ATOMIC) to have
> kmalloc(GFP_ATOMIC | get_gfp_mask(aead))?
>
> >>>> An alternative would be for talitos to use the page allocator to get 1 /
> >>>> 2 pages at probe time (4 channels x 32 entries/channel x 64B/descriptor
> >>>> = 8 kB), dma_map_page the area and manage it internally for talitos_desc
> >>>> hw descriptors.
> >>>> What do you think?
> >>>
> >>> There's a comment in esp_alloc_tmp(): "Use spare space in skb for
> >>> this where possible," which is ideally where we'd want to be (esp.
> >>
> >> Ok, I'll check that. But note the "where possible" - finding room in the
> >> skb to avoid the allocation won't always be the case, and then we're
> >> back to square one.
>
> So the skb cb is out of the question, being too small (48B).
> Any idea what was the intention of the "TODO" - maybe to use the
> tailroom in the skb data area?
>
> >>> because that memory could already be DMA-able). Your above
> >>> suggestion would be in the opposite direction of that.
> >>
> >> The proposal:
> >> -removes dma (un)mapping on the fast path
> >
> > sure, but at the expense of additional complexity.
>
> Right, there's no free lunch. But it's cheaper.
>
> >> -avoids requesting dma mappable memory for more than it's actually
> >> needed (CRYPTO_TFM_REQ_DMA forces entire request to be mappable, not
> >> only its private context)
> >
> > compared to the payload? Plus, we have plenty of DMA space these
> > days.
> >
> >> -for caam it has the added benefit of speeding the below search for the
> >> offending descriptor in the SW ring from O(n) to O(1):
> >> for (i = 0; CIRC_CNT(head, tail + i, JOBR_DEPTH) >= 1; i++) {
> >> sw_idx = (tail + i) & (JOBR_DEPTH - 1);
> >>
> >> if (jrp->outring[hw_idx].desc ==
> >> jrp->entinfo[sw_idx].desc_addr_dma)
> >> break; /* found */
> >> }
> >> (drivers/crypto/caam/jr.c - caam_dequeue)
> >
> > how? The job ring h/w will still be spitting things out
> > out-of-order.
>
> jrp->outring[hw_idx].desc bus address can be used to find the sw_idx in
> O(1):
>
> dma_addr_t desc_base = dma_map_page(alloc_page(GFP_DMA),...);
> [...]
> sw_idx = (desc_base - jrp->outring[hw_idx].desc) / JD_SIZE;
>
> JD_SIZE would be 16 words (64B) - 13 words used for the h/w job
> descriptor, 3 words can be used for smth. else.
> Basically all JDs would be filled at a 64B-aligned offset in the memory
> page.

that assumes a linear mapping, which is a wrong assumption to make.

I also think you don't know how many times that loop above executes
in practice.

> > Plus, like I said, it's taking the problem in the wrong direction:
> > we need to strive to merge the allocation and mapping with the upper
> > layers as much as possible.
>
> IMHO propagating the GFP_DMA from backend crypto implementations to
> crypto API users doesn't seem feasable.

should be.

> It's error-prone to audit all places that allocate crypto requests w/out
> using *_request_alloc API.

why is it error-prone?

> And even if all these places would be identified:
> -in some cases there's some heavy rework involved

so?

> -more places might show up in the future and there's no way to detect them

let them worry about that.

I leave the rest for netdev.

Kim