2015-11-11 09:33:22

by Baolin Wang

[permalink] [raw]
Subject: [PATCH 0/2] Introduce the request handling for dm-crypt

Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
decrypt block data, which can only hanle one bio at one time. As we know,
one bio must use the sequential physical address and it also has a limitation
of length. Thus it may limit the big block encyrtion/decryption when some
hardware support the big block data encryption.

This patch series introduc the 'based-request' method to handle the data
encryption/decryption. One request can contain multiple bios, so it can
handle big block data to improve the efficiency.

Also this patch refers to the Dinesh K Garg <[email protected]> and
franciscofranco <[email protected]> submission of 'Request
based dm-crypt', and the original link is:
https://github.com/major91/Zeta-Chromium-N5/commit/621d4821da04cfde383329ebdf4ce8711ec3329c

Now it can encrypt/decrypt successfully with read/write testing, and it shows
some efficiency improvements with hardware acceleration. But I really hope to
need more comments about the structure and conception. Thanks for your review.

Baolin Wang (2):
block: Introduce BIO_ENDIO_FREE for bio flags
md: dm-crypt: Introduce the request handling for dm-crypt

block/blk-core.c | 6 +-
drivers/md/Kconfig | 6 +
drivers/md/dm-crypt.c | 831 ++++++++++++++++++++++++++++++++++++++++-
drivers/md/dm.c | 13 +-
include/linux/blk_types.h | 6 +
include/linux/device-mapper.h | 5 +
6 files changed, 861 insertions(+), 6 deletions(-)

--
1.7.9.5


2015-11-11 09:33:30

by Baolin Wang

[permalink] [raw]
Subject: [PATCH 1/2] block: Introduce BIO_ENDIO_FREE for bio flags

When we use dm-crypt to decrypt block data, it will decrypt the block data
in endio() when one IO is completed. In this situation we don't want the
cloned bios is freed before calling the endio().

Thus introduce 'BIO_ENDIO_FREE' flag to support the request handling for dm-crypt,
this flag will ensure that blk layer does not complete the cloned bios before
completing the request. When the crypt endio is called, post-processsing is
done and then the dm layer will complete the bios (clones) and free them.

Signed-off-by: Baolin Wang <[email protected]>
---
block/blk-core.c | 6 +++++-
drivers/md/dm.c | 13 ++++++++++---
include/linux/blk_types.h | 6 ++++++
include/linux/device-mapper.h | 5 +++++
4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 60912e9..6838936 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1411,7 +1411,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
elv_completed_request(q, req);

/* this is a bio leak */
- WARN_ON(req->bio != NULL);
+ WARN_ON(req->bio != NULL && !bio_flagged(req->bio, BIO_ENDIO_FREE));

/*
* Request may not have originated from ll_rw_blk. if not,
@@ -2521,6 +2521,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
blk_account_io_completion(req, nr_bytes);

total_bytes = 0;
+
+ if (bio_flagged(req->bio, BIO_ENDIO_FREE))
+ return false;
+
while (req->bio) {
struct bio *bio = req->bio;
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 6264781..2c18a34 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1047,6 +1047,13 @@ static struct dm_rq_target_io *tio_from_request(struct request *rq)
return (rq->q->mq_ops ? blk_mq_rq_to_pdu(rq) : rq->special);
}

+struct request *dm_get_orig_rq(struct request *clone)
+{
+ struct dm_rq_target_io *tio = clone->end_io_data;
+
+ return tio->orig;
+}
+
static void rq_end_stats(struct mapped_device *md, struct request *orig)
{
if (unlikely(dm_stats_used(&md->stats))) {
@@ -1118,7 +1125,7 @@ static void free_rq_clone(struct request *clone)
* Must be called without clone's queue lock held,
* see end_clone_request() for more details.
*/
-static void dm_end_request(struct request *clone, int error)
+void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
struct dm_rq_target_io *tio = clone->end_io_data;
@@ -1311,7 +1318,7 @@ static void dm_complete_request(struct request *rq, int error)
* Target's rq_end_io() function isn't called.
* This may be used when the target's map_rq() or clone_and_map_rq() functions fail.
*/
-static void dm_kill_unmapped_request(struct request *rq, int error)
+void dm_kill_unmapped_request(struct request *rq, int error)
{
rq->cmd_flags |= REQ_FAILED;
dm_complete_request(rq, error);
@@ -1758,7 +1765,7 @@ int dm_request_based(struct mapped_device *md)
return blk_queue_stackable(md->queue);
}

-static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
+void dm_dispatch_clone_request(struct request *clone, struct request *rq)
{
int r;

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index e813013..30aee54 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -122,6 +122,12 @@ struct bio {
#define BIO_REFFED 8 /* bio has elevated ->bi_cnt */

/*
+ * Added for Req based dm which need to perform post processing. This flag
+ * ensures blk_update_request does not free the bios or request, this is done
+ * at the dm level.
+ */
+#define BIO_ENDIO_FREE 12
+/*
* Flags starting here get preserved by bio_reset() - this includes
* BIO_POOL_IDX()
*/
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 76d23fa..f636c50 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -407,6 +407,11 @@ union map_info *dm_get_rq_mapinfo(struct request *rq);

struct queue_limits *dm_get_queue_limits(struct mapped_device *md);

+void dm_end_request(struct request *clone, int error);
+void dm_kill_unmapped_request(struct request *rq, int error);
+void dm_dispatch_clone_request(struct request *clone, struct request *rq);
+struct request *dm_get_orig_rq(struct request *clone);
+
/*
* Geometry functions.
*/
--
1.7.9.5

2015-11-11 09:33:38

by Baolin Wang

[permalink] [raw]
Subject: [PATCH 2/2] md: dm-crypt: Introduce the request handling for dm-crypt

Some hardware can support big block data encrytion, the original dm-crypt
only implemented the 'based-bio' things that will limit the efficiency
(only handle one bio at one time) for the big block data encryption.

This patch introduces the 'based-request' method to handle the big block,
which it can contain more than one bio at one time for dm-drypt. Now we use
a config macro to enable the 'based-request' method and to ensure the original
code can be run successfully.

Signed-off-by: Baolin Wang <[email protected]>
---
drivers/md/Kconfig | 6 +
drivers/md/dm-crypt.c | 831 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 835 insertions(+), 2 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index d5415ee..aea1db0 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -266,6 +266,12 @@ config DM_CRYPT

If unsure, say N.

+config DM_REQ_CRYPT
+ bool "Crypt target support with request"
+ depends on BLK_DEV_DM
+ select CRYPTO
+ select CRYPTO_CBC
+
config DM_SNAPSHOT
tristate "Snapshot target"
depends on BLK_DEV_DM
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index d60c88d..e21a1ed15 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -28,10 +28,13 @@
#include <crypto/hash.h>
#include <crypto/md5.h>
#include <crypto/algapi.h>
+#include <linux/buffer_head.h>

#include <linux/device-mapper.h>

#define DM_MSG_PREFIX "crypt"
+#define DM_MAX_SG_LIST (1024)
+#define BIO_INLINE_VECS (4)

/*
* context holding the current state of a multi-part conversion
@@ -64,10 +67,27 @@ struct dm_crypt_io {
struct rb_node rb_node;
} CRYPTO_MINALIGN_ATTR;

+struct dm_req_crypt_io {
+ struct crypt_config *cc;
+ struct work_struct work;
+ struct request *cloned_request;
+ struct convert_context ctx;
+
+ int error;
+ atomic_t pending;
+ sector_t sector;
+ struct rb_node rb_node;
+
+ bool should_encrypt;
+ bool should_decrypt;
+};
+
struct dm_crypt_request {
struct convert_context *ctx;
struct scatterlist sg_in;
struct scatterlist sg_out;
+ struct sg_table req_sgt_in;
+ struct sg_table req_sgt_out;
sector_t iv_sector;
};

@@ -127,6 +147,10 @@ struct crypt_config {
*/
mempool_t *req_pool;
mempool_t *page_pool;
+
+ struct kmem_cache *req_crypt_io_pool;
+ mempool_t *req_io_pool;
+
struct bio_set *bs;
struct mutex bio_alloc_lock;

@@ -184,6 +208,7 @@ struct crypt_config {
static void clone_init(struct dm_crypt_io *, struct bio *);
static void kcryptd_queue_crypt(struct dm_crypt_io *io);
static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);
+static int req_crypt_write_work(void *data);

/*
* Use this to access cipher attributes that are the same for each CPU.
@@ -1547,6 +1572,8 @@ static void crypt_dtr(struct dm_target *ti)
mempool_destroy(cc->page_pool);
if (cc->req_pool)
mempool_destroy(cc->req_pool);
+ if (cc->req_io_pool)
+ mempool_destroy(cc->req_io_pool);

if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
cc->iv_gen_ops->dtr(cc);
@@ -1556,6 +1583,7 @@ static void crypt_dtr(struct dm_target *ti)

kzfree(cc->cipher);
kzfree(cc->cipher_string);
+ kmem_cache_destroy(cc->req_crypt_io_pool);

/* Must zero key material before freeing */
kzfree(cc);
@@ -1796,7 +1824,19 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad;
}

- cc->bs = bioset_create(MIN_IOS, 0);
+ cc->req_crypt_io_pool = KMEM_CACHE(dm_req_crypt_io, 0);
+ if (!cc->req_crypt_io_pool) {
+ ti->error = "Cannot allocate req_crypt_io_pool";
+ goto bad;
+ }
+
+ cc->req_io_pool = mempool_create_slab_pool(MIN_IOS, cc->req_crypt_io_pool);
+ if (!cc->req_io_pool) {
+ ti->error = "Cannot allocate request io mempool";
+ goto bad;
+ }
+
+ cc->bs = bioset_create(BIO_MAX_PAGES, 0);
if (!cc->bs) {
ti->error = "Cannot allocate crypt bioset";
goto bad;
@@ -1880,7 +1920,12 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
init_waitqueue_head(&cc->write_thread_wait);
cc->write_tree = RB_ROOT;

+#ifndef CONFIG_DM_REQ_CRYPT
cc->write_thread = kthread_create(dmcrypt_write, cc, "dmcrypt_write");
+#else
+ cc->write_thread = kthread_create(req_crypt_write_work,
+ cc, "req_dmcrypt_write");
+#endif
if (IS_ERR(cc->write_thread)) {
ret = PTR_ERR(cc->write_thread);
cc->write_thread = NULL;
@@ -2045,14 +2090,796 @@ static int crypt_iterate_devices(struct dm_target *ti,
return fn(ti, cc->dev, cc->start, ti->len, data);
}

+/*
+ * If bio->bi_dev is a partition, remap the location
+ */
+static inline void req_crypt_blk_partition_remap(struct bio *bio)
+{
+ struct block_device *bdev = bio->bi_bdev;
+
+ if (bio_sectors(bio) && bdev != bdev->bd_contains) {
+ struct hd_struct *p = bdev->bd_part;
+ /* Check for integer overflow, should never happen. */
+ if (p->start_sect > (UINT_MAX - bio->bi_iter.bi_sector))
+ return;
+
+ bio->bi_iter.bi_sector += p->start_sect;
+ bio->bi_bdev = bdev->bd_contains;
+ }
+}
+
+static void req_crypt_dispatch_io(struct dm_req_crypt_io *io)
+{
+ struct request *clone = io->cloned_request;
+ struct request *rq = dm_get_orig_rq(clone);
+
+ dm_dispatch_clone_request(clone, rq);
+}
+
+static void req_crypt_free_resource(struct dm_req_crypt_io *io)
+{
+ struct crypt_config *cc = io->cc;
+ struct ablkcipher_request *req = io->ctx.req;
+ struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
+
+ if (dmreq->req_sgt_out.orig_nents > 0)
+ sg_free_table(&dmreq->req_sgt_out);
+
+ if (dmreq->req_sgt_in.orig_nents > 0)
+ sg_free_table(&dmreq->req_sgt_in);
+
+ mempool_free(req, cc->req_pool);
+ mempool_free(io, cc->req_io_pool);
+}
+
+static void req_crypt_inc_pending(struct dm_req_crypt_io *io)
+{
+ atomic_inc(&io->pending);
+}
+
+static void req_crypt_dec_pending_encrypt(struct dm_req_crypt_io *io)
+{
+ struct request *clone = io->cloned_request;
+ int error = io->error;
+
+ atomic_dec(&io->pending);
+
+ if (error < 0) {
+ dm_kill_unmapped_request(clone, error);
+ req_crypt_free_resource(io);
+ }
+}
+
+static void req_crypt_dec_pending_decrypt(struct dm_req_crypt_io *io)
+{
+ struct request *clone = io->cloned_request;
+ int error = io->error;
+
+ atomic_dec(&io->pending);
+
+ dm_end_request(clone, error);
+ req_crypt_free_resource(io);
+}
+
+/*
+ * This callback is called by the worker queue to perform non-decrypt writes
+ * and use the dm function to complete the bios and requests.
+ */
+static void req_crypt_write_plain(struct dm_req_crypt_io *io)
+{
+ io->error = 0;
+ req_crypt_dispatch_io(io);
+}
+
+/*
+ * This callback is called by the worker queue to perform non-decrypt reads
+ * and use the dm function to complete the bios and requests.
+ */
+static void req_crypt_read_plain(struct dm_req_crypt_io *io)
+{
+ struct crypt_config *cc = io->cc;
+ struct request *clone = io->cloned_request;
+
+ dm_end_request(clone, 0);
+ mempool_free(io, cc->req_io_pool);
+}
+
+#define req_crypt_io_from_node(node) rb_entry((node), struct dm_req_crypt_io, rb_node)
+static int req_crypt_write_work(void *data)
+{
+ struct crypt_config *cc = data;
+ struct dm_req_crypt_io *io;
+
+ while (1) {
+ struct rb_root write_tree;
+ struct blk_plug plug;
+ DECLARE_WAITQUEUE(wait, current);
+
+ spin_lock_irq(&cc->write_thread_wait.lock);
+
+continue_locked:
+ if (!RB_EMPTY_ROOT(&cc->write_tree))
+ goto pop_from_list;
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ __add_wait_queue(&cc->write_thread_wait, &wait);
+
+ spin_unlock_irq(&cc->write_thread_wait.lock);
+
+ if (unlikely(kthread_should_stop())) {
+ set_task_state(current, TASK_RUNNING);
+ remove_wait_queue(&cc->write_thread_wait, &wait);
+ break;
+ }
+
+ schedule();
+
+ set_task_state(current, TASK_RUNNING);
+ spin_lock_irq(&cc->write_thread_wait.lock);
+ __remove_wait_queue(&cc->write_thread_wait, &wait);
+ goto continue_locked;
+
+pop_from_list:
+ write_tree = cc->write_tree;
+ cc->write_tree = RB_ROOT;
+ spin_unlock_irq(&cc->write_thread_wait.lock);
+
+ BUG_ON(rb_parent(write_tree.rb_node));
+
+ blk_start_plug(&plug);
+ do {
+ io = req_crypt_io_from_node(rb_first(&write_tree));
+ rb_erase(&io->rb_node, &write_tree);
+ req_crypt_dispatch_io(io);
+ } while (!RB_EMPTY_ROOT(&write_tree));
+ blk_finish_plug(&plug);
+ }
+
+ return 0;
+}
+
+static void req_crypt_write_io_submit(struct dm_req_crypt_io *io, int async)
+{
+ struct crypt_config *cc = io->cc;
+ unsigned long flags;
+ sector_t sector;
+ struct rb_node **rbp, *parent;
+
+ if (io->error < 0)
+ return;
+
+ if (likely(!async) && test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags)) {
+ req_crypt_dispatch_io(io);
+ return;
+ }
+
+ spin_lock_irqsave(&cc->write_thread_wait.lock, flags);
+ rbp = &cc->write_tree.rb_node;
+ parent = NULL;
+ sector = io->sector;
+
+ while (*rbp) {
+ parent = *rbp;
+ if (sector < req_crypt_io_from_node(parent)->sector)
+ rbp = &(*rbp)->rb_left;
+ else
+ rbp = &(*rbp)->rb_right;
+ }
+
+ rb_link_node(&io->rb_node, parent, rbp);
+ rb_insert_color(&io->rb_node, &cc->write_tree);
+
+ wake_up_locked(&cc->write_thread_wait);
+ spin_unlock_irqrestore(&cc->write_thread_wait.lock, flags);
+}
+
+/*
+ * Cipher complete callback, this is triggered by the linux crypto api once
+ * the operation is done. This signals the waiting thread that the crypto
+ * operation is complete.
+ */
+static void req_crypt_cipher_complete(struct crypto_async_request *req, int err)
+{
+ struct dm_crypt_request *dmreq = req->data;
+ struct convert_context *ctx = dmreq->ctx;
+ struct dm_req_crypt_io *io =
+ container_of(ctx, struct dm_req_crypt_io, ctx);
+ struct crypt_config *cc = io->cc;
+
+ if (err == -EINPROGRESS)
+ return;
+
+ io->error = err;
+ atomic_dec(&io->ctx.cc_pending);
+ complete(&io->ctx.restart);
+
+ if (!err && cc->iv_gen_ops && cc->iv_gen_ops->post)
+ err = cc->iv_gen_ops->post(cc, iv_of_dmreq(cc, dmreq), dmreq);
+}
+
+static int req_crypt_alloc_req(struct crypt_config *cc,
+ struct convert_context *ctx)
+{
+ /* TODO: need to reconsider and modify here */
+ unsigned int key_index = ctx->cc_sector & (cc->tfms_count - 1);
+ struct dm_crypt_request *dmreq;
+
+ ctx->req = mempool_alloc(cc->req_pool, GFP_NOIO);
+ if (!ctx->req)
+ return -ENOMEM;
+
+ dmreq = dmreq_of_req(cc, ctx->req);
+ dmreq->req_sgt_in.orig_nents = 0;
+ dmreq->req_sgt_out.orig_nents = 0;
+
+ crypto_ablkcipher_clear_flags(cc->tfms[key_index], ~0);
+ ablkcipher_request_set_tfm(ctx->req, cc->tfms[key_index]);
+
+ /*
+ * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
+ * requests if driver request queue is full.
+ */
+ ablkcipher_request_set_callback(ctx->req,
+ CRYPTO_TFM_REQ_MAY_BACKLOG | CRYPTO_TFM_REQ_MAY_SLEEP,
+ req_crypt_cipher_complete, dmreq_of_req(cc, ctx->req));
+
+ return 0;
+}
+
+/*
+ * Free the pages that used to allacation for write operation, also it
+ * will free the bvec if there are.
+ */
+static void req_crypt_free_pages(struct crypt_config *cc, struct request *clone)
+{
+ struct req_iterator iter;
+ struct bio_vec bvec;
+ struct bio *bio_t;
+ int nr_iovecs = 0;
+
+ rq_for_each_segment(bvec, clone, iter) {
+ if (bvec.bv_offset == 0 && bvec.bv_page)
+ mempool_free(bvec.bv_page, cc->page_pool);
+ bvec.bv_page = NULL;
+ }
+
+ __rq_for_each_bio(bio_t, clone) {
+ nr_iovecs = bio_t->bi_max_vecs;
+ if (nr_iovecs > BIO_INLINE_VECS) {
+ BIO_BUG_ON(BIO_POOL_IDX(bio_t) >= BIOVEC_NR_POOLS);
+ bvec_free(cc->bs->bvec_pool, bio_t->bi_io_vec,
+ BIO_POOL_IDX(bio_t));
+ }
+ }
+}
+
+/*
+ * Allocate the pages for write operation.
+ */
+static int req_crypt_alloc_pages(struct crypt_config *cc, struct request *clone)
+{
+ gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
+ struct page *page = NULL;
+ struct bio_vec *bvl = NULL;
+ struct bio_vec *bv = NULL;
+ struct bio *bio_t = NULL;
+ unsigned long idx = BIO_POOL_NONE;
+ struct bio_vec bvec;
+ struct bvec_iter biter;
+ int nr_iovecs = 0, i = 0, remaining_size = 0;
+
+ /*
+ * When clone the request, it will not copy the bi_vcnt and
+ * bi_max_vecs of one bio, so we should set it here.
+ */
+ __rq_for_each_bio(bio_t, clone) {
+ nr_iovecs = 0;
+ bio_for_each_segment(bvec, bio_t, biter)
+ nr_iovecs++;
+ bio_t->bi_vcnt = bio_t->bi_max_vecs = nr_iovecs;
+ }
+
+ /*
+ * When clone the original request, it will also clone the bios of
+ * the original request. But it will not copy the pages which the
+ * original bios are pointing to and the cloned bios just point
+ * same page. So here we need to allocate some new pages for the
+ * clone bios to encrypto system.
+ */
+ __rq_for_each_bio(bio_t, clone) {
+ nr_iovecs = bio_t->bi_max_vecs;
+ if (nr_iovecs > BIO_INLINE_VECS)
+ bvl = bvec_alloc(GFP_NOIO, nr_iovecs,
+ &idx, cc->bs->bvec_pool);
+ else if (nr_iovecs)
+ bvl = bio_t->bi_inline_vecs;
+
+ if (!bvl)
+ return -ENOMEM;
+
+ memcpy(bvl, bio_t->bi_io_vec,
+ nr_iovecs * sizeof(struct bio_vec));
+ bio_t->bi_max_vecs = nr_iovecs;
+ bio_t->bi_io_vec = bvl;
+ if (idx < BIO_POOL_NONE) {
+ bio_t->bi_flags &= ~(BIO_POOL_NONE << BIO_POOL_OFFSET);
+ bio_t->bi_flags |= idx << BIO_POOL_OFFSET;
+ }
+ }
+
+ __rq_for_each_bio(bio_t, clone) {
+ bio_for_each_segment_all(bv, bio_t, i) {
+ if (bv->bv_len > remaining_size) {
+ page = NULL;
+ while (page == NULL) {
+ page = mempool_alloc(cc->page_pool,
+ gfp_mask);
+ if (!page) {
+ DMERR("%s page alloc failed",
+ __func__);
+ congestion_wait(BLK_RW_ASYNC,
+ HZ/100);
+ }
+ }
+
+ bv->bv_page = page;
+ bv->bv_offset = 0;
+ remaining_size = PAGE_SIZE - bv->bv_len;
+ if (remaining_size < 0)
+ BUG();
+ } else {
+ bv->bv_page = page;
+ bv->bv_offset = PAGE_SIZE - remaining_size;
+ remaining_size = remaining_size - bv->bv_len;
+ }
+ }
+ }
+
+ return 0;
+}
+
+/*
+ * Check how many sg entry numbers are needed when map one request
+ * with scatterlist in advance.
+ */
+static unsigned int req_crypt_clone_sg_entry(struct request *clone)
+{
+ struct request_queue *q = clone->q;
+ struct bio_vec bvec, bvprv = { NULL };
+ struct bio *bio_t = NULL;
+ struct bvec_iter biter;
+ unsigned int nbytes, sg_length, sg_cnt = 0;
+
+ __rq_for_each_bio(bio_t, clone) {
+ sg_length = 0;
+ bio_for_each_segment(bvec, bio_t, biter) {
+ nbytes = bvec.bv_len;
+ if (sg_length + nbytes > queue_max_segment_size(q)) {
+ sg_length = 0;
+ sg_cnt++;
+ goto next;
+ }
+
+ if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bvec)) {
+ sg_length = 0;
+ sg_cnt++;
+ goto next;
+ }
+
+ if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bvec)) {
+ sg_length = 0;
+ sg_cnt++;
+ goto next;
+ }
+
+ sg_length += nbytes;
+next:
+ memcpy(&bvprv, &bvec, sizeof(struct bio_vec));
+ }
+ }
+
+ return sg_cnt;
+}
+
+static int req_crypt_convert_block(struct crypt_config *cc,
+ struct request *clone,
+ struct convert_context *ctx)
+{
+ struct ablkcipher_request *req = ctx->req;
+ struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
+ u8 *iv = iv_of_dmreq(cc, dmreq);
+ struct scatterlist *req_sg_in = NULL;
+ struct scatterlist *req_sg_out = NULL;
+ unsigned int total_sg_len_req_in = 0;
+ unsigned int total_sg_len_req_out = 0;
+ unsigned int total_bytes_in_req = 0;
+ unsigned int sg_in_max = 0, sg_out_max = 0;
+ int ret;
+
+ dmreq->iv_sector = ctx->cc_sector;
+ dmreq->ctx = ctx;
+ atomic_set(&ctx->cc_pending, 1);
+
+ /*
+ * Need to calculate how many sg entry need to be used
+ * for this clone.
+ */
+ sg_in_max = req_crypt_clone_sg_entry(clone) + 1;
+ if (sg_in_max > DM_MAX_SG_LIST || sg_in_max <= 0) {
+ DMERR("%s sg entry too large or none %d\n",
+ __func__, sg_in_max);
+ return -EINVAL;
+ } else if (sg_in_max == 2) {
+ req_sg_in = &dmreq->sg_in;
+ }
+
+ if (!req_sg_in) {
+ ret = sg_alloc_table(&dmreq->req_sgt_in,
+ sg_in_max, GFP_KERNEL);
+ if (ret) {
+ DMERR("%s sg in allocation failed\n", __func__);
+ return -ENOMEM;
+ }
+
+ req_sg_in = dmreq->req_sgt_in.sgl;
+ }
+
+ total_sg_len_req_in = blk_rq_map_sg(clone->q, clone, req_sg_in);
+ if ((total_sg_len_req_in <= 0)
+ || (total_sg_len_req_in > sg_in_max)) {
+ DMERR("%s in sg map error %d\n", __func__, total_sg_len_req_in);
+ return -EINVAL;
+ }
+
+ total_bytes_in_req = clone->__data_len;
+
+ if (rq_data_dir(clone) == READ)
+ goto set_crypt;
+
+ ret = req_crypt_alloc_pages(cc, clone);
+ if (ret < 0) {
+ DMERR("%s alloc request pages failed\n", __func__);
+ return -ENOMEM;
+ }
+
+ sg_out_max = req_crypt_clone_sg_entry(clone) + 1;
+ if (sg_out_max > DM_MAX_SG_LIST || sg_out_max <= 0) {
+ DMERR("%s sg entry too large or none %d\n",
+ __func__, sg_out_max);
+ return -EINVAL;
+ } else if (sg_out_max == 2) {
+ req_sg_out = &dmreq->sg_out;
+ }
+
+ if (!req_sg_out) {
+ ret = sg_alloc_table(&dmreq->req_sgt_out,
+ sg_out_max, GFP_KERNEL);
+ if (ret) {
+ DMERR("%s sg out allocation failed\n", __func__);
+ return -ENOMEM;
+ }
+
+ req_sg_out = dmreq->req_sgt_out.sgl;
+ }
+
+ total_sg_len_req_out = blk_rq_map_sg(clone->q, clone, req_sg_out);
+ if ((total_sg_len_req_out <= 0) ||
+ (total_sg_len_req_out > sg_out_max)) {
+ DMERR("%s out sg map error %d\n",
+ __func__, total_sg_len_req_out);
+ return -EINVAL;
+ }
+
+set_crypt:
+ if (cc->iv_gen_ops) {
+ ret = cc->iv_gen_ops->generator(cc, iv, dmreq);
+ if (ret < 0) {
+ DMERR("%s generator iv error %d\n", __func__, ret);
+ return ret;
+ }
+ }
+
+ atomic_inc(&ctx->cc_pending);
+
+ if (rq_data_dir(clone) == WRITE) {
+ ablkcipher_request_set_crypt(req, req_sg_in,
+ req_sg_out, total_bytes_in_req, iv);
+
+ ret = crypto_ablkcipher_encrypt(req);
+ } else {
+ ablkcipher_request_set_crypt(req, req_sg_in,
+ req_sg_in, total_bytes_in_req, iv);
+
+ ret = crypto_ablkcipher_decrypt(req);
+ }
+
+ if (!ret && cc->iv_gen_ops && cc->iv_gen_ops->post)
+ ret = cc->iv_gen_ops->post(cc, iv, dmreq);
+
+ return ret;
+}
+
+static void req_crypt_write_convert(struct dm_req_crypt_io *io)
+{
+ struct request *clone = io->cloned_request;
+ struct bio *bio_src = NULL;
+ struct crypt_config *cc = io->cc;
+ int crypt_finished;
+ int ret = 0, err = 0;
+
+ req_crypt_inc_pending(io);
+
+ crypt_convert_init(cc, &io->ctx, NULL, NULL, io->sector);
+ req_crypt_alloc_req(cc, &io->ctx);
+
+ ret = req_crypt_convert_block(cc, clone, &io->ctx);
+ switch (ret) {
+ case 0:
+ atomic_dec(&io->ctx.cc_pending);
+ break;
+ case -EBUSY:
+ /*
+ * Lets make this synchronous request by waiting on
+ * in progress as well
+ */
+ case -EINPROGRESS:
+ wait_for_completion_io(&io->ctx.restart);
+ if (io->error) {
+ err = -EIO;
+ goto crypt_error;
+ }
+ break;
+ default:
+ err = -EIO;
+ atomic_dec(&io->ctx.cc_pending);
+ break;
+ }
+
+ __rq_for_each_bio(bio_src, clone)
+ blk_queue_bounce(clone->q, &bio_src);
+
+crypt_error:
+ if (err == -EIO)
+ req_crypt_free_pages(cc, clone);
+
+ if (io)
+ io->error = err;
+
+ /* Encryption was already finished, submit io now */
+ crypt_finished = atomic_dec_and_test(&io->ctx.cc_pending);
+ if (crypt_finished)
+ req_crypt_write_io_submit(io, 0);
+ else
+ io->error = -EIO;
+
+ req_crypt_dec_pending_encrypt(io);
+}
+
+static void req_crypt_read_convert(struct dm_req_crypt_io *io)
+{
+ struct crypt_config *cc = io->cc;
+ struct request *clone = io->cloned_request;
+ int ret = 0, err = 0;
+
+ req_crypt_inc_pending(io);
+
+ /* io->sector need to be initilized */
+ crypt_convert_init(cc, &io->ctx, NULL, NULL, io->sector);
+ req_crypt_alloc_req(cc, &io->ctx);
+
+ ret = req_crypt_convert_block(cc, clone, &io->ctx);
+ switch (ret) {
+ case 0:
+ atomic_dec(&io->ctx.cc_pending);
+ break;
+ case -EBUSY:
+ /*
+ * Lets make this synchronous request by waiting on
+ * in progress as well
+ */
+ case -EINPROGRESS:
+ wait_for_completion_io(&io->ctx.restart);
+ if (io->error)
+ err = -EIO;
+ break;
+ default:
+ err = -EIO;
+ atomic_dec(&io->ctx.cc_pending);
+ break;
+ }
+
+ if (io)
+ io->error = err;
+
+ if (!atomic_dec_and_test(&io->ctx.cc_pending))
+ DMWARN("%s decryption was not finished\n", __func__);
+
+ req_crypt_dec_pending_decrypt(io);
+}
+
+/* Queue callback function that will get triggered */
+static void req_crypt_work(struct work_struct *work)
+{
+ struct dm_req_crypt_io *io =
+ container_of(work, struct dm_req_crypt_io, work);
+
+ if (rq_data_dir(io->cloned_request) == WRITE) {
+ if (io->should_encrypt)
+ req_crypt_write_convert(io);
+ else
+ req_crypt_write_plain(io);
+ } else if (rq_data_dir(io->cloned_request) == READ) {
+ if (io->should_decrypt)
+ req_crypt_read_convert(io);
+ else
+ req_crypt_read_plain(io);
+ } else {
+ DMERR("%s received non-write request for clone 0x%p\n",
+ __func__, io->cloned_request);
+ }
+}
+
+static void req_crypt_queue(struct dm_req_crypt_io *io)
+{
+ struct crypt_config *cc = io->cc;
+
+ INIT_WORK(&io->work, req_crypt_work);
+ queue_work(cc->crypt_queue, &io->work);
+}
+
+static bool req_crypt_should_encrypt(struct dm_req_crypt_io *req)
+{
+ if (!req || !req->cloned_request || !req->cloned_request->bio)
+ return false;
+
+ /* Maybe there are some others to be considerated */
+ return true;
+}
+
+static bool req_crypt_should_deccrypt(struct dm_req_crypt_io *req)
+{
+ if (!req || !req->cloned_request || !req->cloned_request->bio)
+ return false;
+
+ /* Maybe there are some others to be considerated */
+ return true;
+}
+
+static void crypt_req_io_init(struct dm_req_crypt_io *io,
+ struct crypt_config *cc,
+ struct request *clone,
+ sector_t sector)
+{
+ io->cc = cc;
+ io->sector = sector;
+ io->cloned_request = clone;
+ io->error = 0;
+ io->ctx.req = NULL;
+ atomic_set(&io->pending, 0);
+
+ if (rq_data_dir(clone) == WRITE)
+ io->should_encrypt = req_crypt_should_encrypt(io);
+ else if (rq_data_dir(clone) == READ)
+ io->should_decrypt = req_crypt_should_deccrypt(io);
+ else
+ io->should_decrypt = 0;
+}
+
+/*
+ * This function is called with interrupts disabled
+ * The function remaps the clone for the underlying device.
+ * If it is a write request, it calls into the worker queue to
+ * encrypt the data
+ * and submit the request directly using the elevator
+ * For a read request no pre-processing is required the request
+ * is returned to dm once mapping is done
+ */
+static int req_crypt_map(struct dm_target *ti, struct request *clone,
+ union map_info *map_context)
+{
+ struct crypt_config *cc = ti->private;
+ int copy_bio_sector_to_req = 0;
+ struct dm_req_crypt_io *req_io;
+ struct bio *bio_src;
+
+ if ((rq_data_dir(clone) != READ) && (rq_data_dir(clone) != WRITE)) {
+ DMERR("%s unknown request.\n", __func__);
+ return -EINVAL;
+ }
+
+ req_io = mempool_alloc(cc->req_io_pool, GFP_NOWAIT);
+ if (!req_io) {
+ DMERR("%s req io allocation failed.\n", __func__);
+ return -ENOMEM;
+ }
+
+ map_context->ptr = req_io;
+
+ /* Get the queue of the underlying original device */
+ clone->q = bdev_get_queue(cc->dev->bdev);
+ clone->rq_disk = cc->dev->bdev->bd_disk;
+
+ __rq_for_each_bio(bio_src, clone) {
+ bio_src->bi_bdev = cc->dev->bdev;
+ /*
+ * If request is REQ_FLUSH or REQ_DISCARD, just bypass crypt
+ * queues. It will free the bios of the request in block layer
+ * when completing the bypass if the request is REQ_FLUSH or
+ * REQ_DISCARD.
+ */
+ if (clone->cmd_flags & REQ_DISCARD
+ || clone->cmd_flags & REQ_FLUSH)
+ continue;
+
+ bio_set_flag(bio_src, BIO_ENDIO_FREE);
+
+ /*
+ * If this device has partitions, remap block n
+ * of partition p to block n+start(p) of the disk.
+ */
+ req_crypt_blk_partition_remap(bio_src);
+ if (copy_bio_sector_to_req == 0) {
+ clone->__sector = bio_src->bi_iter.bi_sector;
+ copy_bio_sector_to_req++;
+ }
+ blk_queue_bounce(clone->q, &bio_src);
+ }
+
+ crypt_req_io_init(req_io, cc, clone,
+ dm_target_offset(ti, clone->__sector));
+
+ if (rq_data_dir(clone) == READ) {
+ return DM_MAPIO_REMAPPED;
+ } else if (rq_data_dir(clone) == WRITE) {
+ req_crypt_queue(req_io);
+ return DM_MAPIO_SUBMITTED;
+ }
+
+ return -EINVAL;
+}
+
+/*
+ * The endio function is called from ksoftirqd context (atomic).
+ * For write operations the new pages created form the mempool
+ * is freed and returned. * For read operations, decryption is
+ * required, since this is called in a atomic * context, the
+ * request is sent to a worker queue to complete decryption and
+ * free the request once done.
+ */
+static int req_crypt_endio(struct dm_target *ti, struct request *clone,
+ int error, union map_info *map_context)
+{
+ struct dm_req_crypt_io *req_io = map_context->ptr;
+ struct crypt_config *cc = ti->private;
+ int ret = 0;
+
+ /* If it is a write request, do nothing just return. */
+ if (rq_data_dir(clone) == WRITE) {
+ if (req_io->should_encrypt)
+ req_crypt_free_pages(cc, clone);
+ req_crypt_free_resource(req_io);
+ } else if (rq_data_dir(clone) == READ) {
+ req_io->error = error;
+ req_crypt_queue(req_io);
+ ret = DM_ENDIO_INCOMPLETE;
+ }
+
+ return ret;
+}
+
static struct target_type crypt_target = {
.name = "crypt",
.version = {1, 14, 0},
.module = THIS_MODULE,
.ctr = crypt_ctr,
.dtr = crypt_dtr,
- .map = crypt_map,
.status = crypt_status,
+#ifndef CONFIG_DM_REQ_CRYPT
+ .map = crypt_map,
+#else
+ .map_rq = req_crypt_map,
+ .rq_end_io = req_crypt_endio,
+#endif
.postsuspend = crypt_postsuspend,
.preresume = crypt_preresume,
.resume = crypt_resume,
--
1.7.9.5

2015-11-11 09:48:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
> decrypt block data, which can only hanle one bio at one time. As we know,
> one bio must use the sequential physical address and it also has a limitation
> of length. Thus it may limit the big block encyrtion/decryption when some
> hardware support the big block data encryption.
>
> This patch series introduc the 'based-request' method to handle the data
> encryption/decryption. One request can contain multiple bios, so it can
> handle big block data to improve the efficiency.

NAK for more request based stacking or DM drivers. They are a major
pain to deal with, and adding more with different requirements then
dm-multipath is not helping in actually making that one work properly.

2015-11-11 17:54:16

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 1/2] block: Introduce BIO_ENDIO_FREE for bio flags

On Wed, Nov 11 2015 at 4:31am -0500,
Baolin Wang <[email protected]> wrote:

> When we use dm-crypt to decrypt block data, it will decrypt the block data
> in endio() when one IO is completed. In this situation we don't want the
> cloned bios is freed before calling the endio().
>
> Thus introduce 'BIO_ENDIO_FREE' flag to support the request handling for dm-crypt,
> this flag will ensure that blk layer does not complete the cloned bios before
> completing the request. When the crypt endio is called, post-processsing is
> done and then the dm layer will complete the bios (clones) and free them.

Not following why request-based DM's partial completion handling
(drivers/md/dm.c:end_clone_bio) isn't a sufficient hook -- no need to
add block complexity.

But that aside, I'm not liking the idea of a request-based dm-crypt.

> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 76d23fa..f636c50 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -407,6 +407,11 @@ union map_info *dm_get_rq_mapinfo(struct request *rq);
>
> struct queue_limits *dm_get_queue_limits(struct mapped_device *md);
>
> +void dm_end_request(struct request *clone, int error);
> +void dm_kill_unmapped_request(struct request *rq, int error);
> +void dm_dispatch_clone_request(struct request *clone, struct request *rq);
> +struct request *dm_get_orig_rq(struct request *clone);
> +
> /*
> * Geometry functions.
> */

I have no interest in seeing any request-based DM interfaces exported.

2015-11-11 18:18:18

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Wed, Nov 11 2015 at 4:31am -0500,
Baolin Wang <[email protected]> wrote:

> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
> decrypt block data, which can only hanle one bio at one time. As we know,
> one bio must use the sequential physical address and it also has a limitation
> of length. Thus it may limit the big block encyrtion/decryption when some
> hardware support the big block data encryption.
>
> This patch series introduc the 'based-request' method to handle the data
> encryption/decryption. One request can contain multiple bios, so it can
> handle big block data to improve the efficiency.

The duality of bio-based vs request-based code paths in DM core frankly
sucks. So the prospect of polluting dm-crypt with a similar duality is
really _not_ interesting.

Request-based DM requires more memory reserves per device than bio-based
DM. Also, you cannot stack request-based DM ontop of bio-based devices
(be them DM, MD, etc) so request-based DM's underlying storage stack
gets a lot less interesting with this change.

That said, it could be that the benefits of supporting both bio-based
and request-based DM in dm-crypt outweigh any overhead/limitations. But
you haven't given any performance data to justify this patchset.

There needs to be a _really_ compelling benefit to do this.

Also, FYI, having a big CONFIG knob to switch all of dm-crypt from
bio-based to request-based is _not_ acceptable. Both modes would need
to be supported in parallel. Could easily be that not all devices in a
system will benefit from being request-based.

Regardless, the risk of this change causing request-based DM to become
more brittle than it already is concerns me.

But I'm trying to keep an open mind... show me data that real hardware
_really_ benefits and we'll go from there. Again, it needs to be "OMG,
this is amazing!" level performance to warrant any further serious
consideration.

2015-11-12 02:15:36

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
> On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
>> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
>> decrypt block data, which can only hanle one bio at one time. As we know,
>> one bio must use the sequential physical address and it also has a limitation
>> of length. Thus it may limit the big block encyrtion/decryption when some
>> hardware support the big block data encryption.
>>
>> This patch series introduc the 'based-request' method to handle the data
>> encryption/decryption. One request can contain multiple bios, so it can
>> handle big block data to improve the efficiency.
>
> NAK for more request based stacking or DM drivers. They are a major
> pain to deal with, and adding more with different requirements then
> dm-multipath is not helping in actually making that one work properly.

But now many vendors supply the hardware engine to handle the
encyrtion/decryption. The hardware really need a big block to indicate
its performance with request based things. Another thing is now the
request based things is used by many vendors (Qualcomm, Spreadtrum and
so on) to improve their performance and there's a real performance
requirement here (I can show the performance result later).

I don't think you will worry if any one can work properly, we will
remove the bio based things in future if the request things are
accepted and proved effectively. Thanks for your comment.

--
Baolin.wang
Best Regards

2015-11-12 02:36:37

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 02:18, Mike Snitzer <[email protected]> wrote:
> On Wed, Nov 11 2015 at 4:31am -0500,
> Baolin Wang <[email protected]> wrote:
>
>> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
>> decrypt block data, which can only hanle one bio at one time. As we know,
>> one bio must use the sequential physical address and it also has a limitation
>> of length. Thus it may limit the big block encyrtion/decryption when some
>> hardware support the big block data encryption.
>>
>> This patch series introduc the 'based-request' method to handle the data
>> encryption/decryption. One request can contain multiple bios, so it can
>> handle big block data to improve the efficiency.
>
> The duality of bio-based vs request-based code paths in DM core frankly
> sucks. So the prospect of polluting dm-crypt with a similar duality is
> really _not_ interesting.
>

That's right. But we'll not introduce the duality things, cause we
will remove the bio based things in dm-crypt if the request based
things are accepted.

> Request-based DM requires more memory reserves per device than bio-based
> DM. Also, you cannot stack request-based DM ontop of bio-based devices
> (be them DM, MD, etc) so request-based DM's underlying storage stack
> gets a lot less interesting with this change.
>

Yes, the request based requires more memory than bio based, but it is
not too much. And the request based has a big performance improvement.

> That said, it could be that the benefits of supporting both bio-based
> and request-based DM in dm-crypt outweigh any overhead/limitations. But
> you haven't given any performance data to justify this patchset.
>

Like I said above, we plan to remove the bio based things which are
not good support for hardware engine encryption. And I'll show you the
performance data to prove the request things have a good performance.

> There needs to be a _really_ compelling benefit to do this.
>
> Also, FYI, having a big CONFIG knob to switch all of dm-crypt from
> bio-based to request-based is _not_ acceptable. Both modes would need
> to be supported in parallel. Could easily be that not all devices in a
> system will benefit from being request-based.
>

OK. The CONFIG is not suitable here. I'll remove the CONFIG with just
enable the request based things.

> Regardless, the risk of this change causing request-based DM to become
> more brittle than it already is concerns me.
>
> But I'm trying to keep an open mind... show me data that real hardware
> _really_ benefits and we'll go from there. Again, it needs to be "OMG,
> this is amazing!" level performance to warrant any further serious
> consideration.

OK. I'll show you the performance data. Thanks for your comments.



--
Baolin.wang
Best Regards

2015-11-12 04:05:34

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 1/2] block: Introduce BIO_ENDIO_FREE for bio flags

On 12 November 2015 at 01:54, Mike Snitzer <[email protected]> wrote:
> On Wed, Nov 11 2015 at 4:31am -0500,
> Baolin Wang <[email protected]> wrote:
>
>> When we use dm-crypt to decrypt block data, it will decrypt the block data
>> in endio() when one IO is completed. In this situation we don't want the
>> cloned bios is freed before calling the endio().
>>
>> Thus introduce 'BIO_ENDIO_FREE' flag to support the request handling for dm-crypt,
>> this flag will ensure that blk layer does not complete the cloned bios before
>> completing the request. When the crypt endio is called, post-processsing is
>> done and then the dm layer will complete the bios (clones) and free them.
>
> Not following why request-based DM's partial completion handling
> (drivers/md/dm.c:end_clone_bio) isn't a sufficient hook -- no need to
> add block complexity.
>

Sorry for lacking of more explanation for that. The dm-crypt will
decrypt block data in the end_io() callback function when one request
is completed, so we don't want the bios of this request is freed when
calling the end_io() callback. Thus we introduce a flag to indicate
these type bios of this request will be freed at dm layer not in block
layer.

> But that aside, I'm not liking the idea of a request-based dm-crypt.
>
>> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
>> index 76d23fa..f636c50 100644
>> --- a/include/linux/device-mapper.h
>> +++ b/include/linux/device-mapper.h
>> @@ -407,6 +407,11 @@ union map_info *dm_get_rq_mapinfo(struct request *rq);
>>
>> struct queue_limits *dm_get_queue_limits(struct mapped_device *md);
>>
>> +void dm_end_request(struct request *clone, int error);
>> +void dm_kill_unmapped_request(struct request *rq, int error);
>> +void dm_dispatch_clone_request(struct request *clone, struct request *rq);
>> +struct request *dm_get_orig_rq(struct request *clone);
>> +
>> /*
>> * Geometry functions.
>> */
>
> I have no interest in seeing any request-based DM interfaces exported.

OK.


--
Baolin.wang
Best Regards

2015-11-12 08:20:45

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 02:18, Mike Snitzer <[email protected]> wrote:
> On Wed, Nov 11 2015 at 4:31am -0500,
> Baolin Wang <[email protected]> wrote:
>
>> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
>> decrypt block data, which can only hanle one bio at one time. As we know,
>> one bio must use the sequential physical address and it also has a limitation
>> of length. Thus it may limit the big block encyrtion/decryption when some
>> hardware support the big block data encryption.
>>
>> This patch series introduc the 'based-request' method to handle the data
>> encryption/decryption. One request can contain multiple bios, so it can
>> handle big block data to improve the efficiency.
>
> The duality of bio-based vs request-based code paths in DM core frankly
> sucks. So the prospect of polluting dm-crypt with a similar duality is
> really _not_ interesting.
>
> Request-based DM requires more memory reserves per device than bio-based
> DM. Also, you cannot stack request-based DM ontop of bio-based devices
> (be them DM, MD, etc) so request-based DM's underlying storage stack
> gets a lot less interesting with this change.
>
> That said, it could be that the benefits of supporting both bio-based
> and request-based DM in dm-crypt outweigh any overhead/limitations. But
> you haven't given any performance data to justify this patchset.
>
> There needs to be a _really_ compelling benefit to do this.
>
> Also, FYI, having a big CONFIG knob to switch all of dm-crypt from
> bio-based to request-based is _not_ acceptable. Both modes would need
> to be supported in parallel. Could easily be that not all devices in a
> system will benefit from being request-based.
>
> Regardless, the risk of this change causing request-based DM to become
> more brittle than it already is concerns me.
>
> But I'm trying to keep an open mind... show me data that real hardware
> _really_ benefits and we'll go from there. Again, it needs to be "OMG,
> this is amazing!" level performance to warrant any further serious
> consideration.

Thanks for your suggestion. But let me explain it again. Now for many
vendors, they supply the encryption hardware (such as AES engine) to
accelerate the encyrtion/decryption speed with handling a big block at
one time. So if we want the hardware engine can play the best
performance, the size of block handled at one time need to be
expanded.

But it can only handle one bio at one time for bio based dm-crypt, one
bio has a size limitation and one bio's size can't make the hardware
engine reach its best performance. So we want to introduce the request
based dm-crypt. For request based things, some sequential bios can
merged into one request to expand the IO size to be a big block
handled by hardware engine at one time. With the hardware
acceleration, it can improve the encryption/decryption speed.

I think 3 questions need to be clarified.

1. Are there ways of enhancing the dm-crypt bio-based target to overcome this?
The focus is the size limitation of one bio, its size can not meet the
hardware requirement. But one request can have a big block size with
merging multiple bios. So I think the request is the best choice.

2. Would any sort of bio aggregation mechanism help?
The request can combined sequential bios by block layer automatically.
But for bio aggregation, I think it will be similar to that, why do we
need recomplement it again?

3. perforamence data
It is just a simple dd test result, and will provide the formal report
in future. But from the simple test, we can see the improvement.
Hardware environment:
Board: beaglebone black
processor: AM335x 1GHz ARM Cortex-A8
RAM: 512M
Cipher: cbc(aes) with AES hardware engine

(1) bio based dm-crypt with hardware accelarate:
read 64M command: dd if=/dev/dm-0 of=/dev/null bs=512k count=128 iflag=direct
67108864 bytes (67 MB) copied, 11.6592 s, 5.8 MB/s
67108864 bytes (67 MB) copied, 11.6391 s, 5.8 MB/s
67108864 bytes (67 MB) copied, 11.6296 s, 5.8 MB/s

(2) request based dm-crypt with hardware accelarate
read 64M command: dd if=/dev/dm-0 of=/dev/null bs=512k count=128 iflag=direct
67108864 bytes (67 MB) copied, 5.16586 s, 13.0 MB/s
67108864 bytes (67 MB) copied, 5.19338 s, 12.9 MB/s
67108864 bytes (67 MB) copied, 5.19169 s, 12.9 MB/s

(3) bio based dm-crypt with hardware accelarate
write 64M command: dd if=/dev/zero of=/dev/dm-0 bs=512k count=128 iflag=direct
67108864 bytes (67 MB) copied, 13.6852 s, 4.9 MB/s
67108864 bytes (67 MB) copied, 14.0873 s, 4.8 MB/s
67108864 bytes (67 MB) copied, 13.6649 s, 4.9 MB/s

(4) request based dm-crypt with hardware accelarate
write 64M command: dd if=/dev/zero of=/dev/dm-0 bs=512k count=128 iflag=direct
67108864 bytes (67 MB) copied, 7.27832 s, 9.2 MB/s
67108864 bytes (67 MB) copied, 7.29051 s, 9.2 MB/s
67108864 bytes (67 MB) copied, 7.28318 s, 9.2 MB/s

>From the simple result, we can see it at least has a double
improvement of the encryption performance.



--
Baolin.wang
Best Regards

2015-11-12 09:05:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On Wed, Nov 11, 2015 at 01:18:13PM -0500, Mike Snitzer wrote:
> But I'm trying to keep an open mind... show me data that real hardware
> _really_ benefits and we'll go from there. Again, it needs to be "OMG,
> this is amazing!" level performance to warrant any further serious
> consideration.

Also the numbers should be on 4.3+ with our arbitrarily sizeds bios,
and a file system that isn't stupid and actually submits bios (xfs,
btrfs for example).

2015-11-12 09:06:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu, Nov 12, 2015 at 10:36:34AM +0800, Baolin Wang wrote:
> That's right. But we'll not introduce the duality things, cause we
> will remove the bio based things in dm-crypt if the request based
> things are accepted.

No, you will NOT remote the bio based path. That would break all kinds
of perfectly valid setups.

2015-11-12 09:17:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu 12-11-15 10:15:32, Baolin Wang wrote:
> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
> >> decrypt block data, which can only hanle one bio at one time. As we know,
> >> one bio must use the sequential physical address and it also has a limitation
> >> of length. Thus it may limit the big block encyrtion/decryption when some
> >> hardware support the big block data encryption.
> >>
> >> This patch series introduc the 'based-request' method to handle the data
> >> encryption/decryption. One request can contain multiple bios, so it can
> >> handle big block data to improve the efficiency.
> >
> > NAK for more request based stacking or DM drivers. They are a major
> > pain to deal with, and adding more with different requirements then
> > dm-multipath is not helping in actually making that one work properly.
>
> But now many vendors supply the hardware engine to handle the
> encyrtion/decryption. The hardware really need a big block to indicate
> its performance with request based things. Another thing is now the
> request based things is used by many vendors (Qualcomm, Spreadtrum and
> so on) to improve their performance and there's a real performance
> requirement here (I can show the performance result later).

So you've mentioned several times that hardware needs big blocks. How big
those blocks need to be? Ideally, can you give some numbers on how the
throughput of the encryption hw grows with the block size?

Because as Mike had said there are downsides to having request based
dm-crypt as well. E.g. if you want to have encrypted raid5 volume then
you'd rather want to put encryption on top of raid5 (easier management,
larger sequential blocks to encrypt, ...) but you cannot do that when
dm-crypt would be request based. So modifying bio-based dm-crypt to form
larger chunks for encryption HW would be superior in this regard.

You mentioned that you use requests because of size limitations on bios - I
had a look and current struct bio can easily describe 1MB requests (that's
assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
struct bio_vec. Is that not enough?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-12 09:41:08

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 17:17, Jan Kara <[email protected]> wrote:
> On Thu 12-11-15 10:15:32, Baolin Wang wrote:
>> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
>> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
>> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
>> >> decrypt block data, which can only hanle one bio at one time. As we know,
>> >> one bio must use the sequential physical address and it also has a limitation
>> >> of length. Thus it may limit the big block encyrtion/decryption when some
>> >> hardware support the big block data encryption.
>> >>
>> >> This patch series introduc the 'based-request' method to handle the data
>> >> encryption/decryption. One request can contain multiple bios, so it can
>> >> handle big block data to improve the efficiency.
>> >
>> > NAK for more request based stacking or DM drivers. They are a major
>> > pain to deal with, and adding more with different requirements then
>> > dm-multipath is not helping in actually making that one work properly.
>>
>> But now many vendors supply the hardware engine to handle the
>> encyrtion/decryption. The hardware really need a big block to indicate
>> its performance with request based things. Another thing is now the
>> request based things is used by many vendors (Qualcomm, Spreadtrum and
>> so on) to improve their performance and there's a real performance
>> requirement here (I can show the performance result later).
>
> So you've mentioned several times that hardware needs big blocks. How big
> those blocks need to be? Ideally, can you give some numbers on how the
> throughput of the encryption hw grows with the block size?

It depends on the hardware design. My beaglebone black board's AES
engine can handle 1M at one time which is not big. As I know some
other AES engine can handle 16M data at one time or more.

>
> Because as Mike had said there are downsides to having request based
> dm-crypt as well. E.g. if you want to have encrypted raid5 volume then
> you'd rather want to put encryption on top of raid5 (easier management,
> larger sequential blocks to encrypt, ...) but you cannot do that when
> dm-crypt would be request based. So modifying bio-based dm-crypt to form
> larger chunks for encryption HW would be superior in this regard.
>

Make sense.

> You mentioned that you use requests because of size limitations on bios - I
> had a look and current struct bio can easily describe 1MB requests (that's
> assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
> struct bio_vec. Is that not enough?

Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
or some other small chunks. But request can combine some sequential
small bios to be a big block and it is better than bio at least.

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR



--
Baolin.wang
Best Regards

2015-11-12 10:04:43

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu, Nov 12, 2015 at 04:20:41PM +0800, Baolin Wang wrote:

> 3. perforamence data
> It is just a simple dd test result, and will provide the formal report
> in future. But from the simple test, we can see the improvement.

It's probably also worth pointing out that Qualcomm have been shipping
an out of tree implementation of this as a separate module in their BSP
(originally written by Danesh Garg who's on this thread):

https://android.googlesource.com/kernel/msm/+/android-msm-dory-3.10-kitkat-wear/drivers/md/dm-req-crypt.c

Android now wants to encrypt phones and tablets by default and have been
seeing substantial performance hits as a result, we can try to get
people to share performance data from productionish systems but it might
be difficult.


Attachments:
(No filename) (765.00 B)
signature.asc (473.00 B)
Download all attachments

2015-11-12 11:06:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu 12-11-15 17:40:59, Baolin Wang wrote:
> On 12 November 2015 at 17:17, Jan Kara <[email protected]> wrote:
> > On Thu 12-11-15 10:15:32, Baolin Wang wrote:
> >> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
> >> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
> >> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
> >> >> decrypt block data, which can only hanle one bio at one time. As we know,
> >> >> one bio must use the sequential physical address and it also has a limitation
> >> >> of length. Thus it may limit the big block encyrtion/decryption when some
> >> >> hardware support the big block data encryption.
> >> >>
> >> >> This patch series introduc the 'based-request' method to handle the data
> >> >> encryption/decryption. One request can contain multiple bios, so it can
> >> >> handle big block data to improve the efficiency.
> >> >
> >> > NAK for more request based stacking or DM drivers. They are a major
> >> > pain to deal with, and adding more with different requirements then
> >> > dm-multipath is not helping in actually making that one work properly.
> >>
> >> But now many vendors supply the hardware engine to handle the
> >> encyrtion/decryption. The hardware really need a big block to indicate
> >> its performance with request based things. Another thing is now the
> >> request based things is used by many vendors (Qualcomm, Spreadtrum and
> >> so on) to improve their performance and there's a real performance
> >> requirement here (I can show the performance result later).
> >
> > So you've mentioned several times that hardware needs big blocks. How big
> > those blocks need to be? Ideally, can you give some numbers on how the
> > throughput of the encryption hw grows with the block size?
>
> It depends on the hardware design. My beaglebone black board's AES
> engine can handle 1M at one time which is not big. As I know some
> other AES engine can handle 16M data at one time or more.

Well, one question is "can handle" and other question is how big gain in
throughput it will bring compared to say 1M chunks. I suppose there's some
constant overhead to issue a request to the crypto hw and by the time it is
encrypting 1M it may be that this overhead is well amortized by the cost of
the encryption itself which is in principle linear in the size of the
block. That's why I'd like to get idea of the real numbers...

> > You mentioned that you use requests because of size limitations on bios - I
> > had a look and current struct bio can easily describe 1MB requests (that's
> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
> > struct bio_vec. Is that not enough?
>
> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
> or some other small chunks. But request can combine some sequential
> small bios to be a big block and it is better than bio at least.

As Christoph mentions 4.3 should be better in submitting larger bios. Did
you check it?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-12 11:46:32

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 19:06, Jan Kara <[email protected]> wrote:
> On Thu 12-11-15 17:40:59, Baolin Wang wrote:
>> On 12 November 2015 at 17:17, Jan Kara <[email protected]> wrote:
>> > On Thu 12-11-15 10:15:32, Baolin Wang wrote:
>> >> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
>> >> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
>> >> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
>> >> >> decrypt block data, which can only hanle one bio at one time. As we know,
>> >> >> one bio must use the sequential physical address and it also has a limitation
>> >> >> of length. Thus it may limit the big block encyrtion/decryption when some
>> >> >> hardware support the big block data encryption.
>> >> >>
>> >> >> This patch series introduc the 'based-request' method to handle the data
>> >> >> encryption/decryption. One request can contain multiple bios, so it can
>> >> >> handle big block data to improve the efficiency.
>> >> >
>> >> > NAK for more request based stacking or DM drivers. They are a major
>> >> > pain to deal with, and adding more with different requirements then
>> >> > dm-multipath is not helping in actually making that one work properly.
>> >>
>> >> But now many vendors supply the hardware engine to handle the
>> >> encyrtion/decryption. The hardware really need a big block to indicate
>> >> its performance with request based things. Another thing is now the
>> >> request based things is used by many vendors (Qualcomm, Spreadtrum and
>> >> so on) to improve their performance and there's a real performance
>> >> requirement here (I can show the performance result later).
>> >
>> > So you've mentioned several times that hardware needs big blocks. How big
>> > those blocks need to be? Ideally, can you give some numbers on how the
>> > throughput of the encryption hw grows with the block size?
>>
>> It depends on the hardware design. My beaglebone black board's AES
>> engine can handle 1M at one time which is not big. As I know some
>> other AES engine can handle 16M data at one time or more.
>
> Well, one question is "can handle" and other question is how big gain in
> throughput it will bring compared to say 1M chunks. I suppose there's some
> constant overhead to issue a request to the crypto hw and by the time it is
> encrypting 1M it may be that this overhead is well amortized by the cost of
> the encryption itself which is in principle linear in the size of the
> block. That's why I'd like to get idea of the real numbers...

Please correct me if I misunderstood your point. Let's suppose the AES
engine can handle 16M at one time. If we give the size of data is less
than 16M, the engine can handle it at one time. But if the data size
is 20M (more than 16M), the engine driver will split the data with 16M
and 4M to deal with. I can not say how many numbers, but I think the
engine is like to big chunks than small chunks which is the hardware
engine's advantage.

>
>> > You mentioned that you use requests because of size limitations on bios - I
>> > had a look and current struct bio can easily describe 1MB requests (that's
>> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
>> > struct bio_vec. Is that not enough?
>>
>> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
>> or some other small chunks. But request can combine some sequential
>> small bios to be a big block and it is better than bio at least.
>
> As Christoph mentions 4.3 should be better in submitting larger bios. Did
> you check it?

I'm sorry I didn't check it. What's the limitation of one bio on 4.3?
But I think it is better to choose a bigger one ( one request ) for
crypto which is suitable for hardware engine.

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR



--
Baolin.wang
Best Regards

2015-11-12 12:24:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu 12-11-15 19:46:26, Baolin Wang wrote:
> On 12 November 2015 at 19:06, Jan Kara <[email protected]> wrote:
> > On Thu 12-11-15 17:40:59, Baolin Wang wrote:
> >> On 12 November 2015 at 17:17, Jan Kara <[email protected]> wrote:
> >> > On Thu 12-11-15 10:15:32, Baolin Wang wrote:
> >> >> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
> >> >> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
> >> >> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
> >> >> >> decrypt block data, which can only hanle one bio at one time. As we know,
> >> >> >> one bio must use the sequential physical address and it also has a limitation
> >> >> >> of length. Thus it may limit the big block encyrtion/decryption when some
> >> >> >> hardware support the big block data encryption.
> >> >> >>
> >> >> >> This patch series introduc the 'based-request' method to handle the data
> >> >> >> encryption/decryption. One request can contain multiple bios, so it can
> >> >> >> handle big block data to improve the efficiency.
> >> >> >
> >> >> > NAK for more request based stacking or DM drivers. They are a major
> >> >> > pain to deal with, and adding more with different requirements then
> >> >> > dm-multipath is not helping in actually making that one work properly.
> >> >>
> >> >> But now many vendors supply the hardware engine to handle the
> >> >> encyrtion/decryption. The hardware really need a big block to indicate
> >> >> its performance with request based things. Another thing is now the
> >> >> request based things is used by many vendors (Qualcomm, Spreadtrum and
> >> >> so on) to improve their performance and there's a real performance
> >> >> requirement here (I can show the performance result later).
> >> >
> >> > So you've mentioned several times that hardware needs big blocks. How big
> >> > those blocks need to be? Ideally, can you give some numbers on how the
> >> > throughput of the encryption hw grows with the block size?
> >>
> >> It depends on the hardware design. My beaglebone black board's AES
> >> engine can handle 1M at one time which is not big. As I know some
> >> other AES engine can handle 16M data at one time or more.
> >
> > Well, one question is "can handle" and other question is how big gain in
> > throughput it will bring compared to say 1M chunks. I suppose there's some
> > constant overhead to issue a request to the crypto hw and by the time it is
> > encrypting 1M it may be that this overhead is well amortized by the cost of
> > the encryption itself which is in principle linear in the size of the
> > block. That's why I'd like to get idea of the real numbers...
>
> Please correct me if I misunderstood your point. Let's suppose the AES
> engine can handle 16M at one time. If we give the size of data is less
> than 16M, the engine can handle it at one time. But if the data size
> is 20M (more than 16M), the engine driver will split the data with 16M
> and 4M to deal with. I can not say how many numbers, but I think the
> engine is like to big chunks than small chunks which is the hardware
> engine's advantage.

No, I meant something different. I meant that if HW can encrypt 1M in say
1.05 ms and it can encrypt 16M in 16.05 ms, then although using 16 M blocks
gives you some advantage it becomes diminishingly small.

> >> > You mentioned that you use requests because of size limitations on bios - I
> >> > had a look and current struct bio can easily describe 1MB requests (that's
> >> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
> >> > struct bio_vec. Is that not enough?
> >>
> >> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
> >> or some other small chunks. But request can combine some sequential
> >> small bios to be a big block and it is better than bio at least.
> >
> > As Christoph mentions 4.3 should be better in submitting larger bios. Did
> > you check it?
>
> I'm sorry I didn't check it. What's the limitation of one bio on 4.3?

On 4.3 it is 1 MB (which should be enough because requests are limited to
512 KB by default anyway). Previously the maximum bio size depended on the
queue parameters such as max number of segments etc.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-12 12:51:13

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 20:24, Jan Kara <[email protected]> wrote:
> On Thu 12-11-15 19:46:26, Baolin Wang wrote:
>> On 12 November 2015 at 19:06, Jan Kara <[email protected]> wrote:
>> > On Thu 12-11-15 17:40:59, Baolin Wang wrote:
>> >> On 12 November 2015 at 17:17, Jan Kara <[email protected]> wrote:
>> >> > On Thu 12-11-15 10:15:32, Baolin Wang wrote:
>> >> >> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
>> >> >> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
>> >> >> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
>> >> >> >> decrypt block data, which can only hanle one bio at one time. As we know,
>> >> >> >> one bio must use the sequential physical address and it also has a limitation
>> >> >> >> of length. Thus it may limit the big block encyrtion/decryption when some
>> >> >> >> hardware support the big block data encryption.
>> >> >> >>
>> >> >> >> This patch series introduc the 'based-request' method to handle the data
>> >> >> >> encryption/decryption. One request can contain multiple bios, so it can
>> >> >> >> handle big block data to improve the efficiency.
>> >> >> >
>> >> >> > NAK for more request based stacking or DM drivers. They are a major
>> >> >> > pain to deal with, and adding more with different requirements then
>> >> >> > dm-multipath is not helping in actually making that one work properly.
>> >> >>
>> >> >> But now many vendors supply the hardware engine to handle the
>> >> >> encyrtion/decryption. The hardware really need a big block to indicate
>> >> >> its performance with request based things. Another thing is now the
>> >> >> request based things is used by many vendors (Qualcomm, Spreadtrum and
>> >> >> so on) to improve their performance and there's a real performance
>> >> >> requirement here (I can show the performance result later).
>> >> >
>> >> > So you've mentioned several times that hardware needs big blocks. How big
>> >> > those blocks need to be? Ideally, can you give some numbers on how the
>> >> > throughput of the encryption hw grows with the block size?
>> >>
>> >> It depends on the hardware design. My beaglebone black board's AES
>> >> engine can handle 1M at one time which is not big. As I know some
>> >> other AES engine can handle 16M data at one time or more.
>> >
>> > Well, one question is "can handle" and other question is how big gain in
>> > throughput it will bring compared to say 1M chunks. I suppose there's some
>> > constant overhead to issue a request to the crypto hw and by the time it is
>> > encrypting 1M it may be that this overhead is well amortized by the cost of
>> > the encryption itself which is in principle linear in the size of the
>> > block. That's why I'd like to get idea of the real numbers...
>>
>> Please correct me if I misunderstood your point. Let's suppose the AES
>> engine can handle 16M at one time. If we give the size of data is less
>> than 16M, the engine can handle it at one time. But if the data size
>> is 20M (more than 16M), the engine driver will split the data with 16M
>> and 4M to deal with. I can not say how many numbers, but I think the
>> engine is like to big chunks than small chunks which is the hardware
>> engine's advantage.
>
> No, I meant something different. I meant that if HW can encrypt 1M in say
> 1.05 ms and it can encrypt 16M in 16.05 ms, then although using 16 M blocks
> gives you some advantage it becomes diminishingly small.
>

But if it encrypts 16M with 1M one by one, it will be much more than
16.05ms (should be consider the SW submits bio one by one).

>> >> > You mentioned that you use requests because of size limitations on bios - I
>> >> > had a look and current struct bio can easily describe 1MB requests (that's
>> >> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
>> >> > struct bio_vec. Is that not enough?
>> >>
>> >> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
>> >> or some other small chunks. But request can combine some sequential
>> >> small bios to be a big block and it is better than bio at least.
>> >
>> > As Christoph mentions 4.3 should be better in submitting larger bios. Did
>> > you check it?
>>
>> I'm sorry I didn't check it. What's the limitation of one bio on 4.3?
>
> On 4.3 it is 1 MB (which should be enough because requests are limited to
> 512 KB by default anyway). Previously the maximum bio size depended on the
> queue parameters such as max number of segments etc.
>

But it maybe not enough for HW engine which can handle maybe 10M/20M
at one time.


> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR



--
Baolin.wang
Best Regards

2015-11-12 12:58:24

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thursday 12 November 2015 20:51:10 Baolin Wang wrote:
> On 12 November 2015 at 20:24, Jan Kara <[email protected]> wrote:
> > On Thu 12-11-15 19:46:26, Baolin Wang wrote:
> >> On 12 November 2015 at 19:06, Jan Kara <[email protected]> wrote:
> >> > Well, one question is "can handle" and other question is how big gain in
> >> > throughput it will bring compared to say 1M chunks. I suppose there's some
> >> > constant overhead to issue a request to the crypto hw and by the time it is
> >> > encrypting 1M it may be that this overhead is well amortized by the cost of
> >> > the encryption itself which is in principle linear in the size of the
> >> > block. That's why I'd like to get idea of the real numbers...
> >>
> >> Please correct me if I misunderstood your point. Let's suppose the AES
> >> engine can handle 16M at one time. If we give the size of data is less
> >> than 16M, the engine can handle it at one time. But if the data size
> >> is 20M (more than 16M), the engine driver will split the data with 16M
> >> and 4M to deal with. I can not say how many numbers, but I think the
> >> engine is like to big chunks than small chunks which is the hardware
> >> engine's advantage.
> >
> > No, I meant something different. I meant that if HW can encrypt 1M in say
> > 1.05 ms and it can encrypt 16M in 16.05 ms, then although using 16 M blocks
> > gives you some advantage it becomes diminishingly small.
> >
>
> But if it encrypts 16M with 1M one by one, it will be much more than
> 16.05ms (should be consider the SW submits bio one by one).

The example that Jan gave was meant to illustrate the case where it's not
much more than 16.05ms, just slightly more.

The point is that we need real numbers to show at what size we stop
getting significant returns from increased block sizes.

> >> >> > You mentioned that you use requests because of size limitations on bios - I
> >> >> > had a look and current struct bio can easily describe 1MB requests (that's
> >> >> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
> >> >> > struct bio_vec. Is that not enough?
> >> >>
> >> >> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
> >> >> or some other small chunks. But request can combine some sequential
> >> >> small bios to be a big block and it is better than bio at least.
> >> >
> >> > As Christoph mentions 4.3 should be better in submitting larger bios. Did
> >> > you check it?
> >>
> >> I'm sorry I didn't check it. What's the limitation of one bio on 4.3?
> >
> > On 4.3 it is 1 MB (which should be enough because requests are limited to
> > 512 KB by default anyway). Previously the maximum bio size depended on the
> > queue parameters such as max number of segments etc.
>
> But it maybe not enough for HW engine which can handle maybe 10M/20M
> at one time.

Given that you have already done measurements, can you find out how much
you lose in overall performance with your existing patch if you artificially
limit the maximum size to sizes like 256kb, 1MB, 4MB, ...?

Arnd

2015-11-12 12:59:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu 12-11-15 20:51:10, Baolin Wang wrote:
> On 12 November 2015 at 20:24, Jan Kara <[email protected]> wrote:
> > On Thu 12-11-15 19:46:26, Baolin Wang wrote:
> >> On 12 November 2015 at 19:06, Jan Kara <[email protected]> wrote:
> >> > On Thu 12-11-15 17:40:59, Baolin Wang wrote:
> >> >> On 12 November 2015 at 17:17, Jan Kara <[email protected]> wrote:
> >> >> > On Thu 12-11-15 10:15:32, Baolin Wang wrote:
> >> >> >> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
> >> >> >> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
> >> >> >> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
> >> >> >> >> decrypt block data, which can only hanle one bio at one time. As we know,
> >> >> >> >> one bio must use the sequential physical address and it also has a limitation
> >> >> >> >> of length. Thus it may limit the big block encyrtion/decryption when some
> >> >> >> >> hardware support the big block data encryption.
> >> >> >> >>
> >> >> >> >> This patch series introduc the 'based-request' method to handle the data
> >> >> >> >> encryption/decryption. One request can contain multiple bios, so it can
> >> >> >> >> handle big block data to improve the efficiency.
> >> >> >> >
> >> >> >> > NAK for more request based stacking or DM drivers. They are a major
> >> >> >> > pain to deal with, and adding more with different requirements then
> >> >> >> > dm-multipath is not helping in actually making that one work properly.
> >> >> >>
> >> >> >> But now many vendors supply the hardware engine to handle the
> >> >> >> encyrtion/decryption. The hardware really need a big block to indicate
> >> >> >> its performance with request based things. Another thing is now the
> >> >> >> request based things is used by many vendors (Qualcomm, Spreadtrum and
> >> >> >> so on) to improve their performance and there's a real performance
> >> >> >> requirement here (I can show the performance result later).
> >> >> >
> >> >> > So you've mentioned several times that hardware needs big blocks. How big
> >> >> > those blocks need to be? Ideally, can you give some numbers on how the
> >> >> > throughput of the encryption hw grows with the block size?
> >> >>
> >> >> It depends on the hardware design. My beaglebone black board's AES
> >> >> engine can handle 1M at one time which is not big. As I know some
> >> >> other AES engine can handle 16M data at one time or more.
> >> >
> >> > Well, one question is "can handle" and other question is how big gain in
> >> > throughput it will bring compared to say 1M chunks. I suppose there's some
> >> > constant overhead to issue a request to the crypto hw and by the time it is
> >> > encrypting 1M it may be that this overhead is well amortized by the cost of
> >> > the encryption itself which is in principle linear in the size of the
> >> > block. That's why I'd like to get idea of the real numbers...
> >>
> >> Please correct me if I misunderstood your point. Let's suppose the AES
> >> engine can handle 16M at one time. If we give the size of data is less
> >> than 16M, the engine can handle it at one time. But if the data size
> >> is 20M (more than 16M), the engine driver will split the data with 16M
> >> and 4M to deal with. I can not say how many numbers, but I think the
> >> engine is like to big chunks than small chunks which is the hardware
> >> engine's advantage.
> >
> > No, I meant something different. I meant that if HW can encrypt 1M in say
> > 1.05 ms and it can encrypt 16M in 16.05 ms, then although using 16 M blocks
> > gives you some advantage it becomes diminishingly small.
> >
>
> But if it encrypts 16M with 1M one by one, it will be much more than
> 16.05ms (should be consider the SW submits bio one by one).

Really? In my example, it would take 16.8 ms if we encrypted 16M in 1M
chunks and 16.05 ms if done in one chunk. That is a difference for which I
would not be willing to bend over backwards. Now these numbers are
completely made up and that's why I wanted to see the real numbers...

> >> >> > You mentioned that you use requests because of size limitations on bios - I
> >> >> > had a look and current struct bio can easily describe 1MB requests (that's
> >> >> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
> >> >> > struct bio_vec. Is that not enough?
> >> >>
> >> >> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
> >> >> or some other small chunks. But request can combine some sequential
> >> >> small bios to be a big block and it is better than bio at least.
> >> >
> >> > As Christoph mentions 4.3 should be better in submitting larger bios. Did
> >> > you check it?
> >>
> >> I'm sorry I didn't check it. What's the limitation of one bio on 4.3?
> >
> > On 4.3 it is 1 MB (which should be enough because requests are limited to
> > 512 KB by default anyway). Previously the maximum bio size depended on the
> > queue parameters such as max number of segments etc.
>
> But it maybe not enough for HW engine which can handle maybe 10M/20M
> at one time.

Currently, you would not be able to create larger than 512K / 1M chunks
even with request based dm-crypt since requests have limits on number of
data they can carry as well... So this is kind of abstract discussion.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-12 15:03:22

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu, Nov 12, 2015 at 01:57:27PM +0100, Arnd Bergmann wrote:
> On Thursday 12 November 2015 20:51:10 Baolin Wang wrote:

> > But it maybe not enough for HW engine which can handle maybe 10M/20M
> > at one time.

> Given that you have already done measurements, can you find out how much
> you lose in overall performance with your existing patch if you artificially
> limit the maximum size to sizes like 256kb, 1MB, 4MB, ...?

It's probably also worth looking at the impact on CPU utilisation as
well as throughput in your benchmarking since the system will often not
be idle when it's doing a lot of I/O - I know you've done some
measurements in that area before, including them when looking at block
sizes might be interesting.


Attachments:
(No filename) (733.00 B)
signature.asc (473.00 B)
Download all attachments

2015-11-12 15:26:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 11/12/2015 03:04 AM, Mark Brown wrote:
> On Thu, Nov 12, 2015 at 04:20:41PM +0800, Baolin Wang wrote:
>
>> 3. perforamence data
>> It is just a simple dd test result, and will provide the formal report
>> in future. But from the simple test, we can see the improvement.
>
> It's probably also worth pointing out that Qualcomm have been shipping
> an out of tree implementation of this as a separate module in their BSP
> (originally written by Danesh Garg who's on this thread):
>
> https://android.googlesource.com/kernel/msm/+/android-msm-dory-3.10-kitkat-wear/drivers/md/dm-req-crypt.c
>
> Android now wants to encrypt phones and tablets by default and have been
> seeing substantial performance hits as a result, we can try to get
> people to share performance data from productionish systems but it might
> be difficult.

Well, shame on them for developing out-of-tree, looks like they are
reaping all the benefits of that.

Guys, we need some numbers, enough with the hand waving. There's no
point discussing this further until we know how much of a difference it
makes to handle X MB chunks instead of Y MB chunks. As was previously
stated, unless there's a _substantial_ performance benefit, this
patchset isn't going anywhere.

If there is a huge benefit, we can look into ways of making it actually
work. That may not even be a request interface, it could just be proper
utilization of plugging for in-dm bio merging.

--
Jens Axboe

2015-11-13 02:05:32

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 20:59, Jan Kara <[email protected]> wrote:
> On Thu 12-11-15 20:51:10, Baolin Wang wrote:
>> On 12 November 2015 at 20:24, Jan Kara <[email protected]> wrote:
>> > On Thu 12-11-15 19:46:26, Baolin Wang wrote:
>> >> On 12 November 2015 at 19:06, Jan Kara <[email protected]> wrote:
>> >> > On Thu 12-11-15 17:40:59, Baolin Wang wrote:
>> >> >> On 12 November 2015 at 17:17, Jan Kara <[email protected]> wrote:
>> >> >> > On Thu 12-11-15 10:15:32, Baolin Wang wrote:
>> >> >> >> On 11 November 2015 at 17:48, Christoph Hellwig <[email protected]> wrote:
>> >> >> >> > On Wed, Nov 11, 2015 at 05:31:43PM +0800, Baolin Wang wrote:
>> >> >> >> >> Now the dm-crypt code only implemented the 'based-bio' method to encrypt/
>> >> >> >> >> decrypt block data, which can only hanle one bio at one time. As we know,
>> >> >> >> >> one bio must use the sequential physical address and it also has a limitation
>> >> >> >> >> of length. Thus it may limit the big block encyrtion/decryption when some
>> >> >> >> >> hardware support the big block data encryption.
>> >> >> >> >>
>> >> >> >> >> This patch series introduc the 'based-request' method to handle the data
>> >> >> >> >> encryption/decryption. One request can contain multiple bios, so it can
>> >> >> >> >> handle big block data to improve the efficiency.
>> >> >> >> >
>> >> >> >> > NAK for more request based stacking or DM drivers. They are a major
>> >> >> >> > pain to deal with, and adding more with different requirements then
>> >> >> >> > dm-multipath is not helping in actually making that one work properly.
>> >> >> >>
>> >> >> >> But now many vendors supply the hardware engine to handle the
>> >> >> >> encyrtion/decryption. The hardware really need a big block to indicate
>> >> >> >> its performance with request based things. Another thing is now the
>> >> >> >> request based things is used by many vendors (Qualcomm, Spreadtrum and
>> >> >> >> so on) to improve their performance and there's a real performance
>> >> >> >> requirement here (I can show the performance result later).
>> >> >> >
>> >> >> > So you've mentioned several times that hardware needs big blocks. How big
>> >> >> > those blocks need to be? Ideally, can you give some numbers on how the
>> >> >> > throughput of the encryption hw grows with the block size?
>> >> >>
>> >> >> It depends on the hardware design. My beaglebone black board's AES
>> >> >> engine can handle 1M at one time which is not big. As I know some
>> >> >> other AES engine can handle 16M data at one time or more.
>> >> >
>> >> > Well, one question is "can handle" and other question is how big gain in
>> >> > throughput it will bring compared to say 1M chunks. I suppose there's some
>> >> > constant overhead to issue a request to the crypto hw and by the time it is
>> >> > encrypting 1M it may be that this overhead is well amortized by the cost of
>> >> > the encryption itself which is in principle linear in the size of the
>> >> > block. That's why I'd like to get idea of the real numbers...
>> >>
>> >> Please correct me if I misunderstood your point. Let's suppose the AES
>> >> engine can handle 16M at one time. If we give the size of data is less
>> >> than 16M, the engine can handle it at one time. But if the data size
>> >> is 20M (more than 16M), the engine driver will split the data with 16M
>> >> and 4M to deal with. I can not say how many numbers, but I think the
>> >> engine is like to big chunks than small chunks which is the hardware
>> >> engine's advantage.
>> >
>> > No, I meant something different. I meant that if HW can encrypt 1M in say
>> > 1.05 ms and it can encrypt 16M in 16.05 ms, then although using 16 M blocks
>> > gives you some advantage it becomes diminishingly small.
>> >
>>
>> But if it encrypts 16M with 1M one by one, it will be much more than
>> 16.05ms (should be consider the SW submits bio one by one).
>
> Really? In my example, it would take 16.8 ms if we encrypted 16M in 1M
> chunks and 16.05 ms if done in one chunk. That is a difference for which I
> would not be willing to bend over backwards. Now these numbers are
> completely made up and that's why I wanted to see the real numbers...
>

Well, I did a simple test with dd reading, cause my engine limitation is 1M,
(1) so the time like below when handle 1M at one time.
1048576 bytes (1.0 MB) copied, 0.0841235 s, 12.5 MB/s
1048576 bytes (1.0 MB) copied, 0.0836294 s, 12.5 MB/s
1048576 bytes (1.0 MB) copied, 0.0836526 s, 12.5 MB/s

(2) These handle 64K at one time * 16 times
1048576 bytes (1.0 MB) copied, 0.0937223 s, 11.2 MB/s
1048576 bytes (1.0 MB) copied, 0.097205 s, 10.8 MB/s
1048576 bytes (1.0 MB) copied, 0.0935884 s, 11.2 MB/s

Here is a 10ms level difference, try to image if the hardware engine's
throughput is bigger than that. But like Jens said, we can measure it
by the performance data. Thanks.

>> >> >> > You mentioned that you use requests because of size limitations on bios - I
>> >> >> > had a look and current struct bio can easily describe 1MB requests (that's
>> >> >> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
>> >> >> > struct bio_vec. Is that not enough?
>> >> >>
>> >> >> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
>> >> >> or some other small chunks. But request can combine some sequential
>> >> >> small bios to be a big block and it is better than bio at least.
>> >> >
>> >> > As Christoph mentions 4.3 should be better in submitting larger bios. Did
>> >> > you check it?
>> >>
>> >> I'm sorry I didn't check it. What's the limitation of one bio on 4.3?
>> >
>> > On 4.3 it is 1 MB (which should be enough because requests are limited to
>> > 512 KB by default anyway). Previously the maximum bio size depended on the
>> > queue parameters such as max number of segments etc.
>>
>> But it maybe not enough for HW engine which can handle maybe 10M/20M
>> at one time.
>
> Currently, you would not be able to create larger than 512K / 1M chunks
> even with request based dm-crypt since requests have limits on number of
> data they can carry as well... So this is kind of abstract discussion.
>

OK. But I think if it is that it should change the default limitation
for the DM device.

> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR



--
Baolin.wang
Best Regards

2015-11-13 02:07:27

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 23:26, Jens Axboe <[email protected]> wrote:
> On 11/12/2015 03:04 AM, Mark Brown wrote:
>>
>> On Thu, Nov 12, 2015 at 04:20:41PM +0800, Baolin Wang wrote:
>>
>>> 3. perforamence data
>>> It is just a simple dd test result, and will provide the formal report
>>> in future. But from the simple test, we can see the improvement.
>>
>>
>> It's probably also worth pointing out that Qualcomm have been shipping
>> an out of tree implementation of this as a separate module in their BSP
>> (originally written by Danesh Garg who's on this thread):
>>
>>
>> https://android.googlesource.com/kernel/msm/+/android-msm-dory-3.10-kitkat-wear/drivers/md/dm-req-crypt.c
>>
>> Android now wants to encrypt phones and tablets by default and have been
>> seeing substantial performance hits as a result, we can try to get
>> people to share performance data from productionish systems but it might
>> be difficult.
>
>
> Well, shame on them for developing out-of-tree, looks like they are reaping
> all the benefits of that.
>
> Guys, we need some numbers, enough with the hand waving. There's no point
> discussing this further until we know how much of a difference it makes to
> handle X MB chunks instead of Y MB chunks. As was previously stated, unless
> there's a _substantial_ performance benefit, this patchset isn't going
> anywhere.

That's fair enough and we will provide the performance data to measure
the patchset.

>
> If there is a huge benefit, we can look into ways of making it actually
> work. That may not even be a request interface, it could just be proper
> utilization of plugging for in-dm bio merging.

Make sense. Thanks.

>
> --
> Jens Axboe
>



--
Baolin.wang
Best Regards

2015-11-13 03:25:40

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 20:57, Arnd Bergmann <[email protected]> wrote:
> On Thursday 12 November 2015 20:51:10 Baolin Wang wrote:
>> On 12 November 2015 at 20:24, Jan Kara <[email protected]> wrote:
>> > On Thu 12-11-15 19:46:26, Baolin Wang wrote:
>> >> On 12 November 2015 at 19:06, Jan Kara <[email protected]> wrote:
>> >> > Well, one question is "can handle" and other question is how big gain in
>> >> > throughput it will bring compared to say 1M chunks. I suppose there's some
>> >> > constant overhead to issue a request to the crypto hw and by the time it is
>> >> > encrypting 1M it may be that this overhead is well amortized by the cost of
>> >> > the encryption itself which is in principle linear in the size of the
>> >> > block. That's why I'd like to get idea of the real numbers...
>> >>
>> >> Please correct me if I misunderstood your point. Let's suppose the AES
>> >> engine can handle 16M at one time. If we give the size of data is less
>> >> than 16M, the engine can handle it at one time. But if the data size
>> >> is 20M (more than 16M), the engine driver will split the data with 16M
>> >> and 4M to deal with. I can not say how many numbers, but I think the
>> >> engine is like to big chunks than small chunks which is the hardware
>> >> engine's advantage.
>> >
>> > No, I meant something different. I meant that if HW can encrypt 1M in say
>> > 1.05 ms and it can encrypt 16M in 16.05 ms, then although using 16 M blocks
>> > gives you some advantage it becomes diminishingly small.
>> >
>>
>> But if it encrypts 16M with 1M one by one, it will be much more than
>> 16.05ms (should be consider the SW submits bio one by one).
>
> The example that Jan gave was meant to illustrate the case where it's not
> much more than 16.05ms, just slightly more.
>
> The point is that we need real numbers to show at what size we stop
> getting significant returns from increased block sizes.
>

Got it. Thanks.

>> >> >> > You mentioned that you use requests because of size limitations on bios - I
>> >> >> > had a look and current struct bio can easily describe 1MB requests (that's
>> >> >> > assuming 64-bit architecture, 4KB pages) when we have 1 page worth of
>> >> >> > struct bio_vec. Is that not enough?
>> >> >>
>> >> >> Usually one bio does not always use the full 1M, maybe some 1k/2k/8k
>> >> >> or some other small chunks. But request can combine some sequential
>> >> >> small bios to be a big block and it is better than bio at least.
>> >> >
>> >> > As Christoph mentions 4.3 should be better in submitting larger bios. Did
>> >> > you check it?
>> >>
>> >> I'm sorry I didn't check it. What's the limitation of one bio on 4.3?
>> >
>> > On 4.3 it is 1 MB (which should be enough because requests are limited to
>> > 512 KB by default anyway). Previously the maximum bio size depended on the
>> > queue parameters such as max number of segments etc.
>>
>> But it maybe not enough for HW engine which can handle maybe 10M/20M
>> at one time.
>
> Given that you have already done measurements, can you find out how much
> you lose in overall performance with your existing patch if you artificially
> limit the maximum size to sizes like 256kb, 1MB, 4MB, ...?
>

Cause my board AES engine throughput is 1M, I just did a simple dd
test with small chunks. Results are in last email.

> Arnd



--
Baolin.wang
Best Regards

2015-11-13 03:27:32

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 12 November 2015 at 23:02, Mark Brown <[email protected]> wrote:
> On Thu, Nov 12, 2015 at 01:57:27PM +0100, Arnd Bergmann wrote:
>> On Thursday 12 November 2015 20:51:10 Baolin Wang wrote:
>
>> > But it maybe not enough for HW engine which can handle maybe 10M/20M
>> > at one time.
>
>> Given that you have already done measurements, can you find out how much
>> you lose in overall performance with your existing patch if you artificially
>> limit the maximum size to sizes like 256kb, 1MB, 4MB, ...?
>
> It's probably also worth looking at the impact on CPU utilisation as
> well as throughput in your benchmarking since the system will often not
> be idle when it's doing a lot of I/O - I know you've done some
> measurements in that area before, including them when looking at block
> sizes might be interesting.

Make sense.



--
Baolin.wang
Best Regards

2015-11-13 09:06:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Friday 13 November 2015 10:05:28 Baolin Wang wrote:
>
> Well, I did a simple test with dd reading, cause my engine limitation is 1M,
> (1) so the time like below when handle 1M at one time.
> 1048576 bytes (1.0 MB) copied, 0.0841235 s, 12.5 MB/s
> 1048576 bytes (1.0 MB) copied, 0.0836294 s, 12.5 MB/s
> 1048576 bytes (1.0 MB) copied, 0.0836526 s, 12.5 MB/s
>
> (2) These handle 64K at one time * 16 times
> 1048576 bytes (1.0 MB) copied, 0.0937223 s, 11.2 MB/s
> 1048576 bytes (1.0 MB) copied, 0.097205 s, 10.8 MB/s
> 1048576 bytes (1.0 MB) copied, 0.0935884 s, 11.2 MB/s
>
> Here is a 10ms level difference, try to image if the hardware engine's
> throughput is bigger than that. But like Jens said, we can measure it
> by the performance data.

The absolute numbers look really low. Does this include writing to
a hard drive? That would certainly make the difference appear
less significant.

Could you try backing this with a ram disk backing for comparison,
and also use 'time dd' to show the CPU utilization for all cases?
For completeness, including cpu-only performance might also help
put this into perspective.

Arnd

2015-11-13 11:37:25

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On 13 November 2015 at 17:05, Arnd Bergmann <[email protected]> wrote:
> On Friday 13 November 2015 10:05:28 Baolin Wang wrote:
>>
>> Well, I did a simple test with dd reading, cause my engine limitation is 1M,
>> (1) so the time like below when handle 1M at one time.
>> 1048576 bytes (1.0 MB) copied, 0.0841235 s, 12.5 MB/s
>> 1048576 bytes (1.0 MB) copied, 0.0836294 s, 12.5 MB/s
>> 1048576 bytes (1.0 MB) copied, 0.0836526 s, 12.5 MB/s
>>
>> (2) These handle 64K at one time * 16 times
>> 1048576 bytes (1.0 MB) copied, 0.0937223 s, 11.2 MB/s
>> 1048576 bytes (1.0 MB) copied, 0.097205 s, 10.8 MB/s
>> 1048576 bytes (1.0 MB) copied, 0.0935884 s, 11.2 MB/s
>>
>> Here is a 10ms level difference, try to image if the hardware engine's
>> throughput is bigger than that. But like Jens said, we can measure it
>> by the performance data.
>
> The absolute numbers look really low. Does this include writing to
> a hard drive? That would certainly make the difference appear
> less significant.
>

OK, I'll supply the complete performance data to measure it.

> Could you try backing this with a ram disk backing for comparison,
> and also use 'time dd' to show the CPU utilization for all cases?
> For completeness, including cpu-only performance might also help
> put this into perspective.
>

OK, I'll try. Thanks.

> Arnd



--
Baolin.wang
Best Regards

2015-11-13 11:52:06

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

On Thu, Nov 12, 2015 at 08:26:26AM -0700, Jens Axboe wrote:
> On 11/12/2015 03:04 AM, Mark Brown wrote:

> >Android now wants to encrypt phones and tablets by default and have been
> >seeing substantial performance hits as a result, we can try to get
> >people to share performance data from productionish systems but it might
> >be difficult.

> Well, shame on them for developing out-of-tree, looks like they are reaping
> all the benefits of that.

> Guys, we need some numbers, enough with the hand waving. There's no point
> discussing this further until we know how much of a difference it makes to
> handle X MB chunks instead of Y MB chunks. As was previously stated, unless
> there's a _substantial_ performance benefit, this patchset isn't going
> anywhere.

Yeah, what I'm saying here is that there will issues getting the numbers
from relevant production systems - we are most likely to be looking at
proxies which are hopefully reasonably representative but there's likely
to be more divergence than you'd see just running benchmark workloads on
similar systems to those used in production.


Attachments:
(No filename) (1.08 kB)
signature.asc (473.00 B)
Download all attachments

2015-12-02 12:46:57

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

Hi All,

These are the benchmarks for request based dm-crypt. Please check it.

一、Environment
1. Hardware configuration
Board: Beaglebone black
Processor: Am335x 1GHz ARM Cortex-A8
RAM:512M
SD card:8G
Kernel version:4.4-rc1

2. Encryption method
(1) Use cbc(aes) cipher to encrypt the block device with dmsetup tool
dmsetup create dm-0 --table “0 `blockdev --getsize /dev/mmcblk0p1`
crypt aes-cbc-plain:sha256 babebabebabebabebabebabebabebabe 0
/dev/mmcblk0p1 0”

(2) Enable the AES engine by config
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_OMAP_AES=y

(3) Limitation
We want to test it on ramdisk devices rather than slow media devices
(SD card) firstly, but here we can't use ramdisk to be mapped with
dmsetup tool. Cause ramdisk device is non-request-stackable device, it
can not be used for request-based dm.

二. Result summary
1. Results table
-----------------------------------------------------------------------------------------------
| Test | size | bio | request
| % change |
-----------------------------------------------------------------------------------------------
| dd sequential read | 1G | 5.6Mb/s | 11.3Mb/s | +101.8% |
-----------------------------------------------------------------------------------------------
| dd sequential write | 1G | 4.2Mb/s | 6.8Mb/s | +61.9% |
-----------------------------------------------------------------------------------------------
| fio sequential read | 1G | 5336KB/s | 10928KB/s | +104.8% |
-----------------------------------------------------------------------------------------------
| fio sequential write | 1G | 4049KB/s | 6574KB/s | +62.4% |
-----------------------------------------------------------------------------------------------

2. Summary
>From all the test data with the dd/fio tools, the test results
basically are coincident though the different tools, so it can
basically reflect the IO performance effection by request based
opimization.

It has a larger IO performance effection for reading speed, and it'll
have a larger improvement (at least a double improvement) when
enabling the request based opimization.

Also it will have a big improvement for writing speed, and it is
increased about 50% when enabling the request based opimization. But
for random writing, it has a litle difference limited by slow hardware
random accessing.

三. DD test procedure
dd can be used for simplified copying of data at the low level with
operating the raw devices.
dd can provide good basic coverage but isn't very realistic, and only
provide sequential IO accessing.
But we can use dd to read/write the raw devices without the
filesystem's caches effection, test result as below:

1. Sequential read:
(1) Sequential read 1G with bio based:
time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 192.091 s, 5.6 MB/s
real 3m12.112s
user 0m0.070s
sys 0m3.820s

(2) Sequential read 1G with requset based:
time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 94.8922 s, 11.3 MB/s
real 1m34.908s
user 0m0.030s
sys 0m4.000s

(3) Sequential read 1G without encryption:
time dd if=/dev/mmcblk0p1 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 58.49 s, 18.4 MB/s
real 0m58.505s
user 0m0.040s
sys 0m3.050s

2. Sequential write:
(1) Sequential write 1G with bio based:
time dd if=/dev/zero of=/dev/dm-0 bs=64K count=16384 oflag=direct
1073741824 bytes (1.1 GB) copied, 253.477 s, 4.2 MB/s
real 4m13.497s
user 0m0.130s
sys 0m3.990s

(2) Sequential write 1G with requset based:
time dd if=/dev/zero of=/dev/dm-0 bs=64K count=16384 oflag=direct
1073741824 bytes (1.1 GB) copied, 157.396 s, 6.8 MB/s
real 2m37.414s
user 0m0.130s
sys 0m4.190s

(3) Sequential write 1G without encryption:
time dd if=/dev/zero of=/dev/mmcblk0p1 bs=64K count=16384 oflag=direct
1073741824 bytes (1.1 GB) copied, 120.452 s, 8.9 MB/s
real 2m0.471s
user 0m0.050s
sys 0m3.820s

3. Summary:
we can see the sequential read/write speed with bio based is: 5.6MB/s
and 4.2MB/s, but when encrypting the block device with request based
things, the sequential read/write speed can be increased to: 11.3MB/s
and 6.8MBs. So sequential reading and writing speed have a big
different with request based, speed are increased by 101.8% and 61.9%.
Meanwhile we also can see the difference in 'sys' time with request
based optimizations.

三、Fio test procedure
We specify the block size is 64K, and command like:
fio --filename=/dev/dm-0 --direct=1 --iodepth=1 --rw=read --bs=64K
--size=1G --group_reporting --numjobs=1 --name=test_read

1. Sequential read 1G with bio based:
READ: io=1024.0MB, aggrb=5336KB/s, minb=5336KB/s, maxb=5336KB/s,
mint=196494msec, maxt=196494msec

2. Sequential write 1G with bio based:
WRITE: io=1024.0MB, aggrb=4049KB/s, minb=4049KB/s, maxb=4049KB/s,
mint=258954msec, maxt=258954msec

3. Sequential read 1G with request based:
READ: io=1024.0MB, aggrb=10928KB/s, minb=10928KB/s, maxb=10928KB/s,
mint=95947msec, maxt=95947msec

4. Sequential write 1G with request based:
WRITE: io=1024.0MB, aggrb=6574KB/s, minb=6574KB/s, maxb=6574KB/s,
mint=159493msec, maxt=159493msec

5. Summary:
(1) read:
The sequential read speed has a big improvment with reuest based
things, which is increased by 104.8% when the reuest based things are
enabled for dm-crypt. It can not be a really random read if we specify
the block size, so the data doesn't list the random read speed
improvements, though it show big improvements.

(2) write:
The sequential write speed has some improvements with request based,
which is increased about by 62.4%. But for random write, this part is
very hard to measure on an SD card though, because any random write
smaller than the underlying block size will cause long I/O latencies
at some point, which is can not show the improvements.

四、IO block size test
We also change the block size from 4K to 1M (most IO block size in
practice are much smaller than 1M) to see the block size influences
with reuest based for dm-crypt.

1. Sequential read 1G
(1) block size = 4k
time dd if=/dev/dm-0 of=/dev/null bs=4k count=262144 iflag=direct
1073741824 bytes (1.1 GB) copied, 310.598 s, 3.5 MB/s
real 5m10.614s
user 0m0.610s
sys 0m36.040s

(2) block size = 64k
1073741824 bytes (1.1 GB) copied, 95.0489 s, 11.3 MB/s
real 1m35.071s
user 0m0.040s
sys 0m4.030s

(3) block size = 256k
1073741824 bytes (1.1 GB) copied, 84.3311 s, 12.7 MB/s
real 1m24.347s
user 0m0.050s
sys 0m1.950s

(4) block size = 1M
1073741824 bytes (1.1 GB) copied, 80.8778 s, 13.3 MB/s
real 1m20.893s
user 0m0.010s
sys 0m1.390s

2. Sequential write 1G
(1) block size = 4k
time dd if=/dev/zero of=/dev/dm-0 bs=4K count=262144 oflag=direct
1073741824 bytes (1.1 GB) copied, 629.656 s, 1.7 MB/s
real 10m29.671s
user 0m0.790s
sys 0m33.550s

(2) block size = 64k
1073741824 bytes (1.1 GB) copied, 155.697 s, 6.9 MB/s
real 2m35.713s
user 0m0.040s
sys 0m4.110s

(3) block size = 256k
1073741824 bytes (1.1 GB) copied, 143.682 s, 7.5 MB/s
real 2m23.698s
user 0m0.040s
sys 0m2.500s

(4) block size = 1M
1073741824 bytes (1.1 GB) copied, 140.654 s, 7.6 MB/s
real 2m20.670s
user 0m0.040s
sys 0m2.090s

3. Summary
For request based things, some sequential bios/requests can merged
into one request to expand the IO size to be a big block handled by
hardware engine at one time. With the hardware acceleration, it can
improve the encryption/decryption speed, so the hardware engine can
play the best performance with big block size.

>From the data, we also can see the reading/writing speed can be
increased by expanding the block size, which means that it doesn't
help much for small sizes in. But when the block size is above 64K,
the speed dose not get the corresponding performance benefits, I think
the speed limitation is also in cryto that lets the bigger bios can't
get the similar performance benefits, which is need more
investigations.

On 13 November 2015 at 19:51, Mark Brown <[email protected]> wrote:
> On Thu, Nov 12, 2015 at 08:26:26AM -0700, Jens Axboe wrote:
>> On 11/12/2015 03:04 AM, Mark Brown wrote:
>
>> >Android now wants to encrypt phones and tablets by default and have been
>> >seeing substantial performance hits as a result, we can try to get
>> >people to share performance data from productionish systems but it might
>> >be difficult.
>
>> Well, shame on them for developing out-of-tree, looks like they are reaping
>> all the benefits of that.
>
>> Guys, we need some numbers, enough with the hand waving. There's no point
>> discussing this further until we know how much of a difference it makes to
>> handle X MB chunks instead of Y MB chunks. As was previously stated, unless
>> there's a _substantial_ performance benefit, this patchset isn't going
>> anywhere.
>
> Yeah, what I'm saying here is that there will issues getting the numbers
> from relevant production systems - we are most likely to be looking at
> proxies which are hopefully reasonably representative but there's likely
> to be more divergence than you'd see just running benchmark workloads on
> similar systems to those used in production.



--
Baolin.wang
Best Regards

2015-12-02 19:57:04

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
> These are the benchmarks for request based dm-crypt. Please check it.

Now please put request-based dm-crypt completely to one side and focus
just on the existing bio-based code. Why is it slower and what can be
adjusted to improve this?

People aren't going to take a request-based solution seriously until
you can explain in full detail *why* bio-based is slower AND why it's
impossible to improve its performance.

> For request based things, some sequential bios/requests can merged
> into one request to expand the IO size to be a big block handled by
> hardware engine at one time.

Bio-based also merges I/O so that does not provide justification.
Investigate in much more detail the actual merging and scheduling
involved in the cases you need to optimise. See if blktrace gives you
any clues, or add your own instrumentation. You could even look at some
of the patches we've had in the list archives for optimising bio-based
crypt in different situations.

Alasdair

2015-12-03 02:56:24

by Baolin Wang

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
> On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
>> These are the benchmarks for request based dm-crypt. Please check it.
>
> Now please put request-based dm-crypt completely to one side and focus
> just on the existing bio-based code. Why is it slower and what can be
> adjusted to improve this?
>

OK. I think I find something need to be point out.
1. From the IO block size test in the performance report, for the
request based, we can find it can not get the corresponding
performance if we just expand the IO size. Because In dm crypt, it
will map the data buffer of one request with scatterlists, and send
all scatterlists of one request to the encryption engine to encrypt or
decrypt. I found if the scatterlist list number is small and each
scatterlist length is bigger, it will improve the encryption speed,
that helps the engine palys best performance. But a big IO size does
not mean bigger scatterlists (maybe many scatterlists with small
length), that's why we can not get the corresponding performance if we
just expand the IO size I think.

2. Why bio based is slower?
If you understand 1, you can obviously understand the crypto engine
likes bigger scatterlists to improve the performance. But for bio
based, it only send one scatterlist (the scatterlist's length is
always '1 << SECTOR_SHIFT' = 512) to the crypto engine at one time. It
means if the bio size is 1M, the bio based will send 2048 times (evey
time the only one scatterlist length is 512 bytes) to crypto engine to
handle, which is more time-consuming and ineffective for the crypto
engine. But for request based, it can map the whole request with many
scatterlists (not just one scatterlist), and send all the scatterlists
to the crypto engine which can improve the performance, is it right?

Another optimization solution I think is we can expand the scatterlist
entry number for bio based.

> People aren't going to take a request-based solution seriously until
> you can explain in full detail *why* bio-based is slower AND why it's
> impossible to improve its performance.
>
>> For request based things, some sequential bios/requests can merged
>> into one request to expand the IO size to be a big block handled by
>> hardware engine at one time.
>
> Bio-based also merges I/O so that does not provide justification.
> Investigate in much more detail the actual merging and scheduling
> involved in the cases you need to optimise. See if blktrace gives you
> any clues, or add your own instrumentation. You could even look at some
> of the patches we've had in the list archives for optimising bio-based
> crypt in different situations.
>
> Alasdair
>



--
Baolin.wang
Best Regards

2015-12-03 10:36:51

by Baolin Wang

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On 3 December 2015 at 10:56, Baolin Wang <[email protected]> wrote:
> On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
>> On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
>>> These are the benchmarks for request based dm-crypt. Please check it.
>>
>> Now please put request-based dm-crypt completely to one side and focus
>> just on the existing bio-based code. Why is it slower and what can be
>> adjusted to improve this?
>>
>
> OK. I think I find something need to be point out.
> 1. From the IO block size test in the performance report, for the
> request based, we can find it can not get the corresponding
> performance if we just expand the IO size. Because In dm crypt, it
> will map the data buffer of one request with scatterlists, and send
> all scatterlists of one request to the encryption engine to encrypt or
> decrypt. I found if the scatterlist list number is small and each
> scatterlist length is bigger, it will improve the encryption speed,
> that helps the engine palys best performance. But a big IO size does
> not mean bigger scatterlists (maybe many scatterlists with small
> length), that's why we can not get the corresponding performance if we
> just expand the IO size I think.
>
> 2. Why bio based is slower?
> If you understand 1, you can obviously understand the crypto engine
> likes bigger scatterlists to improve the performance. But for bio
> based, it only send one scatterlist (the scatterlist's length is
> always '1 << SECTOR_SHIFT' = 512) to the crypto engine at one time. It
> means if the bio size is 1M, the bio based will send 2048 times (evey
> time the only one scatterlist length is 512 bytes) to crypto engine to
> handle, which is more time-consuming and ineffective for the crypto
> engine. But for request based, it can map the whole request with many
> scatterlists (not just one scatterlist), and send all the scatterlists
> to the crypto engine which can improve the performance, is it right?
>
> Another optimization solution I think is we can expand the scatterlist
> entry number for bio based.
>

I did some testing about my assumption of expanding the scatterlist
entry number for bio based. I did some modification for the bio based
to support multiple scatterlists, then it will get the same
performance as the request based things.

1. bio based with expanding the scatterlist entry
time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 94.5458 s, 11.4 MB/s
real 1m34.562s
user 0m0.030s
sys 0m3.850s

2. Sequential read 1G with requset based:
time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 94.8922 s, 11.3 MB/s
real 1m34.908s
user 0m0.030s
sys 0m4.000s

>From the data, we can find the bio based also can get the same
performance as the request based. So if someone still don't like the
request based things, I think we can optimize the bio based by
expanding the scatterlists number. Thanks.

>>
>> Alasdair
>>
>
>
>
> --
> Baolin.wang
> Best Regards



--
Baolin.wang
Best Regards

2015-12-03 11:07:18

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

Dne 3.12.2015 v 11:36 Baolin Wang napsal(a):
> On 3 December 2015 at 10:56, Baolin Wang <[email protected]> wrote:
>> On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
>>> On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
>>>> These are the benchmarks for request based dm-crypt. Please check it.
>>>
>>> Now please put request-based dm-crypt completely to one side and focus
>>> just on the existing bio-based code. Why is it slower and what can be
>>> adjusted to improve this?
>>>
>>
>> OK. I think I find something need to be point out.
>> 1. From the IO block size test in the performance report, for the
>> request based, we can find it can not get the corresponding
>> performance if we just expand the IO size. Because In dm crypt, it
>> will map the data buffer of one request with scatterlists, and send
>> all scatterlists of one request to the encryption engine to encrypt or
>> decrypt. I found if the scatterlist list number is small and each
>> scatterlist length is bigger, it will improve the encryption speed,
>> that helps the engine palys best performance. But a big IO size does
>> not mean bigger scatterlists (maybe many scatterlists with small
>> length), that's why we can not get the corresponding performance if we
>> just expand the IO size I think.
>>
>> 2. Why bio based is slower?
>> If you understand 1, you can obviously understand the crypto engine
>> likes bigger scatterlists to improve the performance. But for bio
>> based, it only send one scatterlist (the scatterlist's length is
>> always '1 << SECTOR_SHIFT' = 512) to the crypto engine at one time. It
>> means if the bio size is 1M, the bio based will send 2048 times (evey
>> time the only one scatterlist length is 512 bytes) to crypto engine to
>> handle, which is more time-consuming and ineffective for the crypto
>> engine. But for request based, it can map the whole request with many
>> scatterlists (not just one scatterlist), and send all the scatterlists
>> to the crypto engine which can improve the performance, is it right?
>>
>> Another optimization solution I think is we can expand the scatterlist
>> entry number for bio based.
>>
>
> I did some testing about my assumption of expanding the scatterlist
> entry number for bio based. I did some modification for the bio based
> to support multiple scatterlists, then it will get the same
> performance as the request based things.
>
> 1. bio based with expanding the scatterlist entry
> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
> 1073741824 bytes (1.1 GB) copied, 94.5458 s, 11.4 MB/s
> real 1m34.562s
> user 0m0.030s
> sys 0m3.850s
>
> 2. Sequential read 1G with requset based:
> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
> 1073741824 bytes (1.1 GB) copied, 94.8922 s, 11.3 MB/s
> real 1m34.908s
> user 0m0.030s
> sys 0m4.000s
>
> From the data, we can find the bio based also can get the same
> performance as the request based. So if someone still don't like the
> request based things, I think we can optimize the bio based by
> expanding the scatterlists number. Thanks.
>


Hi

Do you see any performance impact if you use with cryptsetup options:

--perf-same_cpu_crypt
--perf-submit_from_crypt_cpus

with your regular unpatched kernel.

Zdenek

2015-12-03 11:27:18

by Baolin Wang

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On 3 December 2015 at 19:07, Zdenek Kabelac <[email protected]> wrote:
> Dne 3.12.2015 v 11:36 Baolin Wang napsal(a):
>
>> On 3 December 2015 at 10:56, Baolin Wang <[email protected]> wrote:
>>>
>>> On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
>>>>
>>>> On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
>>>>>
>>>>> These are the benchmarks for request based dm-crypt. Please check it.
>>>>
>>>>
>>>> Now please put request-based dm-crypt completely to one side and focus
>>>> just on the existing bio-based code. Why is it slower and what can be
>>>> adjusted to improve this?
>>>>
>>>
>>> OK. I think I find something need to be point out.
>>> 1. From the IO block size test in the performance report, for the
>>> request based, we can find it can not get the corresponding
>>> performance if we just expand the IO size. Because In dm crypt, it
>>> will map the data buffer of one request with scatterlists, and send
>>> all scatterlists of one request to the encryption engine to encrypt or
>>> decrypt. I found if the scatterlist list number is small and each
>>> scatterlist length is bigger, it will improve the encryption speed,
>>> that helps the engine palys best performance. But a big IO size does
>>> not mean bigger scatterlists (maybe many scatterlists with small
>>> length), that's why we can not get the corresponding performance if we
>>> just expand the IO size I think.
>>>
>>> 2. Why bio based is slower?
>>> If you understand 1, you can obviously understand the crypto engine
>>> likes bigger scatterlists to improve the performance. But for bio
>>> based, it only send one scatterlist (the scatterlist's length is
>>> always '1 << SECTOR_SHIFT' = 512) to the crypto engine at one time. It
>>> means if the bio size is 1M, the bio based will send 2048 times (evey
>>> time the only one scatterlist length is 512 bytes) to crypto engine to
>>> handle, which is more time-consuming and ineffective for the crypto
>>> engine. But for request based, it can map the whole request with many
>>> scatterlists (not just one scatterlist), and send all the scatterlists
>>> to the crypto engine which can improve the performance, is it right?
>>>
>>> Another optimization solution I think is we can expand the scatterlist
>>> entry number for bio based.
>>>
>>
>> I did some testing about my assumption of expanding the scatterlist
>> entry number for bio based. I did some modification for the bio based
>> to support multiple scatterlists, then it will get the same
>> performance as the request based things.
>>
>> 1. bio based with expanding the scatterlist entry
>> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
>> 1073741824 bytes (1.1 GB) copied, 94.5458 s, 11.4 MB/s
>> real 1m34.562s
>> user 0m0.030s
>> sys 0m3.850s
>>
>> 2. Sequential read 1G with requset based:
>> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
>> 1073741824 bytes (1.1 GB) copied, 94.8922 s, 11.3 MB/s
>> real 1m34.908s
>> user 0m0.030s
>> sys 0m4.000s
>>
>> From the data, we can find the bio based also can get the same
>> performance as the request based. So if someone still don't like the
>> request based things, I think we can optimize the bio based by
>> expanding the scatterlists number. Thanks.
>>
>
>
> Hi
>
> Do you see any performance impact if you use with cryptsetup options:
>
> --perf-same_cpu_crypt
> --perf-submit_from_crypt_cpus
>
> with your regular unpatched kernel.

I did not see the performance impact with these options you said,
cause my board only has one cpu.

>
> Zdenek
>



--
Baolin.wang
Best Regards

2015-12-03 15:47:23

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt



On Thu, 3 Dec 2015, Baolin Wang wrote:

> On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
> > On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
> >> These are the benchmarks for request based dm-crypt. Please check it.
> >
> > Now please put request-based dm-crypt completely to one side and focus
> > just on the existing bio-based code. Why is it slower and what can be
> > adjusted to improve this?
> >
>
> OK. I think I find something need to be point out.
> 1. From the IO block size test in the performance report, for the
> request based, we can find it can not get the corresponding
> performance if we just expand the IO size. Because In dm crypt, it
> will map the data buffer of one request with scatterlists, and send
> all scatterlists of one request to the encryption engine to encrypt or
> decrypt. I found if the scatterlist list number is small and each
> scatterlist length is bigger, it will improve the encryption speed,

This optimization is only applicable to XTS mode. XTS has its weaknesses
and it is not recommended for encryption of more than 1TB of data
( http://grouper.ieee.org/groups/1619/email/msg02357.html )

You can optimize bio-based dm-crypt as well (use larger encryption chunk
than 512 bytes when the mode is XTS).

The most commonly used mode aes-cbc-essiv:sha256 can't be optimized that
way. You have to do encryption and decryption sector by sector because
every sector has different IV.

Mikulas


> that helps the engine palys best performance. But a big IO size does
> not mean bigger scatterlists (maybe many scatterlists with small
> length), that's why we can not get the corresponding performance if we
> just expand the IO size I think.
>
> 2. Why bio based is slower?
> If you understand 1, you can obviously understand the crypto engine
> likes bigger scatterlists to improve the performance. But for bio
> based, it only send one scatterlist (the scatterlist's length is
> always '1 << SECTOR_SHIFT' = 512) to the crypto engine at one time. It
> means if the bio size is 1M, the bio based will send 2048 times (evey
> time the only one scatterlist length is 512 bytes) to crypto engine to
> handle, which is more time-consuming and ineffective for the crypto
> engine. But for request based, it can map the whole request with many
> scatterlists (not just one scatterlist), and send all the scatterlists
> to the crypto engine which can improve the performance, is it right?
>
> Another optimization solution I think is we can expand the scatterlist
> entry number for bio based.

2015-12-03 15:49:33

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt



On Thu, 3 Dec 2015, Baolin Wang wrote:

> On 3 December 2015 at 10:56, Baolin Wang <[email protected]> wrote:
> > On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
> >> On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
> >>> These are the benchmarks for request based dm-crypt. Please check it.
> >>
> >> Now please put request-based dm-crypt completely to one side and focus
> >> just on the existing bio-based code. Why is it slower and what can be
> >> adjusted to improve this?
> >>
> >
> > OK. I think I find something need to be point out.
> > 1. From the IO block size test in the performance report, for the
> > request based, we can find it can not get the corresponding
> > performance if we just expand the IO size. Because In dm crypt, it
> > will map the data buffer of one request with scatterlists, and send
> > all scatterlists of one request to the encryption engine to encrypt or
> > decrypt. I found if the scatterlist list number is small and each
> > scatterlist length is bigger, it will improve the encryption speed,
> > that helps the engine palys best performance. But a big IO size does
> > not mean bigger scatterlists (maybe many scatterlists with small
> > length), that's why we can not get the corresponding performance if we
> > just expand the IO size I think.
> >
> > 2. Why bio based is slower?
> > If you understand 1, you can obviously understand the crypto engine
> > likes bigger scatterlists to improve the performance. But for bio
> > based, it only send one scatterlist (the scatterlist's length is
> > always '1 << SECTOR_SHIFT' = 512) to the crypto engine at one time. It
> > means if the bio size is 1M, the bio based will send 2048 times (evey
> > time the only one scatterlist length is 512 bytes) to crypto engine to
> > handle, which is more time-consuming and ineffective for the crypto
> > engine. But for request based, it can map the whole request with many
> > scatterlists (not just one scatterlist), and send all the scatterlists
> > to the crypto engine which can improve the performance, is it right?
> >
> > Another optimization solution I think is we can expand the scatterlist
> > entry number for bio based.
> >
>
> I did some testing about my assumption of expanding the scatterlist
> entry number for bio based. I did some modification for the bio based
> to support multiple scatterlists, then it will get the same
> performance as the request based things.
>
> 1. bio based with expanding the scatterlist entry
> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
> 1073741824 bytes (1.1 GB) copied, 94.5458 s, 11.4 MB/s
> real 1m34.562s
> user 0m0.030s
> sys 0m3.850s
>
> 2. Sequential read 1G with requset based:
> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
> 1073741824 bytes (1.1 GB) copied, 94.8922 s, 11.3 MB/s
> real 1m34.908s
> user 0m0.030s
> sys 0m4.000s

Measuring the system time this way is completely wrong because it doesn't
account for the time spent in kernel threads.

Mikulas

2015-12-04 04:57:22

by Baolin Wang

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On 3 December 2015 at 23:47, Mikulas Patocka <[email protected]> wrote:
>
>
> On Thu, 3 Dec 2015, Baolin Wang wrote:
>
>> On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
>> > On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
>> >> These are the benchmarks for request based dm-crypt. Please check it.
>> >
>> > Now please put request-based dm-crypt completely to one side and focus
>> > just on the existing bio-based code. Why is it slower and what can be
>> > adjusted to improve this?
>> >
>>
>> OK. I think I find something need to be point out.
>> 1. From the IO block size test in the performance report, for the
>> request based, we can find it can not get the corresponding
>> performance if we just expand the IO size. Because In dm crypt, it
>> will map the data buffer of one request with scatterlists, and send
>> all scatterlists of one request to the encryption engine to encrypt or
>> decrypt. I found if the scatterlist list number is small and each
>> scatterlist length is bigger, it will improve the encryption speed,
>
> This optimization is only applicable to XTS mode. XTS has its weaknesses
> and it is not recommended for encryption of more than 1TB of data
> ( http://grouper.ieee.org/groups/1619/email/msg02357.html )
>
> You can optimize bio-based dm-crypt as well (use larger encryption chunk
> than 512 bytes when the mode is XTS).
>
> The most commonly used mode aes-cbc-essiv:sha256 can't be optimized that
> way. You have to do encryption and decryption sector by sector because
> every sector has different IV.

Make sense. We'll optimize bio-based dm-crypt for XTS mode, and do
some investigations for none XTS mode.

>
> Mikulas
>



--
Baolin.wang
Best Regards

2015-12-04 04:58:57

by Baolin Wang

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 0/2] Introduce the request handling for dm-crypt

On 3 December 2015 at 23:49, Mikulas Patocka <[email protected]> wrote:
>
>
> On Thu, 3 Dec 2015, Baolin Wang wrote:
>
>> On 3 December 2015 at 10:56, Baolin Wang <[email protected]> wrote:
>> > On 3 December 2015 at 03:56, Alasdair G Kergon <[email protected]> wrote:
>> >> On Wed, Dec 02, 2015 at 08:46:54PM +0800, Baolin Wang wrote:
>> >>> These are the benchmarks for request based dm-crypt. Please check it.
>> >>
>> >> Now please put request-based dm-crypt completely to one side and focus
>> >> just on the existing bio-based code. Why is it slower and what can be
>> >> adjusted to improve this?
>> >>
>> >
>> > OK. I think I find something need to be point out.
>> > 1. From the IO block size test in the performance report, for the
>> > request based, we can find it can not get the corresponding
>> > performance if we just expand the IO size. Because In dm crypt, it
>> > will map the data buffer of one request with scatterlists, and send
>> > all scatterlists of one request to the encryption engine to encrypt or
>> > decrypt. I found if the scatterlist list number is small and each
>> > scatterlist length is bigger, it will improve the encryption speed,
>> > that helps the engine palys best performance. But a big IO size does
>> > not mean bigger scatterlists (maybe many scatterlists with small
>> > length), that's why we can not get the corresponding performance if we
>> > just expand the IO size I think.
>> >
>> > 2. Why bio based is slower?
>> > If you understand 1, you can obviously understand the crypto engine
>> > likes bigger scatterlists to improve the performance. But for bio
>> > based, it only send one scatterlist (the scatterlist's length is
>> > always '1 << SECTOR_SHIFT' = 512) to the crypto engine at one time. It
>> > means if the bio size is 1M, the bio based will send 2048 times (evey
>> > time the only one scatterlist length is 512 bytes) to crypto engine to
>> > handle, which is more time-consuming and ineffective for the crypto
>> > engine. But for request based, it can map the whole request with many
>> > scatterlists (not just one scatterlist), and send all the scatterlists
>> > to the crypto engine which can improve the performance, is it right?
>> >
>> > Another optimization solution I think is we can expand the scatterlist
>> > entry number for bio based.
>> >
>>
>> I did some testing about my assumption of expanding the scatterlist
>> entry number for bio based. I did some modification for the bio based
>> to support multiple scatterlists, then it will get the same
>> performance as the request based things.
>>
>> 1. bio based with expanding the scatterlist entry
>> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
>> 1073741824 bytes (1.1 GB) copied, 94.5458 s, 11.4 MB/s
>> real 1m34.562s
>> user 0m0.030s
>> sys 0m3.850s
>>
>> 2. Sequential read 1G with requset based:
>> time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
>> 1073741824 bytes (1.1 GB) copied, 94.8922 s, 11.3 MB/s
>> real 1m34.908s
>> user 0m0.030s
>> sys 0m4.000s
>
> Measuring the system time this way is completely wrong because it doesn't
> account for the time spent in kernel threads.
>

OK. Thanks for your suggestions.

> Mikulas



--
Baolin.wang
Best Regards