Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755356AbbG3LhG (ORCPT ); Thu, 30 Jul 2015 07:37:06 -0400 Received: from mail-pa0-f54.google.com ([209.85.220.54]:34267 "EHLO mail-pa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753428AbbG3LhC (ORCPT ); Thu, 30 Jul 2015 07:37:02 -0400 From: Ming Lei To: Jens Axboe , linux-kernel@vger.kernel.org, Dave Kleikamp Cc: Zach Brown , Christoph Hellwig , Maxim Patlasov , Andrew Morton , Alexander Viro , Tejun Heo , Dave Chinner , Ming Lei Subject: [PATCH v8 6/6] block: loop: support DIO & AIO Date: Thu, 30 Jul 2015 07:36:24 -0400 Message-Id: <1438256184-23645-7-git-send-email-ming.lei@canonical.com> X-Mailer: git-send-email 1.9.1 In-Reply-To: <1438256184-23645-1-git-send-email-ming.lei@canonical.com> References: <1438256184-23645-1-git-send-email-ming.lei@canonical.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9598 Lines: 296 There are about 3 advantages to use direct I/O and AIO on read/write loop's backing file: 1) double cache can be avoided, then memory usage gets decreased a lot 2) not like user space direct I/O, there isn't cost of pinning pages 3) avoid context switch for obtaining good throughput - in buffered file read, random I/O top throughput is often obtained only if they are submitted concurrently from lots of tasks; but for sequential I/O, most of times they can be hit from page cache, so concurrent submissions often introduce unnecessary context switch and can't improve throughput much. There was such discussion[1] to use non-blocking I/O to improve the problem for application. - with direct I/O and AIO, concurrent submissions can be avoided and random read throughput can't be affected meantime Follows my fio test result: 1. 16 jobs fio test inside ext4 file system over loop block 1) How to run - linux kernel: 4.1.0-rc2-next-20150506 with the patchset - the loop block is over one image on HDD. - linux psync, 16 jobs, size 400M, ext4 over loop block - test result: IOPS from fio output 2) Throughput result: ------------------------------------------------------------- test cases |randread |read |randwrite |write | ------------------------------------------------------------- base |240 |8705 |3763 |20914 ------------------------------------------------------------- base+loop aio |242 |9258 |4577 |21451 ------------------------------------------------------------- 3) context switch - context switch decreased by ~16% with loop aio for randread, and decreased by ~33% for read 4) memory usage - After these four tests with loop aio: ~10% memory becomes used - After these four tests without loop aio: more than 55% memory becomes used 2. single job fio test inside ext4 file system over loop block(for Maxim Patlasov) 1) How to run - linux kernel: 4.1.0-rc2-next-20150506 with the patchset - the loop block is over one image on HDD. - linux psync, 1 job, size 4000M, ext4 over loop block - test result: IOPS from fio output 2) Throughput result: ------------------------------------------------------------- test cases |randread |read |randwrite |write | ------------------------------------------------------------- base |109 |21180 |4192 |22782 ------------------------------------------------------------- base+loop aio |114 |21018 |5404 |22670 ------------------------------------------------------------- 3) context switch - context switch decreased by ~10% with loop aio for randread, and decreased by ~50% for read 4) memory usage - After these four tests with loop aio: ~10% memory becomes used - After these four tests without loop aio: more than 55% memory becomes used Both 'context switch' and 'memory usage' data are got from sar. [1] https://lwn.net/Articles/612483/ [2] sar graph when running fio over loop without the patchset http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-nonaio.pdf [3] sar graph when running fio over loop with the patchset http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-aio.pdf [4] sar graph when running fio over loop without the patchset http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-nonaio-1job.pdf [5] sar graph when running fio over loop with the patchset http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-aio-1job.pdf Signed-off-by: Ming Lei --- drivers/block/loop.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++-- drivers/block/loop.h | 3 ++ 2 files changed, 120 insertions(+), 3 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 133e4c7..36f930a 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -198,6 +198,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) blk_mq_freeze_queue(lo->lo_queue); lo->use_dio = use_dio; lo->dio_align = dio_align; + lo->last_use_dio = use_dio; if (use_dio) lo->lo_flags |= LO_FLAGS_DIRECT_IO; else @@ -434,6 +435,102 @@ static int lo_req_flush(struct loop_device *lo, struct request *rq) return ret; } +static inline void handle_partial_read(struct loop_cmd *cmd, long bytes) +{ + if (bytes < 0 || (cmd->rq->cmd_flags & REQ_WRITE)) + return; + + if (unlikely(bytes < blk_rq_bytes(cmd->rq))) { + struct bio *bio = cmd->rq->bio; + + bio_advance(bio, bytes); + zero_fill_bio(bio); + } +} + +static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2) +{ + struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); + struct request *rq = cmd->rq; + + handle_partial_read(cmd, ret); + + if (ret > 0) + ret = 0; + else if (ret < 0) + ret = -EIO; + + rq->errors = ret; + blk_mq_complete_request(rq); +} + +static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, + loff_t pos, bool rw) +{ + struct iov_iter iter; + struct bio_vec *bvec; + struct bio *bio = cmd->rq->bio; + struct file *file = lo->lo_backing_file; + int ret; + + /* nomerge for loop request queue */ + WARN_ON(cmd->rq->bio != cmd->rq->biotail); + + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); + iov_iter_bvec(&iter, ITER_BVEC | rw, bvec, + bio_segments(bio), blk_rq_bytes(cmd->rq)); + + cmd->iocb.ki_pos = pos; + cmd->iocb.ki_filp = file; + cmd->iocb.ki_complete = lo_rw_aio_complete; + cmd->iocb.ki_flags = IOCB_DIRECT; + + if (rw == WRITE) + ret = file->f_op->write_iter(&cmd->iocb, &iter); + else + ret = file->f_op->read_iter(&cmd->iocb, &iter); + + if (ret != -EIOCBQUEUED) + cmd->iocb.ki_complete(&cmd->iocb, ret, 0); + return 0; +} + + +static inline int lo_rw_simple(struct loop_device *lo, + struct request *rq, loff_t pos, bool rw) +{ + struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); + + /* + * When working at direct I/O, under very unusual cases, + * such as unaligned direct I/O from application and + * access to loop block device with 'unaligned' offset & size, + * we have to fallback to non-dio mode. + * + * During the switch between dio and non-dio, page cache + * has to be flushed to the backing file. + */ + if (unlikely(lo->use_dio && lo->last_use_dio != cmd->use_aio)) + vfs_fsync(lo->lo_backing_file, 0); + + if (cmd->use_aio) + return lo_rw_aio(lo, cmd, pos, rw); + + /* + * lo_write_simple and lo_read_simple should have been covered + * by io submit style function like lo_rw_aio(), one blocker + * is that lo_read_simple() need to call flush_dcache_page after + * the page is written from kernel, and it isn't easy to handle + * this in io submit style function which submits all segments + * of the req at one time. And direct read IO doesn't need to + * run flush_dcache_page(). + */ + if (rw == WRITE) + return lo_write_simple(lo, rq, pos); + else + return lo_read_simple(lo, rq, pos); +} + static int do_req_filebacked(struct loop_device *lo, struct request *rq) { loff_t pos; @@ -449,13 +546,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) else if (lo->transfer) ret = lo_write_transfer(lo, rq, pos); else - ret = lo_write_simple(lo, rq, pos); + ret = lo_rw_simple(lo, rq, pos, WRITE); } else { if (lo->transfer) ret = lo_read_transfer(lo, rq, pos); else - ret = lo_read_simple(lo, rq, pos); + ret = lo_rw_simple(lo, rq, pos, READ); } return ret; @@ -1543,6 +1640,13 @@ int loop_unregister_transfer(int number) EXPORT_SYMBOL(loop_register_transfer); EXPORT_SYMBOL(loop_unregister_transfer); +static inline bool req_dio_aligned(struct loop_device *lo, + const struct request *rq) +{ + return !((blk_rq_pos(rq) << 9) & lo->dio_align) && + !(blk_rq_bytes(rq) & lo->dio_align); +} + static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { @@ -1554,6 +1658,13 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, if (lo->lo_state != Lo_bound) return -EIO; + if (lo->use_dio && !lo->transfer && + req_dio_aligned(lo, bd->rq) && + !(cmd->rq->cmd_flags & (REQ_FLUSH | REQ_DISCARD))) + cmd->use_aio = true; + else + cmd->use_aio = false; + queue_kthread_work(&lo->worker, &cmd->work); return BLK_MQ_RQ_QUEUE_OK; @@ -1569,11 +1680,14 @@ static void loop_handle_cmd(struct loop_cmd *cmd) goto failed; ret = do_req_filebacked(lo, cmd->rq); + lo->last_use_dio = cmd->use_aio; failed: if (ret) cmd->rq->errors = -EIO; - blk_mq_complete_request(cmd->rq); + /* complete non-aio request */ + if (!cmd->use_aio || ret) + blk_mq_complete_request(cmd->rq); } static void loop_queue_work(struct kthread_work *work) diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 63f8e14..f0ba26d 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -60,6 +60,7 @@ struct loop_device { struct task_struct *worker_task; unsigned dio_align; bool use_dio; + bool last_use_dio; struct request_queue *lo_queue; struct blk_mq_tag_set tag_set; @@ -70,6 +71,8 @@ struct loop_cmd { struct kthread_work work; struct request *rq; struct list_head list; + bool use_aio; /* use AIO interface to handle I/O */ + struct kiocb iocb; }; /* Support for loadable transfer modules */ -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/