2014-01-10 02:40:11

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 0/6] Page I/O

This patch set implements pageio as I described in my talk at
Linux.Conf.AU. It's for review more than application, I think
benchmarking is going to be required to see if it's a win. We've done
some benchmarking with an earlier version of the patch and a Chatham card,
and it's a win for us.

The fundamental point of these patches is that we *can* do I/O without
allocating a BIO (or request, or ...) and so we can end up doing fun
things like swapping out a page without allocating any memory.

Possibly it would be interesting to do sub-page I/Os (ie change the
rw_page prototype to take a 'start' and 'length' instead of requiring the
I/O to be the entire page), but the problem then arises about what the
'done' callback should be.

Keith Busch (1):
NVMe: Add support for rw_page

Matthew Wilcox (5):
Add bdev_read_page() and bdev_write_page()
Factor page_endio() out of mpage_end_io()
swap: Use bdev_read_page() / bdev_write_page()
brd: Add support for rw_page
virtio_blk: Add rw_page implementation

drivers/block/brd.c | 10 +++
drivers/block/nvme-core.c | 129 ++++++++++++++++++++++++++++++++++++---------
drivers/block/virtio_blk.c | 44 +++++++++++++++
fs/block_dev.c | 34 +++++++++++
fs/mpage.c | 83 +++++++++++++++-------------
include/linux/blkdev.h | 4 +
include/linux/pagemap.h | 2
mm/filemap.c | 25 ++++++++
mm/page_io.c | 23 +++++++-
9 files changed, 288 insertions(+), 66 deletions(-)


2014-01-10 02:40:20

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 2/6] Factor page_endio() out of mpage_end_io()

page_endio() takes care of updating all the appropriate page flags once I/O
has finished to a page.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/mpage.c | 17 +----------------
include/linux/pagemap.h | 2 ++
mm/filemap.c | 25 +++++++++++++++++++++++++
3 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index b8552ae..9c21dc0 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -51,22 +51,7 @@ static void mpage_end_io(struct bio *bio, int err)

if (--bvec >= bio->bi_io_vec)
prefetchw(&bvec->bv_page->flags);
- if (bio_data_dir(bio) == READ) {
- if (uptodate) {
- SetPageUptodate(page);
- } else {
- ClearPageUptodate(page);
- SetPageError(page);
- }
- unlock_page(page);
- } else { /* bio_data_dir(bio) == WRITE */
- if (!uptodate) {
- SetPageError(page);
- if (page->mapping)
- set_bit(AS_EIO, &page->mapping->flags);
- }
- end_page_writeback(page);
- }
+ page_endio(page, bio_data_dir(bio), uptodate);
} while (bvec >= bio->bi_io_vec);
bio_put(bio);
}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..2271f19 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -416,6 +416,8 @@ static inline void wait_on_page_writeback(struct page *page)
extern void end_page_writeback(struct page *page);
void wait_for_stable_page(struct page *page);

+void page_endio(struct page *page, int rw, int success);
+
/*
* Add an arbitrary waiter to a page's wait queue
*/
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..763753d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -631,6 +631,31 @@ void end_page_writeback(struct page *page)
}
EXPORT_SYMBOL(end_page_writeback);

+/*
+ * After completing I/O on a page, call this routine to update the page
+ * flags appropriately
+ */
+void page_endio(struct page *page, int rw, int success)
+{
+ if (rw == READ) {
+ if (success) {
+ SetPageUptodate(page);
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ }
+ unlock_page(page);
+ } else { /* rw == WRITE */
+ if (!success) {
+ SetPageError(page);
+ if (page->mapping)
+ set_bit(AS_EIO, &page->mapping->flags);
+ }
+ end_page_writeback(page);
+ }
+}
+EXPORT_SYMBOL_GPL(page_endio);
+
/**
* __lock_page - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
--
1.8.5.2

2014-01-10 02:40:16

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 4/6] brd: Add support for rw_page

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/block/brd.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index d91f1a5..320d041 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -360,6 +360,15 @@ out:
bio_endio(bio, err);
}

+static int brd_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct brd_device *brd = bdev->bd_disk->private_data;
+ int err = brd_do_bvec(brd, page, PAGE_CACHE_SIZE, 0, rw, sector);
+ page_endio(page, rw & WRITE, err);
+ return err;
+}
+
#ifdef CONFIG_BLK_DEV_XIP
static int brd_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn)
@@ -419,6 +428,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode,

static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
+ .rw_page = brd_rw_page,
.ioctl = brd_ioctl,
#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
--
1.8.5.2

2014-01-10 02:41:32

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 5/6] virtio_blk: Add rw_page implementation

This is an implementation of the new rw_page interface. It primarily
demonstrates a low-impact, low-reward method for adding support to a
driver. I don't expect it to improve performance; in fact I expect it
to worsen performance slightly since there are still memory allocations
in the I/O path.

Please don't apply this patch.

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/block/virtio_blk.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 6a680d4..4c1c9ad 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -110,11 +110,22 @@ static int __virtblk_add_req(struct virtqueue *vq,
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}

+#define rw_to_dmad(rw) (((rw) == WRITE) ? DMA_TO_DEVICE : DMA_FROM_DEVICE)
+
static inline void virtblk_request_done(struct virtblk_req *vbr)
{
struct request *req = vbr->req;
int error = virtblk_result(vbr);

+ if ((unsigned long)req & 1) {
+ struct virtio_blk *vblk = (void *)((unsigned long)req & ~1UL);
+ int rw = (vbr->out_hdr.type & VIRTIO_BLK_T_OUT) ? WRITE : READ;
+ struct page *page = sg_page(vbr->sg);
+ dma_unmap_sg(&vblk->vdev->dev, vbr->sg, 1, rw_to_dmad(rw));
+ page_endio(page, rw, error);
+ return;
+ }
+
if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
req->resid_len = vbr->in_hdr.residual;
req->sense_len = vbr->in_hdr.sense_len;
@@ -239,6 +250,38 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
return err;
}

+static int virtblk_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ unsigned long flags;
+ struct virtio_blk *vblk = bdev->bd_disk->private_data;
+ struct virtblk_req *vbr;
+ int err;
+
+ vbr = kmalloc(sizeof(*vbr) + sizeof(struct scatterlist), GFP_NOFS);
+ if (!vbr)
+ return -ENOMEM;
+
+ vbr->req = (void *)((unsigned long)vblk | 1);
+ vbr->out_hdr.type = (rw == WRITE) ? VIRTIO_BLK_T_OUT : VIRTIO_BLK_T_IN;
+ vbr->out_hdr.sector = sector;
+ vbr->out_hdr.ioprio = 0;
+
+ sg_init_table(vbr->sg, 1);
+ sg_set_page(vbr->sg, page, PAGE_CACHE_SIZE, 0);
+ sg_mark_end(vbr->sg);
+ if (dma_map_sg(&vblk->vdev->dev, vbr->sg, 1, rw_to_dmad(rw))) {
+ kfree(vbr);
+ return -ENOMEM;
+ }
+
+ spin_lock_irqsave(&vblk->vq_lock, flags);
+ err = __virtblk_add_req(vblk->vq, vbr, vbr->sg, 1);
+ virtqueue_kick(vblk->vq);
+ spin_unlock_irqrestore(&vblk->vq_lock, flags);
+ return err;
+}
+
static int virtblk_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long data)
{
@@ -278,6 +321,7 @@ static int virtblk_getgeo(struct block_device *bd, struct hd_geometry *geo)
}

static const struct block_device_operations virtblk_fops = {
+ .rw_page = virtblk_rw_page,
.ioctl = virtblk_ioctl,
.owner = THIS_MODULE,
.getgeo = virtblk_getgeo,
--
1.8.5.2

2014-01-10 02:41:29

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 6/6] NVMe: Add support for rw_page

From: Keith Busch <[email protected]>

This demonstrates the full potential of rw_page in a real device driver.
By adding a dma_addr_t to the preallocated per-command data structure, we
can avoid doing any memory allocation in the rw_page path. For example,
that lets us swap without allocating any memory.

Also, this is against the version of the driver in the development tree,
not upstream. So it won't apply.

Signed-off-by: Keith Busch <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 105 insertions(+), 24 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b59a93a..3af7f73 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -117,12 +117,13 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
}

-typedef void (*nvme_completion_fn)(struct nvme_dev *, void *,
+typedef void (*nvme_completion_fn)(struct nvme_dev *, void *, dma_addr_t,
struct nvme_completion *);

struct nvme_cmd_info {
nvme_completion_fn fn;
void *ctx;
+ dma_addr_t dma;
unsigned long timeout;
int aborted;
};
@@ -152,7 +153,7 @@ static unsigned nvme_queue_extra(int depth)
* May be called with local interrupts disabled and the q_lock held,
* or with interrupts enabled and no locks held.
*/
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
+static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma,
nvme_completion_fn handler, unsigned timeout)
{
int depth = nvmeq->q_depth - 1;
@@ -167,17 +168,18 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,

info[cmdid].fn = handler;
info[cmdid].ctx = ctx;
+ info[cmdid].dma = dma;
info[cmdid].timeout = jiffies + timeout;
info[cmdid].aborted = 0;
return cmdid;
}

-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
+static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma,
nvme_completion_fn handler, unsigned timeout)
{
int cmdid;
wait_event_killable(nvmeq->sq_full,
- (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
+ (cmdid = alloc_cmdid(nvmeq, ctx, dma, handler, timeout)) >= 0);
return (cmdid < 0) ? -EINTR : cmdid;
}

@@ -189,7 +191,7 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
#define CMD_CTX_FLUSH (0x318 + CMD_CTX_BASE)
#define CMD_CTX_ABORT (0x31C + CMD_CTX_BASE)

-static void special_completion(struct nvme_dev *dev, void *ctx,
+static void special_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
if (ctx == CMD_CTX_CANCELLED)
@@ -216,7 +218,7 @@ static void special_completion(struct nvme_dev *dev, void *ctx,
dev_warn(&dev->pci_dev->dev, "Unknown special completion %p\n", ctx);
}

-static void async_completion(struct nvme_dev *dev, void *ctx,
+static void async_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct async_cmd_info *cmdinfo = ctx;
@@ -228,7 +230,7 @@ static void async_completion(struct nvme_dev *dev, void *ctx,
/*
* Called with local interrupts disabled and the q_lock held. May not sleep.
*/
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap,
nvme_completion_fn *fn)
{
void *ctx;
@@ -240,6 +242,8 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
}
if (fn)
*fn = info[cmdid].fn;
+ if (dmap)
+ *dmap = info[cmdid].dma;
ctx = info[cmdid].ctx;
info[cmdid].fn = special_completion;
info[cmdid].ctx = CMD_CTX_COMPLETED;
@@ -248,13 +252,15 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
return ctx;
}

-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap,
nvme_completion_fn *fn)
{
void *ctx;
struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
if (fn)
*fn = info[cmdid].fn;
+ if (dmap)
+ *dmap = info[cmdid].dma;
ctx = info[cmdid].ctx;
info[cmdid].fn = special_completion;
info[cmdid].ctx = CMD_CTX_CANCELLED;
@@ -370,7 +376,7 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
part_stat_unlock();
}

-static void bio_completion(struct nvme_dev *dev, void *ctx,
+static void bio_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct nvme_iod *iod = ctx;
@@ -674,7 +680,7 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,

int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns)
{
- int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH,
+ int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH, 0,
special_completion, NVME_IO_TIMEOUT);
if (unlikely(cmdid < 0))
return cmdid;
@@ -709,7 +715,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
iod->private = bio;

result = -EBUSY;
- cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
+ cmdid = alloc_cmdid(nvmeq, iod, 0, bio_completion, NVME_IO_TIMEOUT);
if (unlikely(cmdid < 0))
goto free_iod;

@@ -765,7 +771,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
return 0;

free_cmdid:
- free_cmdid(nvmeq, cmdid, NULL);
+ free_cmdid(nvmeq, cmdid, NULL, NULL);
free_iod:
nvme_free_iod(nvmeq->dev, iod);
nomem:
@@ -781,6 +787,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)

for (;;) {
void *ctx;
+ dma_addr_t dma;
nvme_completion_fn fn;
struct nvme_completion cqe = nvmeq->cqes[head];
if ((le16_to_cpu(cqe.status) & 1) != phase)
@@ -791,8 +798,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
phase = !phase;
}

- ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
- fn(nvmeq->dev, ctx, &cqe);
+ ctx = free_cmdid(nvmeq, cqe.command_id, &dma, &fn);
+ fn(nvmeq->dev, ctx, dma, &cqe);
}

/* If the controller ignores the cq head doorbell and continuously
@@ -862,7 +869,7 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
{
spin_lock_irq(&nvmeq->q_lock);
- cancel_cmdid(nvmeq, cmdid, NULL);
+ cancel_cmdid(nvmeq, cmdid, NULL, NULL);
spin_unlock_irq(&nvmeq->q_lock);
}

@@ -872,7 +879,7 @@ struct sync_cmd_info {
int status;
};

-static void sync_completion(struct nvme_dev *dev, void *ctx,
+static void sync_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct sync_cmd_info *cmdinfo = ctx;
@@ -894,7 +901,7 @@ int nvme_submit_sync_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
cmdinfo.task = current;
cmdinfo.status = -EINTR;

- cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, sync_completion,
+ cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, 0, sync_completion,
timeout);
if (cmdid < 0)
return cmdid;
@@ -919,9 +926,8 @@ int nvme_submit_async_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
struct async_cmd_info *cmdinfo,
unsigned timeout)
{
- int cmdid;
-
- cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
+ int cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, 0, async_completion,
+ timeout);
if (cmdid < 0)
return cmdid;
cmdinfo->status = -EINTR;
@@ -1081,8 +1087,8 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
if (!dev->abort_limit)
return;

- a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, special_completion,
- ADMIN_TIMEOUT);
+ a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, 0,
+ special_completion, ADMIN_TIMEOUT);
if (a_cmdid < 0)
return;

@@ -1115,6 +1121,7 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)

for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
void *ctx;
+ dma_addr_t dma;
nvme_completion_fn fn;
static struct nvme_completion cqe = {
.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
@@ -1130,8 +1137,8 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
}
dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
nvmeq->qid);
- ctx = cancel_cmdid(nvmeq, cmdid, &fn);
- fn(nvmeq->dev, ctx, &cqe);
+ ctx = cancel_cmdid(nvmeq, cmdid, &dma, &fn);
+ fn(nvmeq->dev, ctx, dma, &cqe);
}
}

@@ -1617,6 +1624,79 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
return status;
}

+static void pgrd_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
+ struct nvme_completion *cqe)
+{
+ struct page *page = ctx;
+ u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+ dma_unmap_page(&dev->pci_dev->dev, dma,
+ PAGE_CACHE_SIZE, DMA_FROM_DEVICE);
+ page_endio(page, READ, status == NVME_SC_SUCCESS);
+}
+
+static void pgwr_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
+ struct nvme_completion *cqe)
+{
+ struct page *page = ctx;
+ u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+ dma_unmap_page(&dev->pci_dev->dev, dma, PAGE_CACHE_SIZE, DMA_TO_DEVICE);
+ page_endio(page, WRITE, status == NVME_SC_SUCCESS);
+}
+
+static const enum dma_data_direction nvme_to_direction[] = {
+ DMA_NONE, DMA_TO_DEVICE, DMA_FROM_DEVICE, DMA_BIDIRECTIONAL
+};
+
+static int nvme_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct nvme_ns *ns = bdev->bd_disk->private_data;
+ u8 op = (rw & WRITE) ? nvme_cmd_write : nvme_cmd_read;
+ nvme_completion_fn fn = (rw & WRITE) ? pgwr_completion :
+ pgrd_completion;
+ dma_addr_t dma;
+ int cmdid;
+ struct nvme_command *cmd;
+ enum dma_data_direction dma_dir = nvme_to_direction[op & 3];
+ struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
+ dma = dma_map_page(nvmeq->q_dmadev, page, 0, PAGE_CACHE_SIZE, dma_dir);
+
+ if (rw == WRITE)
+ cmdid = alloc_cmdid(nvmeq, page, dma, fn, NVME_IO_TIMEOUT);
+ else
+ cmdid = alloc_cmdid_killable(nvmeq, page, dma, fn,
+ NVME_IO_TIMEOUT);
+ if (unlikely(cmdid < 0)) {
+ dma_unmap_page(nvmeq->q_dmadev, dma, PAGE_CACHE_SIZE,
+ DMA_FROM_DEVICE);
+ put_nvmeq(nvmeq);
+ return -EBUSY;
+ }
+
+ spin_lock_irq(&nvmeq->q_lock);
+ cmd = &nvmeq->sq_cmds[nvmeq->sq_tail];
+ memset(cmd, 0, sizeof(*cmd));
+
+ cmd->rw.opcode = op;
+ cmd->rw.command_id = cmdid;
+ cmd->rw.nsid = cpu_to_le32(ns->ns_id);
+ cmd->rw.slba = cpu_to_le64(nvme_block_nr(ns, sector));
+ cmd->rw.length = cpu_to_le16((PAGE_CACHE_SIZE >> ns->lba_shift) - 1);
+ cmd->rw.prp1 = cpu_to_le64(dma);
+
+ if (++nvmeq->sq_tail == nvmeq->q_depth)
+ nvmeq->sq_tail = 0;
+ writel(nvmeq->sq_tail, nvmeq->q_db);
+
+ put_nvmeq(nvmeq);
+ nvme_process_cq(nvmeq);
+ spin_unlock_irq(&nvmeq->q_lock);
+
+ return 0;
+}
+
static int nvme_user_admin_cmd(struct nvme_dev *dev,
struct nvme_admin_cmd __user *ucmd)
{
@@ -1714,6 +1794,7 @@ static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,

static const struct block_device_operations nvme_fops = {
.owner = THIS_MODULE,
+ .rw_page = nvme_rw_page,
.ioctl = nvme_ioctl,
.compat_ioctl = nvme_compat_ioctl,
};
--
1.8.5.2

2014-01-10 02:42:03

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 3/6] swap: Use bdev_read_page() / bdev_write_page()

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/page_io.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 8c79a47..559697e 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -248,11 +248,16 @@ out:
return ret;
}

+static sector_t swap_page_sector(struct page *page)
+{
+ return (sector_t)__page_file_index(page) << (PAGE_CACHE_SHIFT - 9);
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
void (*end_write_func)(struct bio *, int))
{
struct bio *bio;
- int ret = 0, rw = WRITE;
+ int ret, rw = WRITE;
struct swap_info_struct *sis = page_swap_info(page);

if (sis->flags & SWP_FILE) {
@@ -297,6 +302,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return ret;
}

+ ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
+ if (!ret) {
+ count_vm_event(PSWPOUT);
+ return 0;
+ }
+
+ ret = 0;
bio = get_swap_bio(GFP_NOIO, page, end_write_func);
if (bio == NULL) {
set_page_dirty(page);
@@ -317,7 +329,7 @@ out:
int swap_readpage(struct page *page)
{
struct bio *bio;
- int ret = 0;
+ int ret;
struct swap_info_struct *sis = page_swap_info(page);

VM_BUG_ON(!PageLocked(page));
@@ -338,6 +350,13 @@ int swap_readpage(struct page *page)
return ret;
}

+ ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
+ if (!ret) {
+ count_vm_event(PSWPIN);
+ return 0;
+ }
+
+ ret = 0;
bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
if (bio == NULL) {
unlock_page(page);
--
1.8.5.2

2014-01-10 02:42:25

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 1/6] Add bdev_read_page() and bdev_write_page()

A block device driver may choose to provide a rw_page operation.
These will be called when the filesystem is attempting to do page sized
I/O to page cache pages (ie not for direct I/O). This does preclude
I/Os that are larger than page size, so this may only be a performance
gain for some devices.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/block_dev.c | 34 ++++++++++++++++++++++++++
fs/mpage.c | 66 ++++++++++++++++++++++++++++++++------------------
include/linux/blkdev.h | 4 +++
3 files changed, 80 insertions(+), 24 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1e86823..660557c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -363,6 +363,40 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
}
EXPORT_SYMBOL(blkdev_fsync);

+int bdev_read_page(struct block_device *bdev, sector_t sector,
+ struct page *page)
+{
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ if (!ops->rw_page)
+ return -EOPNOTSUPP;
+ return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+}
+EXPORT_SYMBOL_GPL(bdev_read_page);
+
+int bdev_write_page(struct block_device *bdev, sector_t sector,
+ struct page *page, struct writeback_control *wbc)
+{
+ int result;
+ int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ if (!ops->rw_page)
+ return -EOPNOTSUPP;
+ result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);
+ if (result) {
+ if (PageSwapCache(page)) {
+ set_page_dirty(page);
+ ClearPageReclaim(page);
+ end_page_writeback(page);
+ } else {
+ redirty_page_for_writepage(wbc, page);
+ result = 0;
+ }
+ }
+ unlock_page(page);
+ return result;
+}
+EXPORT_SYMBOL_GPL(bdev_write_page);
+
/*
* pseudo-fs
*/
diff --git a/fs/mpage.c b/fs/mpage.c
index 0face1c..b8552ae 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -286,6 +286,11 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,

alloc_new:
if (bio == NULL) {
+ if (first_hole == blocks_per_page) {
+ if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9),
+ page))
+ goto out;
+ }
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
min_t(int, nr_pages, bio_get_nr_vecs(bdev)),
GFP_KERNEL);
@@ -440,6 +445,35 @@ struct mpage_data {
unsigned use_writepage;
};

+/*
+ * We have our BIO, so we can now mark the buffers clean. Make
+ * sure to only clean buffers which we know we'll be writing.
+ */
+static void clean_buffers(struct page *page, unsigned first_unmapped)
+{
+ unsigned buffer_counter = 0;
+ struct buffer_head *bh, *head;
+ if (!page_has_buffers(page))
+ return;
+ head = page_buffers(page);
+ bh = head;
+
+ do {
+ if (buffer_counter++ == first_unmapped)
+ break;
+ clear_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ /*
+ * we cannot drop the bh if the page is not uptodate or a concurrent
+ * readpage would fail to serialize with the bh and it would read from
+ * disk before we reach the platter.
+ */
+ if (buffer_heads_over_limit && PageUptodate(page))
+ try_to_free_buffers(page);
+}
+
static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
void *data)
{
@@ -575,6 +609,13 @@ page_is_mapped:

alloc_new:
if (bio == NULL) {
+ if (first_unmapped == blocks_per_page) {
+ if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
+ page, wbc)) {
+ clean_buffers(page, first_unmapped);
+ goto out;
+ }
+ }
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
@@ -592,30 +633,7 @@ alloc_new:
goto alloc_new;
}

- /*
- * OK, we have our BIO, so we can now mark the buffers clean. Make
- * sure to only clean buffers which we know we'll be writing.
- */
- if (page_has_buffers(page)) {
- struct buffer_head *head = page_buffers(page);
- struct buffer_head *bh = head;
- unsigned buffer_counter = 0;
-
- do {
- if (buffer_counter++ == first_unmapped)
- break;
- clear_buffer_dirty(bh);
- bh = bh->b_this_page;
- } while (bh != head);
-
- /*
- * we cannot drop the bh if the page is not uptodate
- * or a concurrent readpage would fail to serialize with the bh
- * and it would read from disk before we reach the platter.
- */
- if (buffer_heads_over_limit && PageUptodate(page))
- try_to_free_buffers(page);
- }
+ clean_buffers(page, first_unmapped);

BUG_ON(PageWriteback(page));
set_page_writeback(page);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b135d4..50dc979 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1564,6 +1564,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
struct block_device_operations {
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
+ int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t,
@@ -1582,6 +1583,9 @@ struct block_device_operations {

extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
unsigned long);
+extern int bdev_read_page(struct block_device *, sector_t, struct page *);
+extern int bdev_write_page(struct block_device *, sector_t, struct page *,
+ struct writeback_control *);
#else /* CONFIG_BLOCK */
/*
* stubs for when the block layer is configured out
--
1.8.5.2

2014-01-10 15:24:18

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0/6] Page I/O

Matthew Wilcox <[email protected]> writes:

> This patch set implements pageio as I described in my talk at
> Linux.Conf.AU. It's for review more than application, I think
> benchmarking is going to be required to see if it's a win. We've done
> some benchmarking with an earlier version of the patch and a Chatham card,
> and it's a win for us.
>
> The fundamental point of these patches is that we *can* do I/O without
> allocating a BIO (or request, or ...) and so we can end up doing fun
> things like swapping out a page without allocating any memory.
>
> Possibly it would be interesting to do sub-page I/Os (ie change the
> rw_page prototype to take a 'start' and 'length' instead of requiring the
> I/O to be the entire page), but the problem then arises about what the
> 'done' callback should be.

For those of us who were not fortunate enough to attend your talk, would
mind providing some background, like why you went down this path in the
first place, and maybe what benchmarks you ran where you found it "a
win?"

Another code path making an end-run around the block layer is
interesting, but may keep cgroup I/O throttling from working properly,
for example.

Cheers,
Jeff

2014-01-10 16:17:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/6] Page I/O

On Fri, Jan 10, 2014 at 10:24:07AM -0500, Jeff Moyer wrote:
> Matthew Wilcox <[email protected]> writes:
> > This patch set implements pageio as I described in my talk at
> > Linux.Conf.AU. It's for review more than application, I think
> > benchmarking is going to be required to see if it's a win. We've done
> > some benchmarking with an earlier version of the patch and a Chatham card,
> > and it's a win for us.
> >
> > The fundamental point of these patches is that we *can* do I/O without
> > allocating a BIO (or request, or ...) and so we can end up doing fun
> > things like swapping out a page without allocating any memory.
> >
> > Possibly it would be interesting to do sub-page I/Os (ie change the
> > rw_page prototype to take a 'start' and 'length' instead of requiring the
> > I/O to be the entire page), but the problem then arises about what the
> > 'done' callback should be.
>
> For those of us who were not fortunate enough to attend your talk, would
> mind providing some background, like why you went down this path in the
> first place, and maybe what benchmarks you ran where you found it "a
> win?"

Swapping is the real reason. Everything else is cherries. One of my
colleagues was comparing swapping performance between different interfaces
for fast storage (eg NV-DIMMs). I must admit to not actually knowing
what his results *were*. We did a bunch of custom hacks to get his
performance numbers up, and on the plane here I finally wrestled those
custom hacks into an interface that worked for any block device.

> Another code path making an end-run around the block layer is
> interesting, but may keep cgroup I/O throttling from working properly,
> for example.

Definitely something that should be taken into consideration, although
I would *starts handwaving massively* think that we could fit cgroup
throttling into bdev_{read,write}_page, by returning an error which causes
the caller to fall back to the bio path, which ends up doing ... whatever
cgroup throttling would have done if the pageio path didn't exist.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2014-01-15 00:04:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] Page I/O

On Fri, Jan 10, 2014 at 10:24:07AM -0500, Jeff Moyer wrote:
> Matthew Wilcox <[email protected]> writes:
>
> > This patch set implements pageio as I described in my talk at
> > Linux.Conf.AU. It's for review more than application, I think
> > benchmarking is going to be required to see if it's a win. We've done
> > some benchmarking with an earlier version of the patch and a Chatham card,
> > and it's a win for us.
> >
> > The fundamental point of these patches is that we *can* do I/O without
> > allocating a BIO (or request, or ...) and so we can end up doing fun
> > things like swapping out a page without allocating any memory.
> >
> > Possibly it would be interesting to do sub-page I/Os (ie change the
> > rw_page prototype to take a 'start' and 'length' instead of requiring the
> > I/O to be the entire page), but the problem then arises about what the
> > 'done' callback should be.
>
> For those of us who were not fortunate enough to attend your talk, would
> mind providing some background, like why you went down this path in the
> first place, and maybe what benchmarks you ran where you found it "a
> win?"

No need to attend - the LCA A/V team live streamed it over the
intertubes and had the recording up on the mirror within 24 hours:

http://mirror.linux.org.au/pub/linux.conf.au/2014/Thursday/239-Further_adventures_in_non-volatile_memory_-_Matthew_Wilcox.mp4

> Another code path making an end-run around the block layer is
> interesting, but may keep cgroup I/O throttling from working properly,
> for example.

Well, this is really aimed at CPU cache coherent DRAM speed devices,
so I think the per-IO overhead of throttling would be prohibitive
for such devices. IMO, those devices will spend more CPU time in the
IO path than doing the IO (which is likely to be the CPU doing a
memcpy!), so IO rates will be more effectively controlled by
restricting CPU time rather than adding extra overhead into the
block layer to account for each individual IO....

Cheers,

Dave.
--
Dave Chinner
[email protected]