2014-02-26 23:40:07

by Kent Overstreet

[permalink] [raw]
Subject: Make generic_make_request() handle arbitrary size bios

Jens, the following patches are for 3.15. This builds off of immutable biovecs,
and this series in turn enables the DIO rewrite I've been working on off and on
for ages and ages, and it also will let us delete merge_bvec_fn piecemeal
(altogether merge_bvec_fn is 1500 lines of code we'll be able to delete).

This will also get us performance gains (that I've measured myself, benchmarking
on a p320h) once we push the work of splitting too-big bios all the way down to
the drivers, and I expect bigger gains for stacking block drivers due to just
processing larger bios.

I've got a patch I'll send separately for the mtip32xx driver to make it take
arbitrary size bios, it's pretty small.

Also a few related small refactoring patches.

This is on top of your for-3.15/core branch. If you want to pull these patches
from my tree, here's that:

The following changes since commit c46fff2a3b29794b35d717b5680a27f31a6a6bc0:

smp: Rename __smp_call_function_single() to smp_call_function_single_async() (2014-02-24 14:47:15 -0800)

are available in the git repository at:

git://evilpiepirate.org/~kent/linux-bcache.git for-jens

for you to fetch changes up to 089f8de5c42a121f351ef9d240d66e1128fa0ea2:

iov_iter: Kill written arg to iov_iter_init() (2014-02-26 15:17:36 -0800)

----------------------------------------------------------------
Kent Overstreet (9):
block: Make generic_make_request handle arbitrary sized bios
block: Gut bio_add_page()
blk-lib.c: generic_make_request() handles large bios now
bcache: generic_make_request() handles large bios now
btrfs: generic_make_request() handles arbitrary size bios now
btrfs: Convert to bio_for_each_segment()
iov_iter: Move iov_iter to uio.h
iov_iter: Kill iov_iter_single_seg_count()
iov_iter: Kill written arg to iov_iter_init()

block/blk-core.c | 19 ++-
block/blk-lib.c | 175 +++++-----------------------
block/blk-merge.c | 150 ++++++++++++++++++++++--
block/blk-mq.c | 2 +
drivers/block/drbd/drbd_req.c | 2 +
drivers/block/mtip32xx/mtip32xx.c | 6 +-
drivers/block/nvme-core.c | 2 +
drivers/block/pktcdvd.c | 6 +-
drivers/block/ps3vram.c | 2 +
drivers/block/rsxx/dev.c | 2 +
drivers/block/umem.c | 2 +
drivers/block/zram/zram_drv.c | 2 +
drivers/md/bcache/bcache.h | 18 ---
drivers/md/bcache/io.c | 100 +---------------
drivers/md/bcache/journal.c | 4 +-
drivers/md/bcache/request.c | 16 +--
drivers/md/bcache/super.c | 32 +----
drivers/md/bcache/util.h | 5 +-
drivers/md/bcache/writeback.c | 4 +-
drivers/md/dm.c | 2 +
drivers/md/md.c | 2 +
drivers/s390/block/dcssblk.c | 2 +
drivers/s390/block/xpram.c | 2 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 +
fs/bio.c | 137 +++++++++-------------
fs/btrfs/extent_io.c | 12 +-
fs/btrfs/file-item.c | 59 ++++------
fs/btrfs/file.c | 8 +-
fs/btrfs/inode.c | 22 ++--
fs/btrfs/volumes.c | 73 ------------
fs/ceph/file.c | 7 +-
fs/cifs/file.c | 4 +-
fs/fuse/file.c | 17 +--
include/linux/blkdev.h | 3 +
include/linux/fs.h | 32 -----
include/linux/uio.h | 46 ++++++++
mm/filemap.c | 19 +--
37 files changed, 399 insertions(+), 599 deletions(-)


2014-02-26 23:40:29

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 5/9] btrfs: generic_make_request() handles arbitrary size bios now

So there's no need for btrfs to break up bios for device limits anymore

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: [email protected]
---
fs/btrfs/volumes.c | 73 ------------------------------------------------------
1 file changed, 73 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bab0b84d8f..3ce1a8aed6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5395,34 +5395,6 @@ static noinline void btrfs_schedule_bio(struct btrfs_root *root,
&device->work);
}

-static int bio_size_ok(struct block_device *bdev, struct bio *bio,
- sector_t sector)
-{
- struct bio_vec *prev;
- struct request_queue *q = bdev_get_queue(bdev);
- unsigned int max_sectors = queue_max_sectors(q);
- struct bvec_merge_data bvm = {
- .bi_bdev = bdev,
- .bi_sector = sector,
- .bi_rw = bio->bi_rw,
- };
-
- if (WARN_ON(bio->bi_vcnt == 0))
- return 1;
-
- prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
- if (bio_sectors(bio) > max_sectors)
- return 0;
-
- if (!q->merge_bvec_fn)
- return 1;
-
- bvm.bi_size = bio->bi_iter.bi_size - prev->bv_len;
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len)
- return 0;
- return 1;
-}
-
static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
struct bio *bio, u64 physical, int dev_nr,
int rw, int async)
@@ -5453,38 +5425,6 @@ static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
btrfsic_submit_bio(rw, bio);
}

-static int breakup_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
- struct bio *first_bio, struct btrfs_device *dev,
- int dev_nr, int rw, int async)
-{
- struct bio_vec *bvec = first_bio->bi_io_vec;
- struct bio *bio;
- int nr_vecs = bio_get_nr_vecs(dev->bdev);
- u64 physical = bbio->stripes[dev_nr].physical;
-
-again:
- bio = btrfs_bio_alloc(dev->bdev, physical >> 9, nr_vecs, GFP_NOFS);
- if (!bio)
- return -ENOMEM;
-
- while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) {
- if (bio_add_page(bio, bvec->bv_page, bvec->bv_len,
- bvec->bv_offset) < bvec->bv_len) {
- u64 len = bio->bi_iter.bi_size;
-
- atomic_inc(&bbio->stripes_pending);
- submit_stripe_bio(root, bbio, bio, physical, dev_nr,
- rw, async);
- physical += len;
- goto again;
- }
- bvec++;
- }
-
- submit_stripe_bio(root, bbio, bio, physical, dev_nr, rw, async);
- return 0;
-}
-
static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
{
atomic_inc(&bbio->error);
@@ -5553,19 +5493,6 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
continue;
}

- /*
- * Check and see if we're ok with this bio based on it's size
- * and offset with the given device.
- */
- if (!bio_size_ok(dev->bdev, first_bio,
- bbio->stripes[dev_nr].physical >> 9)) {
- ret = breakup_stripe_bio(root, bbio, first_bio, dev,
- dev_nr, rw, async_submit);
- BUG_ON(ret);
- dev_nr++;
- continue;
- }
-
if (dev_nr < total_devs - 1) {
bio = btrfs_bio_clone(first_bio, GFP_NOFS);
BUG_ON(!bio); /* -ENOMEM */
--
1.9.0

2014-02-26 23:40:52

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 9/9] iov_iter: Kill written arg to iov_iter_init()

This gets rid of a usually needless call to iov_iter_advance().

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: [email protected]
Cc: Steve French <[email protected]>
Cc: [email protected]
Cc: Miklos Szeredi <[email protected]>
Cc: [email protected]
Cc: Sage Weil <[email protected]>
Cc: [email protected]
---
fs/btrfs/file.c | 8 +++++---
fs/ceph/file.c | 7 ++++---
fs/cifs/file.c | 4 ++--
fs/fuse/file.c | 11 ++++++-----
include/linux/uio.h | 9 +++------
mm/filemap.c | 4 +++-
6 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f..18b2e127e8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1652,8 +1652,9 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb,
return written;

pos += written;
- count -= written;
- iov_iter_init(&i, iov, nr_segs, count, written);
+ iov_iter_init(&i, iov, nr_segs, count);
+ iov_iter_advance(&i, written);
+
written_buffered = __btrfs_buffered_write(file, &i, pos);
if (written_buffered < 0) {
err = written_buffered;
@@ -1768,7 +1769,8 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
} else {
struct iov_iter i;

- iov_iter_init(&i, iov, nr_segs, count, num_written);
+ iov_iter_init(&i, iov, nr_segs, count + num_written);
+ iov_iter_advance(&i, num_written);

num_written = __btrfs_buffered_write(file, &i, pos);
if (num_written > 0)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index dfd2ce3419..163ae0fe20 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -580,7 +580,7 @@ ceph_sync_direct_write(struct kiocb *iocb, const struct iovec *iov,
CEPH_OSD_FLAG_ONDISK |
CEPH_OSD_FLAG_WRITE;

- iov_iter_init(&i, iov, nr_segs, count, 0);
+ iov_iter_init(&i, iov, nr_segs, count);

while (iov_iter_count(&i) > 0) {
void __user *data = i.iov->iov_base + i.iov_offset;
@@ -701,7 +701,7 @@ static ssize_t ceph_sync_write(struct kiocb *iocb, const struct iovec *iov,
CEPH_OSD_FLAG_WRITE |
CEPH_OSD_FLAG_ACK;

- iov_iter_init(&i, iov, nr_segs, count, 0);
+ iov_iter_init(&i, iov, nr_segs, count);

while ((len = iov_iter_count(&i)) > 0) {
size_t left;
@@ -834,7 +834,8 @@ again:
goto out;
}

- iov_iter_init(&i, iov, nr_segs, len, read);
+ iov_iter_init(&i, iov, nr_segs, len);
+ iov_iter_advance(&i, read);

/* hmm, this isn't really async... */
ret = ceph_sync_read(iocb, &i, &checkeof);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f..8a03eeb95e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2424,7 +2424,7 @@ cifs_iovec_write(struct file *file, const struct iovec *iov,
else
pid = current->tgid;

- iov_iter_init(&it, iov, nr_segs, len, 0);
+ iov_iter_init(&it, iov, nr_segs, len);
do {
size_t save_len;

@@ -2729,7 +2729,7 @@ cifs_readdata_to_iov(struct cifs_readdata *rdata, const struct iovec *iov,
unsigned int i;

/* set up iov_iter and advance to the correct offset */
- iov_iter_init(&ii, iov, nr_segs, iov_length(iov, nr_segs), 0);
+ iov_iter_init(&ii, iov, nr_segs, iov_length(iov, nr_segs));
iov_iter_advance(&ii, pos);

*copied = 0;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b4c9a14e70..67542da685 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1152,9 +1152,10 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
goto out;

pos += written;
- count -= written;

- iov_iter_init(&i, iov, nr_segs, count, written);
+ iov_iter_init(&i, iov, nr_segs, count);
+ iov_iter_advance(&i, written);
+
written_buffered = fuse_perform_write(file, mapping, &i, pos);
if (written_buffered < 0) {
err = written_buffered;
@@ -1174,7 +1175,7 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
written += written_buffered;
iocb->ki_pos = pos + written_buffered;
} else {
- iov_iter_init(&i, iov, nr_segs, count, 0);
+ iov_iter_init(&i, iov, nr_segs, count);
written = fuse_perform_write(file, mapping, &i, pos);
if (written >= 0)
iocb->ki_pos = pos + written;
@@ -1300,7 +1301,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
struct fuse_req *req;
struct iov_iter ii;

- iov_iter_init(&ii, iov, nr_segs, count, 0);
+ iov_iter_init(&ii, iov, nr_segs, count);

if (io->async)
req = fuse_get_req_for_background(fc, fuse_iter_npages(&ii));
@@ -2188,7 +2189,7 @@ static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
if (!bytes)
return 0;

- iov_iter_init(&ii, iov, nr_segs, bytes, 0);
+ iov_iter_init(&ii, iov, nr_segs, bytes);

while (iov_iter_count(&ii)) {
struct page *page = pages[page_idx++];
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2b99e0d2a1..d653feeeb1 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -67,16 +67,13 @@ size_t iov_iter_copy_from_user(struct page *page,
void iov_iter_advance(struct iov_iter *i, size_t bytes);
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);

-static inline void iov_iter_init(struct iov_iter *i,
- const struct iovec *iov, unsigned long nr_segs,
- size_t count, size_t written)
+static inline void iov_iter_init(struct iov_iter *i, const struct iovec *iov,
+ unsigned long nr_segs, size_t count)
{
i->iov = iov;
i->nr_segs = nr_segs;
i->iov_offset = 0;
- i->count = count + written;
-
- iov_iter_advance(i, written);
+ i->count = count;
}

static inline size_t iov_iter_count(struct iov_iter *i)
diff --git a/mm/filemap.c b/mm/filemap.c
index 8e509b3225..e756a0cf21 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2385,7 +2385,9 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
ssize_t status;
struct iov_iter i;

- iov_iter_init(&i, iov, nr_segs, count, written);
+ iov_iter_init(&i, iov, nr_segs, count + written);
+ iov_iter_advance(&i, written);
+
status = generic_perform_write(file, &i, pos);

if (likely(status >= 0)) {
--
1.9.0

2014-02-26 23:40:26

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 6/9] btrfs: Convert to bio_for_each_segment()

This is going to be important for future (hopeful) block layer refactoring, and
using the standard primitives makes the code easier to audit.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: [email protected]
---
fs/btrfs/extent_io.c | 12 ++++++++---
fs/btrfs/file-item.c | 59 ++++++++++++++++++++--------------------------------
fs/btrfs/inode.c | 22 +++++++-------------
3 files changed, 39 insertions(+), 54 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 85bbd01f12..0a84847123 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2632,12 +2632,18 @@ static int __must_check submit_one_bio(int rw, struct bio *bio,
int mirror_num, unsigned long bio_flags)
{
int ret = 0;
- struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
- struct page *page = bvec->bv_page;
+ struct bio_vec bvec = { 0 };
+ struct bvec_iter iter;
+ struct page *page;
struct extent_io_tree *tree = bio->bi_private;
u64 start;

- start = page_offset(page) + bvec->bv_offset;
+ bio_for_each_segment(bvec, bio, iter)
+ if (bio_iter_last(bvec, iter))
+ break;
+
+ page = bvec.bv_page;
+ start = page_offset(page) + bvec.bv_offset;

bio->bi_private = NULL;

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 127555b29f..c41642ea69 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -162,7 +162,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
struct inode *inode, struct bio *bio,
u64 logical_offset, u32 *dst, int dio)
{
- struct bio_vec *bvec = bio->bi_io_vec;
+ struct bvec_iter iter = bio->bi_iter;
struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
struct btrfs_csum_item *item = NULL;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
@@ -171,10 +171,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
u64 offset = 0;
u64 item_start_offset = 0;
u64 item_last_offset = 0;
- u64 disk_bytenr;
u32 diff;
int nblocks;
- int bio_index = 0;
int count;
u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);

@@ -204,8 +202,6 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (bio->bi_iter.bi_size > PAGE_CACHE_SIZE * 8)
path->reada = 2;

- WARN_ON(bio->bi_vcnt <= 0);
-
/*
* the free space stuff is only read when it hasn't been
* updated in the current transaction. So, we can safely
@@ -217,12 +213,13 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
path->skip_locking = 1;
}

- disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
if (dio)
offset = logical_offset;
- while (bio_index < bio->bi_vcnt) {
+ while (iter.bi_size) {
+ u64 disk_bytenr = (u64)iter.bi_sector << 9;
+ struct bio_vec bvec = bio_iter_iovec(bio, iter);
if (!dio)
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bvec.bv_page) + bvec.bv_offset;
count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
(u32 *)csum, nblocks);
if (count)
@@ -243,7 +240,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (BTRFS_I(inode)->root->root_key.objectid ==
BTRFS_DATA_RELOC_TREE_OBJECTID) {
set_extent_bits(io_tree, offset,
- offset + bvec->bv_len - 1,
+ offset + bvec.bv_len - 1,
EXTENT_NODATASUM, GFP_NOFS);
} else {
btrfs_info(BTRFS_I(inode)->root->fs_info,
@@ -281,12 +278,9 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
found:
csum += count * csum_size;
nblocks -= count;
- while (count--) {
- disk_bytenr += bvec->bv_len;
- offset += bvec->bv_len;
- bio_index++;
- bvec++;
- }
+ bio_advance_iter(bio, &iter,
+ count << inode->i_sb->s_blocksize_bits);
+ offset += count << inode->i_sb->s_blocksize_bits;
}
btrfs_free_path(path);
return 0;
@@ -439,14 +433,12 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
struct btrfs_ordered_sum *sums;
struct btrfs_ordered_extent *ordered;
char *data;
- struct bio_vec *bvec = bio->bi_io_vec;
- int bio_index = 0;
+ struct bio_vec bvec;
+ struct bvec_iter iter;
int index;
- unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
u64 offset;

- WARN_ON(bio->bi_vcnt <= 0);
sums = kzalloc(btrfs_ordered_sum_size(root, bio->bi_iter.bi_size),
GFP_NOFS);
if (!sums)
@@ -458,53 +450,46 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
if (contig)
offset = file_start;
else
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bio_page(bio)) + bio_offset(bio);

ordered = btrfs_lookup_ordered_extent(inode, offset);
BUG_ON(!ordered); /* Logic error */
sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
index = 0;

- while (bio_index < bio->bi_vcnt) {
+ bio_for_each_segment(bvec, bio, iter) {
if (!contig)
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bvec.bv_page) + bvec.bv_offset;

if (offset >= ordered->file_offset + ordered->len ||
offset < ordered->file_offset) {
- unsigned long bytes_left;
sums->len = this_sum_bytes;
this_sum_bytes = 0;
btrfs_add_ordered_sum(inode, ordered, sums);
btrfs_put_ordered_extent(ordered);

- bytes_left = bio->bi_iter.bi_size - total_bytes;
-
- sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
+ sums = kzalloc(btrfs_ordered_sum_size(root, iter.bi_size),
GFP_NOFS);
BUG_ON(!sums); /* -ENOMEM */
- sums->len = bytes_left;
+ sums->len = iter.bi_size;
ordered = btrfs_lookup_ordered_extent(inode, offset);
BUG_ON(!ordered); /* Logic error */
- sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
- total_bytes;
+ sums->bytenr = ((u64)iter.bi_sector) << 9;
index = 0;
}

- data = kmap_atomic(bvec->bv_page);
+ data = kmap_atomic(bvec.bv_page);
sums->sums[index] = ~(u32)0;
- sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
+ sums->sums[index] = btrfs_csum_data(data + bvec.bv_offset,
sums->sums[index],
- bvec->bv_len);
+ bvec.bv_len);
kunmap_atomic(data);
btrfs_csum_final(sums->sums[index],
(char *)(sums->sums + index));

- bio_index++;
index++;
- total_bytes += bvec->bv_len;
- this_sum_bytes += bvec->bv_len;
- offset += bvec->bv_len;
- bvec++;
+ offset += bvec.bv_len;
+ this_sum_bytes += bvec.bv_len;
}
this_sum_bytes = 0;
btrfs_add_ordered_sum(inode, ordered, sums);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d3d4448629..2475908a7b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7161,12 +7161,11 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct bio *bio;
struct bio *orig_bio = dip->orig_bio;
- struct bio_vec *bvec = orig_bio->bi_io_vec;
+ struct bio_vec bvec;
+ struct bvec_iter iter;
u64 start_sector = orig_bio->bi_iter.bi_sector;
u64 file_offset = dip->logical_offset;
- u64 submit_len = 0;
u64 map_length;
- int nr_pages = 0;
int ret = 0;
int async_submit = 0;

@@ -7197,10 +7196,12 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
bio->bi_end_io = btrfs_end_dio_bio;
atomic_inc(&dip->pending_bios);

- while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
- if (unlikely(map_length < submit_len + bvec->bv_len ||
- bio_add_page(bio, bvec->bv_page, bvec->bv_len,
- bvec->bv_offset) < bvec->bv_len)) {
+ bio_for_each_segment(bvec, orig_bio, iter) {
+ if (unlikely(map_length < bio->bi_iter.bi_size + bvec.bv_len ||
+ bio_add_page(bio, bvec.bv_page, bvec.bv_len,
+ bvec.bv_offset) < bvec.bv_len)) {
+ unsigned submit_len = bio->bi_iter.bi_size;
+
/*
* inc the count before we submit the bio so
* we know the end IO handler won't happen before
@@ -7220,9 +7221,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
start_sector += submit_len >> 9;
file_offset += submit_len;

- submit_len = 0;
- nr_pages = 0;
-
bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev,
start_sector, GFP_NOFS);
if (!bio)
@@ -7238,10 +7236,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
bio_put(bio);
goto out_err;
}
- } else {
- submit_len += bvec->bv_len;
- nr_pages++;
- bvec++;
}
}

--
1.9.0

2014-02-26 23:41:34

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 8/9] iov_iter: Kill iov_iter_single_seg_count()

The new iov_iter_iovec() is a more general replacement.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
---
fs/fuse/file.c | 6 +++---
include/linux/uio.h | 1 -
mm/filemap.c | 15 +--------------
3 files changed, 4 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3..b4c9a14e70 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1013,7 +1013,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req,
if (!tmp) {
unlock_page(page);
page_cache_release(page);
- bytes = min(bytes, iov_iter_single_seg_count(ii));
+ bytes = min(bytes, iov_iter_iovec(ii).iov_len);
goto again;
}

@@ -1204,7 +1204,7 @@ static inline unsigned long fuse_get_user_addr(const struct iov_iter *ii)
static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
size_t max_size)
{
- return min(iov_iter_single_seg_count(ii), max_size);
+ return min(iov_iter_iovec(ii).iov_len, max_size);
}

static int fuse_get_user_pages(struct fuse_req *req, struct iov_iter *ii,
@@ -1278,7 +1278,7 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p)
while (iov_iter_count(&ii) && npages < FUSE_MAX_PAGES_PER_REQ) {
unsigned long user_addr = fuse_get_user_addr(&ii);
unsigned offset = user_addr & ~PAGE_MASK;
- size_t frag_size = iov_iter_single_seg_count(&ii);
+ size_t frag_size = iov_iter_iovec(&ii).iov_len;

npages += (frag_size + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
iov_iter_advance(&ii, frag_size);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 347d70ce09..2b99e0d2a1 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -66,7 +66,6 @@ size_t iov_iter_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
void iov_iter_advance(struct iov_iter *i, size_t bytes);
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
-size_t iov_iter_single_seg_count(const struct iov_iter *i);

static inline void iov_iter_init(struct iov_iter *i,
const struct iovec *iov, unsigned long nr_segs,
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac54..8e509b3225 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2073,19 +2073,6 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
EXPORT_SYMBOL(iov_iter_fault_in_readable);

/*
- * Return the count of just the current iov_iter segment.
- */
-size_t iov_iter_single_seg_count(const struct iov_iter *i)
-{
- const struct iovec *iov = i->iov;
- if (i->nr_segs == 1)
- return i->count;
- else
- return min(i->count, iov->iov_len - i->iov_offset);
-}
-EXPORT_SYMBOL(iov_iter_single_seg_count);
-
-/*
* Performs necessary checks before doing a write
*
* Can adjust writing position or amount of bytes to write.
@@ -2373,7 +2360,7 @@ again:
* once without a pagefault.
*/
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
- iov_iter_single_seg_count(i));
+ iov_iter_iovec(i).iov_len);
goto again;
}
pos += copied;
--
1.9.0

2014-02-26 23:44:59

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 7/9] iov_iter: Move iov_iter to uio.h

Going to be consolidating all the iov iter in one place, and fs.h is way too
big. This also adds a new helper, iovec iov_iter_iovec().

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/fs.h | 32 --------------------------------
include/linux/uio.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 32 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 60829565e5..e42211f044 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -290,38 +290,6 @@ struct page;
struct address_space;
struct writeback_control;

-struct iov_iter {
- const struct iovec *iov;
- unsigned long nr_segs;
- size_t iov_offset;
- size_t count;
-};
-
-size_t iov_iter_copy_from_user_atomic(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t iov_iter_copy_from_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
-size_t iov_iter_single_seg_count(const struct iov_iter *i);
-
-static inline void iov_iter_init(struct iov_iter *i,
- const struct iovec *iov, unsigned long nr_segs,
- size_t count, size_t written)
-{
- i->iov = iov;
- i->nr_segs = nr_segs;
- i->iov_offset = 0;
- i->count = count + written;
-
- iov_iter_advance(i, written);
-}
-
-static inline size_t iov_iter_count(struct iov_iter *i)
-{
- return i->count;
-}
-
/*
* "descriptor" for what we're up to with a read.
* This allows us to use the same read code yet
diff --git a/include/linux/uio.h b/include/linux/uio.h
index c55ce243cc..347d70ce09 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -9,14 +9,23 @@
#ifndef __LINUX_UIO_H
#define __LINUX_UIO_H

+#include <linux/kernel.h>
#include <uapi/linux/uio.h>

+struct page;

struct kvec {
void *iov_base; /* and that should *never* hold a userland pointer */
size_t iov_len;
};

+struct iov_iter {
+ const struct iovec *iov;
+ unsigned long nr_segs;
+ size_t iov_offset;
+ size_t count;
+};
+
/*
* Total number of bytes covered by an iovec.
*
@@ -34,8 +43,49 @@ static inline size_t iov_length(const struct iovec *iov, unsigned long nr_segs)
return ret;
}

+static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
+{
+ return (struct iovec) {
+ .iov_base = iter->iov->iov_base + iter->iov_offset,
+ .iov_len = min(iter->count,
+ iter->iov->iov_len - iter->iov_offset),
+ };
+}
+
+#define iov_for_each(iov, iter, start) \
+ for (iter = (start); \
+ (iter).count && \
+ ((iov = iov_iter_iovec(&(iter))), 1); \
+ iov_iter_advance(&(iter), (iov).iov_len))
+
unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to);

+size_t iov_iter_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_from_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
+void iov_iter_advance(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+size_t iov_iter_single_seg_count(const struct iov_iter *i);
+
+static inline void iov_iter_init(struct iov_iter *i,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t count, size_t written)
+{
+ i->iov = iov;
+ i->nr_segs = nr_segs;
+ i->iov_offset = 0;
+ i->count = count + written;
+
+ iov_iter_advance(i, written);
+}
+
+static inline size_t iov_iter_count(struct iov_iter *i)
+{
+ return i->count;
+}
+
int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
int memcpy_toiovec(struct iovec *iov, unsigned char *kdata, int len);
+
#endif
--
1.9.0

2014-02-26 23:40:21

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 2/9] block: Gut bio_add_page()

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
---
fs/bio.c | 137 ++++++++++++++++++++++++++-------------------------------------
1 file changed, 57 insertions(+), 80 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index b2dd42ed9e..8985cc784d 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -694,9 +694,23 @@ int bio_get_nr_vecs(struct block_device *bdev)
}
EXPORT_SYMBOL(bio_get_nr_vecs);

-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned int max_sectors)
+/**
+ * bio_add_pc_page - attempt to add page to bio
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ *
+ * This should only be used by REQ_PC bios.
+ */
+int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
+ *page, unsigned int len, unsigned int offset)
{
int retried_segments = 0;
struct bio_vec *bvec;
@@ -707,7 +721,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (unlikely(bio_flagged(bio, BIO_CLONED)))
return 0;

- if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
+ if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
return 0;

/*
@@ -720,28 +734,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page

if (page == prev->bv_page &&
offset == prev->bv_offset + prev->bv_len) {
- unsigned int prev_bv_len = prev->bv_len;
prev->bv_len += len;
-
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- /* prev_bvec is already charged in
- bi_size, discharge it in order to
- simulate merging updated prev_bvec
- as new bvec. */
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = bio->bi_iter.bi_size -
- prev_bv_len,
- .bi_rw = bio->bi_rw,
- };
-
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
- prev->bv_len -= len;
- return 0;
- }
- }
-
goto done;
}
}
@@ -772,31 +765,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
bvec->bv_len = len;
bvec->bv_offset = offset;

- /*
- * if queue has other restrictions (eg varying max sector size
- * depending on offset), it can specify a merge_bvec_fn in the
- * queue to get further control
- */
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = bio->bi_iter.bi_size,
- .bi_rw = bio->bi_rw,
- };
-
- /*
- * merge_bvec_fn() returns number of bytes it can accept
- * at this offset
- */
- if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
- bvec->bv_page = NULL;
- bvec->bv_len = 0;
- bvec->bv_offset = 0;
- return 0;
- }
- }
-
/* If we may be able to merge these biovecs, force a recount */
if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
bio->bi_flags &= ~(1 << BIO_SEG_VALID);
@@ -807,28 +775,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
bio->bi_iter.bi_size += len;
return len;
}
-
-/**
- * bio_add_pc_page - attempt to add page to bio
- * @q: the target queue
- * @bio: destination bio
- * @page: page to add
- * @len: vec entry length
- * @offset: vec entry offset
- *
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
- *
- * This should only be used by REQ_PC bios.
- */
-int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
- unsigned int len, unsigned int offset)
-{
- return __bio_add_page(q, bio, page, len, offset,
- queue_max_hw_sectors(q));
-}
EXPORT_SYMBOL(bio_add_pc_page);

/**
@@ -838,16 +784,47 @@ EXPORT_SYMBOL(bio_add_pc_page);
* @len: vec entry length
* @offset: vec entry offset
*
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
+ * Attempt to add a page to the bio_vec maplist. This will only fail if
+ * bio->bi_vcnt == bio->bi_max_vecs.
*/
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
- unsigned int offset)
+int bio_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- return __bio_add_page(q, bio, page, len, offset, queue_max_sectors(q));
+ struct bio_vec *bv;
+
+ /*
+ * cloned bio must not modify vec list
+ */
+ if (unlikely(bio_flagged(bio, BIO_CLONED)))
+ return 0;
+
+ /*
+ * For filesystems with a blocksize smaller than the pagesize
+ * we will often be called with the same page as last time and
+ * a consecutive offset. Optimize this special case.
+ */
+ if (bio->bi_vcnt > 0) {
+ bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+ if (page == bv->bv_page &&
+ offset == bv->bv_offset + bv->bv_len) {
+ bv->bv_len += len;
+ goto done;
+ }
+ }
+
+ if (bio->bi_vcnt >= bio->bi_max_vecs)
+ return 0;
+
+ bv = &bio->bi_io_vec[bio->bi_vcnt];
+ bv->bv_page = page;
+ bv->bv_len = len;
+ bv->bv_offset = offset;
+
+ bio->bi_vcnt++;
+done:
+ bio->bi_iter.bi_size += len;
+ return len;
}
EXPORT_SYMBOL(bio_add_page);

--
1.9.0

2014-02-26 23:45:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 3/9] blk-lib.c: generic_make_request() handles large bios now

generic_make_request() will now do for us what the code in blk-lib.c was
doing manually, with the bio_batch stuff - we still need some looping in
case we're trying to discard/zeroout more than around a gigabyte, but
when we can submit that much at a time doing the submissions in parallel
really shouldn't matter.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-lib.c | 175 ++++++++++----------------------------------------------
1 file changed, 30 insertions(+), 145 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 97a733cf3d..117467c2ff 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -9,23 +9,6 @@

#include "blk.h"

-struct bio_batch {
- atomic_t done;
- unsigned long flags;
- struct completion *wait;
-};
-
-static void bio_batch_end_io(struct bio *bio, int err)
-{
- struct bio_batch *bb = bio->bi_private;
-
- if (err && (err != -EOPNOTSUPP))
- clear_bit(BIO_UPTODATE, &bb->flags);
- if (atomic_dec_and_test(&bb->done))
- complete(bb->wait);
- bio_put(bio);
-}
-
/**
* blkdev_issue_discard - queue a discard
* @bdev: blockdev to issue discard for
@@ -40,15 +23,10 @@ static void bio_batch_end_io(struct bio *bio, int err)
int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
{
- DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q = bdev_get_queue(bdev);
int type = REQ_WRITE | REQ_DISCARD;
- unsigned int max_discard_sectors, granularity;
- int alignment;
- struct bio_batch bb;
struct bio *bio;
int ret = 0;
- struct blk_plug plug;

if (!q)
return -ENXIO;
@@ -56,69 +34,27 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
if (!blk_queue_discard(q))
return -EOPNOTSUPP;

- /* Zero-sector (unknown) and one-sector granularities are the same. */
- granularity = max(q->limits.discard_granularity >> 9, 1U);
- alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
- /*
- * Ensure that max_discard_sectors is of the proper
- * granularity, so that requests stay aligned after a split.
- */
- max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
- max_discard_sectors -= max_discard_sectors % granularity;
- if (unlikely(!max_discard_sectors)) {
- /* Avoid infinite loop below. Being cautious never hurts. */
- return -EOPNOTSUPP;
- }
-
if (flags & BLKDEV_DISCARD_SECURE) {
if (!blk_queue_secdiscard(q))
return -EOPNOTSUPP;
type |= REQ_SECURE;
}

- atomic_set(&bb.done, 1);
- bb.flags = 1 << BIO_UPTODATE;
- bb.wait = &wait;
-
- blk_start_plug(&plug);
while (nr_sects) {
- unsigned int req_sects;
- sector_t end_sect, tmp;
-
bio = bio_alloc(gfp_mask, 1);
- if (!bio) {
- ret = -ENOMEM;
- break;
- }
-
- req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
+ if (!bio)
+ return -ENOMEM;

- /*
- * If splitting a request, and the next starting sector would be
- * misaligned, stop the discard at the previous aligned sector.
- */
- end_sect = sector + req_sects;
- tmp = end_sect;
- if (req_sects < nr_sects &&
- sector_div(tmp, granularity) != alignment) {
- end_sect = end_sect - alignment;
- sector_div(end_sect, granularity);
- end_sect = end_sect * granularity + alignment;
- req_sects = end_sect - sector;
- }
-
- bio->bi_iter.bi_sector = sector;
- bio->bi_end_io = bio_batch_end_io;
bio->bi_bdev = bdev;
- bio->bi_private = &bb;
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_iter.bi_size = min_t(sector_t, nr_sects, 1 << 20) << 9;

- bio->bi_iter.bi_size = req_sects << 9;
- nr_sects -= req_sects;
- sector = end_sect;
+ sector += bio_sectors(bio);
+ nr_sects -= bio_sectors(bio);

- atomic_inc(&bb.done);
- submit_bio(type, bio);
+ ret = submit_bio_wait(type, bio);
+ if (ret)
+ break;

/*
* We can loop for a long time in here, if someone does
@@ -128,14 +64,6 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
*/
cond_resched();
}
- blk_finish_plug(&plug);
-
- /* Wait for bios in-flight */
- if (!atomic_dec_and_test(&bb.done))
- wait_for_completion_io(&wait);
-
- if (!test_bit(BIO_UPTODATE, &bb.flags))
- ret = -EIO;

return ret;
}
@@ -156,61 +84,37 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask,
struct page *page)
{
- DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q = bdev_get_queue(bdev);
- unsigned int max_write_same_sectors;
- struct bio_batch bb;
struct bio *bio;
int ret = 0;

if (!q)
return -ENXIO;

- max_write_same_sectors = q->limits.max_write_same_sectors;
-
- if (max_write_same_sectors == 0)
+ if (!q->limits.max_write_same_sectors)
return -EOPNOTSUPP;

- atomic_set(&bb.done, 1);
- bb.flags = 1 << BIO_UPTODATE;
- bb.wait = &wait;
-
while (nr_sects) {
bio = bio_alloc(gfp_mask, 1);
- if (!bio) {
- ret = -ENOMEM;
- break;
- }
+ if (!bio)
+ return -ENOMEM;

- bio->bi_iter.bi_sector = sector;
- bio->bi_end_io = bio_batch_end_io;
bio->bi_bdev = bdev;
- bio->bi_private = &bb;
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_iter.bi_size = min_t(sector_t, nr_sects, 1 << 20) << 9;
bio->bi_vcnt = 1;
bio->bi_io_vec->bv_page = page;
bio->bi_io_vec->bv_offset = 0;
bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);

- if (nr_sects > max_write_same_sectors) {
- bio->bi_iter.bi_size = max_write_same_sectors << 9;
- nr_sects -= max_write_same_sectors;
- sector += max_write_same_sectors;
- } else {
- bio->bi_iter.bi_size = nr_sects << 9;
- nr_sects = 0;
- }
+ sector += bio_sectors(bio);
+ nr_sects -= bio_sectors(bio);

- atomic_inc(&bb.done);
- submit_bio(REQ_WRITE | REQ_WRITE_SAME, bio);
+ ret = submit_bio_wait(REQ_WRITE | REQ_WRITE_SAME, bio);
+ if (ret)
+ break;
}

- /* Wait for bios in-flight */
- if (!atomic_dec_and_test(&bb.done))
- wait_for_completion_io(&wait);
-
- if (!test_bit(BIO_UPTODATE, &bb.flags))
- ret = -ENOTSUPP;
-
return ret;
}
EXPORT_SYMBOL(blkdev_issue_write_same);
@@ -225,33 +129,22 @@ EXPORT_SYMBOL(blkdev_issue_write_same);
* Description:
* Generate and issue number of bios with zerofiled pages.
*/
-
int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
- sector_t nr_sects, gfp_t gfp_mask)
+ sector_t nr_sects, gfp_t gfp_mask)
{
- int ret;
+ int ret = 0;
struct bio *bio;
- struct bio_batch bb;
unsigned int sz;
- DECLARE_COMPLETION_ONSTACK(wait);
-
- atomic_set(&bb.done, 1);
- bb.flags = 1 << BIO_UPTODATE;
- bb.wait = &wait;

- ret = 0;
- while (nr_sects != 0) {
+ while (nr_sects) {
bio = bio_alloc(gfp_mask,
- min(nr_sects, (sector_t)BIO_MAX_PAGES));
- if (!bio) {
- ret = -ENOMEM;
- break;
- }
+ min(nr_sects / (PAGE_SIZE >> 9),
+ (sector_t)BIO_MAX_PAGES));
+ if (!bio)
+ return -ENOMEM;

bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
- bio->bi_end_io = bio_batch_end_io;
- bio->bi_private = &bb;

while (nr_sects != 0) {
sz = min((sector_t) PAGE_SIZE >> 9 , nr_sects);
@@ -261,18 +154,11 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
if (ret < (sz << 9))
break;
}
- ret = 0;
- atomic_inc(&bb.done);
- submit_bio(WRITE, bio);
- }
-
- /* Wait for bios in-flight */
- if (!atomic_dec_and_test(&bb.done))
- wait_for_completion_io(&wait);

- if (!test_bit(BIO_UPTODATE, &bb.flags))
- /* One of bios in the batch was completed with error.*/
- ret = -EIO;
+ ret = submit_bio_wait(WRITE, bio);
+ if (ret)
+ break;
+ }

return ret;
}
@@ -287,7 +173,6 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
* Description:
* Generate and issue number of bios with zerofiled pages.
*/
-
int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask)
{
--
1.9.0

2014-02-26 23:45:52

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 4/9] bcache: generic_make_request() handles large bios now

So we get to delete our hacky workaround.

Signed-off-by: Kent Overstreet <[email protected]>
---
drivers/md/bcache/bcache.h | 18 --------
drivers/md/bcache/io.c | 100 +-----------------------------------------
drivers/md/bcache/journal.c | 4 +-
drivers/md/bcache/request.c | 16 +++----
drivers/md/bcache/super.c | 32 +-------------
drivers/md/bcache/util.h | 5 ++-
drivers/md/bcache/writeback.c | 4 +-
7 files changed, 18 insertions(+), 161 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index a4c7306ff4..d669947650 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -245,19 +245,6 @@ struct keybuf {
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
};

-struct bio_split_pool {
- struct bio_set *bio_split;
- mempool_t *bio_split_hook;
-};
-
-struct bio_split_hook {
- struct closure cl;
- struct bio_split_pool *p;
- struct bio *bio;
- bio_end_io_t *bi_end_io;
- void *bi_private;
-};
-
struct bcache_device {
struct closure cl;

@@ -290,8 +277,6 @@ struct bcache_device {
int (*cache_miss)(struct btree *, struct search *,
struct bio *, unsigned);
int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
-
- struct bio_split_pool bio_split_hook;
};

struct io {
@@ -467,8 +452,6 @@ struct cache {
atomic_long_t meta_sectors_written;
atomic_long_t btree_sectors_written;
atomic_long_t sectors_written;
-
- struct bio_split_pool bio_split_hook;
};

struct gc_stat {
@@ -893,7 +876,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
void bch_bbio_free(struct bio *, struct cache_set *);
struct bio *bch_bbio_alloc(struct cache_set *);

-void bch_generic_make_request(struct bio *, struct bio_split_pool *);
void __bch_submit_bbio(struct bio *, struct cache_set *);
void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);

diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa82d..86a0bb8712 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -11,104 +11,6 @@

#include <linux/blkdev.h>

-static unsigned bch_bio_max_sectors(struct bio *bio)
-{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- struct bio_vec bv;
- struct bvec_iter iter;
- unsigned ret = 0, seg = 0;
-
- if (bio->bi_rw & REQ_DISCARD)
- return min(bio_sectors(bio), q->limits.max_discard_sectors);
-
- bio_for_each_segment(bv, bio, iter) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = ret << 9,
- .bi_rw = bio->bi_rw,
- };
-
- if (seg == min_t(unsigned, BIO_MAX_PAGES,
- queue_max_segments(q)))
- break;
-
- if (q->merge_bvec_fn &&
- q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
- break;
-
- seg++;
- ret += bv.bv_len >> 9;
- }
-
- ret = min(ret, queue_max_sectors(q));
-
- WARN_ON(!ret);
- ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
-
- return ret;
-}
-
-static void bch_bio_submit_split_done(struct closure *cl)
-{
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- s->bio->bi_end_io = s->bi_end_io;
- s->bio->bi_private = s->bi_private;
- bio_endio_nodec(s->bio, 0);
-
- closure_debug_destroy(&s->cl);
- mempool_free(s, s->p->bio_split_hook);
-}
-
-static void bch_bio_submit_split_endio(struct bio *bio, int error)
-{
- struct closure *cl = bio->bi_private;
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- if (error)
- clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
-
- bio_put(bio);
- closure_put(cl);
-}
-
-void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
-{
- struct bio_split_hook *s;
- struct bio *n;
-
- if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
- goto submit;
-
- if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
- goto submit;
-
- s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
- closure_init(&s->cl, NULL);
-
- s->bio = bio;
- s->p = p;
- s->bi_end_io = bio->bi_end_io;
- s->bi_private = bio->bi_private;
- bio_get(bio);
-
- do {
- n = bio_next_split(bio, bch_bio_max_sectors(bio),
- GFP_NOIO, s->p->bio_split);
-
- n->bi_end_io = bch_bio_submit_split_endio;
- n->bi_private = &s->cl;
-
- closure_get(&s->cl);
- generic_make_request(n);
- } while (n != bio);
-
- continue_at(&s->cl, bch_bio_submit_split_done, NULL);
-submit:
- generic_make_request(bio);
-}
-
/* Bios with headers */

void bch_bbio_free(struct bio *bio, struct cache_set *c)
@@ -138,7 +40,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
bio->bi_bdev = PTR_CACHE(c, &b->key, 0)->bdev;

b->submit_time_us = local_clock_us();
- closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
+ closure_bio_submit(bio, bio->bi_private);
}

void bch_submit_bbio(struct bio *bio, struct cache_set *c,
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 18039affc3..3246497c1a 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -60,7 +60,7 @@ reread: left = ca->sb.bucket_size - offset;
bio->bi_private = &cl;
bch_bio_map(bio, data);

- closure_bio_submit(bio, &cl, ca);
+ closure_bio_submit(bio, &cl);
closure_sync(&cl);

/* This function could be simpler now since we no longer write
@@ -642,7 +642,7 @@ static void journal_write_unlocked(struct closure *cl)
spin_unlock(&c->journal.lock);

while ((bio = bio_list_pop(&list)))
- closure_bio_submit(bio, cl, c->cache[0]);
+ closure_bio_submit(bio, cl);

continue_at(cl, journal_write_done, NULL);
}
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 5d5d031cf3..29f77f9da2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -873,7 +873,7 @@ static void cached_dev_read_error(struct closure *cl)

/* XXX: invalidate cache */

- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}

continue_at(cl, cached_dev_cache_miss_done, NULL);
@@ -996,7 +996,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
s->cache_miss = miss;
s->iop.bio = cache_bio;
bio_get(cache_bio);
- closure_bio_submit(cache_bio, &s->cl, s->d);
+ closure_bio_submit(cache_bio, &s->cl);

return ret;
out_put:
@@ -1004,7 +1004,7 @@ out_put:
out_submit:
miss->bi_end_io = request_endio;
miss->bi_private = &s->cl;
- closure_bio_submit(miss, &s->cl, s->d);
+ closure_bio_submit(miss, &s->cl);
return ret;
}

@@ -1069,7 +1069,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)

if (!(bio->bi_rw & REQ_DISCARD) ||
blk_queue_discard(bdev_get_queue(dc->bdev)))
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
} else if (s->iop.writeback) {
bch_writeback_add(dc);
s->iop.bio = bio;
@@ -1084,12 +1084,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
flush->bi_end_io = request_endio;
flush->bi_private = cl;

- closure_bio_submit(flush, cl, s->d);
+ closure_bio_submit(flush, cl);
}
} else {
s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);

- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}

closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
@@ -1105,7 +1105,7 @@ static void cached_dev_nodata(struct closure *cl)
bch_journal_meta(s->iop.c, cl);

/* If it's a flush, we send the flush to the backing device too */
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);

continue_at(cl, cached_dev_bio_complete, NULL);
}
@@ -1152,7 +1152,7 @@ static void cached_dev_make_request(struct request_queue *q, struct bio *bio)
!blk_queue_discard(bdev_get_queue(dc->bdev)))
bio_endio(bio, 0);
else
- bch_generic_make_request(bio, &d->bio_split_hook);
+ generic_make_request(bio);
}
}

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 24a3a1546c..d4e328386f 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -59,29 +59,6 @@ struct workqueue_struct *bcache_wq;

#define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE)

-static void bio_split_pool_free(struct bio_split_pool *p)
-{
- if (p->bio_split_hook)
- mempool_destroy(p->bio_split_hook);
-
- if (p->bio_split)
- bioset_free(p->bio_split);
-}
-
-static int bio_split_pool_init(struct bio_split_pool *p)
-{
- p->bio_split = bioset_create(4, 0);
- if (!p->bio_split)
- return -ENOMEM;
-
- p->bio_split_hook = mempool_create_kmalloc_pool(4,
- sizeof(struct bio_split_hook));
- if (!p->bio_split_hook)
- return -ENOMEM;
-
- return 0;
-}
-
/* Superblock */

static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
@@ -537,7 +514,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, unsigned long rw)
bio->bi_private = ca;
bch_bio_map(bio, ca->disk_buckets);

- closure_bio_submit(bio, &ca->prio, ca);
+ closure_bio_submit(bio, &ca->prio);
closure_sync(cl);
}

@@ -763,7 +740,6 @@ static void bcache_device_free(struct bcache_device *d)
put_disk(d->disk);
}

- bio_split_pool_free(&d->bio_split_hook);
if (d->bio_split)
bioset_free(d->bio_split);
if (is_vmalloc_addr(d->full_dirty_stripes))
@@ -816,7 +792,6 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
return minor;

if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
- bio_split_pool_init(&d->bio_split_hook) ||
!(d->disk = alloc_disk(1))) {
ida_simple_remove(&bcache_minor, minor);
return -ENOMEM;
@@ -1775,8 +1750,6 @@ void bch_cache_release(struct kobject *kobj)
if (ca->set)
ca->set->cache[ca->sb.nr_this_dev] = NULL;

- bio_split_pool_free(&ca->bio_split_hook);
-
free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
kfree(ca->prio_buckets);
vfree(ca->buckets);
@@ -1825,8 +1798,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
ca->sb.nbuckets)) ||
!(ca->prio_buckets = kzalloc(sizeof(uint64_t) * prio_buckets(ca) *
2, GFP_KERNEL)) ||
- !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)) ||
- bio_split_pool_init(&ca->bio_split_hook))
+ !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)))
return -ENOMEM;

ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca);
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index ac7d0d1f70..9ac17568f8 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@

#include <linux/blkdev.h>
#include <linux/errno.h>
+#include <linux/blkdev.h>
#include <linux/kernel.h>
#include <linux/llist.h>
#include <linux/ratelimit.h>
@@ -576,10 +577,10 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
return bdev->bd_inode->i_size >> 9;
}

-#define closure_bio_submit(bio, cl, dev) \
+#define closure_bio_submit(bio, cl) \
do { \
closure_get(cl); \
- bch_generic_make_request(bio, &(dev)->bio_split_hook); \
+ generic_make_request(bio); \
} while (0)

uint64_t bch_crc64_update(uint64_t, const void *, size_t);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f4300e4c01..a01d97bc2e 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -188,7 +188,7 @@ static void write_dirty(struct closure *cl)
io->bio.bi_bdev = io->dc->bdev;
io->bio.bi_end_io = dirty_endio;

- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);

continue_at(cl, write_dirty_finish, system_wq);
}
@@ -208,7 +208,7 @@ static void read_dirty_submit(struct closure *cl)
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);

- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);

continue_at(cl, write_dirty, system_wq);
}
--
1.9.0

2014-02-26 23:46:37

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH 1/9] block: Make generic_make_request handle arbitrary sized bios

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to blk_queue_bounce();
this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be
concerned with bouncing affecting segment merging.

Some make_request_fns were simple enough to audit and verify they don't
need blk_queue_split() calls. The skipped ones are:

* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left for future
patches.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: [email protected]
Cc: Lars Ellenberg <[email protected]>
Cc: [email protected]
Cc: Asai Thambi S P <[email protected]>
Cc: Sam Bradshaw <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: [email protected]
Cc: Jiri Kosina <[email protected]>
Cc: Geoff Levand <[email protected]>
Cc: Jim Paris <[email protected]>
Cc: Joshua Morris <[email protected]>
Cc: Philip Kelleher <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Peng Tao <[email protected]>
---
block/blk-core.c | 19 ++--
block/blk-merge.c | 150 ++++++++++++++++++++++++++--
block/blk-mq.c | 2 +
drivers/block/drbd/drbd_req.c | 2 +
drivers/block/mtip32xx/mtip32xx.c | 6 +-
drivers/block/nvme-core.c | 2 +
drivers/block/pktcdvd.c | 6 +-
drivers/block/ps3vram.c | 2 +
drivers/block/rsxx/dev.c | 2 +
drivers/block/umem.c | 2 +
drivers/block/zram/zram_drv.c | 2 +
drivers/md/dm.c | 2 +
drivers/md/md.c | 2 +
drivers/s390/block/dcssblk.c | 2 +
drivers/s390/block/xpram.c | 2 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 +
include/linux/blkdev.h | 3 +
17 files changed, 185 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 853f927492..d3b0782ec3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -581,6 +581,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (q->id < 0)
goto fail_c;

+ q->bio_split = bioset_create(4, 0);
+ if (!q->bio_split)
+ goto fail_id;
+
q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
q->backing_dev_info.state = 0;
@@ -590,7 +594,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

err = bdi_init(&q->backing_dev_info);
if (err)
- goto fail_id;
+ goto fail_split;

setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
laptop_mode_timer_fn, (unsigned long) q);
@@ -635,6 +639,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

fail_bdi:
bdi_destroy(&q->backing_dev_info);
+fail_split:
+ bioset_free(q->bio_split);
fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_c:
@@ -1501,6 +1507,8 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
struct request *req;
unsigned int request_count = 0;

+ blk_queue_split(q, &bio, q->bio_split);
+
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
@@ -1723,15 +1731,6 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}

- if (likely(bio_is_rw(bio) &&
- nr_sectors > queue_max_hw_sectors(q))) {
- printk(KERN_ERR "bio too big device %s (%u > %u)\n",
- bdevname(bio->bi_bdev, b),
- bio_sectors(bio),
- queue_max_hw_sectors(q));
- goto end_io;
- }
-
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(&part_to_disk(part)->part0,
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6c583f9c5b..0afbe3f1c2 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -9,11 +9,149 @@

#include "blk.h"

+static struct bio *blk_bio_discard_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ unsigned int max_discard_sectors, granularity;
+ int alignment;
+ sector_t tmp;
+ unsigned split_sectors;
+
+ /* Zero-sector (unknown) and one-sector granularities are the same. */
+ granularity = max(q->limits.discard_granularity >> 9, 1U);
+
+ max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+ max_discard_sectors -= max_discard_sectors % granularity;
+
+ if (unlikely(!max_discard_sectors)) {
+ /* XXX: warn */
+ return NULL;
+ }
+
+ if (bio_sectors(bio) <= max_discard_sectors)
+ return NULL;
+
+ split_sectors = max_discard_sectors;
+
+ /*
+ * If the next starting sector would be misaligned, stop the discard at
+ * the previous aligned sector.
+ */
+ alignment = (q->limits.discard_alignment >> 9) % granularity;
+
+ tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
+ tmp = sector_div(tmp, granularity);
+
+ if (split_sectors > tmp)
+ split_sectors -= tmp;
+
+ return bio_split(bio, split_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_write_same_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ if (!q->limits.max_write_same_sectors)
+ return NULL;
+
+ if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
+ return NULL;
+
+ return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_segment_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ struct bio *split;
+ struct bio_vec bv, bvprv;
+ struct bvec_iter iter;
+ unsigned seg_size = 0, nsegs = 0;
+ int prev = 0;
+
+ struct bvec_merge_data bvm = {
+ .bi_bdev = bio->bi_bdev,
+ .bi_sector = bio->bi_iter.bi_sector,
+ .bi_size = 0,
+ .bi_rw = bio->bi_rw,
+ };
+
+ bio_for_each_segment(bv, bio, iter) {
+ if (q->merge_bvec_fn &&
+ q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
+ goto split;
+
+ bvm.bi_size += bv.bv_len;
+
+ if (bvm.bi_size >> 9 > queue_max_sectors(q))
+ goto split;
+
+ if (prev && blk_queue_cluster(q)) {
+ if (seg_size + bv.bv_len > queue_max_segment_size(q))
+ goto new_segment;
+ if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
+ goto new_segment;
+ if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
+ goto new_segment;
+
+ seg_size += bv.bv_len;
+ bvprv = bv;
+ prev = 1;
+ continue;
+ }
+new_segment:
+ if (nsegs == queue_max_segments(q))
+ goto split;
+
+ nsegs++;
+ bvprv = bv;
+ prev = 1;
+ seg_size = bv.bv_len;
+ }
+
+ return NULL;
+split:
+ split = bio_clone_bioset(bio, GFP_NOIO, bs);
+
+ split->bi_iter.bi_size -= iter.bi_size;
+ bio->bi_iter = iter;
+
+ if (bio_integrity(bio)) {
+ bio_integrity_advance(bio, split->bi_iter.bi_size);
+ bio_integrity_trim(split, 0, bio_sectors(split));
+ }
+
+ return split;
+}
+
+void blk_queue_split(struct request_queue *q, struct bio **bio,
+ struct bio_set *bs)
+{
+ struct bio *split;
+
+ if ((*bio)->bi_rw & REQ_DISCARD)
+ split = blk_bio_discard_split(q, *bio, bs);
+ else if ((*bio)->bi_rw & REQ_WRITE_SAME)
+ split = blk_bio_write_same_split(q, *bio, bs);
+ else
+ split = blk_bio_segment_split(q, *bio, q->bio_split);
+
+ if (split) {
+ bio_chain(split, *bio);
+ generic_make_request(*bio);
+ *bio = split;
+ }
+}
+EXPORT_SYMBOL(blk_queue_split);
+
static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
struct bio *bio)
{
struct bio_vec bv, bvprv = { NULL };
- int cluster, high, highprv = 1;
+ int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
struct bvec_iter iter;
@@ -37,13 +175,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_segment(bv, bio, iter) {
- /*
- * the trick here is making sure that a high page is
- * never considered part of another segment, since that
- * might change with the bounce page.
- */
- high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
- if (!high && !highprv && cluster) {
+ if (prev && cluster) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
goto new_segment;
@@ -63,8 +195,8 @@ new_segment:

nr_phys_segs++;
bvprv = bv;
+ prev = 1;
seg_size = bv.bv_len;
- highprv = high;
}
bbio = bio;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6468a715a0..7893e254d8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -915,6 +915,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
return;
}

+ blk_queue_split(q, &bio, q->bio_split);
+
if (use_plug && blk_attempt_plug_merge(q, bio, &request_count))
return;

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 104a040f24..941a69c50c 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1275,6 +1275,8 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
struct drbd_conf *mdev = (struct drbd_conf *) q->queuedata;
unsigned long start_time;

+ blk_queue_split(q, &bio, q->bio_split);
+
start_time = jiffies;

/*
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 516026954b..df733ca685 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -4033,6 +4033,10 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)
int nents = 0;
int tag = 0, unaligned = 0;

+ blk_queue_bounce(queue, &bio);
+
+ blk_queue_split(queue, &bio, queue->bio_split);
+
if (unlikely(dd->dd_flag & MTIP_DDF_STOP_IO)) {
if (unlikely(test_bit(MTIP_DDF_REMOVE_PENDING_BIT,
&dd->dd_flag))) {
@@ -4082,8 +4086,6 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)

sg = mtip_hw_get_scatterlist(dd, &tag, unaligned);
if (likely(sg != NULL)) {
- blk_queue_bounce(queue, &bio);
-
if (unlikely((bio)->bi_vcnt > MTIP_MAX_SG)) {
dev_warn(&dd->pdev->dev,
"Maximum number of SGL entries exceeded\n");
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 51824d1f23..e4376b9613 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -737,6 +737,8 @@ static void nvme_make_request(struct request_queue *q, struct bio *bio)
struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
int result = -EBUSY;

+ blk_queue_split(q, &bio, q->bio_split);
+
if (!nvmeq) {
put_nvmeq(NULL);
bio_endio(bio, -EIO);
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index a2af73db18..a37acf722b 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2444,6 +2444,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
char b[BDEVNAME_SIZE];
struct bio *split;

+ blk_queue_bounce(q, &bio);
+
+ blk_queue_split(q, &bio, q->bio_split);
+
pd = q->queuedata;
if (!pd) {
pr_err("%s incorrect request queue\n",
@@ -2474,8 +2478,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
goto end_io;
}

- blk_queue_bounce(q, &bio);
-
do {
sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index ef45cfb98f..a995972961 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -603,6 +603,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev);
int busy;

+ blk_queue_split(q, &bio, q->bio_split);
+
dev_dbg(&dev->core, "%s\n", __func__);

spin_lock_irq(&priv->lock);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 2839d37e5a..ff074a3cd4 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -169,6 +169,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
struct rsxx_bio_meta *bio_meta;
int st = -EINVAL;

+ blk_queue_split(q, &bio, q->bio_split);
+
might_sleep();

if (!card)
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5bf0..13d577cfbc 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio)
(unsigned long long)bio->bi_iter.bi_sector,
bio->bi_iter.bi_size);

+ blk_queue_split(q, &bio, q->bio_split);
+
spin_lock_irq(&card->lock);
*card->biotail = bio;
bio->bi_next = NULL;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 011e55d820..ecf9daa01c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -733,6 +733,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
{
struct zram *zram = queue->queuedata;

+ blk_queue_split(queue, &bio, queue->bio_split);
+
down_read(&zram->init_lock);
if (unlikely(!zram->init_done))
goto error;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c53b09b9a..97f70420f2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1500,6 +1500,8 @@ static void dm_request(struct request_queue *q, struct bio *bio)
{
struct mapped_device *md = q->queuedata;

+ blk_queue_split(q, &bio, q->bio_split);
+
if (dm_request_based(md))
blk_queue_bio(q, bio);
else
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4ad5cc4e63..1421bc3f7b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -256,6 +256,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
int cpu;
unsigned int sectors;

+ blk_queue_split(q, &bio, q->bio_split);
+
if (mddev == NULL || mddev->pers == NULL
|| !mddev->ready) {
bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index ebf41e228e..db33cd3e4c 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -815,6 +815,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
unsigned long source_addr;
unsigned long bytes_done;

+ blk_queue_split(q, &bio, q->bio_split);
+
bytes_done = 0;
dev_info = bio->bi_bdev->bd_disk->private_data;
if (dev_info == NULL)
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 6969d39f1e..f03c103f13 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
unsigned long page_addr;
unsigned long bytes;

+ blk_queue_split(q, &bio, q->bio_split);
+
if ((bio->bi_iter.bi_sector & 7) != 0 ||
(bio->bi_iter.bi_size & 4095) != 0)
/* Request is not page-aligned. */
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index 0718905ade..a3f6dc930b 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -344,6 +344,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio)
int rw = bio_rw(old_bio);
int inactive;

+ blk_queue_split(q, &old_bio, q->bio_split);
+
if (!lo)
goto err;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1e1fa3f93d..99e9955c4d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -470,6 +470,7 @@ struct request_queue {
wait_queue_head_t mq_freeze_wq;
struct percpu_counter mq_usage_counter;
struct list_head all_q_node;
+ struct bio_set *bio_split;
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
@@ -781,6 +782,8 @@ extern void blk_rq_unprep_clone(struct request *rq);
extern int blk_insert_cloned_request(struct request_queue *q,
struct request *rq);
extern void blk_delay_queue(struct request_queue *, unsigned long);
+extern void blk_queue_split(struct request_queue *, struct bio **,
+ struct bio_set *);
extern void blk_recount_segments(struct request_queue *, struct bio *);
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
--
1.9.0

2014-02-27 17:23:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/9] block: Make generic_make_request handle arbitrary sized bios

On Wed, Feb 26, 2014 at 03:39:49PM -0800, Kent Overstreet wrote:
> We do this by adding calls to blk_queue_split() to the various
> make_request functions that need it - a few can already handle arbitrary
> size bios. Note that we add the call _after_ any call to blk_queue_bounce();
> this means that blk_queue_split() and blk_recalc_rq_segments() don't need to
> be concerned with bouncing affecting segment merging.

> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index 51824d1f23..e4376b9613 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
> @@ -737,6 +737,8 @@ static void nvme_make_request(struct request_queue *q, struct bio *bio)
> struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> int result = -EBUSY;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if (!nvmeq) {
> put_nvmeq(NULL);
> bio_endio(bio, -EIO);

I'd suggest that we do:

- struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
+ struct nvme_queue *nvmeq;
int result = -EBUSY;

+ blk_queue_split(q, &bio, q->bio_split);
+
+ nvmeq = get_nvmeq(ns->dev);
if (!nvmeq) {

so that we're running the blk_queue_split() code outside the get_cpu()
call.

Now, the NVMe driver has its own rules about when BIOs have to be split.
Right now, that's way down inside the nvme_map_bio() call when we're
walking the bio to compose the scatterlist. Should we instead have an
nvme_bio_split() routine that is called instead of blk_queue_split(),
and we can simplify nvme_map_bio() since it'll know that it's working
with bios that don't have to be split.

In fact, I think it would have little NVMe-specific in it at that point,
so we could name __blk_bios_map_sg() better, export it to drivers and
call it from nvme_map_bio(), which I think would make everybody happier.

> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index a2af73db18..a37acf722b 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -2444,6 +2444,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
> char b[BDEVNAME_SIZE];
> struct bio *split;
>
> + blk_queue_bounce(q, &bio);
> +
> + blk_queue_split(q, &bio, q->bio_split);
> +
> pd = q->queuedata;
> if (!pd) {
> pr_err("%s incorrect request queue\n",
> @@ -2474,8 +2478,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
> goto end_io;
> }
>
> - blk_queue_bounce(q, &bio);
> -
> do {
> sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
> sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
> diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
> index ef45cfb98f..a995972961 100644
> --- a/drivers/block/ps3vram.c
> +++ b/drivers/block/ps3vram.c
> @@ -603,6 +603,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
> struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev);
> int busy;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> dev_dbg(&dev->core, "%s\n", __func__);
>
> spin_lock_irq(&priv->lock);
> diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
> index 2839d37e5a..ff074a3cd4 100644
> --- a/drivers/block/rsxx/dev.c
> +++ b/drivers/block/rsxx/dev.c
> @@ -169,6 +169,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
> struct rsxx_bio_meta *bio_meta;
> int st = -EINVAL;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> might_sleep();
>
> if (!card)
> diff --git a/drivers/block/umem.c b/drivers/block/umem.c
> index 4cf81b5bf0..13d577cfbc 100644
> --- a/drivers/block/umem.c
> +++ b/drivers/block/umem.c
> @@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio)
> (unsigned long long)bio->bi_iter.bi_sector,
> bio->bi_iter.bi_size);
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> spin_lock_irq(&card->lock);
> *card->biotail = bio;
> bio->bi_next = NULL;
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 011e55d820..ecf9daa01c 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -733,6 +733,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
> {
> struct zram *zram = queue->queuedata;
>
> + blk_queue_split(queue, &bio, queue->bio_split);
> +
> down_read(&zram->init_lock);
> if (unlikely(!zram->init_done))
> goto error;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 8c53b09b9a..97f70420f2 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1500,6 +1500,8 @@ static void dm_request(struct request_queue *q, struct bio *bio)
> {
> struct mapped_device *md = q->queuedata;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if (dm_request_based(md))
> blk_queue_bio(q, bio);
> else
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4ad5cc4e63..1421bc3f7b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -256,6 +256,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
> int cpu;
> unsigned int sectors;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if (mddev == NULL || mddev->pers == NULL
> || !mddev->ready) {
> bio_io_error(bio);
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index ebf41e228e..db33cd3e4c 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -815,6 +815,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
> unsigned long source_addr;
> unsigned long bytes_done;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> bytes_done = 0;
> dev_info = bio->bi_bdev->bd_disk->private_data;
> if (dev_info == NULL)
> diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
> index 6969d39f1e..f03c103f13 100644
> --- a/drivers/s390/block/xpram.c
> +++ b/drivers/s390/block/xpram.c
> @@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
> unsigned long page_addr;
> unsigned long bytes;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if ((bio->bi_iter.bi_sector & 7) != 0 ||
> (bio->bi_iter.bi_size & 4095) != 0)
> /* Request is not page-aligned. */
> diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
> index 0718905ade..a3f6dc930b 100644
> --- a/drivers/staging/lustre/lustre/llite/lloop.c
> +++ b/drivers/staging/lustre/lustre/llite/lloop.c
> @@ -344,6 +344,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio)
> int rw = bio_rw(old_bio);
> int inactive;
>
> + blk_queue_split(q, &old_bio, q->bio_split);
> +
> if (!lo)
> goto err;
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1e1fa3f93d..99e9955c4d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -470,6 +470,7 @@ struct request_queue {
> wait_queue_head_t mq_freeze_wq;
> struct percpu_counter mq_usage_counter;
> struct list_head all_q_node;
> + struct bio_set *bio_split;
> };
>
> #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
> @@ -781,6 +782,8 @@ extern void blk_rq_unprep_clone(struct request *rq);
> extern int blk_insert_cloned_request(struct request_queue *q,
> struct request *rq);
> extern void blk_delay_queue(struct request_queue *, unsigned long);
> +extern void blk_queue_split(struct request_queue *, struct bio **,
> + struct bio_set *);
> extern void blk_recount_segments(struct request_queue *, struct bio *);
> extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
> extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
> --
> 1.9.0

2014-02-27 21:25:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 1/9] block: Make generic_make_request handle arbitrary sized bios

On Thu, Feb 27, 2014 at 12:22:54PM -0500, Matthew Wilcox wrote:
> On Wed, Feb 26, 2014 at 03:39:49PM -0800, Kent Overstreet wrote:
> > We do this by adding calls to blk_queue_split() to the various
> > make_request functions that need it - a few can already handle arbitrary
> > size bios. Note that we add the call _after_ any call to blk_queue_bounce();
> > this means that blk_queue_split() and blk_recalc_rq_segments() don't need to
> > be concerned with bouncing affecting segment merging.
>
> > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > index 51824d1f23..e4376b9613 100644
> > --- a/drivers/block/nvme-core.c
> > +++ b/drivers/block/nvme-core.c
> > @@ -737,6 +737,8 @@ static void nvme_make_request(struct request_queue *q, struct bio *bio)
> > struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> > int result = -EBUSY;
> >
> > + blk_queue_split(q, &bio, q->bio_split);
> > +
> > if (!nvmeq) {
> > put_nvmeq(NULL);
> > bio_endio(bio, -EIO);
>
> I'd suggest that we do:
>
> - struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> + struct nvme_queue *nvmeq;
> int result = -EBUSY;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> + nvmeq = get_nvmeq(ns->dev);
> if (!nvmeq) {
>
> so that we're running the blk_queue_split() code outside the get_cpu()
> call.

Whoops, that's definitely a bug.

> Now, the NVMe driver has its own rules about when BIOs have to be split.
> Right now, that's way down inside the nvme_map_bio() call when we're
> walking the bio to compose the scatterlist. Should we instead have an
> nvme_bio_split() routine that is called instead of blk_queue_split(),
> and we can simplify nvme_map_bio() since it'll know that it's working
> with bios that don't have to be split.
>
> In fact, I think it would have little NVMe-specific in it at that point,
> so we could name __blk_bios_map_sg() better, export it to drivers and
> call it from nvme_map_bio(), which I think would make everybody happier.

Yes, definitely - and by doing it there we shoudn't even have to split
the bios, we can just process them incrementally. I can write a patch
for it later if you want to test it.

2014-02-28 23:32:38

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 1/9] block: Make generic_make_request handle arbitrary sized bios

On Thu, Feb 27, 2014 at 12:22:54PM -0500, Matthew Wilcox wrote:
> On Wed, Feb 26, 2014 at 03:39:49PM -0800, Kent Overstreet wrote:
> > We do this by adding calls to blk_queue_split() to the various
> > make_request functions that need it - a few can already handle arbitrary
> > size bios. Note that we add the call _after_ any call to blk_queue_bounce();
> > this means that blk_queue_split() and blk_recalc_rq_segments() don't need to
> > be concerned with bouncing affecting segment merging.
>
> > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > index 51824d1f23..e4376b9613 100644
> > --- a/drivers/block/nvme-core.c
> > +++ b/drivers/block/nvme-core.c
> > @@ -737,6 +737,8 @@ static void nvme_make_request(struct request_queue *q, struct bio *bio)
> > struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> > int result = -EBUSY;
> >
> > + blk_queue_split(q, &bio, q->bio_split);
> > +
> > if (!nvmeq) {
> > put_nvmeq(NULL);
> > bio_endio(bio, -EIO);
>
> I'd suggest that we do:
>
> - struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
> + struct nvme_queue *nvmeq;
> int result = -EBUSY;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> + nvmeq = get_nvmeq(ns->dev);
> if (!nvmeq) {
>
> so that we're running the blk_queue_split() code outside the get_cpu()
> call.
>
> Now, the NVMe driver has its own rules about when BIOs have to be split.
> Right now, that's way down inside the nvme_map_bio() call when we're
> walking the bio to compose the scatterlist. Should we instead have an
> nvme_bio_split() routine that is called instead of blk_queue_split(),
> and we can simplify nvme_map_bio() since it'll know that it's working
> with bios that don't have to be split.
>
> In fact, I think it would have little NVMe-specific in it at that point,
> so we could name __blk_bios_map_sg() better, export it to drivers and
> call it from nvme_map_bio(), which I think would make everybody happier.

Actually, reading nvme_map_bio() (it's different since last I looked at
it) it looks like nvme should already be able to handle arbitrary size
bios?

I do intend to rework the blk_bio_map_sg() (or add a new one?) to
incrementally map as much of a bio as will fit in the provided
scatterlist, but it looks like nvme has some odd restrictions where it's
using BIOVEC_PHYS_MERGABLE()/BIOVEC_NOT_VIRT_MERGABLE() so I dunno if
it's worth bothering to try and have it use generic code.

However we don't need an explicit split here: if the sg fills up (i.e.
the places nvme_split_and_submit() is called), we can just mark the bio
as partially completed (set bio->bi_iter = iter, i.e. use the iterator
you passed to bio_for_each_segment), then increment bi_remaining (which
just counts completions, i.e. bio_endio() calls before the bio is really
completed) and resubmit the original bio. No need to allocate a split
bio, or loop over the bio again in bio_split().