2015-05-22 18:19:15

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 00/11] simplify block layer based on immutable biovecs

v4:
- rebase on top of 4.1-rc4
- use BIO_POOL_SIZE instead of number 4 for bioset_create()
- call blk_queue_split() in blk_mq_make_request()
- call blk_queue_split() in zram_make_request()
- add patch "block: remove bio_get_nr_vecs()"
- remove split code in blkdev_issue_discard()
- drop patch "md/raid10: make sync_request_write() call bio_copy_data()".
NeilBrown queued it.
- drop patch "block: allow __blk_queue_bounce() to handle bios larger than BIO_MAX_PAGES".
Will send it seperately

v3:
- rebase on top of 4.1-rc2
- support for QUEUE_FLAG_SG_GAPS
- update commit logs of patch 2&4
- split bio for chunk_aligned_read

v2: https://lkml.org/lkml/2015/4/28/28
v1: https://lkml.org/lkml/2014/12/22/128

This is the 4th attempt of simplifying block layer based on immutable
biovecs. Immutable biovecs, implemented by Kent Overstreet, have been
available in mainline since v3.14. Its original goal was actually making
generic_make_request() accept arbitrarily sized bios, and pushing the
splitting down to the drivers or wherever it's required. See also
discussions in the past, [1] [2] [3].

This will bring not only performance improvements, but also a great amount
of reduction in code complexity all over the block layer. Performance gain
is possible due to the fact that bio_add_page() does not have to check
unnecesary conditions such as queue limits or if biovecs are mergeable.
Those will be delegated to the driver level. Kent already said that he
actually benchmarked the impact of this with fio on a micron p320h, which
showed definitely a positive impact.

Moreover, this patchset also allows a lot of code to be deleted, mainly
because of removal of merge_bvec_fn() callbacks. We have been aware that
it has been always a delicate issue for stacking block drivers (e.g. md
and bcache) to handle merging bio consistently. This simplication will
help every individual block driver avoid having such an issue.

Patches are against 4.1-rc4. These are also available in my git repo at:

https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
git://git.kernel.org/pub/scm/linux/kernel/git/mlin/linux.git block-generic-req

This patchset is a prerequisite of other consecutive patchsets, e.g.
multipage biovecs, rewriting plugging, or rewriting direct-IO, which are
excluded this time. That means, this patchset should not bring any
regression to end-users.

Comments are welcome.
Ming

[1] https://lkml.org/lkml/2014/11/23/263
[2] https://lkml.org/lkml/2013/11/25/732
[3] https://lkml.org/lkml/2014/2/26/618

Dongsu Park (1):
Documentation: update notes in biovecs about arbitrarily sized bios

Kent Overstreet (8):
block: make generic_make_request handle arbitrarily sized bios
block: simplify bio_add_page()
bcache: remove driver private bio splitting code
btrfs: remove bio splitting and merge_bvec_fn() calls
md/raid5: get rid of bio_fits_rdev()
block: kill merge_bvec_fn() completely
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: remove bio_get_nr_vecs()

Ming Lin (2):
block: remove split code in blkdev_issue_discard
md/raid5: split bio for chunk_aligned_read

Documentation/block/biovecs.txt | 10 +-
block/bio.c | 152 ++++++++++------------------
block/blk-core.c | 19 ++--
block/blk-lib.c | 73 +++----------
block/blk-merge.c | 148 +++++++++++++++++++++++++--
block/blk-mq.c | 4 +
block/blk-settings.c | 22 ----
drivers/block/drbd/drbd_int.h | 1 -
drivers/block/drbd/drbd_main.c | 1 -
drivers/block/drbd/drbd_req.c | 37 +------
drivers/block/pktcdvd.c | 27 +----
drivers/block/ps3vram.c | 2 +
drivers/block/rbd.c | 47 ---------
drivers/block/rsxx/dev.c | 2 +
drivers/block/umem.c | 2 +
drivers/block/zram/zram_drv.c | 2 +
drivers/md/bcache/bcache.h | 18 ----
drivers/md/bcache/io.c | 100 +-----------------
drivers/md/bcache/journal.c | 4 +-
drivers/md/bcache/request.c | 16 +--
drivers/md/bcache/super.c | 32 +-----
drivers/md/bcache/util.h | 5 +-
drivers/md/bcache/writeback.c | 4 +-
drivers/md/dm-cache-target.c | 21 ----
drivers/md/dm-crypt.c | 16 ---
drivers/md/dm-era-target.c | 15 ---
drivers/md/dm-flakey.c | 16 ---
drivers/md/dm-io.c | 2 +-
drivers/md/dm-linear.c | 16 ---
drivers/md/dm-log-writes.c | 16 ---
drivers/md/dm-snap.c | 15 ---
drivers/md/dm-stripe.c | 21 ----
drivers/md/dm-table.c | 8 --
drivers/md/dm-thin.c | 31 ------
drivers/md/dm-verity.c | 16 ---
drivers/md/dm.c | 122 +---------------------
drivers/md/dm.h | 2 -
drivers/md/linear.c | 43 --------
drivers/md/md.c | 28 +----
drivers/md/md.h | 12 ---
drivers/md/multipath.c | 21 ----
drivers/md/raid0.c | 56 ----------
drivers/md/raid0.h | 2 -
drivers/md/raid1.c | 58 +----------
drivers/md/raid10.c | 121 +---------------------
drivers/md/raid5.c | 92 ++++++-----------
drivers/s390/block/dcssblk.c | 2 +
drivers/s390/block/xpram.c | 2 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 +
fs/btrfs/compression.c | 5 +-
fs/btrfs/extent_io.c | 9 +-
fs/btrfs/inode.c | 3 +-
fs/btrfs/scrub.c | 18 +---
fs/btrfs/volumes.c | 72 -------------
fs/buffer.c | 7 +-
fs/direct-io.c | 2 +-
fs/ext4/page-io.c | 3 +-
fs/ext4/readpage.c | 2 +-
fs/gfs2/lops.c | 9 +-
fs/jfs/jfs_logmgr.c | 14 +--
fs/logfs/dev_bdev.c | 4 +-
fs/mpage.c | 4 +-
fs/nilfs2/segbuf.c | 2 +-
fs/xfs/xfs_aops.c | 3 +-
include/linux/bio.h | 1 -
include/linux/blkdev.h | 13 +--
include/linux/device-mapper.h | 4 -
mm/page_io.c | 8 +-
68 files changed, 336 insertions(+), 1331 deletions(-)


2015-05-22 18:19:25

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

From: Kent Overstreet <[email protected]>

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: [email protected]
Cc: Lars Ellenberg <[email protected]>
Cc: [email protected]
Cc: Jiri Kosina <[email protected]>
Cc: Geoff Levand <[email protected]>
Cc: Jim Paris <[email protected]>
Cc: Joshua Morris <[email protected]>
Cc: Philip Kelleher <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Oleg Drokin <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
block/blk-core.c | 19 ++--
block/blk-merge.c | 159 ++++++++++++++++++++++++++--
block/blk-mq.c | 4 +
drivers/block/drbd/drbd_req.c | 2 +
drivers/block/pktcdvd.c | 6 +-
drivers/block/ps3vram.c | 2 +
drivers/block/rsxx/dev.c | 2 +
drivers/block/umem.c | 2 +
drivers/block/zram/zram_drv.c | 2 +
drivers/md/dm.c | 2 +
drivers/md/md.c | 2 +
drivers/s390/block/dcssblk.c | 2 +
drivers/s390/block/xpram.c | 2 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 +
include/linux/blkdev.h | 3 +
15 files changed, 189 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7871603..fbbb337 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -619,6 +619,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (q->id < 0)
goto fail_q;

+ q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+ if (!q->bio_split)
+ goto fail_id;
+
q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
q->backing_dev_info.state = 0;
@@ -628,7 +632,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

err = bdi_init(&q->backing_dev_info);
if (err)
- goto fail_id;
+ goto fail_split;

setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
laptop_mode_timer_fn, (unsigned long) q);
@@ -670,6 +674,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

fail_bdi:
bdi_destroy(&q->backing_dev_info);
+fail_split:
+ bioset_free(q->bio_split);
fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_q:
@@ -1586,6 +1592,8 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
struct request *req;
unsigned int request_count = 0;

+ blk_queue_split(q, &bio, q->bio_split);
+
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
@@ -1809,15 +1817,6 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}

- if (likely(bio_is_rw(bio) &&
- nr_sectors > queue_max_hw_sectors(q))) {
- printk(KERN_ERR "bio too big device %s (%u > %u)\n",
- bdevname(bio->bi_bdev, b),
- bio_sectors(bio),
- queue_max_hw_sectors(q));
- goto end_io;
- }
-
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(&part_to_disk(part)->part0,
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fd3fee8..dc14255 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -9,12 +9,158 @@

#include "blk.h"

+static struct bio *blk_bio_discard_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ unsigned int max_discard_sectors, granularity;
+ int alignment;
+ sector_t tmp;
+ unsigned split_sectors;
+
+ /* Zero-sector (unknown) and one-sector granularities are the same. */
+ granularity = max(q->limits.discard_granularity >> 9, 1U);
+
+ max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+ max_discard_sectors -= max_discard_sectors % granularity;
+
+ if (unlikely(!max_discard_sectors)) {
+ /* XXX: warn */
+ return NULL;
+ }
+
+ if (bio_sectors(bio) <= max_discard_sectors)
+ return NULL;
+
+ split_sectors = max_discard_sectors;
+
+ /*
+ * If the next starting sector would be misaligned, stop the discard at
+ * the previous aligned sector.
+ */
+ alignment = (q->limits.discard_alignment >> 9) % granularity;
+
+ tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
+ tmp = sector_div(tmp, granularity);
+
+ if (split_sectors > tmp)
+ split_sectors -= tmp;
+
+ return bio_split(bio, split_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_write_same_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ if (!q->limits.max_write_same_sectors)
+ return NULL;
+
+ if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
+ return NULL;
+
+ return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_segment_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ struct bio *split;
+ struct bio_vec bv, bvprv;
+ struct bvec_iter iter;
+ unsigned seg_size = 0, nsegs = 0;
+ int prev = 0;
+
+ struct bvec_merge_data bvm = {
+ .bi_bdev = bio->bi_bdev,
+ .bi_sector = bio->bi_iter.bi_sector,
+ .bi_size = 0,
+ .bi_rw = bio->bi_rw,
+ };
+
+ bio_for_each_segment(bv, bio, iter) {
+ if (q->merge_bvec_fn &&
+ q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
+ goto split;
+
+ bvm.bi_size += bv.bv_len;
+
+ if (bvm.bi_size >> 9 > queue_max_sectors(q))
+ goto split;
+
+ /*
+ * If the queue doesn't support SG gaps and adding this
+ * offset would create a gap, disallow it.
+ */
+ if (q->queue_flags & (1 << QUEUE_FLAG_SG_GAPS) &&
+ prev && bvec_gap_to_prev(&bvprv, bv.bv_offset))
+ goto split;
+
+ if (prev && blk_queue_cluster(q)) {
+ if (seg_size + bv.bv_len > queue_max_segment_size(q))
+ goto new_segment;
+ if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
+ goto new_segment;
+ if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
+ goto new_segment;
+
+ seg_size += bv.bv_len;
+ bvprv = bv;
+ prev = 1;
+ continue;
+ }
+new_segment:
+ if (nsegs == queue_max_segments(q))
+ goto split;
+
+ nsegs++;
+ bvprv = bv;
+ prev = 1;
+ seg_size = bv.bv_len;
+ }
+
+ return NULL;
+split:
+ split = bio_clone_bioset(bio, GFP_NOIO, bs);
+
+ split->bi_iter.bi_size -= iter.bi_size;
+ bio->bi_iter = iter;
+
+ if (bio_integrity(bio)) {
+ bio_integrity_advance(bio, split->bi_iter.bi_size);
+ bio_integrity_trim(split, 0, bio_sectors(split));
+ }
+
+ return split;
+}
+
+void blk_queue_split(struct request_queue *q, struct bio **bio,
+ struct bio_set *bs)
+{
+ struct bio *split;
+
+ if ((*bio)->bi_rw & REQ_DISCARD)
+ split = blk_bio_discard_split(q, *bio, bs);
+ else if ((*bio)->bi_rw & REQ_WRITE_SAME)
+ split = blk_bio_write_same_split(q, *bio, bs);
+ else
+ split = blk_bio_segment_split(q, *bio, q->bio_split);
+
+ if (split) {
+ bio_chain(split, *bio);
+ generic_make_request(*bio);
+ *bio = split;
+ }
+}
+EXPORT_SYMBOL(blk_queue_split);
+
static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
struct bio *bio,
bool no_sg_merge)
{
struct bio_vec bv, bvprv = { NULL };
- int cluster, high, highprv = 1;
+ int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
struct bvec_iter iter;
@@ -36,7 +182,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
cluster = blk_queue_cluster(q);
seg_size = 0;
nr_phys_segs = 0;
- high = 0;
for_each_bio(bio) {
bio_for_each_segment(bv, bio, iter) {
/*
@@ -46,13 +191,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
if (no_sg_merge)
goto new_segment;

- /*
- * the trick here is making sure that a high page is
- * never considered part of another segment, since
- * that might change with the bounce page.
- */
- high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
- if (!high && !highprv && cluster) {
+ if (prev && cluster) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
goto new_segment;
@@ -72,8 +211,8 @@ new_segment:

nr_phys_segs++;
bvprv = bv;
+ prev = 1;
seg_size = bv.bv_len;
- highprv = high;
}
bbio = bio;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e68b71b..e7fae76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1256,6 +1256,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
return;
}

+ blk_queue_split(q, &bio, q->bio_split);
+
rq = blk_mq_map_request(q, bio, &data);
if (unlikely(!rq))
return;
@@ -1339,6 +1341,8 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
return;
}

+ blk_queue_split(q, &bio, q->bio_split);
+
if (use_plug && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, &request_count))
return;
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 3907202..a6265bc 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1497,6 +1497,8 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
struct drbd_device *device = (struct drbd_device *) q->queuedata;
unsigned long start_jif;

+ blk_queue_split(q, &bio, q->bio_split);
+
start_jif = jiffies;

/*
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628da..ea10bd9 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2446,6 +2446,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
char b[BDEVNAME_SIZE];
struct bio *split;

+ blk_queue_bounce(q, &bio);
+
+ blk_queue_split(q, &bio, q->bio_split);
+
pd = q->queuedata;
if (!pd) {
pr_err("%s incorrect request queue\n",
@@ -2476,8 +2480,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
goto end_io;
}

- blk_queue_bounce(q, &bio);
-
do {
sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index ef45cfb..e32e799 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -605,6 +605,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)

dev_dbg(&dev->core, "%s\n", __func__);

+ blk_queue_split(q, &bio, q->bio_split);
+
spin_lock_irq(&priv->lock);
busy = !bio_list_empty(&priv->list);
bio_list_add(&priv->list, bio);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index ac8c62c..50ef199 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -148,6 +148,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
struct rsxx_bio_meta *bio_meta;
int st = -EINVAL;

+ blk_queue_split(q, &bio, q->bio_split);
+
might_sleep();

if (!card)
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5..13d577c 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio)
(unsigned long long)bio->bi_iter.bi_sector,
bio->bi_iter.bi_size);

+ blk_queue_split(q, &bio, q->bio_split);
+
spin_lock_irq(&card->lock);
*card->biotail = bio;
bio->bi_next = NULL;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8dcbced..36a004e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -981,6 +981,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
if (unlikely(!zram_meta_get(zram)))
goto error;

+ blk_queue_split(queue, &bio, queue->bio_split);
+
if (!valid_io_request(zram, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size)) {
atomic64_inc(&zram->stats.invalid_io);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a930b72..34f6063 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1784,6 +1784,8 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)

map = dm_get_live_table(md, &srcu_idx);

+ blk_queue_split(q, &bio, q->bio_split);
+
generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);

/* if we're suspended, we have to queue this io for later */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 593a024..046b3c9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -257,6 +257,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
unsigned int sectors;
int cpu;

+ blk_queue_split(q, &bio, q->bio_split);
+
if (mddev == NULL || mddev->pers == NULL
|| !mddev->ready) {
bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index da21281..267ca3a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -826,6 +826,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
unsigned long source_addr;
unsigned long bytes_done;

+ blk_queue_split(q, &bio, q->bio_split);
+
bytes_done = 0;
dev_info = bio->bi_bdev->bd_disk->private_data;
if (dev_info == NULL)
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 7d4e939..1305ed3 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
unsigned long page_addr;
unsigned long bytes;

+ blk_queue_split(q, &bio, q->bio_split);
+
if ((bio->bi_iter.bi_sector & 7) != 0 ||
(bio->bi_iter.bi_size & 4095) != 0)
/* Request is not page-aligned. */
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index 413a840..a8645a9 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -340,6 +340,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio)
int rw = bio_rw(old_bio);
int inactive;

+ blk_queue_split(q, &old_bio, q->bio_split);
+
if (!lo)
goto err;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..93b81a2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -488,6 +488,7 @@ struct request_queue {

struct blk_mq_tag_set *tag_set;
struct list_head tag_set_list;
+ struct bio_set *bio_split;
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
@@ -812,6 +813,8 @@ extern void blk_rq_unprep_clone(struct request *rq);
extern int blk_insert_cloned_request(struct request_queue *q,
struct request *rq);
extern void blk_delay_queue(struct request_queue *, unsigned long);
+extern void blk_queue_split(struct request_queue *, struct bio **,
+ struct bio_set *);
extern void blk_recount_segments(struct request_queue *, struct bio *);
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
--
1.9.1

2015-05-22 18:19:34

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 02/11] block: simplify bio_add_page()

From: Kent Overstreet <[email protected]>

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.
__bio_add_page() doesn't need to call ->merge_bvec_fn(), where
we can get rid of unnecessary code paths.

Removing the call to ->merge_bvec_fn() is also fine, as no driver that
implements support for BLOCK_PC commands even has a ->merge_bvec_fn()
method.

Cc: Christoph Hellwig <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
[dpark: rebase and resolve merge conflicts, change a couple of comments,
make bio_add_page() warn once upon a cloned bio.]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
block/bio.c | 135 +++++++++++++++++++++++++-----------------------------------
1 file changed, 55 insertions(+), 80 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f66a4ea..ae31cdb 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -699,9 +699,23 @@ int bio_get_nr_vecs(struct block_device *bdev)
}
EXPORT_SYMBOL(bio_get_nr_vecs);

-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned int max_sectors)
+/**
+ * bio_add_pc_page - attempt to add page to bio
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ *
+ * This should only be used by REQ_PC bios.
+ */
+int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
+ *page, unsigned int len, unsigned int offset)
{
int retried_segments = 0;
struct bio_vec *bvec;
@@ -712,7 +726,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (unlikely(bio_flagged(bio, BIO_CLONED)))
return 0;

- if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
+ if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
return 0;

/*
@@ -725,28 +739,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page

if (page == prev->bv_page &&
offset == prev->bv_offset + prev->bv_len) {
- unsigned int prev_bv_len = prev->bv_len;
prev->bv_len += len;
-
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- /* prev_bvec is already charged in
- bi_size, discharge it in order to
- simulate merging updated prev_bvec
- as new bvec. */
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = bio->bi_iter.bi_size -
- prev_bv_len,
- .bi_rw = bio->bi_rw,
- };
-
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
- prev->bv_len -= len;
- return 0;
- }
- }
-
bio->bi_iter.bi_size += len;
goto done;
}
@@ -789,27 +782,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
blk_recount_segments(q, bio);
}

- /*
- * if queue has other restrictions (eg varying max sector size
- * depending on offset), it can specify a merge_bvec_fn in the
- * queue to get further control
- */
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = bio->bi_iter.bi_size - len,
- .bi_rw = bio->bi_rw,
- };
-
- /*
- * merge_bvec_fn() returns number of bytes it can accept
- * at this offset
- */
- if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len)
- goto failed;
- }
-
/* If we may be able to merge these biovecs, force a recount */
if (bio->bi_vcnt > 1 && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
bio->bi_flags &= ~(1 << BIO_SEG_VALID);
@@ -826,28 +798,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
blk_recount_segments(q, bio);
return 0;
}
-
-/**
- * bio_add_pc_page - attempt to add page to bio
- * @q: the target queue
- * @bio: destination bio
- * @page: page to add
- * @len: vec entry length
- * @offset: vec entry offset
- *
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
- *
- * This should only be used by REQ_PC bios.
- */
-int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
- unsigned int len, unsigned int offset)
-{
- return __bio_add_page(q, bio, page, len, offset,
- queue_max_hw_sectors(q));
-}
EXPORT_SYMBOL(bio_add_pc_page);

/**
@@ -857,22 +807,47 @@ EXPORT_SYMBOL(bio_add_pc_page);
* @len: vec entry length
* @offset: vec entry offset
*
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
+ * Attempt to add a page to the bio_vec maplist. This will only fail
+ * if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
*/
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
- unsigned int offset)
+int bio_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- unsigned int max_sectors;
+ struct bio_vec *bv;

- max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
- if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
- max_sectors = len >> 9;
+ /*
+ * cloned bio must not modify vec list
+ */
+ if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
+ return 0;

- return __bio_add_page(q, bio, page, len, offset, max_sectors);
+ /*
+ * For filesystems with a blocksize smaller than the pagesize
+ * we will often be called with the same page as last time and
+ * a consecutive offset. Optimize this special case.
+ */
+ if (bio->bi_vcnt > 0) {
+ bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+ if (page == bv->bv_page &&
+ offset == bv->bv_offset + bv->bv_len) {
+ bv->bv_len += len;
+ goto done;
+ }
+ }
+
+ if (bio->bi_vcnt >= bio->bi_max_vecs)
+ return 0;
+
+ bv = &bio->bi_io_vec[bio->bi_vcnt];
+ bv->bv_page = page;
+ bv->bv_len = len;
+ bv->bv_offset = offset;
+
+ bio->bi_vcnt++;
+done:
+ bio->bi_iter.bi_size += len;
+ return len;
}
EXPORT_SYMBOL(bio_add_page);

--
1.9.1

2015-05-22 18:23:18

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 03/11] bcache: remove driver private bio splitting code

From: Kent Overstreet <[email protected]>

The bcache driver has always accepted arbitrarily large bios and split
them internally. Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.

Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
drivers/md/bcache/bcache.h | 18 --------
drivers/md/bcache/io.c | 100 +-----------------------------------------
drivers/md/bcache/journal.c | 4 +-
drivers/md/bcache/request.c | 16 +++----
drivers/md/bcache/super.c | 32 +-------------
drivers/md/bcache/util.h | 5 ++-
drivers/md/bcache/writeback.c | 4 +-
7 files changed, 18 insertions(+), 161 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 04f7bc2..6b420a5 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -243,19 +243,6 @@ struct keybuf {
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
};

-struct bio_split_pool {
- struct bio_set *bio_split;
- mempool_t *bio_split_hook;
-};
-
-struct bio_split_hook {
- struct closure cl;
- struct bio_split_pool *p;
- struct bio *bio;
- bio_end_io_t *bi_end_io;
- void *bi_private;
-};
-
struct bcache_device {
struct closure cl;

@@ -288,8 +275,6 @@ struct bcache_device {
int (*cache_miss)(struct btree *, struct search *,
struct bio *, unsigned);
int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
-
- struct bio_split_pool bio_split_hook;
};

struct io {
@@ -454,8 +439,6 @@ struct cache {
atomic_long_t meta_sectors_written;
atomic_long_t btree_sectors_written;
atomic_long_t sectors_written;
-
- struct bio_split_pool bio_split_hook;
};

struct gc_stat {
@@ -873,7 +856,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
void bch_bbio_free(struct bio *, struct cache_set *);
struct bio *bch_bbio_alloc(struct cache_set *);

-void bch_generic_make_request(struct bio *, struct bio_split_pool *);
void __bch_submit_bbio(struct bio *, struct cache_set *);
void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);

diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa..86a0bb8 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -11,104 +11,6 @@

#include <linux/blkdev.h>

-static unsigned bch_bio_max_sectors(struct bio *bio)
-{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- struct bio_vec bv;
- struct bvec_iter iter;
- unsigned ret = 0, seg = 0;
-
- if (bio->bi_rw & REQ_DISCARD)
- return min(bio_sectors(bio), q->limits.max_discard_sectors);
-
- bio_for_each_segment(bv, bio, iter) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = ret << 9,
- .bi_rw = bio->bi_rw,
- };
-
- if (seg == min_t(unsigned, BIO_MAX_PAGES,
- queue_max_segments(q)))
- break;
-
- if (q->merge_bvec_fn &&
- q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
- break;
-
- seg++;
- ret += bv.bv_len >> 9;
- }
-
- ret = min(ret, queue_max_sectors(q));
-
- WARN_ON(!ret);
- ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
-
- return ret;
-}
-
-static void bch_bio_submit_split_done(struct closure *cl)
-{
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- s->bio->bi_end_io = s->bi_end_io;
- s->bio->bi_private = s->bi_private;
- bio_endio_nodec(s->bio, 0);
-
- closure_debug_destroy(&s->cl);
- mempool_free(s, s->p->bio_split_hook);
-}
-
-static void bch_bio_submit_split_endio(struct bio *bio, int error)
-{
- struct closure *cl = bio->bi_private;
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- if (error)
- clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
-
- bio_put(bio);
- closure_put(cl);
-}
-
-void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
-{
- struct bio_split_hook *s;
- struct bio *n;
-
- if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
- goto submit;
-
- if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
- goto submit;
-
- s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
- closure_init(&s->cl, NULL);
-
- s->bio = bio;
- s->p = p;
- s->bi_end_io = bio->bi_end_io;
- s->bi_private = bio->bi_private;
- bio_get(bio);
-
- do {
- n = bio_next_split(bio, bch_bio_max_sectors(bio),
- GFP_NOIO, s->p->bio_split);
-
- n->bi_end_io = bch_bio_submit_split_endio;
- n->bi_private = &s->cl;
-
- closure_get(&s->cl);
- generic_make_request(n);
- } while (n != bio);
-
- continue_at(&s->cl, bch_bio_submit_split_done, NULL);
-submit:
- generic_make_request(bio);
-}
-
/* Bios with headers */

void bch_bbio_free(struct bio *bio, struct cache_set *c)
@@ -138,7 +40,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
bio->bi_bdev = PTR_CACHE(c, &b->key, 0)->bdev;

b->submit_time_us = local_clock_us();
- closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
+ closure_bio_submit(bio, bio->bi_private);
}

void bch_submit_bbio(struct bio *bio, struct cache_set *c,
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index fe080ad..af47e6c 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -61,7 +61,7 @@ reread: left = ca->sb.bucket_size - offset;
bio->bi_private = &cl;
bch_bio_map(bio, data);

- closure_bio_submit(bio, &cl, ca);
+ closure_bio_submit(bio, &cl);
closure_sync(&cl);

/* This function could be simpler now since we no longer write
@@ -646,7 +646,7 @@ static void journal_write_unlocked(struct closure *cl)
spin_unlock(&c->journal.lock);

while ((bio = bio_list_pop(&list)))
- closure_bio_submit(bio, cl, c->cache[0]);
+ closure_bio_submit(bio, cl);

continue_at(cl, journal_write_done, NULL);
}
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ab43fad..89500e0 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -710,7 +710,7 @@ static void cached_dev_read_error(struct closure *cl)

/* XXX: invalidate cache */

- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}

continue_at(cl, cached_dev_cache_miss_done, NULL);
@@ -833,7 +833,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
s->cache_miss = miss;
s->iop.bio = cache_bio;
bio_get(cache_bio);
- closure_bio_submit(cache_bio, &s->cl, s->d);
+ closure_bio_submit(cache_bio, &s->cl);

return ret;
out_put:
@@ -841,7 +841,7 @@ out_put:
out_submit:
miss->bi_end_io = request_endio;
miss->bi_private = &s->cl;
- closure_bio_submit(miss, &s->cl, s->d);
+ closure_bio_submit(miss, &s->cl);
return ret;
}

@@ -906,7 +906,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)

if (!(bio->bi_rw & REQ_DISCARD) ||
blk_queue_discard(bdev_get_queue(dc->bdev)))
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
} else if (s->iop.writeback) {
bch_writeback_add(dc);
s->iop.bio = bio;
@@ -921,12 +921,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
flush->bi_end_io = request_endio;
flush->bi_private = cl;

- closure_bio_submit(flush, cl, s->d);
+ closure_bio_submit(flush, cl);
}
} else {
s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);

- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}

closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
@@ -942,7 +942,7 @@ static void cached_dev_nodata(struct closure *cl)
bch_journal_meta(s->iop.c, cl);

/* If it's a flush, we send the flush to the backing device too */
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);

continue_at(cl, cached_dev_bio_complete, NULL);
}
@@ -986,7 +986,7 @@ static void cached_dev_make_request(struct request_queue *q, struct bio *bio)
!blk_queue_discard(bdev_get_queue(dc->bdev)))
bio_endio(bio, 0);
else
- bch_generic_make_request(bio, &d->bio_split_hook);
+ generic_make_request(bio);
}
}

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4dd2bb7..a542b58 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -59,29 +59,6 @@ struct workqueue_struct *bcache_wq;

#define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE)

-static void bio_split_pool_free(struct bio_split_pool *p)
-{
- if (p->bio_split_hook)
- mempool_destroy(p->bio_split_hook);
-
- if (p->bio_split)
- bioset_free(p->bio_split);
-}
-
-static int bio_split_pool_init(struct bio_split_pool *p)
-{
- p->bio_split = bioset_create(4, 0);
- if (!p->bio_split)
- return -ENOMEM;
-
- p->bio_split_hook = mempool_create_kmalloc_pool(4,
- sizeof(struct bio_split_hook));
- if (!p->bio_split_hook)
- return -ENOMEM;
-
- return 0;
-}
-
/* Superblock */

static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
@@ -537,7 +514,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, unsigned long rw)
bio->bi_private = ca;
bch_bio_map(bio, ca->disk_buckets);

- closure_bio_submit(bio, &ca->prio, ca);
+ closure_bio_submit(bio, &ca->prio);
closure_sync(cl);
}

@@ -757,7 +734,6 @@ static void bcache_device_free(struct bcache_device *d)
put_disk(d->disk);
}

- bio_split_pool_free(&d->bio_split_hook);
if (d->bio_split)
bioset_free(d->bio_split);
if (is_vmalloc_addr(d->full_dirty_stripes))
@@ -810,7 +786,6 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
return minor;

if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
- bio_split_pool_init(&d->bio_split_hook) ||
!(d->disk = alloc_disk(1))) {
ida_simple_remove(&bcache_minor, minor);
return -ENOMEM;
@@ -1799,8 +1774,6 @@ void bch_cache_release(struct kobject *kobj)
ca->set->cache[ca->sb.nr_this_dev] = NULL;
}

- bio_split_pool_free(&ca->bio_split_hook);
-
free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
kfree(ca->prio_buckets);
vfree(ca->buckets);
@@ -1845,8 +1818,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
ca->sb.nbuckets)) ||
!(ca->prio_buckets = kzalloc(sizeof(uint64_t) * prio_buckets(ca) *
2, GFP_KERNEL)) ||
- !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)) ||
- bio_split_pool_init(&ca->bio_split_hook))
+ !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)))
return -ENOMEM;

ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca);
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 98df757..e3dee05 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@

#include <linux/blkdev.h>
#include <linux/errno.h>
+#include <linux/blkdev.h>
#include <linux/kernel.h>
#include <linux/llist.h>
#include <linux/ratelimit.h>
@@ -576,10 +577,10 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
return bdev->bd_inode->i_size >> 9;
}

-#define closure_bio_submit(bio, cl, dev) \
+#define closure_bio_submit(bio, cl) \
do { \
closure_get(cl); \
- bch_generic_make_request(bio, &(dev)->bio_split_hook); \
+ generic_make_request(bio); \
} while (0)

uint64_t bch_crc64_update(uint64_t, const void *, size_t);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bc..ca38362 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -188,7 +188,7 @@ static void write_dirty(struct closure *cl)
io->bio.bi_bdev = io->dc->bdev;
io->bio.bi_end_io = dirty_endio;

- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);

continue_at(cl, write_dirty_finish, system_wq);
}
@@ -208,7 +208,7 @@ static void read_dirty_submit(struct closure *cl)
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);

- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);

continue_at(cl, write_dirty, system_wq);
}
--
1.9.1

2015-05-22 18:19:38

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 04/11] btrfs: remove bio splitting and merge_bvec_fn() calls

From: Kent Overstreet <[email protected]>

Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Chris Mason <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
fs/btrfs/volumes.c | 72 ------------------------------------------------------
1 file changed, 72 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 96aebf3..c616aba 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5715,34 +5715,6 @@ static noinline void btrfs_schedule_bio(struct btrfs_root *root,
&device->work);
}

-static int bio_size_ok(struct block_device *bdev, struct bio *bio,
- sector_t sector)
-{
- struct bio_vec *prev;
- struct request_queue *q = bdev_get_queue(bdev);
- unsigned int max_sectors = queue_max_sectors(q);
- struct bvec_merge_data bvm = {
- .bi_bdev = bdev,
- .bi_sector = sector,
- .bi_rw = bio->bi_rw,
- };
-
- if (WARN_ON(bio->bi_vcnt == 0))
- return 1;
-
- prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
- if (bio_sectors(bio) > max_sectors)
- return 0;
-
- if (!q->merge_bvec_fn)
- return 1;
-
- bvm.bi_size = bio->bi_iter.bi_size - prev->bv_len;
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len)
- return 0;
- return 1;
-}
-
static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
struct bio *bio, u64 physical, int dev_nr,
int rw, int async)
@@ -5776,38 +5748,6 @@ static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
btrfsic_submit_bio(rw, bio);
}

-static int breakup_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
- struct bio *first_bio, struct btrfs_device *dev,
- int dev_nr, int rw, int async)
-{
- struct bio_vec *bvec = first_bio->bi_io_vec;
- struct bio *bio;
- int nr_vecs = bio_get_nr_vecs(dev->bdev);
- u64 physical = bbio->stripes[dev_nr].physical;
-
-again:
- bio = btrfs_bio_alloc(dev->bdev, physical >> 9, nr_vecs, GFP_NOFS);
- if (!bio)
- return -ENOMEM;
-
- while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) {
- if (bio_add_page(bio, bvec->bv_page, bvec->bv_len,
- bvec->bv_offset) < bvec->bv_len) {
- u64 len = bio->bi_iter.bi_size;
-
- atomic_inc(&bbio->stripes_pending);
- submit_stripe_bio(root, bbio, bio, physical, dev_nr,
- rw, async);
- physical += len;
- goto again;
- }
- bvec++;
- }
-
- submit_stripe_bio(root, bbio, bio, physical, dev_nr, rw, async);
- return 0;
-}
-
static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
{
atomic_inc(&bbio->error);
@@ -5882,18 +5822,6 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
continue;
}

- /*
- * Check and see if we're ok with this bio based on it's size
- * and offset with the given device.
- */
- if (!bio_size_ok(dev->bdev, first_bio,
- bbio->stripes[dev_nr].physical >> 9)) {
- ret = breakup_stripe_bio(root, bbio, first_bio, dev,
- dev_nr, rw, async_submit);
- BUG_ON(ret);
- continue;
- }
-
if (dev_nr < total_devs - 1) {
bio = btrfs_bio_clone(first_bio, GFP_NOFS);
BUG_ON(!bio); /* -ENOMEM */
--
1.9.1

2015-05-22 18:23:20

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 05/11] block: remove split code in blkdev_issue_discard

The split code in blkdev_issue_discard() can go away now
that any driver that cares does the split.

Signed-off-by: Ming Lin <[email protected]>
---
block/blk-lib.c | 73 +++++++++++----------------------------------------------
1 file changed, 14 insertions(+), 59 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..3bf3c4a 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -43,34 +43,17 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q = bdev_get_queue(bdev);
int type = REQ_WRITE | REQ_DISCARD;
- unsigned int max_discard_sectors, granularity;
- int alignment;
struct bio_batch bb;
struct bio *bio;
int ret = 0;
struct blk_plug plug;

- if (!q)
+ if (!q || !nr_sects)
return -ENXIO;

if (!blk_queue_discard(q))
return -EOPNOTSUPP;

- /* Zero-sector (unknown) and one-sector granularities are the same. */
- granularity = max(q->limits.discard_granularity >> 9, 1U);
- alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
- /*
- * Ensure that max_discard_sectors is of the proper
- * granularity, so that requests stay aligned after a split.
- */
- max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
- max_discard_sectors -= max_discard_sectors % granularity;
- if (unlikely(!max_discard_sectors)) {
- /* Avoid infinite loop below. Being cautious never hurts. */
- return -EOPNOTSUPP;
- }
-
if (flags & BLKDEV_DISCARD_SECURE) {
if (!blk_queue_secdiscard(q))
return -EOPNOTSUPP;
@@ -82,52 +65,24 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
bb.wait = &wait;

blk_start_plug(&plug);
- while (nr_sects) {
- unsigned int req_sects;
- sector_t end_sect, tmp;

- bio = bio_alloc(gfp_mask, 1);
- if (!bio) {
- ret = -ENOMEM;
- break;
- }
+ bio = bio_alloc(gfp_mask, 1);
+ if (!bio) {
+ ret = -ENOMEM;
+ goto out;
+ }

- req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
-
- /*
- * If splitting a request, and the next starting sector would be
- * misaligned, stop the discard at the previous aligned sector.
- */
- end_sect = sector + req_sects;
- tmp = end_sect;
- if (req_sects < nr_sects &&
- sector_div(tmp, granularity) != alignment) {
- end_sect = end_sect - alignment;
- sector_div(end_sect, granularity);
- end_sect = end_sect * granularity + alignment;
- req_sects = end_sect - sector;
- }
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_end_io = bio_batch_end_io;
+ bio->bi_bdev = bdev;
+ bio->bi_private = &bb;

- bio->bi_iter.bi_sector = sector;
- bio->bi_end_io = bio_batch_end_io;
- bio->bi_bdev = bdev;
- bio->bi_private = &bb;
+ bio->bi_iter.bi_size = nr_sects << 9;

- bio->bi_iter.bi_size = req_sects << 9;
- nr_sects -= req_sects;
- sector = end_sect;
+ atomic_inc(&bb.done);
+ submit_bio(type, bio);

- atomic_inc(&bb.done);
- submit_bio(type, bio);
-
- /*
- * We can loop for a long time in here, if someone does
- * full device discards (like mkfs). Be nice and allow
- * us to schedule out to avoid softlocking if preempt
- * is disabled.
- */
- cond_resched();
- }
+out:
blk_finish_plug(&plug);

/* Wait for bios in-flight */
--
1.9.1

2015-05-22 18:22:43

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

From: Kent Overstreet <[email protected]>

Remove bio_fits_rdev() completely, because ->merge_bvec_fn() has now
gone. There's no point in calling bio_fits_rdev() only for ensuring
aligned read from rdev.

Cc: Neil Brown <[email protected]>
Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
drivers/md/raid5.c | 23 +----------------------
1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1ba97fd..b303ded 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4743,25 +4743,6 @@ static void raid5_align_endio(struct bio *bi, int error)
add_bio_to_retry(raid_bi, conf);
}

-static int bio_fits_rdev(struct bio *bi)
-{
- struct request_queue *q = bdev_get_queue(bi->bi_bdev);
-
- if (bio_sectors(bi) > queue_max_sectors(q))
- return 0;
- blk_recount_segments(q, bi);
- if (bi->bi_phys_segments > queue_max_segments(q))
- return 0;
-
- if (q->merge_bvec_fn)
- /* it's too hard to apply the merge_bvec_fn at this stage,
- * just just give up
- */
- return 0;
-
- return 1;
-}
-
static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
{
struct r5conf *conf = mddev->private;
@@ -4815,11 +4796,9 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
align_bi->bi_bdev = rdev->bdev;
__clear_bit(BIO_SEG_VALID, &align_bi->bi_flags);

- if (!bio_fits_rdev(align_bi) ||
- is_badblock(rdev, align_bi->bi_iter.bi_sector,
+ if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
bio_sectors(align_bi),
&first_bad, &bad_sectors)) {
- /* too big in some way, or has a known bad block */
bio_put(align_bi);
rdev_dec_pending(rdev, mddev);
return 0;
--
1.9.1

2015-05-22 18:22:13

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 07/11] md/raid5: split bio for chunk_aligned_read

If a read request fits entirely in a chunk, it will be passed directly to the
underlying device (providing it hasn't failed of course). If it doesn't fit,
the slightly less efficient path that uses the stripe_cache is used.
Requests that get to the stripe cache are always completely split up as
necessary.

So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work,
but could cause it to take the less efficient path more often.

All that is needed to manage this is for 'chunk_aligned_read' do some bio
splitting, much like the RAID0 code does.

Cc: Neil Brown <[email protected]>
Cc: [email protected]
Acked-by: NeilBrown <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
drivers/md/raid5.c | 37 ++++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b303ded..b6c6ace 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4743,7 +4743,7 @@ static void raid5_align_endio(struct bio *bi, int error)
add_bio_to_retry(raid_bi, conf);
}

-static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
+static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
{
struct r5conf *conf = mddev->private;
int dd_idx;
@@ -4752,7 +4752,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
sector_t end_sector;

if (!in_chunk_boundary(mddev, raid_bio)) {
- pr_debug("chunk_aligned_read : non aligned\n");
+ pr_debug("%s: non aligned\n", __func__);
return 0;
}
/*
@@ -4827,6 +4827,31 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
}
}

+static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
+{
+ struct bio *split;
+
+ do {
+ sector_t sector = raid_bio->bi_iter.bi_sector;
+ unsigned chunk_sects = mddev->chunk_sectors;
+ unsigned sectors = chunk_sects - (sector & (chunk_sects-1));
+
+ if (sectors < bio_sectors(raid_bio)) {
+ split = bio_split(raid_bio, sectors, GFP_NOIO, fs_bio_set);
+ bio_chain(split, raid_bio);
+ } else
+ split = raid_bio;
+
+ if (!raid5_read_one_chunk(mddev, split)) {
+ if (split != raid_bio)
+ generic_make_request(raid_bio);
+ return split;
+ }
+ } while (split != raid_bio);
+
+ return NULL;
+}
+
/* __get_priority_stripe - get the next stripe to process
*
* Full stripe writes are allowed to pass preread active stripes up until
@@ -5104,9 +5129,11 @@ static void make_request(struct mddev *mddev, struct bio * bi)
* data on failed drives.
*/
if (rw == READ && mddev->degraded == 0 &&
- mddev->reshape_position == MaxSector &&
- chunk_aligned_read(mddev,bi))
- return;
+ mddev->reshape_position == MaxSector) {
+ bi = chunk_aligned_read(mddev, bi);
+ if (!bi)
+ return;
+ }

if (unlikely(bi->bi_rw & REQ_DISCARD)) {
make_discard_request(mddev, bi);
--
1.9.1

2015-05-22 18:20:51

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

From: Kent Overstreet <[email protected]>

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <[email protected]>
Cc: Lars Ellenberg <[email protected]>
Cc: [email protected]
Cc: Jiri Kosina <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Cc: Sage Weil <[email protected]>
Cc: Alex Elder <[email protected]>
Cc: [email protected]
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: [email protected]
Cc: Neil Brown <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
block/blk-merge.c | 17 +-----
block/blk-settings.c | 22 --------
drivers/block/drbd/drbd_int.h | 1 -
drivers/block/drbd/drbd_main.c | 1 -
drivers/block/drbd/drbd_req.c | 35 ------------
drivers/block/pktcdvd.c | 21 -------
drivers/block/rbd.c | 47 ----------------
drivers/md/dm-cache-target.c | 21 -------
drivers/md/dm-crypt.c | 16 ------
drivers/md/dm-era-target.c | 15 -----
drivers/md/dm-flakey.c | 16 ------
drivers/md/dm-linear.c | 16 ------
drivers/md/dm-log-writes.c | 16 ------
drivers/md/dm-snap.c | 15 -----
drivers/md/dm-stripe.c | 21 -------
drivers/md/dm-table.c | 8 ---
drivers/md/dm-thin.c | 31 -----------
drivers/md/dm-verity.c | 16 ------
drivers/md/dm.c | 120 +---------------------------------------
drivers/md/dm.h | 2 -
drivers/md/linear.c | 43 ---------------
drivers/md/md.c | 26 ---------
drivers/md/md.h | 12 ----
drivers/md/multipath.c | 21 -------
drivers/md/raid0.c | 56 -------------------
drivers/md/raid0.h | 2 -
drivers/md/raid1.c | 58 +-------------------
drivers/md/raid10.c | 121 +----------------------------------------
drivers/md/raid5.c | 32 -----------
include/linux/blkdev.h | 10 ----
include/linux/device-mapper.h | 4 --
31 files changed, 9 insertions(+), 833 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index dc14255..25cafb8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -69,24 +69,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio *split;
struct bio_vec bv, bvprv;
struct bvec_iter iter;
- unsigned seg_size = 0, nsegs = 0;
+ unsigned seg_size = 0, nsegs = 0, sectors = 0;
int prev = 0;

- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = 0,
- .bi_rw = bio->bi_rw,
- };
-
bio_for_each_segment(bv, bio, iter) {
- if (q->merge_bvec_fn &&
- q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
- goto split;
-
- bvm.bi_size += bv.bv_len;
+ sectors += bv.bv_len >> 9;

- if (bvm.bi_size >> 9 > queue_max_sectors(q))
+ if (sectors > queue_max_sectors(q))
goto split;

/*
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 12600bf..e90d477 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -53,28 +53,6 @@ void blk_queue_unprep_rq(struct request_queue *q, unprep_rq_fn *ufn)
}
EXPORT_SYMBOL(blk_queue_unprep_rq);

-/**
- * blk_queue_merge_bvec - set a merge_bvec function for queue
- * @q: queue
- * @mbfn: merge_bvec_fn
- *
- * Usually queues have static limitations on the max sectors or segments that
- * we can put in a request. Stacking drivers may have some settings that
- * are dynamic, and thus we have to query the queue whether it is ok to
- * add a new bio_vec to a bio at a given offset or not. If the block device
- * has such limitations, it needs to register a merge_bvec_fn to control
- * the size of bio's sent to it. Note that a block device *must* allow a
- * single page to be added to an empty bio. The block device driver may want
- * to use the bio_split() function to deal with these bio's. By default
- * no merge_bvec_fn is defined for a queue, and only the fixed limits are
- * honored.
- */
-void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
-{
- q->merge_bvec_fn = mbfn;
-}
-EXPORT_SYMBOL(blk_queue_merge_bvec);
-
void blk_queue_softirq_done(struct request_queue *q, softirq_done_fn *fn)
{
q->softirq_done_fn = fn;
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index b905e98..63ce2b0 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1449,7 +1449,6 @@ extern void do_submit(struct work_struct *ws);
extern void __drbd_make_request(struct drbd_device *, struct bio *, unsigned long);
extern void drbd_make_request(struct request_queue *q, struct bio *bio);
extern int drbd_read_remote(struct drbd_device *device, struct drbd_request *req);
-extern int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec);
extern int is_valid_ar_handle(struct drbd_request *, sector_t);


diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 81fde9e..771e68c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2774,7 +2774,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
This triggers a max_bio_size message upon first attach or connect */
blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
- blk_queue_merge_bvec(q, drbd_merge_bvec);
q->queue_lock = &resource->req_lock;

device->md_io.page = alloc_page(GFP_KERNEL);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index a6265bc..7523f00 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1510,41 +1510,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
__drbd_make_request(device, bio, start_jif);
}

-/* This is called by bio_add_page().
- *
- * q->max_hw_sectors and other global limits are already enforced there.
- *
- * We need to call down to our lower level device,
- * in case it has special restrictions.
- *
- * We also may need to enforce configured max-bio-bvecs limits.
- *
- * As long as the BIO is empty we have to allow at least one bvec,
- * regardless of size and offset, so no need to ask lower levels.
- */
-int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
-{
- struct drbd_device *device = (struct drbd_device *) q->queuedata;
- unsigned int bio_size = bvm->bi_size;
- int limit = DRBD_MAX_BIO_SIZE;
- int backing_limit;
-
- if (bio_size && get_ldev(device)) {
- unsigned int max_hw_sectors = queue_max_hw_sectors(q);
- struct request_queue * const b =
- device->ldev->backing_bdev->bd_disk->queue;
- if (b->merge_bvec_fn) {
- bvm->bi_bdev = device->ldev->backing_bdev;
- backing_limit = b->merge_bvec_fn(b, bvm, bvec);
- limit = min(limit, backing_limit);
- }
- put_ldev(device);
- if ((limit >> 9) > max_hw_sectors)
- limit = max_hw_sectors << 9;
- }
- return limit;
-}
-
void request_timer_fn(unsigned long data)
{
struct drbd_device *device = (struct drbd_device *) data;
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index ea10bd9..85eac23 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2505,26 +2505,6 @@ end_io:



-static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
- struct bio_vec *bvec)
-{
- struct pktcdvd_device *pd = q->queuedata;
- sector_t zone = get_zone(bmd->bi_sector, pd);
- int used = ((bmd->bi_sector - zone) << 9) + bmd->bi_size;
- int remaining = (pd->settings.size << 9) - used;
- int remaining2;
-
- /*
- * A bio <= PAGE_SIZE must be allowed. If it crosses a packet
- * boundary, pkt_make_request() will split the bio.
- */
- remaining2 = PAGE_SIZE - bmd->bi_size;
- remaining = max(remaining, remaining2);
-
- BUG_ON(remaining < 0);
- return remaining;
-}
-
static void pkt_init_queue(struct pktcdvd_device *pd)
{
struct request_queue *q = pd->disk->queue;
@@ -2532,7 +2512,6 @@ static void pkt_init_queue(struct pktcdvd_device *pd)
blk_queue_make_request(q, pkt_make_request);
blk_queue_logical_block_size(q, CD_FRAMESIZE);
blk_queue_max_hw_sectors(q, PACKET_MAX_SECTORS);
- blk_queue_merge_bvec(q, pkt_merge_bvec);
q->queuedata = pd;
}

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index ec6c5c6..f50edb3 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
return BLK_MQ_RQ_QUEUE_OK;
}

-/*
- * a queue callback. Makes sure that we don't create a bio that spans across
- * multiple osd objects. One exception would be with a single page bios,
- * which we handle later at bio_chain_clone_range()
- */
-static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
- struct bio_vec *bvec)
-{
- struct rbd_device *rbd_dev = q->queuedata;
- sector_t sector_offset;
- sector_t sectors_per_obj;
- sector_t obj_sector_offset;
- int ret;
-
- /*
- * Find how far into its rbd object the partition-relative
- * bio start sector is to offset relative to the enclosing
- * device.
- */
- sector_offset = get_start_sect(bmd->bi_bdev) + bmd->bi_sector;
- sectors_per_obj = 1 << (rbd_dev->header.obj_order - SECTOR_SHIFT);
- obj_sector_offset = sector_offset & (sectors_per_obj - 1);
-
- /*
- * Compute the number of bytes from that offset to the end
- * of the object. Account for what's already used by the bio.
- */
- ret = (int) (sectors_per_obj - obj_sector_offset) << SECTOR_SHIFT;
- if (ret > bmd->bi_size)
- ret -= bmd->bi_size;
- else
- ret = 0;
-
- /*
- * Don't send back more than was asked for. And if the bio
- * was empty, let the whole thing through because: "Note
- * that a block device *must* allow a single page to be
- * added to an empty bio."
- */
- rbd_assert(bvec->bv_len <= PAGE_SIZE);
- if (ret > (int) bvec->bv_len || !bmd->bi_size)
- ret = (int) bvec->bv_len;
-
- return ret;
-}
-
static void rbd_free_disk(struct rbd_device *rbd_dev)
{
struct gendisk *disk = rbd_dev->disk;
@@ -3784,7 +3738,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
q->limits.max_discard_sectors = segment_size / SECTOR_SIZE;
q->limits.discard_zeroes_data = 1;

- blk_queue_merge_bvec(q, rbd_merge_bvec);
disk->queue = q;

q->queuedata = rbd_dev;
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 7755af3..2e47e35 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -3289,26 +3289,6 @@ static int cache_iterate_devices(struct dm_target *ti,
return r;
}

-/*
- * We assume I/O is going to the origin (which is the volume
- * more likely to have restrictions e.g. by being striped).
- * (Looking up the exact location of the data would be expensive
- * and could always be out of date by the time the bio is submitted.)
- */
-static int cache_bvec_merge(struct dm_target *ti,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct cache *cache = ti->private;
- struct request_queue *q = bdev_get_queue(cache->origin_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = cache->origin_dev->bdev;
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
{
/*
@@ -3352,7 +3332,6 @@ static struct target_type cache_target = {
.status = cache_status,
.message = cache_message,
.iterate_devices = cache_iterate_devices,
- .merge = cache_bvec_merge,
.io_hints = cache_io_hints,
};

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 5503e43..d13f330 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2017,21 +2017,6 @@ error:
return -EINVAL;
}

-static int crypt_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct crypt_config *cc = ti->private;
- struct request_queue *q = bdev_get_queue(cc->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = cc->dev->bdev;
- bvm->bi_sector = cc->start + dm_target_offset(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int crypt_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -2052,7 +2037,6 @@ static struct target_type crypt_target = {
.preresume = crypt_preresume,
.resume = crypt_resume,
.message = crypt_message,
- .merge = crypt_merge,
.iterate_devices = crypt_iterate_devices,
};

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index ad913cd..0119ebf 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1673,20 +1673,6 @@ static int era_iterate_devices(struct dm_target *ti,
return fn(ti, era->origin_dev, 0, get_dev_size(era->origin_dev), data);
}

-static int era_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct era *era = ti->private;
- struct request_queue *q = bdev_get_queue(era->origin_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = era->origin_dev->bdev;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static void era_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct era *era = ti->private;
@@ -1717,7 +1703,6 @@ static struct target_type era_target = {
.status = era_status,
.message = era_message,
.iterate_devices = era_iterate_devices,
- .merge = era_merge,
.io_hints = era_io_hints
};

diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index b257e46..d955b3e 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -387,21 +387,6 @@ static int flakey_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long ar
return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}

-static int flakey_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct flakey_c *fc = ti->private;
- struct request_queue *q = bdev_get_queue(fc->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = fc->dev->bdev;
- bvm->bi_sector = flakey_map_sector(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int flakey_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data)
{
struct flakey_c *fc = ti->private;
@@ -419,7 +404,6 @@ static struct target_type flakey_target = {
.end_io = flakey_end_io,
.status = flakey_status,
.ioctl = flakey_ioctl,
- .merge = flakey_merge,
.iterate_devices = flakey_iterate_devices,
};

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 53e848c..7dd5fc8 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -130,21 +130,6 @@ static int linear_ioctl(struct dm_target *ti, unsigned int cmd,
return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}

-static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct linear_c *lc = ti->private;
- struct request_queue *q = bdev_get_queue(lc->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = lc->dev->bdev;
- bvm->bi_sector = linear_map_sector(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int linear_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -162,7 +147,6 @@ static struct target_type linear_target = {
.map = linear_map,
.status = linear_status,
.ioctl = linear_ioctl,
- .merge = linear_merge,
.iterate_devices = linear_iterate_devices,
};

diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 93e0844..4325808 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -728,21 +728,6 @@ static int log_writes_ioctl(struct dm_target *ti, unsigned int cmd,
return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}

-static int log_writes_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct log_writes_c *lc = ti->private;
- struct request_queue *q = bdev_get_queue(lc->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = lc->dev->bdev;
- bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int log_writes_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn,
void *data)
@@ -796,7 +781,6 @@ static struct target_type log_writes_target = {
.end_io = normal_end_io,
.status = log_writes_status,
.ioctl = log_writes_ioctl,
- .merge = log_writes_merge,
.message = log_writes_message,
.iterate_devices = log_writes_iterate_devices,
.io_hints = log_writes_io_hints,
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index f83a0f3..274cbec 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2331,20 +2331,6 @@ static void origin_status(struct dm_target *ti, status_type_t type,
}
}

-static int origin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct dm_origin *o = ti->private;
- struct request_queue *q = bdev_get_queue(o->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = o->dev->bdev;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int origin_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -2363,7 +2349,6 @@ static struct target_type origin_target = {
.resume = origin_resume,
.postsuspend = origin_postsuspend,
.status = origin_status,
- .merge = origin_merge,
.iterate_devices = origin_iterate_devices,
};

diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index f8b37d4..09bb2fe 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -412,26 +412,6 @@ static void stripe_io_hints(struct dm_target *ti,
blk_limits_io_opt(limits, chunk_size * sc->stripes);
}

-static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct stripe_c *sc = ti->private;
- sector_t bvm_sector = bvm->bi_sector;
- uint32_t stripe;
- struct request_queue *q;
-
- stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector);
-
- q = bdev_get_queue(sc->stripe[stripe].dev->bdev);
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = sc->stripe[stripe].dev->bdev;
- bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static struct target_type stripe_target = {
.name = "striped",
.version = {1, 5, 1},
@@ -443,7 +423,6 @@ static struct target_type stripe_target = {
.status = stripe_status,
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
- .merge = stripe_merge,
};

int __init dm_stripe_init(void)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d9b00b8..19c9b01 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -440,14 +440,6 @@ static int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
q->limits.alignment_offset,
(unsigned long long) start << SECTOR_SHIFT);

- /*
- * Check if merge fn is supported.
- * If not we'll force DM to use PAGE_SIZE or
- * smaller I/O, just to be safe.
- */
- if (dm_queue_merge_is_compulsory(q) && !ti->type->merge)
- blk_limits_max_hw_sectors(limits,
- (unsigned int) (PAGE_SIZE >> 9));
return 0;
}

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..03552fe 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3562,20 +3562,6 @@ static int pool_iterate_devices(struct dm_target *ti,
return fn(ti, pt->data_dev, 0, ti->len, data);
}

-static int pool_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct pool_c *pt = ti->private;
- struct request_queue *q = bdev_get_queue(pt->data_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = pt->data_dev->bdev;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static void set_discard_limits(struct pool_c *pt, struct queue_limits *limits)
{
struct pool *pool = pt->pool;
@@ -3667,7 +3653,6 @@ static struct target_type pool_target = {
.resume = pool_resume,
.message = pool_message,
.status = pool_status,
- .merge = pool_merge,
.iterate_devices = pool_iterate_devices,
.io_hints = pool_io_hints,
};
@@ -3992,21 +3977,6 @@ err:
DMEMIT("Error");
}

-static int thin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct thin_c *tc = ti->private;
- struct request_queue *q = bdev_get_queue(tc->pool_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = tc->pool_dev->bdev;
- bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int thin_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -4041,7 +4011,6 @@ static struct target_type thin_target = {
.presuspend = thin_presuspend,
.postsuspend = thin_postsuspend,
.status = thin_status,
- .merge = thin_merge,
.iterate_devices = thin_iterate_devices,
};

diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 66616db..3b85460 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -648,21 +648,6 @@ static int verity_ioctl(struct dm_target *ti, unsigned cmd,
cmd, arg);
}

-static int verity_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct dm_verity *v = ti->private;
- struct request_queue *q = bdev_get_queue(v->data_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = v->data_dev->bdev;
- bvm->bi_sector = verity_map_sector(v, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int verity_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -995,7 +980,6 @@ static struct target_type verity_target = {
.map = verity_map,
.status = verity_status,
.ioctl = verity_ioctl,
- .merge = verity_merge,
.iterate_devices = verity_iterate_devices,
.io_hints = verity_io_hints,
};
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 34f6063..f732a7a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -121,9 +121,8 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
#define DMF_FREEING 3
#define DMF_DELETING 4
#define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_MERGE_IS_OPTIONAL 6
-#define DMF_DEFERRED_REMOVE 7
-#define DMF_SUSPENDED_INTERNALLY 8
+#define DMF_DEFERRED_REMOVE 6
+#define DMF_SUSPENDED_INTERNALLY 7

/*
* A dummy definition to make RCU happy.
@@ -1717,60 +1716,6 @@ static void __split_and_process_bio(struct mapped_device *md,
* CRUD END
*---------------------------------------------------------------*/

-static int dm_merge_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mapped_device *md = q->queuedata;
- struct dm_table *map = dm_get_live_table_fast(md);
- struct dm_target *ti;
- sector_t max_sectors;
- int max_size = 0;
-
- if (unlikely(!map))
- goto out;
-
- ti = dm_table_find_target(map, bvm->bi_sector);
- if (!dm_target_is_valid(ti))
- goto out;
-
- /*
- * Find maximum amount of I/O that won't need splitting
- */
- max_sectors = min(max_io_len(bvm->bi_sector, ti),
- (sector_t) queue_max_sectors(q));
- max_size = (max_sectors << SECTOR_SHIFT) - bvm->bi_size;
- if (unlikely(max_size < 0)) /* this shouldn't _ever_ happen */
- max_size = 0;
-
- /*
- * merge_bvec_fn() returns number of bytes
- * it can accept at this offset
- * max is precomputed maximal io size
- */
- if (max_size && ti->type->merge)
- max_size = ti->type->merge(ti, bvm, biovec, max_size);
- /*
- * If the target doesn't support merge method and some of the devices
- * provided their merge_bvec method (we know this by looking for the
- * max_hw_sectors that dm_set_device_limits may set), then we can't
- * allow bios with multiple vector entries. So always set max_size
- * to 0, and the code below allows just one page.
- */
- else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
- max_size = 0;
-
-out:
- dm_put_live_table_fast(md);
- /*
- * Always allow an entire first page
- */
- if (max_size <= biovec->bv_len && !(bvm->bi_size >> SECTOR_SHIFT))
- max_size = biovec->bv_len;
-
- return max_size;
-}
-
/*
* The request function that just remaps the bio built up by
* dm_merge_bvec.
@@ -2477,59 +2422,6 @@ static void __set_size(struct mapped_device *md, sector_t size)
}

/*
- * Return 1 if the queue has a compulsory merge_bvec_fn function.
- *
- * If this function returns 0, then the device is either a non-dm
- * device without a merge_bvec_fn, or it is a dm device that is
- * able to split any bios it receives that are too big.
- */
-int dm_queue_merge_is_compulsory(struct request_queue *q)
-{
- struct mapped_device *dev_md;
-
- if (!q->merge_bvec_fn)
- return 0;
-
- if (q->make_request_fn == dm_make_request) {
- dev_md = q->queuedata;
- if (test_bit(DMF_MERGE_IS_OPTIONAL, &dev_md->flags))
- return 0;
- }
-
- return 1;
-}
-
-static int dm_device_merge_is_compulsory(struct dm_target *ti,
- struct dm_dev *dev, sector_t start,
- sector_t len, void *data)
-{
- struct block_device *bdev = dev->bdev;
- struct request_queue *q = bdev_get_queue(bdev);
-
- return dm_queue_merge_is_compulsory(q);
-}
-
-/*
- * Return 1 if it is acceptable to ignore merge_bvec_fn based
- * on the properties of the underlying devices.
- */
-static int dm_table_merge_is_optional(struct dm_table *table)
-{
- unsigned i = 0;
- struct dm_target *ti;
-
- while (i < dm_table_get_num_targets(table)) {
- ti = dm_table_get_target(table, i++);
-
- if (ti->type->iterate_devices &&
- ti->type->iterate_devices(ti, dm_device_merge_is_compulsory, NULL))
- return 0;
- }
-
- return 1;
-}
-
-/*
* Returns old map, which caller must destroy.
*/
static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
@@ -2538,7 +2430,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
struct dm_table *old_map;
struct request_queue *q = md->queue;
sector_t size;
- int merge_is_optional;

size = dm_table_get_size(t);

@@ -2564,17 +2455,11 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,

__bind_mempools(md, t);

- merge_is_optional = dm_table_merge_is_optional(t);
-
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, t);
md->immutable_target_type = dm_table_get_immutable_target_type(t);

dm_table_set_restrictions(t, q, limits);
- if (merge_is_optional)
- set_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
- else
- clear_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
if (old_map)
dm_sync_table(md);

@@ -2852,7 +2737,6 @@ int dm_setup_md_queue(struct mapped_device *md)
case DM_TYPE_BIO_BASED:
dm_init_old_md_queue(md);
blk_queue_make_request(md->queue, dm_make_request);
- blk_queue_merge_bvec(md->queue, dm_merge_bvec);
break;
}

diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 6123c2b..7d61cca 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -77,8 +77,6 @@ bool dm_table_mq_request_based(struct dm_table *t);
void dm_table_free_md_mempools(struct dm_table *t);
struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);

-int dm_queue_merge_is_compulsory(struct request_queue *q);
-
void dm_lock_md_type(struct mapped_device *md);
void dm_unlock_md_type(struct mapped_device *md);
void dm_set_md_type(struct mapped_device *md, unsigned type);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index fa7d577..8721ef9 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -52,48 +52,6 @@ static inline struct dev_info *which_dev(struct mddev *mddev, sector_t sector)
return conf->disks + lo;
}

-/**
- * linear_mergeable_bvec -- tell bio layer if two requests can be merged
- * @q: request queue
- * @bvm: properties of new bio
- * @biovec: the request that could be merged to it.
- *
- * Return amount of bytes we can take at this offset
- */
-static int linear_mergeable_bvec(struct mddev *mddev,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct dev_info *dev0;
- unsigned long maxsectors, bio_sectors = bvm->bi_size >> 9;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int maxbytes = biovec->bv_len;
- struct request_queue *subq;
-
- dev0 = which_dev(mddev, sector);
- maxsectors = dev0->end_sector - sector;
- subq = bdev_get_queue(dev0->rdev->bdev);
- if (subq->merge_bvec_fn) {
- bvm->bi_bdev = dev0->rdev->bdev;
- bvm->bi_sector -= dev0->end_sector - dev0->rdev->sectors;
- maxbytes = min(maxbytes, subq->merge_bvec_fn(subq, bvm,
- biovec));
- }
-
- if (maxsectors < bio_sectors)
- maxsectors = 0;
- else
- maxsectors -= bio_sectors;
-
- if (maxsectors <= (PAGE_SIZE >> 9 ) && bio_sectors == 0)
- return maxbytes;
-
- if (maxsectors > (maxbytes >> 9))
- return maxbytes;
- else
- return maxsectors << 9;
-}
-
static int linear_congested(struct mddev *mddev, int bits)
{
struct linear_conf *conf;
@@ -338,7 +296,6 @@ static struct md_personality linear_personality =
.size = linear_size,
.quiesce = linear_quiesce,
.congested = linear_congested,
- .mergeable_bvec = linear_mergeable_bvec,
};

static int __init linear_init (void)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 046b3c9..f101981 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -352,29 +352,6 @@ static int md_congested(void *data, int bits)
return mddev_congested(mddev, bits);
}

-static int md_mergeable_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mddev *mddev = q->queuedata;
- int ret;
- rcu_read_lock();
- if (mddev->suspended) {
- /* Must always allow one vec */
- if (bvm->bi_size == 0)
- ret = biovec->bv_len;
- else
- ret = 0;
- } else {
- struct md_personality *pers = mddev->pers;
- if (pers && pers->mergeable_bvec)
- ret = pers->mergeable_bvec(mddev, bvm, biovec);
- else
- ret = biovec->bv_len;
- }
- rcu_read_unlock();
- return ret;
-}
/*
* Generic flush handling for md
*/
@@ -5165,7 +5142,6 @@ int md_run(struct mddev *mddev)
if (mddev->queue) {
mddev->queue->backing_dev_info.congested_data = mddev;
mddev->queue->backing_dev_info.congested_fn = md_congested;
- blk_queue_merge_bvec(mddev->queue, md_mergeable_bvec);
}
if (pers->sync_request) {
if (mddev->kobj.sd &&
@@ -5293,7 +5269,6 @@ static void md_clean(struct mddev *mddev)
mddev->changed = 0;
mddev->degraded = 0;
mddev->safemode = 0;
- mddev->merge_check_needed = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
mddev->bitmap_info.default_space = 0;
@@ -5489,7 +5464,6 @@ static int do_md_stop(struct mddev *mddev, int mode,

__md_stop_writes(mddev);
__md_stop(mddev);
- mddev->queue->merge_bvec_fn = NULL;
mddev->queue->backing_dev_info.congested_fn = NULL;

/* tell userspace to handle 'inactive' */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4046a6c..cf7141a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -133,10 +133,6 @@ enum flag_bits {
Bitmap_sync, /* ..actually, not quite In_sync. Need a
* bitmap-based recovery to get fully in sync
*/
- Unmerged, /* device is being added to array and should
- * be considerred for bvec_merge_fn but not
- * yet for actual IO
- */
WriteMostly, /* Avoid reading if at all possible */
AutoDetected, /* added by auto-detect */
Blocked, /* An error occurred but has not yet
@@ -373,10 +369,6 @@ struct mddev {
int degraded; /* whether md should consider
* adding a spare
*/
- int merge_check_needed; /* at least one
- * member device
- * has a
- * merge_bvec_fn */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
@@ -531,10 +523,6 @@ struct md_personality
/* congested implements bdi.congested_fn().
* Will not be called while array is 'suspended' */
int (*congested)(struct mddev *mddev, int bits);
- /* mergeable_bvec is use to implement ->merge_bvec_fn */
- int (*mergeable_bvec)(struct mddev *mddev,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec);
};

struct md_sysfs_entry {
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index ac3ede2..7ee27fb 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -257,18 +257,6 @@ static int multipath_add_disk(struct mddev *mddev, struct md_rdev *rdev)
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);

- /* as we don't honour merge_bvec_fn, we must never risk
- * violating it, so limit ->max_segments to one, lying
- * within a single page.
- * (Note: it is very unlikely that a device with
- * merge_bvec_fn will be involved in multipath.)
- */
- if (q->merge_bvec_fn) {
- blk_queue_max_segments(mddev->queue, 1);
- blk_queue_segment_boundary(mddev->queue,
- PAGE_CACHE_SIZE - 1);
- }
-
spin_lock_irq(&conf->device_lock);
mddev->degraded--;
rdev->raid_disk = path;
@@ -432,15 +420,6 @@ static int multipath_run (struct mddev *mddev)
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);

- /* as we don't honour merge_bvec_fn, we must never risk
- * violating it, not that we ever expect a device with
- * a merge_bvec_fn to be involved in multipath */
- if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
- blk_queue_max_segments(mddev->queue, 1);
- blk_queue_segment_boundary(mddev->queue,
- PAGE_CACHE_SIZE - 1);
- }
-
if (!test_bit(Faulty, &rdev->flags))
working_disks++;
}
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 6a68ef5..1440bd4 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -192,9 +192,6 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
disk_stack_limits(mddev->gendisk, rdev1->bdev,
rdev1->data_offset << 9);

- if (rdev1->bdev->bd_disk->queue->merge_bvec_fn)
- conf->has_merge_bvec = 1;
-
if (!smallest || (rdev1->sectors < smallest->sectors))
smallest = rdev1;
cnt++;
@@ -351,58 +348,6 @@ static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
+ sector_div(sector, zone->nb_dev)];
}

-/**
- * raid0_mergeable_bvec -- tell bio layer if two requests can be merged
- * @mddev: the md device
- * @bvm: properties of new bio
- * @biovec: the request that could be merged to it.
- *
- * Return amount of bytes we can accept at this offset
- */
-static int raid0_mergeable_bvec(struct mddev *mddev,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct r0conf *conf = mddev->private;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- sector_t sector_offset = sector;
- int max;
- unsigned int chunk_sectors = mddev->chunk_sectors;
- unsigned int bio_sectors = bvm->bi_size >> 9;
- struct strip_zone *zone;
- struct md_rdev *rdev;
- struct request_queue *subq;
-
- if (is_power_of_2(chunk_sectors))
- max = (chunk_sectors - ((sector & (chunk_sectors-1))
- + bio_sectors)) << 9;
- else
- max = (chunk_sectors - (sector_div(sector, chunk_sectors)
- + bio_sectors)) << 9;
- if (max < 0)
- max = 0; /* bio_add cannot handle a negative return */
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- if (max < biovec->bv_len)
- /* too small already, no need to check further */
- return max;
- if (!conf->has_merge_bvec)
- return max;
-
- /* May need to check subordinate device */
- sector = sector_offset;
- zone = find_zone(mddev->private, &sector_offset);
- rdev = map_sector(mddev, zone, sector, &sector_offset);
- subq = bdev_get_queue(rdev->bdev);
- if (subq->merge_bvec_fn) {
- bvm->bi_bdev = rdev->bdev;
- bvm->bi_sector = sector_offset + zone->dev_start +
- rdev->data_offset;
- return min(max, subq->merge_bvec_fn(subq, bvm, biovec));
- } else
- return max;
-}
-
static sector_t raid0_size(struct mddev *mddev, sector_t sectors, int raid_disks)
{
sector_t array_sectors = 0;
@@ -725,7 +670,6 @@ static struct md_personality raid0_personality=
.takeover = raid0_takeover,
.quiesce = raid0_quiesce,
.congested = raid0_congested,
- .mergeable_bvec = raid0_mergeable_bvec,
};

static int __init raid0_init (void)
diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h
index 05539d9..7127a62 100644
--- a/drivers/md/raid0.h
+++ b/drivers/md/raid0.h
@@ -12,8 +12,6 @@ struct r0conf {
struct md_rdev **devlist; /* lists of rdevs, pointed to
* by strip_zone->dev */
int nr_strip_zones;
- int has_merge_bvec; /* at least one member has
- * a merge_bvec_fn */
};

#endif
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 9157a29..478878f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -557,7 +557,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
rdev = rcu_dereference(conf->mirrors[disk].rdev);
if (r1_bio->bios[disk] == IO_BLOCKED
|| rdev == NULL
- || test_bit(Unmerged, &rdev->flags)
|| test_bit(Faulty, &rdev->flags))
continue;
if (!test_bit(In_sync, &rdev->flags) &&
@@ -708,38 +707,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
return best_disk;
}

-static int raid1_mergeable_bvec(struct mddev *mddev,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct r1conf *conf = mddev->private;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int max = biovec->bv_len;
-
- if (mddev->merge_check_needed) {
- int disk;
- rcu_read_lock();
- for (disk = 0; disk < conf->raid_disks * 2; disk++) {
- struct md_rdev *rdev = rcu_dereference(
- conf->mirrors[disk].rdev);
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- struct request_queue *q =
- bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn) {
- bvm->bi_sector = sector +
- rdev->data_offset;
- bvm->bi_bdev = rdev->bdev;
- max = min(max, q->merge_bvec_fn(
- q, bvm, biovec));
- }
- }
- }
- rcu_read_unlock();
- }
- return max;
-
-}
-
static int raid1_congested(struct mddev *mddev, int bits)
{
struct r1conf *conf = mddev->private;
@@ -1268,8 +1235,7 @@ read_again:
break;
}
r1_bio->bios[i] = NULL;
- if (!rdev || test_bit(Faulty, &rdev->flags)
- || test_bit(Unmerged, &rdev->flags)) {
+ if (!rdev || test_bit(Faulty, &rdev->flags)) {
if (i < conf->raid_disks)
set_bit(R1BIO_Degraded, &r1_bio->state);
continue;
@@ -1614,7 +1580,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
struct raid1_info *p;
int first = 0;
int last = conf->raid_disks - 1;
- struct request_queue *q = bdev_get_queue(rdev->bdev);

if (mddev->recovery_disabled == conf->recovery_disabled)
return -EBUSY;
@@ -1622,11 +1587,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;

- if (q->merge_bvec_fn) {
- set_bit(Unmerged, &rdev->flags);
- mddev->merge_check_needed = 1;
- }
-
for (mirror = first; mirror <= last; mirror++) {
p = conf->mirrors+mirror;
if (!p->rdev) {
@@ -1658,19 +1618,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
break;
}
}
- if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
- /* Some requests might not have seen this new
- * merge_bvec_fn. We must wait for them to complete
- * before merging the device fully.
- * First we make sure any code which has tested
- * our function has submitted the request, then
- * we wait for all outstanding requests to complete.
- */
- synchronize_sched();
- freeze_array(conf, 0);
- unfreeze_array(conf);
- clear_bit(Unmerged, &rdev->flags);
- }
md_integrity_add_rdev(rdev, mddev);
if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
@@ -2807,8 +2754,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
goto abort;
disk->rdev = rdev;
q = bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn)
- mddev->merge_check_needed = 1;

disk->head_position = 0;
disk->seq_start = MaxSector;
@@ -3173,7 +3118,6 @@ static struct md_personality raid1_personality =
.quiesce = raid1_quiesce,
.takeover = raid1_takeover,
.congested = raid1_congested,
- .mergeable_bvec = raid1_mergeable_bvec,
};

static int __init raid_init(void)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e793ab6..a46c402 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -672,93 +672,6 @@ static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
return (vchunk << geo->chunk_shift) + offset;
}

-/**
- * raid10_mergeable_bvec -- tell bio layer if a two requests can be merged
- * @mddev: the md device
- * @bvm: properties of new bio
- * @biovec: the request that could be merged to it.
- *
- * Return amount of bytes we can accept at this offset
- * This requires checking for end-of-chunk if near_copies != raid_disks,
- * and for subordinate merge_bvec_fns if merge_check_needed.
- */
-static int raid10_mergeable_bvec(struct mddev *mddev,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct r10conf *conf = mddev->private;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int max;
- unsigned int chunk_sectors;
- unsigned int bio_sectors = bvm->bi_size >> 9;
- struct geom *geo = &conf->geo;
-
- chunk_sectors = (conf->geo.chunk_mask & conf->prev.chunk_mask) + 1;
- if (conf->reshape_progress != MaxSector &&
- ((sector >= conf->reshape_progress) !=
- conf->mddev->reshape_backwards))
- geo = &conf->prev;
-
- if (geo->near_copies < geo->raid_disks) {
- max = (chunk_sectors - ((sector & (chunk_sectors - 1))
- + bio_sectors)) << 9;
- if (max < 0)
- /* bio_add cannot handle a negative return */
- max = 0;
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- } else
- max = biovec->bv_len;
-
- if (mddev->merge_check_needed) {
- struct {
- struct r10bio r10_bio;
- struct r10dev devs[conf->copies];
- } on_stack;
- struct r10bio *r10_bio = &on_stack.r10_bio;
- int s;
- if (conf->reshape_progress != MaxSector) {
- /* Cannot give any guidance during reshape */
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- return 0;
- }
- r10_bio->sector = sector;
- raid10_find_phys(conf, r10_bio);
- rcu_read_lock();
- for (s = 0; s < conf->copies; s++) {
- int disk = r10_bio->devs[s].devnum;
- struct md_rdev *rdev = rcu_dereference(
- conf->mirrors[disk].rdev);
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- struct request_queue *q =
- bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn) {
- bvm->bi_sector = r10_bio->devs[s].addr
- + rdev->data_offset;
- bvm->bi_bdev = rdev->bdev;
- max = min(max, q->merge_bvec_fn(
- q, bvm, biovec));
- }
- }
- rdev = rcu_dereference(conf->mirrors[disk].replacement);
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- struct request_queue *q =
- bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn) {
- bvm->bi_sector = r10_bio->devs[s].addr
- + rdev->data_offset;
- bvm->bi_bdev = rdev->bdev;
- max = min(max, q->merge_bvec_fn(
- q, bvm, biovec));
- }
- }
- }
- rcu_read_unlock();
- }
- return max;
-}
-
/*
* This routine returns the disk from which the requested read should
* be done. There is a per-array 'next expected sequential IO' sector
@@ -821,12 +734,10 @@ retry:
disk = r10_bio->devs[slot].devnum;
rdev = rcu_dereference(conf->mirrors[disk].replacement);
if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
- test_bit(Unmerged, &rdev->flags) ||
r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
rdev = rcu_dereference(conf->mirrors[disk].rdev);
if (rdev == NULL ||
- test_bit(Faulty, &rdev->flags) ||
- test_bit(Unmerged, &rdev->flags))
+ test_bit(Faulty, &rdev->flags))
continue;
if (!test_bit(In_sync, &rdev->flags) &&
r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
@@ -1326,11 +1237,9 @@ retry_write:
blocked_rdev = rrdev;
break;
}
- if (rdev && (test_bit(Faulty, &rdev->flags)
- || test_bit(Unmerged, &rdev->flags)))
+ if (rdev && (test_bit(Faulty, &rdev->flags)))
rdev = NULL;
- if (rrdev && (test_bit(Faulty, &rrdev->flags)
- || test_bit(Unmerged, &rrdev->flags)))
+ if (rrdev && (test_bit(Faulty, &rrdev->flags)))
rrdev = NULL;

r10_bio->devs[i].bio = NULL;
@@ -1777,7 +1686,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
int mirror;
int first = 0;
int last = conf->geo.raid_disks - 1;
- struct request_queue *q = bdev_get_queue(rdev->bdev);

if (mddev->recovery_cp < MaxSector)
/* only hot-add to in-sync arrays, as recovery is
@@ -1790,11 +1698,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;

- if (q->merge_bvec_fn) {
- set_bit(Unmerged, &rdev->flags);
- mddev->merge_check_needed = 1;
- }
-
if (rdev->saved_raid_disk >= first &&
conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
mirror = rdev->saved_raid_disk;
@@ -1833,19 +1736,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
rcu_assign_pointer(p->rdev, rdev);
break;
}
- if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
- /* Some requests might not have seen this new
- * merge_bvec_fn. We must wait for them to complete
- * before merging the device fully.
- * First we make sure any code which has tested
- * our function has submitted the request, then
- * we wait for all outstanding requests to complete.
- */
- synchronize_sched();
- freeze_array(conf, 0);
- unfreeze_array(conf);
- clear_bit(Unmerged, &rdev->flags);
- }
md_integrity_add_rdev(rdev, mddev);
if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
@@ -2404,7 +2294,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
d = r10_bio->devs[sl].devnum;
rdev = rcu_dereference(conf->mirrors[d].rdev);
if (rdev &&
- !test_bit(Unmerged, &rdev->flags) &&
test_bit(In_sync, &rdev->flags) &&
is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
&first_bad, &bad_sectors) == 0) {
@@ -2458,7 +2347,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
d = r10_bio->devs[sl].devnum;
rdev = rcu_dereference(conf->mirrors[d].rdev);
if (!rdev ||
- test_bit(Unmerged, &rdev->flags) ||
!test_bit(In_sync, &rdev->flags))
continue;

@@ -3652,8 +3540,6 @@ static int run(struct mddev *mddev)
disk->rdev = rdev;
}
q = bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn)
- mddev->merge_check_needed = 1;
diff = (rdev->new_data_offset - rdev->data_offset);
if (!mddev->reshape_backwards)
diff = -diff;
@@ -4706,7 +4592,6 @@ static struct md_personality raid10_personality =
.start_reshape = raid10_start_reshape,
.finish_reshape = raid10_finish_reshape,
.congested = raid10_congested,
- .mergeable_bvec = raid10_mergeable_bvec,
};

static int __init raid_init(void)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b6c6ace..18d2b23 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4625,35 +4625,6 @@ static int raid5_congested(struct mddev *mddev, int bits)
return 0;
}

-/* We want read requests to align with chunks where possible,
- * but write requests don't need to.
- */
-static int raid5_mergeable_bvec(struct mddev *mddev,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int max;
- unsigned int chunk_sectors = mddev->chunk_sectors;
- unsigned int bio_sectors = bvm->bi_size >> 9;
-
- /*
- * always allow writes to be mergeable, read as well if array
- * is degraded as we'll go through stripe cache anyway.
- */
- if ((bvm->bi_rw & 1) == WRITE || mddev->degraded)
- return biovec->bv_len;
-
- if (mddev->new_chunk_sectors < mddev->chunk_sectors)
- chunk_sectors = mddev->new_chunk_sectors;
- max = (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
- if (max < 0) max = 0;
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- else
- return max;
-}
-
static int in_chunk_boundary(struct mddev *mddev, struct bio *bio)
{
sector_t sector = bio->bi_iter.bi_sector + get_start_sect(bio->bi_bdev);
@@ -7722,7 +7693,6 @@ static struct md_personality raid6_personality =
.quiesce = raid5_quiesce,
.takeover = raid6_takeover,
.congested = raid5_congested,
- .mergeable_bvec = raid5_mergeable_bvec,
};
static struct md_personality raid5_personality =
{
@@ -7746,7 +7716,6 @@ static struct md_personality raid5_personality =
.quiesce = raid5_quiesce,
.takeover = raid5_takeover,
.congested = raid5_congested,
- .mergeable_bvec = raid5_mergeable_bvec,
};

static struct md_personality raid4_personality =
@@ -7771,7 +7740,6 @@ static struct md_personality raid4_personality =
.quiesce = raid5_quiesce,
.takeover = raid4_takeover,
.congested = raid5_congested,
- .mergeable_bvec = raid5_mergeable_bvec,
};

static int __init raid5_init(void)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 93b81a2..6927b76 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -239,14 +239,6 @@ typedef int (prep_rq_fn) (struct request_queue *, struct request *);
typedef void (unprep_rq_fn) (struct request_queue *, struct request *);

struct bio_vec;
-struct bvec_merge_data {
- struct block_device *bi_bdev;
- sector_t bi_sector;
- unsigned bi_size;
- unsigned long bi_rw;
-};
-typedef int (merge_bvec_fn) (struct request_queue *, struct bvec_merge_data *,
- struct bio_vec *);
typedef void (softirq_done_fn)(struct request *);
typedef int (dma_drain_needed_fn)(struct request *);
typedef int (lld_busy_fn) (struct request_queue *q);
@@ -331,7 +323,6 @@ struct request_queue {
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
unprep_rq_fn *unprep_rq_fn;
- merge_bvec_fn *merge_bvec_fn;
softirq_done_fn *softirq_done_fn;
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
@@ -1041,7 +1032,6 @@ extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
extern void blk_queue_unprep_rq(struct request_queue *, unprep_rq_fn *ufn);
-extern void blk_queue_merge_bvec(struct request_queue *, merge_bvec_fn *);
extern void blk_queue_dma_alignment(struct request_queue *, int);
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 51cc1de..76d23fa 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -82,9 +82,6 @@ typedef int (*dm_message_fn) (struct dm_target *ti, unsigned argc, char **argv);
typedef int (*dm_ioctl_fn) (struct dm_target *ti, unsigned int cmd,
unsigned long arg);

-typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size);
-
/*
* These iteration functions are typically used to check (and combine)
* properties of underlying devices.
@@ -160,7 +157,6 @@ struct target_type {
dm_status_fn status;
dm_message_fn message;
dm_ioctl_fn ioctl;
- dm_merge_fn merge;
dm_busy_fn busy;
dm_iterate_devices_fn iterate_devices;
dm_io_hints_fn io_hints;
--
1.9.1

2015-05-22 18:20:46

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 09/11] fs: use helper bio_add_page() instead of open coding on bi_io_vec

From: Kent Overstreet <[email protected]>

Call pre-defined helper bio_add_page() instead of open coding for
iterating through bi_io_vec[]. Doing that, it's possible to make some
parts in filesystems and mm/page_io.c simpler than before.

Acked-by: Dave Kleikamp <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
fs/buffer.c | 7 ++-----
fs/jfs/jfs_logmgr.c | 14 ++++----------
mm/page_io.c | 8 +++-----
3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c7a5602..d9f00b6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3022,12 +3022,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)

bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
- bio->bi_io_vec[0].bv_page = bh->b_page;
- bio->bi_io_vec[0].bv_len = bh->b_size;
- bio->bi_io_vec[0].bv_offset = bh_offset(bh);

- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = bh->b_size;
+ bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
+ BUG_ON(bio->bi_iter.bi_size != bh->b_size);

bio->bi_end_io = end_bio_bh_io_sync;
bio->bi_private = bh;
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index bc462dc..46fae06 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -1999,12 +1999,9 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp)

bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
- bio->bi_io_vec[0].bv_page = bp->l_page;
- bio->bi_io_vec[0].bv_len = LOGPSIZE;
- bio->bi_io_vec[0].bv_offset = bp->l_offset;

- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = LOGPSIZE;
+ bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+ BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);

bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
@@ -2145,12 +2142,9 @@ static void lbmStartIO(struct lbuf * bp)
bio = bio_alloc(GFP_NOFS, 1);
bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
- bio->bi_io_vec[0].bv_page = bp->l_page;
- bio->bi_io_vec[0].bv_len = LOGPSIZE;
- bio->bi_io_vec[0].bv_offset = bp->l_offset;

- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = LOGPSIZE;
+ bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+ BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);

bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
diff --git a/mm/page_io.c b/mm/page_io.c
index 6424869..9fb8a0d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -33,12 +33,10 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
if (bio) {
bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
- bio->bi_io_vec[0].bv_page = page;
- bio->bi_io_vec[0].bv_len = PAGE_SIZE;
- bio->bi_io_vec[0].bv_offset = 0;
- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = PAGE_SIZE;
bio->bi_end_io = end_io;
+
+ bio_add_page(bio, page, PAGE_SIZE, 0);
+ BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE);
}
return bio;
}
--
1.9.1

2015-05-22 18:19:46

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 10/11] block: remove bio_get_nr_vecs()

From: Kent Overstreet <[email protected]>

We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Signed-off-by: Kent Overstreet <[email protected]>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
block/bio.c | 23 -----------------------
drivers/md/dm-io.c | 2 +-
fs/btrfs/compression.c | 5 +----
fs/btrfs/extent_io.c | 9 ++-------
fs/btrfs/inode.c | 3 +--
fs/btrfs/scrub.c | 18 ++----------------
fs/direct-io.c | 2 +-
fs/ext4/page-io.c | 3 +--
fs/ext4/readpage.c | 2 +-
fs/gfs2/lops.c | 9 +--------
fs/logfs/dev_bdev.c | 4 ++--
fs/mpage.c | 4 ++--
fs/nilfs2/segbuf.c | 2 +-
fs/xfs/xfs_aops.c | 3 +--
include/linux/bio.h | 1 -
15 files changed, 17 insertions(+), 73 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ae31cdb..f59b647 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -677,29 +677,6 @@ integrity_clone:
EXPORT_SYMBOL(bio_clone_bioset);

/**
- * bio_get_nr_vecs - return approx number of vecs
- * @bdev: I/O target
- *
- * Return the approximate number of pages we can send to this target.
- * There's no guarantee that you will be able to fit this number of pages
- * into a bio, it does not account for dynamic restrictions that vary
- * on offset.
- */
-int bio_get_nr_vecs(struct block_device *bdev)
-{
- struct request_queue *q = bdev_get_queue(bdev);
- int nr_pages;
-
- nr_pages = min_t(unsigned,
- queue_max_segments(q),
- queue_max_sectors(q) / (PAGE_SIZE >> 9) + 1);
-
- return min_t(unsigned, nr_pages, BIO_MAX_PAGES);
-
-}
-EXPORT_SYMBOL(bio_get_nr_vecs);
-
-/**
* bio_add_pc_page - attempt to add page to bio
* @q: the target queue
* @bio: destination bio
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 74adcd2..7d64272 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -314,7 +314,7 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
if ((rw & REQ_DISCARD) || (rw & REQ_WRITE_SAME))
num_bvecs = 1;
else
- num_bvecs = min_t(int, bio_get_nr_vecs(where->bdev),
+ num_bvecs = min_t(int, BIO_MAX_PAGES,
dm_sector_div_up(remaining, (PAGE_SIZE >> SECTOR_SHIFT)));

bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ce62324..449c752 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -97,10 +97,7 @@ static inline int compressed_bio_size(struct btrfs_root *root,
static struct bio *compressed_bio_alloc(struct block_device *bdev,
u64 first_byte, gfp_t gfp_flags)
{
- int nr_vecs;
-
- nr_vecs = bio_get_nr_vecs(bdev);
- return btrfs_bio_alloc(bdev, first_byte >> 9, nr_vecs, gfp_flags);
+ return btrfs_bio_alloc(bdev, first_byte >> 9, BIO_MAX_PAGES, gfp_flags);
}

static int check_compressed_csum(struct inode *inode,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c32d226..07ecf2e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2799,9 +2799,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
{
int ret = 0;
struct bio *bio;
- int nr;
int contig = 0;
- int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED;
int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED;
size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE);

@@ -2826,12 +2824,9 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
return 0;
}
}
- if (this_compressed)
- nr = BIO_MAX_PAGES;
- else
- nr = bio_get_nr_vecs(bdev);

- bio = btrfs_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH);
+ bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES,
+ GFP_NOFS | __GFP_HIGH);
if (!bio)
return -ENOMEM;

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8bb0136..eff9129 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7943,8 +7943,7 @@ out:
static struct bio *btrfs_dio_bio_alloc(struct block_device *bdev,
u64 first_sector, gfp_t gfp_flags)
{
- int nr_vecs = bio_get_nr_vecs(bdev);
- return btrfs_bio_alloc(bdev, first_sector, nr_vecs, gfp_flags);
+ return btrfs_bio_alloc(bdev, first_sector, BIO_MAX_PAGES, gfp_flags);
}

static inline int btrfs_lookup_and_bind_dio_csum(struct btrfs_root *root,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ab58115..00b828a 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -454,27 +454,14 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
struct scrub_ctx *sctx;
int i;
struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
- int pages_per_rd_bio;
int ret;

- /*
- * the setting of pages_per_rd_bio is correct for scrub but might
- * be wrong for the dev_replace code where we might read from
- * different devices in the initial huge bios. However, that
- * code is able to correctly handle the case when adding a page
- * to a bio fails.
- */
- if (dev->bdev)
- pages_per_rd_bio = min_t(int, SCRUB_PAGES_PER_RD_BIO,
- bio_get_nr_vecs(dev->bdev));
- else
- pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
sctx = kzalloc(sizeof(*sctx), GFP_NOFS);
if (!sctx)
goto nomem;
atomic_set(&sctx->refs, 1);
sctx->is_dev_replace = is_dev_replace;
- sctx->pages_per_rd_bio = pages_per_rd_bio;
+ sctx->pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
sctx->curr = -1;
sctx->dev_root = dev->dev_root;
for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
@@ -3875,8 +3862,7 @@ static int scrub_setup_wr_ctx(struct scrub_ctx *sctx,
return 0;

WARN_ON(!dev->bdev);
- wr_ctx->pages_per_wr_bio = min_t(int, SCRUB_PAGES_PER_WR_BIO,
- bio_get_nr_vecs(dev->bdev));
+ wr_ctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
wr_ctx->tgtdev = dev;
atomic_set(&wr_ctx->flush_all_writes, 0);
return 0;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 745d234..89baebe 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -653,7 +653,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
if (ret)
goto out;
sector = start_sector << (sdio->blkbits - 9);
- nr_pages = min(sdio->pages_in_io, bio_get_nr_vecs(map_bh->b_bdev));
+ nr_pages = min(sdio->pages_in_io, BIO_MAX_PAGES);
BUG_ON(nr_pages <= 0);
dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
sdio->boundary = 0;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 5765f88..000682e 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -376,10 +376,9 @@ void ext4_io_submit_init(struct ext4_io_submit *io,
static int io_submit_init_bio(struct ext4_io_submit *io,
struct buffer_head *bh)
{
- int nvecs = bio_get_nr_vecs(bh->b_bdev);
struct bio *bio;

- bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES));
+ bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);
if (!bio)
return -ENOMEM;
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 171b9ac..de1f96a 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -284,7 +284,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
goto set_error_page;
}
bio = bio_alloc(GFP_KERNEL,
- min_t(int, nr_pages, bio_get_nr_vecs(bdev)));
+ min_t(int, nr_pages, BIO_MAX_PAGES));
if (!bio) {
if (ctx)
ext4_release_crypto_ctx(ctx);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 2c1ae86..64d3116 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -261,18 +261,11 @@ void gfs2_log_flush_bio(struct gfs2_sbd *sdp, int rw)
static struct bio *gfs2_log_alloc_bio(struct gfs2_sbd *sdp, u64 blkno)
{
struct super_block *sb = sdp->sd_vfs;
- unsigned nrvecs = bio_get_nr_vecs(sb->s_bdev);
struct bio *bio;

BUG_ON(sdp->sd_log_bio);

- while (1) {
- bio = bio_alloc(GFP_NOIO, nrvecs);
- if (likely(bio))
- break;
- nrvecs = max(nrvecs/2, 1U);
- }
-
+ bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);
bio->bi_iter.bi_sector = blkno * (sb->s_blocksize >> 9);
bio->bi_bdev = sb->s_bdev;
bio->bi_end_io = gfs2_end_log_write;
diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index 76279e1..fbb5f95 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -83,7 +83,7 @@ static int __bdev_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,
unsigned int max_pages;
int i;

- max_pages = min(nr_pages, (size_t) bio_get_nr_vecs(super->s_bdev));
+ max_pages = min(nr_pages, BIO_MAX_PAGES);

bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio);
@@ -175,7 +175,7 @@ static int do_erase(struct super_block *sb, u64 ofs, pgoff_t index,
unsigned int max_pages;
int i;

- max_pages = min(nr_pages, (size_t) bio_get_nr_vecs(super->s_bdev));
+ max_pages = min(nr_pages, BIO_MAX_PAGES);

bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio);
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..0e7762b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -277,7 +277,7 @@ alloc_new:
goto out;
}
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
- min_t(int, nr_pages, bio_get_nr_vecs(bdev)),
+ min_t(int, nr_pages, BIO_MAX_PAGES),
GFP_KERNEL);
if (bio == NULL)
goto confused;
@@ -602,7 +602,7 @@ alloc_new:
}
}
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
- bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
+ BIO_MAX_PAGES, GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
goto confused;
}
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index dc3a9efd..9be5a79 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -427,7 +427,7 @@ static void nilfs_segbuf_prepare_write(struct nilfs_segment_buffer *segbuf,
{
wi->bio = NULL;
wi->rest_blocks = segbuf->sb_sum.nblocks;
- wi->max_pages = bio_get_nr_vecs(wi->nilfs->ns_bdev);
+ wi->max_pages = BIO_MAX_PAGES;
wi->nr_vecs = min(wi->max_pages, wi->rest_blocks);
wi->start = wi->end = 0;
wi->blocknr = segbuf->sb_pseg_start;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a56960d..6339fe7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -383,8 +383,7 @@ STATIC struct bio *
xfs_alloc_ioend_bio(
struct buffer_head *bh)
{
- int nvecs = bio_get_nr_vecs(bh->b_bdev);
- struct bio *bio = bio_alloc(GFP_NOIO, nvecs);
+ struct bio *bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES);

ASSERT(bio->bi_private == NULL);
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127..015422a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -427,7 +427,6 @@ void bio_chain(struct bio *, struct bio *);
extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
unsigned int, unsigned int);
-extern int bio_get_nr_vecs(struct block_device *);
struct rq_map_data;
extern struct bio *bio_map_user_iov(struct request_queue *,
const struct iov_iter *, gfp_t);
--
1.9.1

2015-05-22 18:20:02

by Ming Lin

[permalink] [raw]
Subject: [PATCH v4 11/11] Documentation: update notes in biovecs about arbitrarily sized bios

From: Dongsu Park <[email protected]>

Update block/biovecs.txt so that it includes a note on what kind of
effects arbitrarily sized bios would bring to the block layer.
Also fix a trivial typo, bio_iter_iovec.

Cc: Christoph Hellwig <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Ming Lin <[email protected]>
---
Documentation/block/biovecs.txt | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 74a32ad..2568958 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -24,7 +24,7 @@ particular, presenting the illusion of partially completed biovecs so that
normal code doesn't have to deal with bi_bvec_done.

* Driver code should no longer refer to biovecs directly; we now have
- bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
+ bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs,
constructed from the raw biovecs but taking into account bi_bvec_done and
bi_size.

@@ -109,3 +109,11 @@ Other implications:
over all the biovecs in the new bio - which is silly as it's not needed.

So, don't use bi_vcnt anymore.
+
+ * The current interface allows the block layer to split bios as needed, so we
+ could eliminate a lot of complexity particularly in stacked drivers. Code
+ that creates bios can then create whatever size bios are convenient, and
+ more importantly stacked drivers don't have to deal with both their own bio
+ size limitations and the limitations of the underlying devices. Thus
+ there's no need to define ->merge_bvec_fn() callbacks for individual block
+ drivers.
--
1.9.1

2015-05-23 14:15:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

On Fri, May 22, 2015 at 11:18:32AM -0700, Ming Lin wrote:
> This will bring not only performance improvements, but also a great amount
> of reduction in code complexity all over the block layer. Performance gain
> is possible due to the fact that bio_add_page() does not have to check
> unnecesary conditions such as queue limits or if biovecs are mergeable.
> Those will be delegated to the driver level. Kent already said that he
> actually benchmarked the impact of this with fio on a micron p320h, which
> showed definitely a positive impact.

We'll need some actual numbers. I actually like these changes a lot
and don't even need a performance justification for this fundamentally
better model, but I'd really prefer to avoid any large scale regressions.
I don't really expect them, but for code this fundamental we'll just
need some benchmarks.

Except for that these changes looks good, and the previous version
passed my tests fine, so with some benchmarks you'ĺl have my ACK.

I'd love to see this go into 4.2, but for that we'll need Jens
approval and a merge into for-next very soon.

2015-05-24 07:37:36

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

On Sat, May 23, 2015 at 7:15 AM, Christoph Hellwig <[email protected]> wrote:
> On Fri, May 22, 2015 at 11:18:32AM -0700, Ming Lin wrote:
>> This will bring not only performance improvements, but also a great amount
>> of reduction in code complexity all over the block layer. Performance gain
>> is possible due to the fact that bio_add_page() does not have to check
>> unnecesary conditions such as queue limits or if biovecs are mergeable.
>> Those will be delegated to the driver level. Kent already said that he
>> actually benchmarked the impact of this with fio on a micron p320h, which
>> showed definitely a positive impact.
>
> We'll need some actual numbers. I actually like these changes a lot
> and don't even need a performance justification for this fundamentally
> better model, but I'd really prefer to avoid any large scale regressions.
> I don't really expect them, but for code this fundamental we'll just
> need some benchmarks.
>
> Except for that these changes looks good, and the previous version
> passed my tests fine, so with some benchmarks you'ĺl have my ACK.

I'll test it on a 2 sockets server with 10 NVMe drives on Monday.
I'm going to run fio tests:
1. raw NVMe drives direct IO read/write
2. ext4 read/write

Let me know if you have other tests that I can run.
Thanks.

>
> I'd love to see this go into 4.2, but for that we'll need Jens
> approval and a merge into for-next very soon.

2015-05-25 05:46:46

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Fri, 22 May 2015 11:18:33 -0700 Ming Lin <[email protected]> wrote:

> From: Kent Overstreet <[email protected]>
>
> The way the block layer is currently written, it goes to great lengths
> to avoid having to split bios; upper layer code (such as bio_add_page())
> checks what the underlying device can handle and tries to always create
> bios that don't need to be split.
>
> But this approach becomes unwieldy and eventually breaks down with
> stacked devices and devices with dynamic limits, and it adds a lot of
> complexity. If the block layer could split bios as needed, we could
> eliminate a lot of complexity elsewhere - particularly in stacked
> drivers. Code that creates bios can then create whatever size bios are
> convenient, and more importantly stacked drivers don't have to deal with
> both their own bio size limitations and the limitations of the
> (potentially multiple) devices underneath them. In the future this will
> let us delete merge_bvec_fn and a bunch of other code.
>
> We do this by adding calls to blk_queue_split() to the various
> make_request functions that need it - a few can already handle arbitrary
> size bios. Note that we add the call _after_ any call to
> blk_queue_bounce(); this means that blk_queue_split() and
> blk_recalc_rq_segments() don't need to be concerned with bouncing
> affecting segment merging.
>
> Some make_request_fn() callbacks were simple enough to audit and verify
> they don't need blk_queue_split() calls. The skipped ones are:
>
> * nfhd_make_request (arch/m68k/emu/nfblock.c)
> * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
> * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
> * brd_make_request (ramdisk - drivers/block/brd.c)
> * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
> * loop_make_request
> * null_queue_bio
> * bcache's make_request fns
>
> Some others are almost certainly safe to remove now, but will be left
> for future patches.
>
> Cc: Jens Axboe <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Ming Lei <[email protected]>
> Cc: Neil Brown <[email protected]>
> Cc: Alasdair Kergon <[email protected]>
> Cc: Mike Snitzer <[email protected]>
> Cc: [email protected]
> Cc: Lars Ellenberg <[email protected]>
> Cc: [email protected]
> Cc: Jiri Kosina <[email protected]>
> Cc: Geoff Levand <[email protected]>
> Cc: Jim Paris <[email protected]>
> Cc: Joshua Morris <[email protected]>
> Cc: Philip Kelleher <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Nitin Gupta <[email protected]>
> Cc: Oleg Drokin <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
> Signed-off-by: Dongsu Park <[email protected]>
> Signed-off-by: Ming Lin <[email protected]>


Acked-by: NeilBrown <[email protected]>

For the 'md/md.c' bits.

Thanks,
NeilBrown


> ---
> block/blk-core.c | 19 ++--
> block/blk-merge.c | 159 ++++++++++++++++++++++++++--
> block/blk-mq.c | 4 +
> drivers/block/drbd/drbd_req.c | 2 +
> drivers/block/pktcdvd.c | 6 +-
> drivers/block/ps3vram.c | 2 +
> drivers/block/rsxx/dev.c | 2 +
> drivers/block/umem.c | 2 +
> drivers/block/zram/zram_drv.c | 2 +
> drivers/md/dm.c | 2 +
> drivers/md/md.c | 2 +
> drivers/s390/block/dcssblk.c | 2 +
> drivers/s390/block/xpram.c | 2 +
> drivers/staging/lustre/lustre/llite/lloop.c | 2 +
> include/linux/blkdev.h | 3 +
> 15 files changed, 189 insertions(+), 22 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7871603..fbbb337 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -619,6 +619,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
> if (q->id < 0)
> goto fail_q;
>
> + q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
> + if (!q->bio_split)
> + goto fail_id;
> +
> q->backing_dev_info.ra_pages =
> (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> q->backing_dev_info.state = 0;
> @@ -628,7 +632,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>
> err = bdi_init(&q->backing_dev_info);
> if (err)
> - goto fail_id;
> + goto fail_split;
>
> setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
> laptop_mode_timer_fn, (unsigned long) q);
> @@ -670,6 +674,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>
> fail_bdi:
> bdi_destroy(&q->backing_dev_info);
> +fail_split:
> + bioset_free(q->bio_split);
> fail_id:
> ida_simple_remove(&blk_queue_ida, q->id);
> fail_q:
> @@ -1586,6 +1592,8 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
> struct request *req;
> unsigned int request_count = 0;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> /*
> * low level driver can indicate that it wants pages above a
> * certain limit bounced to low memory (ie for highmem, or even
> @@ -1809,15 +1817,6 @@ generic_make_request_checks(struct bio *bio)
> goto end_io;
> }
>
> - if (likely(bio_is_rw(bio) &&
> - nr_sectors > queue_max_hw_sectors(q))) {
> - printk(KERN_ERR "bio too big device %s (%u > %u)\n",
> - bdevname(bio->bi_bdev, b),
> - bio_sectors(bio),
> - queue_max_hw_sectors(q));
> - goto end_io;
> - }
> -
> part = bio->bi_bdev->bd_part;
> if (should_fail_request(part, bio->bi_iter.bi_size) ||
> should_fail_request(&part_to_disk(part)->part0,
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index fd3fee8..dc14255 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -9,12 +9,158 @@
>
> #include "blk.h"
>
> +static struct bio *blk_bio_discard_split(struct request_queue *q,
> + struct bio *bio,
> + struct bio_set *bs)
> +{
> + unsigned int max_discard_sectors, granularity;
> + int alignment;
> + sector_t tmp;
> + unsigned split_sectors;
> +
> + /* Zero-sector (unknown) and one-sector granularities are the same. */
> + granularity = max(q->limits.discard_granularity >> 9, 1U);
> +
> + max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
> + max_discard_sectors -= max_discard_sectors % granularity;
> +
> + if (unlikely(!max_discard_sectors)) {
> + /* XXX: warn */
> + return NULL;
> + }
> +
> + if (bio_sectors(bio) <= max_discard_sectors)
> + return NULL;
> +
> + split_sectors = max_discard_sectors;
> +
> + /*
> + * If the next starting sector would be misaligned, stop the discard at
> + * the previous aligned sector.
> + */
> + alignment = (q->limits.discard_alignment >> 9) % granularity;
> +
> + tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
> + tmp = sector_div(tmp, granularity);
> +
> + if (split_sectors > tmp)
> + split_sectors -= tmp;
> +
> + return bio_split(bio, split_sectors, GFP_NOIO, bs);
> +}
> +
> +static struct bio *blk_bio_write_same_split(struct request_queue *q,
> + struct bio *bio,
> + struct bio_set *bs)
> +{
> + if (!q->limits.max_write_same_sectors)
> + return NULL;
> +
> + if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
> + return NULL;
> +
> + return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
> +}
> +
> +static struct bio *blk_bio_segment_split(struct request_queue *q,
> + struct bio *bio,
> + struct bio_set *bs)
> +{
> + struct bio *split;
> + struct bio_vec bv, bvprv;
> + struct bvec_iter iter;
> + unsigned seg_size = 0, nsegs = 0;
> + int prev = 0;
> +
> + struct bvec_merge_data bvm = {
> + .bi_bdev = bio->bi_bdev,
> + .bi_sector = bio->bi_iter.bi_sector,
> + .bi_size = 0,
> + .bi_rw = bio->bi_rw,
> + };
> +
> + bio_for_each_segment(bv, bio, iter) {
> + if (q->merge_bvec_fn &&
> + q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
> + goto split;
> +
> + bvm.bi_size += bv.bv_len;
> +
> + if (bvm.bi_size >> 9 > queue_max_sectors(q))
> + goto split;
> +
> + /*
> + * If the queue doesn't support SG gaps and adding this
> + * offset would create a gap, disallow it.
> + */
> + if (q->queue_flags & (1 << QUEUE_FLAG_SG_GAPS) &&
> + prev && bvec_gap_to_prev(&bvprv, bv.bv_offset))
> + goto split;
> +
> + if (prev && blk_queue_cluster(q)) {
> + if (seg_size + bv.bv_len > queue_max_segment_size(q))
> + goto new_segment;
> + if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
> + goto new_segment;
> + if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
> + goto new_segment;
> +
> + seg_size += bv.bv_len;
> + bvprv = bv;
> + prev = 1;
> + continue;
> + }
> +new_segment:
> + if (nsegs == queue_max_segments(q))
> + goto split;
> +
> + nsegs++;
> + bvprv = bv;
> + prev = 1;
> + seg_size = bv.bv_len;
> + }
> +
> + return NULL;
> +split:
> + split = bio_clone_bioset(bio, GFP_NOIO, bs);
> +
> + split->bi_iter.bi_size -= iter.bi_size;
> + bio->bi_iter = iter;
> +
> + if (bio_integrity(bio)) {
> + bio_integrity_advance(bio, split->bi_iter.bi_size);
> + bio_integrity_trim(split, 0, bio_sectors(split));
> + }
> +
> + return split;
> +}
> +
> +void blk_queue_split(struct request_queue *q, struct bio **bio,
> + struct bio_set *bs)
> +{
> + struct bio *split;
> +
> + if ((*bio)->bi_rw & REQ_DISCARD)
> + split = blk_bio_discard_split(q, *bio, bs);
> + else if ((*bio)->bi_rw & REQ_WRITE_SAME)
> + split = blk_bio_write_same_split(q, *bio, bs);
> + else
> + split = blk_bio_segment_split(q, *bio, q->bio_split);
> +
> + if (split) {
> + bio_chain(split, *bio);
> + generic_make_request(*bio);
> + *bio = split;
> + }
> +}
> +EXPORT_SYMBOL(blk_queue_split);
> +
> static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> struct bio *bio,
> bool no_sg_merge)
> {
> struct bio_vec bv, bvprv = { NULL };
> - int cluster, high, highprv = 1;
> + int cluster, prev = 0;
> unsigned int seg_size, nr_phys_segs;
> struct bio *fbio, *bbio;
> struct bvec_iter iter;
> @@ -36,7 +182,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> cluster = blk_queue_cluster(q);
> seg_size = 0;
> nr_phys_segs = 0;
> - high = 0;
> for_each_bio(bio) {
> bio_for_each_segment(bv, bio, iter) {
> /*
> @@ -46,13 +191,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> if (no_sg_merge)
> goto new_segment;
>
> - /*
> - * the trick here is making sure that a high page is
> - * never considered part of another segment, since
> - * that might change with the bounce page.
> - */
> - high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
> - if (!high && !highprv && cluster) {
> + if (prev && cluster) {
> if (seg_size + bv.bv_len
> > queue_max_segment_size(q))
> goto new_segment;
> @@ -72,8 +211,8 @@ new_segment:
>
> nr_phys_segs++;
> bvprv = bv;
> + prev = 1;
> seg_size = bv.bv_len;
> - highprv = high;
> }
> bbio = bio;
> }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e68b71b..e7fae76 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1256,6 +1256,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> return;
> }
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> rq = blk_mq_map_request(q, bio, &data);
> if (unlikely(!rq))
> return;
> @@ -1339,6 +1341,8 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
> return;
> }
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if (use_plug && !blk_queue_nomerges(q) &&
> blk_attempt_plug_merge(q, bio, &request_count))
> return;
> diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
> index 3907202..a6265bc 100644
> --- a/drivers/block/drbd/drbd_req.c
> +++ b/drivers/block/drbd/drbd_req.c
> @@ -1497,6 +1497,8 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
> struct drbd_device *device = (struct drbd_device *) q->queuedata;
> unsigned long start_jif;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> start_jif = jiffies;
>
> /*
> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index 09e628da..ea10bd9 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -2446,6 +2446,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
> char b[BDEVNAME_SIZE];
> struct bio *split;
>
> + blk_queue_bounce(q, &bio);
> +
> + blk_queue_split(q, &bio, q->bio_split);
> +
> pd = q->queuedata;
> if (!pd) {
> pr_err("%s incorrect request queue\n",
> @@ -2476,8 +2480,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
> goto end_io;
> }
>
> - blk_queue_bounce(q, &bio);
> -
> do {
> sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
> sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
> diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
> index ef45cfb..e32e799 100644
> --- a/drivers/block/ps3vram.c
> +++ b/drivers/block/ps3vram.c
> @@ -605,6 +605,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
>
> dev_dbg(&dev->core, "%s\n", __func__);
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> spin_lock_irq(&priv->lock);
> busy = !bio_list_empty(&priv->list);
> bio_list_add(&priv->list, bio);
> diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
> index ac8c62c..50ef199 100644
> --- a/drivers/block/rsxx/dev.c
> +++ b/drivers/block/rsxx/dev.c
> @@ -148,6 +148,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
> struct rsxx_bio_meta *bio_meta;
> int st = -EINVAL;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> might_sleep();
>
> if (!card)
> diff --git a/drivers/block/umem.c b/drivers/block/umem.c
> index 4cf81b5..13d577c 100644
> --- a/drivers/block/umem.c
> +++ b/drivers/block/umem.c
> @@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio)
> (unsigned long long)bio->bi_iter.bi_sector,
> bio->bi_iter.bi_size);
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> spin_lock_irq(&card->lock);
> *card->biotail = bio;
> bio->bi_next = NULL;
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 8dcbced..36a004e 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -981,6 +981,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
> if (unlikely(!zram_meta_get(zram)))
> goto error;
>
> + blk_queue_split(queue, &bio, queue->bio_split);
> +
> if (!valid_io_request(zram, bio->bi_iter.bi_sector,
> bio->bi_iter.bi_size)) {
> atomic64_inc(&zram->stats.invalid_io);
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index a930b72..34f6063 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1784,6 +1784,8 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)
>
> map = dm_get_live_table(md, &srcu_idx);
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);
>
> /* if we're suspended, we have to queue this io for later */
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 593a024..046b3c9 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -257,6 +257,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
> unsigned int sectors;
> int cpu;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if (mddev == NULL || mddev->pers == NULL
> || !mddev->ready) {
> bio_io_error(bio);
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index da21281..267ca3a 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -826,6 +826,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
> unsigned long source_addr;
> unsigned long bytes_done;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> bytes_done = 0;
> dev_info = bio->bi_bdev->bd_disk->private_data;
> if (dev_info == NULL)
> diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
> index 7d4e939..1305ed3 100644
> --- a/drivers/s390/block/xpram.c
> +++ b/drivers/s390/block/xpram.c
> @@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
> unsigned long page_addr;
> unsigned long bytes;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if ((bio->bi_iter.bi_sector & 7) != 0 ||
> (bio->bi_iter.bi_size & 4095) != 0)
> /* Request is not page-aligned. */
> diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
> index 413a840..a8645a9 100644
> --- a/drivers/staging/lustre/lustre/llite/lloop.c
> +++ b/drivers/staging/lustre/lustre/llite/lloop.c
> @@ -340,6 +340,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio)
> int rw = bio_rw(old_bio);
> int inactive;
>
> + blk_queue_split(q, &old_bio, q->bio_split);
> +
> if (!lo)
> goto err;
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..93b81a2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -488,6 +488,7 @@ struct request_queue {
>
> struct blk_mq_tag_set *tag_set;
> struct list_head tag_set_list;
> + struct bio_set *bio_split;
> };
>
> #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
> @@ -812,6 +813,8 @@ extern void blk_rq_unprep_clone(struct request *rq);
> extern int blk_insert_cloned_request(struct request_queue *q,
> struct request *rq);
> extern void blk_delay_queue(struct request_queue *, unsigned long);
> +extern void blk_queue_split(struct request_queue *, struct bio **,
> + struct bio_set *);
> extern void blk_recount_segments(struct request_queue *, struct bio *);
> extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
> extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-25 05:48:42

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Fri, 22 May 2015 11:18:38 -0700 Ming Lin <[email protected]> wrote:

> From: Kent Overstreet <[email protected]>
>
> Remove bio_fits_rdev() completely, because ->merge_bvec_fn() has now
> gone. There's no point in calling bio_fits_rdev() only for ensuring
> aligned read from rdev.

Surely this patch should come *before*
[PATCH v4 07/11] md/raid5: split bio for chunk_aligned_read

and the comment says ->merge_bvec_fn() has gone, but that isn't until
[PATCH v4 08/11] block: kill merge_bvec_fn() completely


If those issues are resolved, then

Acked-by: NeilBrown <[email protected]>

Thanks,
NeilBrown


>
> Cc: Neil Brown <[email protected]>
> Cc: [email protected]
> Signed-off-by: Kent Overstreet <[email protected]>
> [dpark: add more description in commit message]
> Signed-off-by: Dongsu Park <[email protected]>
> Signed-off-by: Ming Lin <[email protected]>
> ---
> drivers/md/raid5.c | 23 +----------------------
> 1 file changed, 1 insertion(+), 22 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 1ba97fd..b303ded 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4743,25 +4743,6 @@ static void raid5_align_endio(struct bio *bi, int error)
> add_bio_to_retry(raid_bi, conf);
> }
>
> -static int bio_fits_rdev(struct bio *bi)
> -{
> - struct request_queue *q = bdev_get_queue(bi->bi_bdev);
> -
> - if (bio_sectors(bi) > queue_max_sectors(q))
> - return 0;
> - blk_recount_segments(q, bi);
> - if (bi->bi_phys_segments > queue_max_segments(q))
> - return 0;
> -
> - if (q->merge_bvec_fn)
> - /* it's too hard to apply the merge_bvec_fn at this stage,
> - * just just give up
> - */
> - return 0;
> -
> - return 1;
> -}
> -
> static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
> {
> struct r5conf *conf = mddev->private;
> @@ -4815,11 +4796,9 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
> align_bi->bi_bdev = rdev->bdev;
> __clear_bit(BIO_SEG_VALID, &align_bi->bi_flags);
>
> - if (!bio_fits_rdev(align_bi) ||
> - is_badblock(rdev, align_bi->bi_iter.bi_sector,
> + if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
> bio_sectors(align_bi),
> &first_bad, &bad_sectors)) {
> - /* too big in some way, or has a known bad block */
> bio_put(align_bi);
> rdev_dec_pending(rdev, mddev);
> return 0;


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-25 05:49:44

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

On Fri, 22 May 2015 11:18:40 -0700 Ming Lin <[email protected]> wrote:

> From: Kent Overstreet <[email protected]>
>
> As generic_make_request() is now able to handle arbitrarily sized bios,
> it's no longer necessary for each individual block driver to define its
> own ->merge_bvec_fn() callback. Remove every invocation completely.
>
> Cc: Jens Axboe <[email protected]>
> Cc: Lars Ellenberg <[email protected]>
> Cc: [email protected]
> Cc: Jiri Kosina <[email protected]>
> Cc: Yehuda Sadeh <[email protected]>
> Cc: Sage Weil <[email protected]>
> Cc: Alex Elder <[email protected]>
> Cc: [email protected]
> Cc: Alasdair Kergon <[email protected]>
> Cc: Mike Snitzer <[email protected]>
> Cc: [email protected]
> Cc: Neil Brown <[email protected]>
> Cc: [email protected]
> Cc: Christoph Hellwig <[email protected]>
> Cc: "Martin K. Petersen" <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
> dm-era-target, and resolve merge conflicts]
> Signed-off-by: Dongsu Park <[email protected]>
> Signed-off-by: Ming Lin <[email protected]>

Acked-by: NeilBrown <[email protected]> (for the 'md' bits)

Very happy to see this happening!

Thanks,
NeilBrown


> ---
> block/blk-merge.c | 17 +-----
> block/blk-settings.c | 22 --------
> drivers/block/drbd/drbd_int.h | 1 -
> drivers/block/drbd/drbd_main.c | 1 -
> drivers/block/drbd/drbd_req.c | 35 ------------
> drivers/block/pktcdvd.c | 21 -------
> drivers/block/rbd.c | 47 ----------------
> drivers/md/dm-cache-target.c | 21 -------
> drivers/md/dm-crypt.c | 16 ------
> drivers/md/dm-era-target.c | 15 -----
> drivers/md/dm-flakey.c | 16 ------
> drivers/md/dm-linear.c | 16 ------
> drivers/md/dm-log-writes.c | 16 ------
> drivers/md/dm-snap.c | 15 -----
> drivers/md/dm-stripe.c | 21 -------
> drivers/md/dm-table.c | 8 ---
> drivers/md/dm-thin.c | 31 -----------
> drivers/md/dm-verity.c | 16 ------
> drivers/md/dm.c | 120 +---------------------------------------
> drivers/md/dm.h | 2 -
> drivers/md/linear.c | 43 ---------------
> drivers/md/md.c | 26 ---------
> drivers/md/md.h | 12 ----
> drivers/md/multipath.c | 21 -------
> drivers/md/raid0.c | 56 -------------------
> drivers/md/raid0.h | 2 -
> drivers/md/raid1.c | 58 +-------------------
> drivers/md/raid10.c | 121 +----------------------------------------
> drivers/md/raid5.c | 32 -----------
> include/linux/blkdev.h | 10 ----
> include/linux/device-mapper.h | 4 --
> 31 files changed, 9 insertions(+), 833 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index dc14255..25cafb8 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -69,24 +69,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> struct bio *split;
> struct bio_vec bv, bvprv;
> struct bvec_iter iter;
> - unsigned seg_size = 0, nsegs = 0;
> + unsigned seg_size = 0, nsegs = 0, sectors = 0;
> int prev = 0;
>
> - struct bvec_merge_data bvm = {
> - .bi_bdev = bio->bi_bdev,
> - .bi_sector = bio->bi_iter.bi_sector,
> - .bi_size = 0,
> - .bi_rw = bio->bi_rw,
> - };
> -
> bio_for_each_segment(bv, bio, iter) {
> - if (q->merge_bvec_fn &&
> - q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
> - goto split;
> -
> - bvm.bi_size += bv.bv_len;
> + sectors += bv.bv_len >> 9;
>
> - if (bvm.bi_size >> 9 > queue_max_sectors(q))
> + if (sectors > queue_max_sectors(q))
> goto split;
>
> /*
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 12600bf..e90d477 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -53,28 +53,6 @@ void blk_queue_unprep_rq(struct request_queue *q, unprep_rq_fn *ufn)
> }
> EXPORT_SYMBOL(blk_queue_unprep_rq);
>
> -/**
> - * blk_queue_merge_bvec - set a merge_bvec function for queue
> - * @q: queue
> - * @mbfn: merge_bvec_fn
> - *
> - * Usually queues have static limitations on the max sectors or segments that
> - * we can put in a request. Stacking drivers may have some settings that
> - * are dynamic, and thus we have to query the queue whether it is ok to
> - * add a new bio_vec to a bio at a given offset or not. If the block device
> - * has such limitations, it needs to register a merge_bvec_fn to control
> - * the size of bio's sent to it. Note that a block device *must* allow a
> - * single page to be added to an empty bio. The block device driver may want
> - * to use the bio_split() function to deal with these bio's. By default
> - * no merge_bvec_fn is defined for a queue, and only the fixed limits are
> - * honored.
> - */
> -void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
> -{
> - q->merge_bvec_fn = mbfn;
> -}
> -EXPORT_SYMBOL(blk_queue_merge_bvec);
> -
> void blk_queue_softirq_done(struct request_queue *q, softirq_done_fn *fn)
> {
> q->softirq_done_fn = fn;
> diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
> index b905e98..63ce2b0 100644
> --- a/drivers/block/drbd/drbd_int.h
> +++ b/drivers/block/drbd/drbd_int.h
> @@ -1449,7 +1449,6 @@ extern void do_submit(struct work_struct *ws);
> extern void __drbd_make_request(struct drbd_device *, struct bio *, unsigned long);
> extern void drbd_make_request(struct request_queue *q, struct bio *bio);
> extern int drbd_read_remote(struct drbd_device *device, struct drbd_request *req);
> -extern int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec);
> extern int is_valid_ar_handle(struct drbd_request *, sector_t);
>
>
> diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
> index 81fde9e..771e68c 100644
> --- a/drivers/block/drbd/drbd_main.c
> +++ b/drivers/block/drbd/drbd_main.c
> @@ -2774,7 +2774,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
> This triggers a max_bio_size message upon first attach or connect */
> blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
> blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
> - blk_queue_merge_bvec(q, drbd_merge_bvec);
> q->queue_lock = &resource->req_lock;
>
> device->md_io.page = alloc_page(GFP_KERNEL);
> diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
> index a6265bc..7523f00 100644
> --- a/drivers/block/drbd/drbd_req.c
> +++ b/drivers/block/drbd/drbd_req.c
> @@ -1510,41 +1510,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
> __drbd_make_request(device, bio, start_jif);
> }
>
> -/* This is called by bio_add_page().
> - *
> - * q->max_hw_sectors and other global limits are already enforced there.
> - *
> - * We need to call down to our lower level device,
> - * in case it has special restrictions.
> - *
> - * We also may need to enforce configured max-bio-bvecs limits.
> - *
> - * As long as the BIO is empty we have to allow at least one bvec,
> - * regardless of size and offset, so no need to ask lower levels.
> - */
> -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
> -{
> - struct drbd_device *device = (struct drbd_device *) q->queuedata;
> - unsigned int bio_size = bvm->bi_size;
> - int limit = DRBD_MAX_BIO_SIZE;
> - int backing_limit;
> -
> - if (bio_size && get_ldev(device)) {
> - unsigned int max_hw_sectors = queue_max_hw_sectors(q);
> - struct request_queue * const b =
> - device->ldev->backing_bdev->bd_disk->queue;
> - if (b->merge_bvec_fn) {
> - bvm->bi_bdev = device->ldev->backing_bdev;
> - backing_limit = b->merge_bvec_fn(b, bvm, bvec);
> - limit = min(limit, backing_limit);
> - }
> - put_ldev(device);
> - if ((limit >> 9) > max_hw_sectors)
> - limit = max_hw_sectors << 9;
> - }
> - return limit;
> -}
> -
> void request_timer_fn(unsigned long data)
> {
> struct drbd_device *device = (struct drbd_device *) data;
> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index ea10bd9..85eac23 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -2505,26 +2505,6 @@ end_io:
>
>
>
> -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
> - struct bio_vec *bvec)
> -{
> - struct pktcdvd_device *pd = q->queuedata;
> - sector_t zone = get_zone(bmd->bi_sector, pd);
> - int used = ((bmd->bi_sector - zone) << 9) + bmd->bi_size;
> - int remaining = (pd->settings.size << 9) - used;
> - int remaining2;
> -
> - /*
> - * A bio <= PAGE_SIZE must be allowed. If it crosses a packet
> - * boundary, pkt_make_request() will split the bio.
> - */
> - remaining2 = PAGE_SIZE - bmd->bi_size;
> - remaining = max(remaining, remaining2);
> -
> - BUG_ON(remaining < 0);
> - return remaining;
> -}
> -
> static void pkt_init_queue(struct pktcdvd_device *pd)
> {
> struct request_queue *q = pd->disk->queue;
> @@ -2532,7 +2512,6 @@ static void pkt_init_queue(struct pktcdvd_device *pd)
> blk_queue_make_request(q, pkt_make_request);
> blk_queue_logical_block_size(q, CD_FRAMESIZE);
> blk_queue_max_hw_sectors(q, PACKET_MAX_SECTORS);
> - blk_queue_merge_bvec(q, pkt_merge_bvec);
> q->queuedata = pd;
> }
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index ec6c5c6..f50edb3 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
> return BLK_MQ_RQ_QUEUE_OK;
> }
>
> -/*
> - * a queue callback. Makes sure that we don't create a bio that spans across
> - * multiple osd objects. One exception would be with a single page bios,
> - * which we handle later at bio_chain_clone_range()
> - */
> -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
> - struct bio_vec *bvec)
> -{
> - struct rbd_device *rbd_dev = q->queuedata;
> - sector_t sector_offset;
> - sector_t sectors_per_obj;
> - sector_t obj_sector_offset;
> - int ret;
> -
> - /*
> - * Find how far into its rbd object the partition-relative
> - * bio start sector is to offset relative to the enclosing
> - * device.
> - */
> - sector_offset = get_start_sect(bmd->bi_bdev) + bmd->bi_sector;
> - sectors_per_obj = 1 << (rbd_dev->header.obj_order - SECTOR_SHIFT);
> - obj_sector_offset = sector_offset & (sectors_per_obj - 1);
> -
> - /*
> - * Compute the number of bytes from that offset to the end
> - * of the object. Account for what's already used by the bio.
> - */
> - ret = (int) (sectors_per_obj - obj_sector_offset) << SECTOR_SHIFT;
> - if (ret > bmd->bi_size)
> - ret -= bmd->bi_size;
> - else
> - ret = 0;
> -
> - /*
> - * Don't send back more than was asked for. And if the bio
> - * was empty, let the whole thing through because: "Note
> - * that a block device *must* allow a single page to be
> - * added to an empty bio."
> - */
> - rbd_assert(bvec->bv_len <= PAGE_SIZE);
> - if (ret > (int) bvec->bv_len || !bmd->bi_size)
> - ret = (int) bvec->bv_len;
> -
> - return ret;
> -}
> -
> static void rbd_free_disk(struct rbd_device *rbd_dev)
> {
> struct gendisk *disk = rbd_dev->disk;
> @@ -3784,7 +3738,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
> q->limits.max_discard_sectors = segment_size / SECTOR_SIZE;
> q->limits.discard_zeroes_data = 1;
>
> - blk_queue_merge_bvec(q, rbd_merge_bvec);
> disk->queue = q;
>
> q->queuedata = rbd_dev;
> diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
> index 7755af3..2e47e35 100644
> --- a/drivers/md/dm-cache-target.c
> +++ b/drivers/md/dm-cache-target.c
> @@ -3289,26 +3289,6 @@ static int cache_iterate_devices(struct dm_target *ti,
> return r;
> }
>
> -/*
> - * We assume I/O is going to the origin (which is the volume
> - * more likely to have restrictions e.g. by being striped).
> - * (Looking up the exact location of the data would be expensive
> - * and could always be out of date by the time the bio is submitted.)
> - */
> -static int cache_bvec_merge(struct dm_target *ti,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct cache *cache = ti->private;
> - struct request_queue *q = bdev_get_queue(cache->origin_dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = cache->origin_dev->bdev;
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
> {
> /*
> @@ -3352,7 +3332,6 @@ static struct target_type cache_target = {
> .status = cache_status,
> .message = cache_message,
> .iterate_devices = cache_iterate_devices,
> - .merge = cache_bvec_merge,
> .io_hints = cache_io_hints,
> };
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 5503e43..d13f330 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2017,21 +2017,6 @@ error:
> return -EINVAL;
> }
>
> -static int crypt_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct crypt_config *cc = ti->private;
> - struct request_queue *q = bdev_get_queue(cc->dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = cc->dev->bdev;
> - bvm->bi_sector = cc->start + dm_target_offset(ti, bvm->bi_sector);
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static int crypt_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn, void *data)
> {
> @@ -2052,7 +2037,6 @@ static struct target_type crypt_target = {
> .preresume = crypt_preresume,
> .resume = crypt_resume,
> .message = crypt_message,
> - .merge = crypt_merge,
> .iterate_devices = crypt_iterate_devices,
> };
>
> diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
> index ad913cd..0119ebf 100644
> --- a/drivers/md/dm-era-target.c
> +++ b/drivers/md/dm-era-target.c
> @@ -1673,20 +1673,6 @@ static int era_iterate_devices(struct dm_target *ti,
> return fn(ti, era->origin_dev, 0, get_dev_size(era->origin_dev), data);
> }
>
> -static int era_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct era *era = ti->private;
> - struct request_queue *q = bdev_get_queue(era->origin_dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = era->origin_dev->bdev;
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static void era_io_hints(struct dm_target *ti, struct queue_limits *limits)
> {
> struct era *era = ti->private;
> @@ -1717,7 +1703,6 @@ static struct target_type era_target = {
> .status = era_status,
> .message = era_message,
> .iterate_devices = era_iterate_devices,
> - .merge = era_merge,
> .io_hints = era_io_hints
> };
>
> diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
> index b257e46..d955b3e 100644
> --- a/drivers/md/dm-flakey.c
> +++ b/drivers/md/dm-flakey.c
> @@ -387,21 +387,6 @@ static int flakey_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long ar
> return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
> }
>
> -static int flakey_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct flakey_c *fc = ti->private;
> - struct request_queue *q = bdev_get_queue(fc->dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = fc->dev->bdev;
> - bvm->bi_sector = flakey_map_sector(ti, bvm->bi_sector);
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static int flakey_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data)
> {
> struct flakey_c *fc = ti->private;
> @@ -419,7 +404,6 @@ static struct target_type flakey_target = {
> .end_io = flakey_end_io,
> .status = flakey_status,
> .ioctl = flakey_ioctl,
> - .merge = flakey_merge,
> .iterate_devices = flakey_iterate_devices,
> };
>
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 53e848c..7dd5fc8 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -130,21 +130,6 @@ static int linear_ioctl(struct dm_target *ti, unsigned int cmd,
> return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
> }
>
> -static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct linear_c *lc = ti->private;
> - struct request_queue *q = bdev_get_queue(lc->dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = lc->dev->bdev;
> - bvm->bi_sector = linear_map_sector(ti, bvm->bi_sector);
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static int linear_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn, void *data)
> {
> @@ -162,7 +147,6 @@ static struct target_type linear_target = {
> .map = linear_map,
> .status = linear_status,
> .ioctl = linear_ioctl,
> - .merge = linear_merge,
> .iterate_devices = linear_iterate_devices,
> };
>
> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
> index 93e0844..4325808 100644
> --- a/drivers/md/dm-log-writes.c
> +++ b/drivers/md/dm-log-writes.c
> @@ -728,21 +728,6 @@ static int log_writes_ioctl(struct dm_target *ti, unsigned int cmd,
> return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
> }
>
> -static int log_writes_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct log_writes_c *lc = ti->private;
> - struct request_queue *q = bdev_get_queue(lc->dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = lc->dev->bdev;
> - bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static int log_writes_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn,
> void *data)
> @@ -796,7 +781,6 @@ static struct target_type log_writes_target = {
> .end_io = normal_end_io,
> .status = log_writes_status,
> .ioctl = log_writes_ioctl,
> - .merge = log_writes_merge,
> .message = log_writes_message,
> .iterate_devices = log_writes_iterate_devices,
> .io_hints = log_writes_io_hints,
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index f83a0f3..274cbec 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -2331,20 +2331,6 @@ static void origin_status(struct dm_target *ti, status_type_t type,
> }
> }
>
> -static int origin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct dm_origin *o = ti->private;
> - struct request_queue *q = bdev_get_queue(o->dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = o->dev->bdev;
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static int origin_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn, void *data)
> {
> @@ -2363,7 +2349,6 @@ static struct target_type origin_target = {
> .resume = origin_resume,
> .postsuspend = origin_postsuspend,
> .status = origin_status,
> - .merge = origin_merge,
> .iterate_devices = origin_iterate_devices,
> };
>
> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
> index f8b37d4..09bb2fe 100644
> --- a/drivers/md/dm-stripe.c
> +++ b/drivers/md/dm-stripe.c
> @@ -412,26 +412,6 @@ static void stripe_io_hints(struct dm_target *ti,
> blk_limits_io_opt(limits, chunk_size * sc->stripes);
> }
>
> -static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct stripe_c *sc = ti->private;
> - sector_t bvm_sector = bvm->bi_sector;
> - uint32_t stripe;
> - struct request_queue *q;
> -
> - stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector);
> -
> - q = bdev_get_queue(sc->stripe[stripe].dev->bdev);
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = sc->stripe[stripe].dev->bdev;
> - bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector;
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static struct target_type stripe_target = {
> .name = "striped",
> .version = {1, 5, 1},
> @@ -443,7 +423,6 @@ static struct target_type stripe_target = {
> .status = stripe_status,
> .iterate_devices = stripe_iterate_devices,
> .io_hints = stripe_io_hints,
> - .merge = stripe_merge,
> };
>
> int __init dm_stripe_init(void)
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index d9b00b8..19c9b01 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -440,14 +440,6 @@ static int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
> q->limits.alignment_offset,
> (unsigned long long) start << SECTOR_SHIFT);
>
> - /*
> - * Check if merge fn is supported.
> - * If not we'll force DM to use PAGE_SIZE or
> - * smaller I/O, just to be safe.
> - */
> - if (dm_queue_merge_is_compulsory(q) && !ti->type->merge)
> - blk_limits_max_hw_sectors(limits,
> - (unsigned int) (PAGE_SIZE >> 9));
> return 0;
> }
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..03552fe 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -3562,20 +3562,6 @@ static int pool_iterate_devices(struct dm_target *ti,
> return fn(ti, pt->data_dev, 0, ti->len, data);
> }
>
> -static int pool_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct pool_c *pt = ti->private;
> - struct request_queue *q = bdev_get_queue(pt->data_dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = pt->data_dev->bdev;
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static void set_discard_limits(struct pool_c *pt, struct queue_limits *limits)
> {
> struct pool *pool = pt->pool;
> @@ -3667,7 +3653,6 @@ static struct target_type pool_target = {
> .resume = pool_resume,
> .message = pool_message,
> .status = pool_status,
> - .merge = pool_merge,
> .iterate_devices = pool_iterate_devices,
> .io_hints = pool_io_hints,
> };
> @@ -3992,21 +3977,6 @@ err:
> DMEMIT("Error");
> }
>
> -static int thin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct thin_c *tc = ti->private;
> - struct request_queue *q = bdev_get_queue(tc->pool_dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = tc->pool_dev->bdev;
> - bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static int thin_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn, void *data)
> {
> @@ -4041,7 +4011,6 @@ static struct target_type thin_target = {
> .presuspend = thin_presuspend,
> .postsuspend = thin_postsuspend,
> .status = thin_status,
> - .merge = thin_merge,
> .iterate_devices = thin_iterate_devices,
> };
>
> diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
> index 66616db..3b85460 100644
> --- a/drivers/md/dm-verity.c
> +++ b/drivers/md/dm-verity.c
> @@ -648,21 +648,6 @@ static int verity_ioctl(struct dm_target *ti, unsigned cmd,
> cmd, arg);
> }
>
> -static int verity_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{
> - struct dm_verity *v = ti->private;
> - struct request_queue *q = bdev_get_queue(v->data_dev->bdev);
> -
> - if (!q->merge_bvec_fn)
> - return max_size;
> -
> - bvm->bi_bdev = v->data_dev->bdev;
> - bvm->bi_sector = verity_map_sector(v, bvm->bi_sector);
> -
> - return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
> -}
> -
> static int verity_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn, void *data)
> {
> @@ -995,7 +980,6 @@ static struct target_type verity_target = {
> .map = verity_map,
> .status = verity_status,
> .ioctl = verity_ioctl,
> - .merge = verity_merge,
> .iterate_devices = verity_iterate_devices,
> .io_hints = verity_io_hints,
> };
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 34f6063..f732a7a 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -121,9 +121,8 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
> #define DMF_FREEING 3
> #define DMF_DELETING 4
> #define DMF_NOFLUSH_SUSPENDING 5
> -#define DMF_MERGE_IS_OPTIONAL 6
> -#define DMF_DEFERRED_REMOVE 7
> -#define DMF_SUSPENDED_INTERNALLY 8
> +#define DMF_DEFERRED_REMOVE 6
> +#define DMF_SUSPENDED_INTERNALLY 7
>
> /*
> * A dummy definition to make RCU happy.
> @@ -1717,60 +1716,6 @@ static void __split_and_process_bio(struct mapped_device *md,
> * CRUD END
> *---------------------------------------------------------------*/
>
> -static int dm_merge_bvec(struct request_queue *q,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec)
> -{
> - struct mapped_device *md = q->queuedata;
> - struct dm_table *map = dm_get_live_table_fast(md);
> - struct dm_target *ti;
> - sector_t max_sectors;
> - int max_size = 0;
> -
> - if (unlikely(!map))
> - goto out;
> -
> - ti = dm_table_find_target(map, bvm->bi_sector);
> - if (!dm_target_is_valid(ti))
> - goto out;
> -
> - /*
> - * Find maximum amount of I/O that won't need splitting
> - */
> - max_sectors = min(max_io_len(bvm->bi_sector, ti),
> - (sector_t) queue_max_sectors(q));
> - max_size = (max_sectors << SECTOR_SHIFT) - bvm->bi_size;
> - if (unlikely(max_size < 0)) /* this shouldn't _ever_ happen */
> - max_size = 0;
> -
> - /*
> - * merge_bvec_fn() returns number of bytes
> - * it can accept at this offset
> - * max is precomputed maximal io size
> - */
> - if (max_size && ti->type->merge)
> - max_size = ti->type->merge(ti, bvm, biovec, max_size);
> - /*
> - * If the target doesn't support merge method and some of the devices
> - * provided their merge_bvec method (we know this by looking for the
> - * max_hw_sectors that dm_set_device_limits may set), then we can't
> - * allow bios with multiple vector entries. So always set max_size
> - * to 0, and the code below allows just one page.
> - */
> - else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
> - max_size = 0;
> -
> -out:
> - dm_put_live_table_fast(md);
> - /*
> - * Always allow an entire first page
> - */
> - if (max_size <= biovec->bv_len && !(bvm->bi_size >> SECTOR_SHIFT))
> - max_size = biovec->bv_len;
> -
> - return max_size;
> -}
> -
> /*
> * The request function that just remaps the bio built up by
> * dm_merge_bvec.
> @@ -2477,59 +2422,6 @@ static void __set_size(struct mapped_device *md, sector_t size)
> }
>
> /*
> - * Return 1 if the queue has a compulsory merge_bvec_fn function.
> - *
> - * If this function returns 0, then the device is either a non-dm
> - * device without a merge_bvec_fn, or it is a dm device that is
> - * able to split any bios it receives that are too big.
> - */
> -int dm_queue_merge_is_compulsory(struct request_queue *q)
> -{
> - struct mapped_device *dev_md;
> -
> - if (!q->merge_bvec_fn)
> - return 0;
> -
> - if (q->make_request_fn == dm_make_request) {
> - dev_md = q->queuedata;
> - if (test_bit(DMF_MERGE_IS_OPTIONAL, &dev_md->flags))
> - return 0;
> - }
> -
> - return 1;
> -}
> -
> -static int dm_device_merge_is_compulsory(struct dm_target *ti,
> - struct dm_dev *dev, sector_t start,
> - sector_t len, void *data)
> -{
> - struct block_device *bdev = dev->bdev;
> - struct request_queue *q = bdev_get_queue(bdev);
> -
> - return dm_queue_merge_is_compulsory(q);
> -}
> -
> -/*
> - * Return 1 if it is acceptable to ignore merge_bvec_fn based
> - * on the properties of the underlying devices.
> - */
> -static int dm_table_merge_is_optional(struct dm_table *table)
> -{
> - unsigned i = 0;
> - struct dm_target *ti;
> -
> - while (i < dm_table_get_num_targets(table)) {
> - ti = dm_table_get_target(table, i++);
> -
> - if (ti->type->iterate_devices &&
> - ti->type->iterate_devices(ti, dm_device_merge_is_compulsory, NULL))
> - return 0;
> - }
> -
> - return 1;
> -}
> -
> -/*
> * Returns old map, which caller must destroy.
> */
> static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
> @@ -2538,7 +2430,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
> struct dm_table *old_map;
> struct request_queue *q = md->queue;
> sector_t size;
> - int merge_is_optional;
>
> size = dm_table_get_size(t);
>
> @@ -2564,17 +2455,11 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
>
> __bind_mempools(md, t);
>
> - merge_is_optional = dm_table_merge_is_optional(t);
> -
> old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
> rcu_assign_pointer(md->map, t);
> md->immutable_target_type = dm_table_get_immutable_target_type(t);
>
> dm_table_set_restrictions(t, q, limits);
> - if (merge_is_optional)
> - set_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
> - else
> - clear_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
> if (old_map)
> dm_sync_table(md);
>
> @@ -2852,7 +2737,6 @@ int dm_setup_md_queue(struct mapped_device *md)
> case DM_TYPE_BIO_BASED:
> dm_init_old_md_queue(md);
> blk_queue_make_request(md->queue, dm_make_request);
> - blk_queue_merge_bvec(md->queue, dm_merge_bvec);
> break;
> }
>
> diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> index 6123c2b..7d61cca 100644
> --- a/drivers/md/dm.h
> +++ b/drivers/md/dm.h
> @@ -77,8 +77,6 @@ bool dm_table_mq_request_based(struct dm_table *t);
> void dm_table_free_md_mempools(struct dm_table *t);
> struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
>
> -int dm_queue_merge_is_compulsory(struct request_queue *q);
> -
> void dm_lock_md_type(struct mapped_device *md);
> void dm_unlock_md_type(struct mapped_device *md);
> void dm_set_md_type(struct mapped_device *md, unsigned type);
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index fa7d577..8721ef9 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -52,48 +52,6 @@ static inline struct dev_info *which_dev(struct mddev *mddev, sector_t sector)
> return conf->disks + lo;
> }
>
> -/**
> - * linear_mergeable_bvec -- tell bio layer if two requests can be merged
> - * @q: request queue
> - * @bvm: properties of new bio
> - * @biovec: the request that could be merged to it.
> - *
> - * Return amount of bytes we can take at this offset
> - */
> -static int linear_mergeable_bvec(struct mddev *mddev,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec)
> -{
> - struct dev_info *dev0;
> - unsigned long maxsectors, bio_sectors = bvm->bi_size >> 9;
> - sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
> - int maxbytes = biovec->bv_len;
> - struct request_queue *subq;
> -
> - dev0 = which_dev(mddev, sector);
> - maxsectors = dev0->end_sector - sector;
> - subq = bdev_get_queue(dev0->rdev->bdev);
> - if (subq->merge_bvec_fn) {
> - bvm->bi_bdev = dev0->rdev->bdev;
> - bvm->bi_sector -= dev0->end_sector - dev0->rdev->sectors;
> - maxbytes = min(maxbytes, subq->merge_bvec_fn(subq, bvm,
> - biovec));
> - }
> -
> - if (maxsectors < bio_sectors)
> - maxsectors = 0;
> - else
> - maxsectors -= bio_sectors;
> -
> - if (maxsectors <= (PAGE_SIZE >> 9 ) && bio_sectors == 0)
> - return maxbytes;
> -
> - if (maxsectors > (maxbytes >> 9))
> - return maxbytes;
> - else
> - return maxsectors << 9;
> -}
> -
> static int linear_congested(struct mddev *mddev, int bits)
> {
> struct linear_conf *conf;
> @@ -338,7 +296,6 @@ static struct md_personality linear_personality =
> .size = linear_size,
> .quiesce = linear_quiesce,
> .congested = linear_congested,
> - .mergeable_bvec = linear_mergeable_bvec,
> };
>
> static int __init linear_init (void)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 046b3c9..f101981 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -352,29 +352,6 @@ static int md_congested(void *data, int bits)
> return mddev_congested(mddev, bits);
> }
>
> -static int md_mergeable_bvec(struct request_queue *q,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec)
> -{
> - struct mddev *mddev = q->queuedata;
> - int ret;
> - rcu_read_lock();
> - if (mddev->suspended) {
> - /* Must always allow one vec */
> - if (bvm->bi_size == 0)
> - ret = biovec->bv_len;
> - else
> - ret = 0;
> - } else {
> - struct md_personality *pers = mddev->pers;
> - if (pers && pers->mergeable_bvec)
> - ret = pers->mergeable_bvec(mddev, bvm, biovec);
> - else
> - ret = biovec->bv_len;
> - }
> - rcu_read_unlock();
> - return ret;
> -}
> /*
> * Generic flush handling for md
> */
> @@ -5165,7 +5142,6 @@ int md_run(struct mddev *mddev)
> if (mddev->queue) {
> mddev->queue->backing_dev_info.congested_data = mddev;
> mddev->queue->backing_dev_info.congested_fn = md_congested;
> - blk_queue_merge_bvec(mddev->queue, md_mergeable_bvec);
> }
> if (pers->sync_request) {
> if (mddev->kobj.sd &&
> @@ -5293,7 +5269,6 @@ static void md_clean(struct mddev *mddev)
> mddev->changed = 0;
> mddev->degraded = 0;
> mddev->safemode = 0;
> - mddev->merge_check_needed = 0;
> mddev->bitmap_info.offset = 0;
> mddev->bitmap_info.default_offset = 0;
> mddev->bitmap_info.default_space = 0;
> @@ -5489,7 +5464,6 @@ static int do_md_stop(struct mddev *mddev, int mode,
>
> __md_stop_writes(mddev);
> __md_stop(mddev);
> - mddev->queue->merge_bvec_fn = NULL;
> mddev->queue->backing_dev_info.congested_fn = NULL;
>
> /* tell userspace to handle 'inactive' */
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 4046a6c..cf7141a 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -133,10 +133,6 @@ enum flag_bits {
> Bitmap_sync, /* ..actually, not quite In_sync. Need a
> * bitmap-based recovery to get fully in sync
> */
> - Unmerged, /* device is being added to array and should
> - * be considerred for bvec_merge_fn but not
> - * yet for actual IO
> - */
> WriteMostly, /* Avoid reading if at all possible */
> AutoDetected, /* added by auto-detect */
> Blocked, /* An error occurred but has not yet
> @@ -373,10 +369,6 @@ struct mddev {
> int degraded; /* whether md should consider
> * adding a spare
> */
> - int merge_check_needed; /* at least one
> - * member device
> - * has a
> - * merge_bvec_fn */
>
> atomic_t recovery_active; /* blocks scheduled, but not written */
> wait_queue_head_t recovery_wait;
> @@ -531,10 +523,6 @@ struct md_personality
> /* congested implements bdi.congested_fn().
> * Will not be called while array is 'suspended' */
> int (*congested)(struct mddev *mddev, int bits);
> - /* mergeable_bvec is use to implement ->merge_bvec_fn */
> - int (*mergeable_bvec)(struct mddev *mddev,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec);
> };
>
> struct md_sysfs_entry {
> diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
> index ac3ede2..7ee27fb 100644
> --- a/drivers/md/multipath.c
> +++ b/drivers/md/multipath.c
> @@ -257,18 +257,6 @@ static int multipath_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> disk_stack_limits(mddev->gendisk, rdev->bdev,
> rdev->data_offset << 9);
>
> - /* as we don't honour merge_bvec_fn, we must never risk
> - * violating it, so limit ->max_segments to one, lying
> - * within a single page.
> - * (Note: it is very unlikely that a device with
> - * merge_bvec_fn will be involved in multipath.)
> - */
> - if (q->merge_bvec_fn) {
> - blk_queue_max_segments(mddev->queue, 1);
> - blk_queue_segment_boundary(mddev->queue,
> - PAGE_CACHE_SIZE - 1);
> - }
> -
> spin_lock_irq(&conf->device_lock);
> mddev->degraded--;
> rdev->raid_disk = path;
> @@ -432,15 +420,6 @@ static int multipath_run (struct mddev *mddev)
> disk_stack_limits(mddev->gendisk, rdev->bdev,
> rdev->data_offset << 9);
>
> - /* as we don't honour merge_bvec_fn, we must never risk
> - * violating it, not that we ever expect a device with
> - * a merge_bvec_fn to be involved in multipath */
> - if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
> - blk_queue_max_segments(mddev->queue, 1);
> - blk_queue_segment_boundary(mddev->queue,
> - PAGE_CACHE_SIZE - 1);
> - }
> -
> if (!test_bit(Faulty, &rdev->flags))
> working_disks++;
> }
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index 6a68ef5..1440bd4 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -192,9 +192,6 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
> disk_stack_limits(mddev->gendisk, rdev1->bdev,
> rdev1->data_offset << 9);
>
> - if (rdev1->bdev->bd_disk->queue->merge_bvec_fn)
> - conf->has_merge_bvec = 1;
> -
> if (!smallest || (rdev1->sectors < smallest->sectors))
> smallest = rdev1;
> cnt++;
> @@ -351,58 +348,6 @@ static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
> + sector_div(sector, zone->nb_dev)];
> }
>
> -/**
> - * raid0_mergeable_bvec -- tell bio layer if two requests can be merged
> - * @mddev: the md device
> - * @bvm: properties of new bio
> - * @biovec: the request that could be merged to it.
> - *
> - * Return amount of bytes we can accept at this offset
> - */
> -static int raid0_mergeable_bvec(struct mddev *mddev,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec)
> -{
> - struct r0conf *conf = mddev->private;
> - sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
> - sector_t sector_offset = sector;
> - int max;
> - unsigned int chunk_sectors = mddev->chunk_sectors;
> - unsigned int bio_sectors = bvm->bi_size >> 9;
> - struct strip_zone *zone;
> - struct md_rdev *rdev;
> - struct request_queue *subq;
> -
> - if (is_power_of_2(chunk_sectors))
> - max = (chunk_sectors - ((sector & (chunk_sectors-1))
> - + bio_sectors)) << 9;
> - else
> - max = (chunk_sectors - (sector_div(sector, chunk_sectors)
> - + bio_sectors)) << 9;
> - if (max < 0)
> - max = 0; /* bio_add cannot handle a negative return */
> - if (max <= biovec->bv_len && bio_sectors == 0)
> - return biovec->bv_len;
> - if (max < biovec->bv_len)
> - /* too small already, no need to check further */
> - return max;
> - if (!conf->has_merge_bvec)
> - return max;
> -
> - /* May need to check subordinate device */
> - sector = sector_offset;
> - zone = find_zone(mddev->private, &sector_offset);
> - rdev = map_sector(mddev, zone, sector, &sector_offset);
> - subq = bdev_get_queue(rdev->bdev);
> - if (subq->merge_bvec_fn) {
> - bvm->bi_bdev = rdev->bdev;
> - bvm->bi_sector = sector_offset + zone->dev_start +
> - rdev->data_offset;
> - return min(max, subq->merge_bvec_fn(subq, bvm, biovec));
> - } else
> - return max;
> -}
> -
> static sector_t raid0_size(struct mddev *mddev, sector_t sectors, int raid_disks)
> {
> sector_t array_sectors = 0;
> @@ -725,7 +670,6 @@ static struct md_personality raid0_personality=
> .takeover = raid0_takeover,
> .quiesce = raid0_quiesce,
> .congested = raid0_congested,
> - .mergeable_bvec = raid0_mergeable_bvec,
> };
>
> static int __init raid0_init (void)
> diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h
> index 05539d9..7127a62 100644
> --- a/drivers/md/raid0.h
> +++ b/drivers/md/raid0.h
> @@ -12,8 +12,6 @@ struct r0conf {
> struct md_rdev **devlist; /* lists of rdevs, pointed to
> * by strip_zone->dev */
> int nr_strip_zones;
> - int has_merge_bvec; /* at least one member has
> - * a merge_bvec_fn */
> };
>
> #endif
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 9157a29..478878f 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -557,7 +557,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> rdev = rcu_dereference(conf->mirrors[disk].rdev);
> if (r1_bio->bios[disk] == IO_BLOCKED
> || rdev == NULL
> - || test_bit(Unmerged, &rdev->flags)
> || test_bit(Faulty, &rdev->flags))
> continue;
> if (!test_bit(In_sync, &rdev->flags) &&
> @@ -708,38 +707,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> return best_disk;
> }
>
> -static int raid1_mergeable_bvec(struct mddev *mddev,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec)
> -{
> - struct r1conf *conf = mddev->private;
> - sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
> - int max = biovec->bv_len;
> -
> - if (mddev->merge_check_needed) {
> - int disk;
> - rcu_read_lock();
> - for (disk = 0; disk < conf->raid_disks * 2; disk++) {
> - struct md_rdev *rdev = rcu_dereference(
> - conf->mirrors[disk].rdev);
> - if (rdev && !test_bit(Faulty, &rdev->flags)) {
> - struct request_queue *q =
> - bdev_get_queue(rdev->bdev);
> - if (q->merge_bvec_fn) {
> - bvm->bi_sector = sector +
> - rdev->data_offset;
> - bvm->bi_bdev = rdev->bdev;
> - max = min(max, q->merge_bvec_fn(
> - q, bvm, biovec));
> - }
> - }
> - }
> - rcu_read_unlock();
> - }
> - return max;
> -
> -}
> -
> static int raid1_congested(struct mddev *mddev, int bits)
> {
> struct r1conf *conf = mddev->private;
> @@ -1268,8 +1235,7 @@ read_again:
> break;
> }
> r1_bio->bios[i] = NULL;
> - if (!rdev || test_bit(Faulty, &rdev->flags)
> - || test_bit(Unmerged, &rdev->flags)) {
> + if (!rdev || test_bit(Faulty, &rdev->flags)) {
> if (i < conf->raid_disks)
> set_bit(R1BIO_Degraded, &r1_bio->state);
> continue;
> @@ -1614,7 +1580,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> struct raid1_info *p;
> int first = 0;
> int last = conf->raid_disks - 1;
> - struct request_queue *q = bdev_get_queue(rdev->bdev);
>
> if (mddev->recovery_disabled == conf->recovery_disabled)
> return -EBUSY;
> @@ -1622,11 +1587,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> if (rdev->raid_disk >= 0)
> first = last = rdev->raid_disk;
>
> - if (q->merge_bvec_fn) {
> - set_bit(Unmerged, &rdev->flags);
> - mddev->merge_check_needed = 1;
> - }
> -
> for (mirror = first; mirror <= last; mirror++) {
> p = conf->mirrors+mirror;
> if (!p->rdev) {
> @@ -1658,19 +1618,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> break;
> }
> }
> - if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
> - /* Some requests might not have seen this new
> - * merge_bvec_fn. We must wait for them to complete
> - * before merging the device fully.
> - * First we make sure any code which has tested
> - * our function has submitted the request, then
> - * we wait for all outstanding requests to complete.
> - */
> - synchronize_sched();
> - freeze_array(conf, 0);
> - unfreeze_array(conf);
> - clear_bit(Unmerged, &rdev->flags);
> - }
> md_integrity_add_rdev(rdev, mddev);
> if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
> queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
> @@ -2807,8 +2754,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
> goto abort;
> disk->rdev = rdev;
> q = bdev_get_queue(rdev->bdev);
> - if (q->merge_bvec_fn)
> - mddev->merge_check_needed = 1;
>
> disk->head_position = 0;
> disk->seq_start = MaxSector;
> @@ -3173,7 +3118,6 @@ static struct md_personality raid1_personality =
> .quiesce = raid1_quiesce,
> .takeover = raid1_takeover,
> .congested = raid1_congested,
> - .mergeable_bvec = raid1_mergeable_bvec,
> };
>
> static int __init raid_init(void)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index e793ab6..a46c402 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -672,93 +672,6 @@ static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
> return (vchunk << geo->chunk_shift) + offset;
> }
>
> -/**
> - * raid10_mergeable_bvec -- tell bio layer if a two requests can be merged
> - * @mddev: the md device
> - * @bvm: properties of new bio
> - * @biovec: the request that could be merged to it.
> - *
> - * Return amount of bytes we can accept at this offset
> - * This requires checking for end-of-chunk if near_copies != raid_disks,
> - * and for subordinate merge_bvec_fns if merge_check_needed.
> - */
> -static int raid10_mergeable_bvec(struct mddev *mddev,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec)
> -{
> - struct r10conf *conf = mddev->private;
> - sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
> - int max;
> - unsigned int chunk_sectors;
> - unsigned int bio_sectors = bvm->bi_size >> 9;
> - struct geom *geo = &conf->geo;
> -
> - chunk_sectors = (conf->geo.chunk_mask & conf->prev.chunk_mask) + 1;
> - if (conf->reshape_progress != MaxSector &&
> - ((sector >= conf->reshape_progress) !=
> - conf->mddev->reshape_backwards))
> - geo = &conf->prev;
> -
> - if (geo->near_copies < geo->raid_disks) {
> - max = (chunk_sectors - ((sector & (chunk_sectors - 1))
> - + bio_sectors)) << 9;
> - if (max < 0)
> - /* bio_add cannot handle a negative return */
> - max = 0;
> - if (max <= biovec->bv_len && bio_sectors == 0)
> - return biovec->bv_len;
> - } else
> - max = biovec->bv_len;
> -
> - if (mddev->merge_check_needed) {
> - struct {
> - struct r10bio r10_bio;
> - struct r10dev devs[conf->copies];
> - } on_stack;
> - struct r10bio *r10_bio = &on_stack.r10_bio;
> - int s;
> - if (conf->reshape_progress != MaxSector) {
> - /* Cannot give any guidance during reshape */
> - if (max <= biovec->bv_len && bio_sectors == 0)
> - return biovec->bv_len;
> - return 0;
> - }
> - r10_bio->sector = sector;
> - raid10_find_phys(conf, r10_bio);
> - rcu_read_lock();
> - for (s = 0; s < conf->copies; s++) {
> - int disk = r10_bio->devs[s].devnum;
> - struct md_rdev *rdev = rcu_dereference(
> - conf->mirrors[disk].rdev);
> - if (rdev && !test_bit(Faulty, &rdev->flags)) {
> - struct request_queue *q =
> - bdev_get_queue(rdev->bdev);
> - if (q->merge_bvec_fn) {
> - bvm->bi_sector = r10_bio->devs[s].addr
> - + rdev->data_offset;
> - bvm->bi_bdev = rdev->bdev;
> - max = min(max, q->merge_bvec_fn(
> - q, bvm, biovec));
> - }
> - }
> - rdev = rcu_dereference(conf->mirrors[disk].replacement);
> - if (rdev && !test_bit(Faulty, &rdev->flags)) {
> - struct request_queue *q =
> - bdev_get_queue(rdev->bdev);
> - if (q->merge_bvec_fn) {
> - bvm->bi_sector = r10_bio->devs[s].addr
> - + rdev->data_offset;
> - bvm->bi_bdev = rdev->bdev;
> - max = min(max, q->merge_bvec_fn(
> - q, bvm, biovec));
> - }
> - }
> - }
> - rcu_read_unlock();
> - }
> - return max;
> -}
> -
> /*
> * This routine returns the disk from which the requested read should
> * be done. There is a per-array 'next expected sequential IO' sector
> @@ -821,12 +734,10 @@ retry:
> disk = r10_bio->devs[slot].devnum;
> rdev = rcu_dereference(conf->mirrors[disk].replacement);
> if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
> - test_bit(Unmerged, &rdev->flags) ||
> r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
> rdev = rcu_dereference(conf->mirrors[disk].rdev);
> if (rdev == NULL ||
> - test_bit(Faulty, &rdev->flags) ||
> - test_bit(Unmerged, &rdev->flags))
> + test_bit(Faulty, &rdev->flags))
> continue;
> if (!test_bit(In_sync, &rdev->flags) &&
> r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
> @@ -1326,11 +1237,9 @@ retry_write:
> blocked_rdev = rrdev;
> break;
> }
> - if (rdev && (test_bit(Faulty, &rdev->flags)
> - || test_bit(Unmerged, &rdev->flags)))
> + if (rdev && (test_bit(Faulty, &rdev->flags)))
> rdev = NULL;
> - if (rrdev && (test_bit(Faulty, &rrdev->flags)
> - || test_bit(Unmerged, &rrdev->flags)))
> + if (rrdev && (test_bit(Faulty, &rrdev->flags)))
> rrdev = NULL;
>
> r10_bio->devs[i].bio = NULL;
> @@ -1777,7 +1686,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> int mirror;
> int first = 0;
> int last = conf->geo.raid_disks - 1;
> - struct request_queue *q = bdev_get_queue(rdev->bdev);
>
> if (mddev->recovery_cp < MaxSector)
> /* only hot-add to in-sync arrays, as recovery is
> @@ -1790,11 +1698,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> if (rdev->raid_disk >= 0)
> first = last = rdev->raid_disk;
>
> - if (q->merge_bvec_fn) {
> - set_bit(Unmerged, &rdev->flags);
> - mddev->merge_check_needed = 1;
> - }
> -
> if (rdev->saved_raid_disk >= first &&
> conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
> mirror = rdev->saved_raid_disk;
> @@ -1833,19 +1736,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> rcu_assign_pointer(p->rdev, rdev);
> break;
> }
> - if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
> - /* Some requests might not have seen this new
> - * merge_bvec_fn. We must wait for them to complete
> - * before merging the device fully.
> - * First we make sure any code which has tested
> - * our function has submitted the request, then
> - * we wait for all outstanding requests to complete.
> - */
> - synchronize_sched();
> - freeze_array(conf, 0);
> - unfreeze_array(conf);
> - clear_bit(Unmerged, &rdev->flags);
> - }
> md_integrity_add_rdev(rdev, mddev);
> if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
> queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
> @@ -2404,7 +2294,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
> d = r10_bio->devs[sl].devnum;
> rdev = rcu_dereference(conf->mirrors[d].rdev);
> if (rdev &&
> - !test_bit(Unmerged, &rdev->flags) &&
> test_bit(In_sync, &rdev->flags) &&
> is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
> &first_bad, &bad_sectors) == 0) {
> @@ -2458,7 +2347,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
> d = r10_bio->devs[sl].devnum;
> rdev = rcu_dereference(conf->mirrors[d].rdev);
> if (!rdev ||
> - test_bit(Unmerged, &rdev->flags) ||
> !test_bit(In_sync, &rdev->flags))
> continue;
>
> @@ -3652,8 +3540,6 @@ static int run(struct mddev *mddev)
> disk->rdev = rdev;
> }
> q = bdev_get_queue(rdev->bdev);
> - if (q->merge_bvec_fn)
> - mddev->merge_check_needed = 1;
> diff = (rdev->new_data_offset - rdev->data_offset);
> if (!mddev->reshape_backwards)
> diff = -diff;
> @@ -4706,7 +4592,6 @@ static struct md_personality raid10_personality =
> .start_reshape = raid10_start_reshape,
> .finish_reshape = raid10_finish_reshape,
> .congested = raid10_congested,
> - .mergeable_bvec = raid10_mergeable_bvec,
> };
>
> static int __init raid_init(void)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index b6c6ace..18d2b23 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4625,35 +4625,6 @@ static int raid5_congested(struct mddev *mddev, int bits)
> return 0;
> }
>
> -/* We want read requests to align with chunks where possible,
> - * but write requests don't need to.
> - */
> -static int raid5_mergeable_bvec(struct mddev *mddev,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec)
> -{
> - sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
> - int max;
> - unsigned int chunk_sectors = mddev->chunk_sectors;
> - unsigned int bio_sectors = bvm->bi_size >> 9;
> -
> - /*
> - * always allow writes to be mergeable, read as well if array
> - * is degraded as we'll go through stripe cache anyway.
> - */
> - if ((bvm->bi_rw & 1) == WRITE || mddev->degraded)
> - return biovec->bv_len;
> -
> - if (mddev->new_chunk_sectors < mddev->chunk_sectors)
> - chunk_sectors = mddev->new_chunk_sectors;
> - max = (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
> - if (max < 0) max = 0;
> - if (max <= biovec->bv_len && bio_sectors == 0)
> - return biovec->bv_len;
> - else
> - return max;
> -}
> -
> static int in_chunk_boundary(struct mddev *mddev, struct bio *bio)
> {
> sector_t sector = bio->bi_iter.bi_sector + get_start_sect(bio->bi_bdev);
> @@ -7722,7 +7693,6 @@ static struct md_personality raid6_personality =
> .quiesce = raid5_quiesce,
> .takeover = raid6_takeover,
> .congested = raid5_congested,
> - .mergeable_bvec = raid5_mergeable_bvec,
> };
> static struct md_personality raid5_personality =
> {
> @@ -7746,7 +7716,6 @@ static struct md_personality raid5_personality =
> .quiesce = raid5_quiesce,
> .takeover = raid5_takeover,
> .congested = raid5_congested,
> - .mergeable_bvec = raid5_mergeable_bvec,
> };
>
> static struct md_personality raid4_personality =
> @@ -7771,7 +7740,6 @@ static struct md_personality raid4_personality =
> .quiesce = raid5_quiesce,
> .takeover = raid4_takeover,
> .congested = raid5_congested,
> - .mergeable_bvec = raid5_mergeable_bvec,
> };
>
> static int __init raid5_init(void)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 93b81a2..6927b76 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -239,14 +239,6 @@ typedef int (prep_rq_fn) (struct request_queue *, struct request *);
> typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
>
> struct bio_vec;
> -struct bvec_merge_data {
> - struct block_device *bi_bdev;
> - sector_t bi_sector;
> - unsigned bi_size;
> - unsigned long bi_rw;
> -};
> -typedef int (merge_bvec_fn) (struct request_queue *, struct bvec_merge_data *,
> - struct bio_vec *);
> typedef void (softirq_done_fn)(struct request *);
> typedef int (dma_drain_needed_fn)(struct request *);
> typedef int (lld_busy_fn) (struct request_queue *q);
> @@ -331,7 +323,6 @@ struct request_queue {
> make_request_fn *make_request_fn;
> prep_rq_fn *prep_rq_fn;
> unprep_rq_fn *unprep_rq_fn;
> - merge_bvec_fn *merge_bvec_fn;
> softirq_done_fn *softirq_done_fn;
> rq_timed_out_fn *rq_timed_out_fn;
> dma_drain_needed_fn *dma_drain_needed;
> @@ -1041,7 +1032,6 @@ extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
> extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
> extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
> extern void blk_queue_unprep_rq(struct request_queue *, unprep_rq_fn *ufn);
> -extern void blk_queue_merge_bvec(struct request_queue *, merge_bvec_fn *);
> extern void blk_queue_dma_alignment(struct request_queue *, int);
> extern void blk_queue_update_dma_alignment(struct request_queue *, int);
> extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 51cc1de..76d23fa 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -82,9 +82,6 @@ typedef int (*dm_message_fn) (struct dm_target *ti, unsigned argc, char **argv);
> typedef int (*dm_ioctl_fn) (struct dm_target *ti, unsigned int cmd,
> unsigned long arg);
>
> -typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size);
> -
> /*
> * These iteration functions are typically used to check (and combine)
> * properties of underlying devices.
> @@ -160,7 +157,6 @@ struct target_type {
> dm_status_fn status;
> dm_message_fn message;
> dm_ioctl_fn ioctl;
> - dm_merge_fn merge;
> dm_busy_fn busy;
> dm_iterate_devices_fn iterate_devices;
> dm_io_hints_fn io_hints;


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-25 07:03:24

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Sun, May 24, 2015 at 10:48 PM, NeilBrown <[email protected]> wrote:
> On Fri, 22 May 2015 11:18:38 -0700 Ming Lin <[email protected]> wrote:
>
>> From: Kent Overstreet <[email protected]>
>>
>> Remove bio_fits_rdev() completely, because ->merge_bvec_fn() has now
>> gone. There's no point in calling bio_fits_rdev() only for ensuring
>> aligned read from rdev.
>
> Surely this patch should come *before*
> [PATCH v4 07/11] md/raid5: split bio for chunk_aligned_read

PATCH 6, then PATCH 7, isn't it already *before*?

>
> and the comment says ->merge_bvec_fn() has gone, but that isn't until
> [PATCH v4 08/11] block: kill merge_bvec_fn() completely
>
>
> If those issues are resolved, then

How about this?

PATCH 6: md/raid5: split bio for chunk_aligned_read
PATCH 7: block: kill merge_bvec_fn() completely
PATCH 8: md/raid5: get rid of bio_fits_rdev()

Thanks.

>
> Acked-by: NeilBrown <[email protected]>
>
> Thanks,
> NeilBrown
>
>
>>
>> Cc: Neil Brown <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Kent Overstreet <[email protected]>
>> [dpark: add more description in commit message]
>> Signed-off-by: Dongsu Park <[email protected]>
>> Signed-off-by: Ming Lin <[email protected]>
>> ---
>> drivers/md/raid5.c | 23 +----------------------
>> 1 file changed, 1 insertion(+), 22 deletions(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 1ba97fd..b303ded 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -4743,25 +4743,6 @@ static void raid5_align_endio(struct bio *bi, int error)
>> add_bio_to_retry(raid_bi, conf);
>> }
>>
>> -static int bio_fits_rdev(struct bio *bi)
>> -{
>> - struct request_queue *q = bdev_get_queue(bi->bi_bdev);
>> -
>> - if (bio_sectors(bi) > queue_max_sectors(q))
>> - return 0;
>> - blk_recount_segments(q, bi);
>> - if (bi->bi_phys_segments > queue_max_segments(q))
>> - return 0;
>> -
>> - if (q->merge_bvec_fn)
>> - /* it's too hard to apply the merge_bvec_fn at this stage,
>> - * just just give up
>> - */
>> - return 0;
>> -
>> - return 1;
>> -}
>> -
>> static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
>> {
>> struct r5conf *conf = mddev->private;
>> @@ -4815,11 +4796,9 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
>> align_bi->bi_bdev = rdev->bdev;
>> __clear_bit(BIO_SEG_VALID, &align_bi->bi_flags);
>>
>> - if (!bio_fits_rdev(align_bi) ||
>> - is_badblock(rdev, align_bi->bi_iter.bi_sector,
>> + if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
>> bio_sectors(align_bi),
>> &first_bad, &bad_sectors)) {
>> - /* too big in some way, or has a known bad block */
>> bio_put(align_bi);
>> rdev_dec_pending(rdev, mddev);
>> return 0;
>

2015-05-25 07:54:28

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Mon, 25 May 2015 00:03:20 -0700 Ming Lin <[email protected]> wrote:

> On Sun, May 24, 2015 at 10:48 PM, NeilBrown <[email protected]> wrote:
> > On Fri, 22 May 2015 11:18:38 -0700 Ming Lin <[email protected]> wrote:
> >
> >> From: Kent Overstreet <[email protected]>
> >>
> >> Remove bio_fits_rdev() completely, because ->merge_bvec_fn() has now
> >> gone. There's no point in calling bio_fits_rdev() only for ensuring
> >> aligned read from rdev.
> >
> > Surely this patch should come *before*
> > [PATCH v4 07/11] md/raid5: split bio for chunk_aligned_read
>
> PATCH 6, then PATCH 7, isn't it already *before*?

Did I write that? I guess I did :-(
I meant *after*. Don't get rid of bio_fits_rdev until split_bio is in
chunk_aligned_read().

Sorry.

>
> >
> > and the comment says ->merge_bvec_fn() has gone, but that isn't until
> > [PATCH v4 08/11] block: kill merge_bvec_fn() completely
> >
> >
> > If those issues are resolved, then
>
> How about this?
>
> PATCH 6: md/raid5: split bio for chunk_aligned_read
> PATCH 7: block: kill merge_bvec_fn() completely
> PATCH 8: md/raid5: get rid of bio_fits_rdev()

Yes for "get rid of bio_fits_rdev()" after "split bio for chunk_aligned_read".

For the other issue, you could do was you suggest, or you could just change
the comment.
Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now
performed by blk_queue_split() in md_make_request().

Up to you.

Thanks,
NeilBrown


>
> Thanks.
>
> >
> > Acked-by: NeilBrown <[email protected]>
> >
> > Thanks,
> > NeilBrown
> >
> >
> >>
> >> Cc: Neil Brown <[email protected]>
> >> Cc: [email protected]
> >> Signed-off-by: Kent Overstreet <[email protected]>
> >> [dpark: add more description in commit message]
> >> Signed-off-by: Dongsu Park <[email protected]>
> >> Signed-off-by: Ming Lin <[email protected]>
> >> ---
> >> drivers/md/raid5.c | 23 +----------------------
> >> 1 file changed, 1 insertion(+), 22 deletions(-)
> >>
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index 1ba97fd..b303ded 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -4743,25 +4743,6 @@ static void raid5_align_endio(struct bio *bi, int error)
> >> add_bio_to_retry(raid_bi, conf);
> >> }
> >>
> >> -static int bio_fits_rdev(struct bio *bi)
> >> -{
> >> - struct request_queue *q = bdev_get_queue(bi->bi_bdev);
> >> -
> >> - if (bio_sectors(bi) > queue_max_sectors(q))
> >> - return 0;
> >> - blk_recount_segments(q, bi);
> >> - if (bi->bi_phys_segments > queue_max_segments(q))
> >> - return 0;
> >> -
> >> - if (q->merge_bvec_fn)
> >> - /* it's too hard to apply the merge_bvec_fn at this stage,
> >> - * just just give up
> >> - */
> >> - return 0;
> >> -
> >> - return 1;
> >> -}
> >> -
> >> static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
> >> {
> >> struct r5conf *conf = mddev->private;
> >> @@ -4815,11 +4796,9 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
> >> align_bi->bi_bdev = rdev->bdev;
> >> __clear_bit(BIO_SEG_VALID, &align_bi->bi_flags);
> >>
> >> - if (!bio_fits_rdev(align_bi) ||
> >> - is_badblock(rdev, align_bi->bi_iter.bi_sector,
> >> + if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
> >> bio_sectors(align_bi),
> >> &first_bad, &bad_sectors)) {
> >> - /* too big in some way, or has a known bad block */
> >> bio_put(align_bi);
> >> rdev_dec_pending(rdev, mddev);
> >> return 0;
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-25 13:51:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

On Sun, May 24, 2015 at 12:37:32AM -0700, Ming Lin wrote:
> > Except for that these changes looks good, and the previous version
> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>
> I'll test it on a 2 sockets server with 10 NVMe drives on Monday.
> I'm going to run fio tests:
> 1. raw NVMe drives direct IO read/write
> 2. ext4 read/write
>
> Let me know if you have other tests that I can run.

That sounds like a good start, but the most important tests would be
those that will cause a lot of splits with the new code.

E.g. some old ATA devices using the piix driver, some crappy USB
device that just allows 64 sector transfers. Or maybe it's better
to just simulate the case by dropping max_sectors to ease some pain :)

The other cases is DM/MD Ñ•tripes or RAID5/6 with small stripe sizes.

2015-05-25 14:04:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
> From: Kent Overstreet <[email protected]>
>
> As generic_make_request() is now able to handle arbitrarily sized bios,
> it's no longer necessary for each individual block driver to define its
> own ->merge_bvec_fn() callback. Remove every invocation completely.

It might be good to replace patch 1 and this one by a patch per driver
to remove the merge_bvec_fn instance and add the blk_queue_split call
for all those drivers that actually had a ->merge_bvec_fn. As some
of them were non-trivial attention from the maintainers would be helpful,
and a patch per driver might help with that.

> -/* This is called by bio_add_page().
> - *
> - * q->max_hw_sectors and other global limits are already enforced there.
> - *
> - * We need to call down to our lower level device,
> - * in case it has special restrictions.
> - *
> - * We also may need to enforce configured max-bio-bvecs limits.
> - *
> - * As long as the BIO is empty we have to allow at least one bvec,
> - * regardless of size and offset, so no need to ask lower levels.
> - */
> -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)


This just checks the lower device, so it looks obviously fine.

> -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
> - struct bio_vec *bvec)
> -{
> - struct pktcdvd_device *pd = q->queuedata;
> - sector_t zone = get_zone(bmd->bi_sector, pd);
> - int used = ((bmd->bi_sector - zone) << 9) + bmd->bi_size;
> - int remaining = (pd->settings.size << 9) - used;
> - int remaining2;
> -
> - /*
> - * A bio <= PAGE_SIZE must be allowed. If it crosses a packet
> - * boundary, pkt_make_request() will split the bio.
> - */
> - remaining2 = PAGE_SIZE - bmd->bi_size;
> - remaining = max(remaining, remaining2);
> -
> - BUG_ON(remaining < 0);
> - return remaining;
> -}

As mentioned in the comment pkt_make_request will split the bio so pkt
looks fine.

> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index ec6c5c6..f50edb3 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
> return BLK_MQ_RQ_QUEUE_OK;
> }
>
> -/*
> - * a queue callback. Makes sure that we don't create a bio that spans across
> - * multiple osd objects. One exception would be with a single page bios,
> - * which we handle later at bio_chain_clone_range()
> - */
> -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
> - struct bio_vec *bvec)

It seems rbd handles requests spanning objects just fine, so I don't
really understand why rbd_merge_bvec even exists. Getting some form
of ACK from the ceph folks would be useful.

> -/*
> - * We assume I/O is going to the origin (which is the volume
> - * more likely to have restrictions e.g. by being striped).
> - * (Looking up the exact location of the data would be expensive
> - * and could always be out of date by the time the bio is submitted.)
> - */
> -static int cache_bvec_merge(struct dm_target *ti,
> - struct bvec_merge_data *bvm,
> - struct bio_vec *biovec, int max_size)
> -{

DM seems to have the most complex merge functions of all drivers, so
I'd really love to see an ACK from Mike.

2015-05-25 14:17:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Mon, May 25, 2015 at 05:54:14PM +1000, NeilBrown wrote:
> Did I write that? I guess I did :-(
> I meant *after*. Don't get rid of bio_fits_rdev until split_bio is in
> chunk_aligned_read().

I suspect the whole series could use some reordering.

patch 1:

add ->bio_split and blk_queue_split

patch 2..n:

one for each non-trivial driver that implements ->merge_bvec_fn to
remove it and instead split bios in ->make_request. The md patch
to do the right thing in chunk_aligned_read goes into the general
md patch here. The bcache patch also goes into this series.

patch n+1:

- add blk_queue_split calls for remaining trivial drivers

patch n+2:

- remove ->merge_bvec_fn and checking of max_sectors a for all
drivers, simplify bio_add_page

patch n+2:

- remove splitting in blkdev_issue_discard

patch n+3

- remove bio_fits_rdev

patch n+4

- remove bio_get_nr_vecs

patch n+4

- use bio_add_page

patch n+5

- update documentation

2015-05-25 15:02:36

by Ilya Dryomov

[permalink] [raw]
Subject: Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig <[email protected]> wrote:
> On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
>> From: Kent Overstreet <[email protected]>
>>
>> As generic_make_request() is now able to handle arbitrarily sized bios,
>> it's no longer necessary for each individual block driver to define its
>> own ->merge_bvec_fn() callback. Remove every invocation completely.
>
> It might be good to replace patch 1 and this one by a patch per driver
> to remove the merge_bvec_fn instance and add the blk_queue_split call
> for all those drivers that actually had a ->merge_bvec_fn. As some
> of them were non-trivial attention from the maintainers would be helpful,
> and a patch per driver might help with that.
>
>> -/* This is called by bio_add_page().
>> - *
>> - * q->max_hw_sectors and other global limits are already enforced there.
>> - *
>> - * We need to call down to our lower level device,
>> - * in case it has special restrictions.
>> - *
>> - * We also may need to enforce configured max-bio-bvecs limits.
>> - *
>> - * As long as the BIO is empty we have to allow at least one bvec,
>> - * regardless of size and offset, so no need to ask lower levels.
>> - */
>> -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
>
>
> This just checks the lower device, so it looks obviously fine.
>
>> -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
>> - struct bio_vec *bvec)
>> -{
>> - struct pktcdvd_device *pd = q->queuedata;
>> - sector_t zone = get_zone(bmd->bi_sector, pd);
>> - int used = ((bmd->bi_sector - zone) << 9) + bmd->bi_size;
>> - int remaining = (pd->settings.size << 9) - used;
>> - int remaining2;
>> -
>> - /*
>> - * A bio <= PAGE_SIZE must be allowed. If it crosses a packet
>> - * boundary, pkt_make_request() will split the bio.
>> - */
>> - remaining2 = PAGE_SIZE - bmd->bi_size;
>> - remaining = max(remaining, remaining2);
>> -
>> - BUG_ON(remaining < 0);
>> - return remaining;
>> -}
>
> As mentioned in the comment pkt_make_request will split the bio so pkt
> looks fine.
>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index ec6c5c6..f50edb3 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
>> return BLK_MQ_RQ_QUEUE_OK;
>> }
>>
>> -/*
>> - * a queue callback. Makes sure that we don't create a bio that spans across
>> - * multiple osd objects. One exception would be with a single page bios,
>> - * which we handle later at bio_chain_clone_range()
>> - */
>> -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
>> - struct bio_vec *bvec)
>
> It seems rbd handles requests spanning objects just fine, so I don't
> really understand why rbd_merge_bvec even exists. Getting some form
> of ACK from the ceph folks would be useful.

I'm not Alex, but yeah, we have all the clone/split machinery and so we
can handle a spanning case just fine. I think rbd_merge_bvec() exists
to make sure we don't have to do that unless it's really necessary -
like when a single page gets submitted at an inconvenient offset.

I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
rbd_init_disk() but I haven't had a chance to play with it yet. In any
case, we should be fine with getting rid of rbd_merge_bvec(). If this
ends up a per-driver patchset, I can make rbd_merge_bvec() ->
blk_queue_chunk_sectors() a single patch and push it through
ceph-client.git.

Thanks,

Ilya

2015-05-25 15:08:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote:
> I'm not Alex, but yeah, we have all the clone/split machinery and so we
> can handle a spanning case just fine. I think rbd_merge_bvec() exists
> to make sure we don't have to do that unless it's really necessary -
> like when a single page gets submitted at an inconvenient offset.
>
> I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
> rbd_init_disk() but I haven't had a chance to play with it yet. In any
> case, we should be fine with getting rid of rbd_merge_bvec(). If this
> ends up a per-driver patchset, I can make rbd_merge_bvec() ->
> blk_queue_chunk_sectors() a single patch and push it through
> ceph-client.git.

Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors
value, another thing that needs updating. I forgot how many weird
merging hacks we had to add for nvme..

While I'd like to see per-driver patches we'd still need to merge
them together through the block tree. Note that with this series
there won't be any benefit of using blk_queue_chunk_sectors over just
doing the split in rbd. Maybe we can even remove it again and do
that work in the drivers in the future.

2015-05-25 15:19:36

by Ilya Dryomov

[permalink] [raw]
Subject: Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

On Mon, May 25, 2015 at 6:08 PM, Christoph Hellwig <[email protected]> wrote:
> On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote:
>> I'm not Alex, but yeah, we have all the clone/split machinery and so we
>> can handle a spanning case just fine. I think rbd_merge_bvec() exists
>> to make sure we don't have to do that unless it's really necessary -
>> like when a single page gets submitted at an inconvenient offset.
>>
>> I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
>> rbd_init_disk() but I haven't had a chance to play with it yet. In any
>> case, we should be fine with getting rid of rbd_merge_bvec(). If this
>> ends up a per-driver patchset, I can make rbd_merge_bvec() ->
>> blk_queue_chunk_sectors() a single patch and push it through
>> ceph-client.git.
>
> Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors
> value, another thing that needs updating. I forgot how many weird
> merging hacks we had to add for nvme..
>
> While I'd like to see per-driver patches we'd still need to merge
> them together through the block tree. Note that with this series
> there won't be any benefit of using blk_queue_chunk_sectors over just
> doing the split in rbd. Maybe we can even remove it again and do
> that work in the drivers in the future.

OK, I'll drop it, especially if it's potentially on its way out. With
the fancy striping support, which I'll hopefully get to sometime, the
striping pattern will become much more complicated anyway, so relying
on rbd doing bio splitting is right in the long run as well.

Thanks,

Ilya

2015-05-25 15:35:29

by Alex Elder

[permalink] [raw]
Subject: Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

On 05/25/2015 10:02 AM, Ilya Dryomov wrote:
> On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig <[email protected]> wrote:
>> On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
>>> From: Kent Overstreet <[email protected]>
>>>
>>> As generic_make_request() is now able to handle arbitrarily sized bios,
>>> it's no longer necessary for each individual block driver to define its
>>> own ->merge_bvec_fn() callback. Remove every invocation completely.
>>
>> It might be good to replace patch 1 and this one by a patch per driver
>> to remove the merge_bvec_fn instance and add the blk_queue_split call
>> for all those drivers that actually had a ->merge_bvec_fn. As some
>> of them were non-trivial attention from the maintainers would be helpful,
>> and a patch per driver might help with that.
>>
>>> -/* This is called by bio_add_page().
>>> - *
>>> - * q->max_hw_sectors and other global limits are already enforced there.
>>> - *
>>> - * We need to call down to our lower level device,
>>> - * in case it has special restrictions.
>>> - *
>>> - * We also may need to enforce configured max-bio-bvecs limits.
>>> - *
>>> - * As long as the BIO is empty we have to allow at least one bvec,
>>> - * regardless of size and offset, so no need to ask lower levels.
>>> - */
>>> -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
>>
>>
>> This just checks the lower device, so it looks obviously fine.
>>
>>> -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
>>> - struct bio_vec *bvec)
>>> -{
>>> - struct pktcdvd_device *pd = q->queuedata;
>>> - sector_t zone = get_zone(bmd->bi_sector, pd);
>>> - int used = ((bmd->bi_sector - zone) << 9) + bmd->bi_size;
>>> - int remaining = (pd->settings.size << 9) - used;
>>> - int remaining2;
>>> -
>>> - /*
>>> - * A bio <= PAGE_SIZE must be allowed. If it crosses a packet
>>> - * boundary, pkt_make_request() will split the bio.
>>> - */
>>> - remaining2 = PAGE_SIZE - bmd->bi_size;
>>> - remaining = max(remaining, remaining2);
>>> -
>>> - BUG_ON(remaining < 0);
>>> - return remaining;
>>> -}
>>
>> As mentioned in the comment pkt_make_request will split the bio so pkt
>> looks fine.
>>
>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>> index ec6c5c6..f50edb3 100644
>>> --- a/drivers/block/rbd.c
>>> +++ b/drivers/block/rbd.c
>>> @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
>>> return BLK_MQ_RQ_QUEUE_OK;
>>> }
>>>
>>> -/*
>>> - * a queue callback. Makes sure that we don't create a bio that spans across
>>> - * multiple osd objects. One exception would be with a single page bios,
>>> - * which we handle later at bio_chain_clone_range()
>>> - */
>>> -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
>>> - struct bio_vec *bvec)
>>
>> It seems rbd handles requests spanning objects just fine, so I don't
>> really understand why rbd_merge_bvec even exists. Getting some form
>> of ACK from the ceph folks would be useful.
>
> I'm not Alex, but yeah, we have all the clone/split machinery and so we
> can handle a spanning case just fine. I think rbd_merge_bvec() exists
> to make sure we don't have to do that unless it's really necessary -
> like when a single page gets submitted at an inconvenient offset.

I am Alex. This is something I never removed. I haven't
looked at it closely now, but it seems to me that after I
created a function that split stuff properly up *before*
the BIO layer got to it (which has since been replaced by
code related to Kent's immutable BIO work), there has been
no need for this function. Removing this was on a long-ago
to-do list--but I didn't want to do it without spending some
time ensuring it wouldn't break anything.

If you want me to work through it in more detail so I can
give a more certain response, let me know and I will do so.

-Alex

> I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
> rbd_init_disk() but I haven't had a chance to play with it yet. In any
> case, we should be fine with getting rid of rbd_merge_bvec(). If this
> ends up a per-driver patchset, I can make rbd_merge_bvec() ->
> blk_queue_chunk_sectors() a single patch and push it through
> ceph-client.git.
>
> Thanks,
>
> Ilya
>

2015-05-26 14:33:52

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Mon, May 25, 2015 at 7:17 AM, Christoph Hellwig <[email protected]> wrote:
> On Mon, May 25, 2015 at 05:54:14PM +1000, NeilBrown wrote:
>> Did I write that? I guess I did :-(
>> I meant *after*. Don't get rid of bio_fits_rdev until split_bio is in
>> chunk_aligned_read().
>
> I suspect the whole series could use some reordering.

Nice reordering.
I'll do this.

Thanks.

>
> patch 1:
>
> add ->bio_split and blk_queue_split
>
> patch 2..n:
>
> one for each non-trivial driver that implements ->merge_bvec_fn to
> remove it and instead split bios in ->make_request. The md patch
> to do the right thing in chunk_aligned_read goes into the general
> md patch here. The bcache patch also goes into this series.
>
> patch n+1:
>
> - add blk_queue_split calls for remaining trivial drivers
>
> patch n+2:
>
> - remove ->merge_bvec_fn and checking of max_sectors a for all
> drivers, simplify bio_add_page
>
> patch n+2:
>
> - remove splitting in blkdev_issue_discard
>
> patch n+3
>
> - remove bio_fits_rdev
>
> patch n+4
>
> - remove bio_get_nr_vecs
>
> patch n+4
>
> - use bio_add_page
>
> patch n+5
>
> - update documentation

2015-05-26 14:36:31

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Fri, May 22 2015 at 2:18pm -0400,
Ming Lin <[email protected]> wrote:

> From: Kent Overstreet <[email protected]>
>
> The way the block layer is currently written, it goes to great lengths
> to avoid having to split bios; upper layer code (such as bio_add_page())
> checks what the underlying device can handle and tries to always create
> bios that don't need to be split.
>
> But this approach becomes unwieldy and eventually breaks down with
> stacked devices and devices with dynamic limits, and it adds a lot of
> complexity. If the block layer could split bios as needed, we could
> eliminate a lot of complexity elsewhere - particularly in stacked
> drivers. Code that creates bios can then create whatever size bios are
> convenient, and more importantly stacked drivers don't have to deal with
> both their own bio size limitations and the limitations of the
> (potentially multiple) devices underneath them. In the future this will
> let us delete merge_bvec_fn and a bunch of other code.

This series doesn't take any steps to train upper layers
(e.g. filesystems) to size their bios larger (which is defined as
"whatever size bios are convenient" above).

bio_add_page(), and merge_bvec_fn, served as the means for upper layers
(and direct IO) to build up optimally sized bios. Without a replacement
(that I can see anyway) how is this patchset making forward progress
(getting Acks, etc)!?

I like the idea of reduced complexity associated with these late bio
splitting changes I'm just not seeing how this is ready given there are
no upper layer changes that speak to building larger bios..

What am I missing?

Please advise, thanks!
Mike

2015-05-26 15:02:21

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <[email protected]> wrote:
> On Fri, May 22 2015 at 2:18pm -0400,
> Ming Lin <[email protected]> wrote:
>
>> From: Kent Overstreet <[email protected]>
>>
>> The way the block layer is currently written, it goes to great lengths
>> to avoid having to split bios; upper layer code (such as bio_add_page())
>> checks what the underlying device can handle and tries to always create
>> bios that don't need to be split.
>>
>> But this approach becomes unwieldy and eventually breaks down with
>> stacked devices and devices with dynamic limits, and it adds a lot of
>> complexity. If the block layer could split bios as needed, we could
>> eliminate a lot of complexity elsewhere - particularly in stacked
>> drivers. Code that creates bios can then create whatever size bios are
>> convenient, and more importantly stacked drivers don't have to deal with
>> both their own bio size limitations and the limitations of the
>> (potentially multiple) devices underneath them. In the future this will
>> let us delete merge_bvec_fn and a bunch of other code.
>
> This series doesn't take any steps to train upper layers
> (e.g. filesystems) to size their bios larger (which is defined as
> "whatever size bios are convenient" above).
>
> bio_add_page(), and merge_bvec_fn, served as the means for upper layers
> (and direct IO) to build up optimally sized bios. Without a replacement
> (that I can see anyway) how is this patchset making forward progress
> (getting Acks, etc)!?
>
> I like the idea of reduced complexity associated with these late bio
> splitting changes I'm just not seeing how this is ready given there are
> no upper layer changes that speak to building larger bios..
>
> What am I missing?

See: [PATCH v4 02/11] block: simplify bio_add_page()
https://lkml.org/lkml/2015/5/22/754

Now bio_add_page() can build lager bios.
And blk_queue_split() can split the bios in ->make_request() if needed.

Thanks.

>
> Please advise, thanks!
> Mike

2015-05-26 15:35:18

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Tue, May 26, 2015 at 08:02:08AM -0700, Ming Lin wrote:
> Now bio_add_page() can build lager bios.
> And blk_queue_split() can split the bios in ->make_request() if needed.

But why not try to make the bio the right size in the first place so you
don't have to incur the performance impact of splitting?

What performance testing have you yet done to demonstrate the *actual* impact
of this patchset in situations where merge_bvec_fn is currently a net benefit?

Alasdair

2015-05-26 16:05:34

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Tue, May 26 2015 at 11:02am -0400,
Ming Lin <[email protected]> wrote:

> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <[email protected]> wrote:
> > On Fri, May 22 2015 at 2:18pm -0400,
> > Ming Lin <[email protected]> wrote:
> >
> >> From: Kent Overstreet <[email protected]>
> >>
> >> The way the block layer is currently written, it goes to great lengths
> >> to avoid having to split bios; upper layer code (such as bio_add_page())
> >> checks what the underlying device can handle and tries to always create
> >> bios that don't need to be split.
> >>
> >> But this approach becomes unwieldy and eventually breaks down with
> >> stacked devices and devices with dynamic limits, and it adds a lot of
> >> complexity. If the block layer could split bios as needed, we could
> >> eliminate a lot of complexity elsewhere - particularly in stacked
> >> drivers. Code that creates bios can then create whatever size bios are
> >> convenient, and more importantly stacked drivers don't have to deal with
> >> both their own bio size limitations and the limitations of the
> >> (potentially multiple) devices underneath them. In the future this will
> >> let us delete merge_bvec_fn and a bunch of other code.
> >
> > This series doesn't take any steps to train upper layers
> > (e.g. filesystems) to size their bios larger (which is defined as
> > "whatever size bios are convenient" above).
> >
> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers
> > (and direct IO) to build up optimally sized bios. Without a replacement
> > (that I can see anyway) how is this patchset making forward progress
> > (getting Acks, etc)!?
> >
> > I like the idea of reduced complexity associated with these late bio
> > splitting changes I'm just not seeing how this is ready given there are
> > no upper layer changes that speak to building larger bios..
> >
> > What am I missing?
>
> See: [PATCH v4 02/11] block: simplify bio_add_page()
> https://lkml.org/lkml/2015/5/22/754
>
> Now bio_add_page() can build lager bios.
> And blk_queue_split() can split the bios in ->make_request() if needed.

That'll result in quite large bios and always needing splitting.

As Alasdair asked: please provide some performance data that justifies
these changes. E.g use a setup like: XFS on a DM striped target. We
can iterate on more complex setups once we have established some basic
tests.

If you're just punting to reviewers to do the testing for you that isn't
going to instill _any_ confidence in me for this patchset as a suitabe
replacement relative to performance.

2015-05-26 17:18:08

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Tue, May 26, 2015 at 9:04 AM, Mike Snitzer <[email protected]> wrote:
> On Tue, May 26 2015 at 11:02am -0400,
> Ming Lin <[email protected]> wrote:
>
>> On Tue, May 26, 2015 at 7:36 AM, Mike Snitzer <[email protected]> wrote:
>> > On Fri, May 22 2015 at 2:18pm -0400,
>> > Ming Lin <[email protected]> wrote:
>> >
>> >> From: Kent Overstreet <[email protected]>
>> >>
>> >> The way the block layer is currently written, it goes to great lengths
>> >> to avoid having to split bios; upper layer code (such as bio_add_page())
>> >> checks what the underlying device can handle and tries to always create
>> >> bios that don't need to be split.
>> >>
>> >> But this approach becomes unwieldy and eventually breaks down with
>> >> stacked devices and devices with dynamic limits, and it adds a lot of
>> >> complexity. If the block layer could split bios as needed, we could
>> >> eliminate a lot of complexity elsewhere - particularly in stacked
>> >> drivers. Code that creates bios can then create whatever size bios are
>> >> convenient, and more importantly stacked drivers don't have to deal with
>> >> both their own bio size limitations and the limitations of the
>> >> (potentially multiple) devices underneath them. In the future this will
>> >> let us delete merge_bvec_fn and a bunch of other code.
>> >
>> > This series doesn't take any steps to train upper layers
>> > (e.g. filesystems) to size their bios larger (which is defined as
>> > "whatever size bios are convenient" above).
>> >
>> > bio_add_page(), and merge_bvec_fn, served as the means for upper layers
>> > (and direct IO) to build up optimally sized bios. Without a replacement
>> > (that I can see anyway) how is this patchset making forward progress
>> > (getting Acks, etc)!?
>> >
>> > I like the idea of reduced complexity associated with these late bio
>> > splitting changes I'm just not seeing how this is ready given there are
>> > no upper layer changes that speak to building larger bios..
>> >
>> > What am I missing?
>>
>> See: [PATCH v4 02/11] block: simplify bio_add_page()
>> https://lkml.org/lkml/2015/5/22/754
>>
>> Now bio_add_page() can build lager bios.
>> And blk_queue_split() can split the bios in ->make_request() if needed.
>
> That'll result in quite large bios and always needing splitting.
>
> As Alasdair asked: please provide some performance data that justifies
> these changes. E.g use a setup like: XFS on a DM striped target. We
> can iterate on more complex setups once we have established some basic
> tests.

I'll test XFS on DM and also what Christoph suggested:
https://lkml.org/lkml/2015/5/25/226

>
> If you're just punting to reviewers to do the testing for you that isn't
> going to instill _any_ confidence in me for this patchset as a suitabe
> replacement relative to performance.

Kent's Direct IO rewrite patch depends on this series.
https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-dio-rewrite

I did test the dio patch on a 2 sockets(48 logical CPUs) server and
saw 40% improvement with 48 null_blks.
Here is the fio data of 4k read.

4.1-rc2
----------
Test 1: bw=50509MB/s, iops=12930K
Test 2: bw=49745MB/s, iops=12735K
Test 3: bw=50297MB/s, iops=12876K,
Average: bw=50183MB/s, iops=12847K

4.1-rc2-dio-rewrite
------------------------
Test 1: bw=70269MB/s, iops=17989K
Test 2: bw=70097MB/s, iops=17945K
Test 3: bw=70907MB/s, iops=18152K
Average: bw=70424MB/s, iops=18028K

2015-05-26 22:32:44

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Tue, May 26, 2015 at 7:33 AM, Ming Lin <[email protected]> wrote:
> On Mon, May 25, 2015 at 7:17 AM, Christoph Hellwig <[email protected]> wrote:
>> On Mon, May 25, 2015 at 05:54:14PM +1000, NeilBrown wrote:
>>> Did I write that? I guess I did :-(
>>> I meant *after*. Don't get rid of bio_fits_rdev until split_bio is in
>>> chunk_aligned_read().
>>
>> I suspect the whole series could use some reordering.
>
> Nice reordering.
> I'll do this.

Here is the reordering.
https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req

I'll post it if you are OK.

[PATCH 01/15] block: add blk_queue_split()
[PATCH 02/15] md: remove ->merge_bvec_fn
[PATCH 03/15] dm: remov merge functions
[PATCH 04/15] drbd: remove ->merge_bvec_fn
[PATCH 05/15] pktcdvd: remove ->merge_bvec_fn
[PATCH 06/15] rbd: remove ->merge_bvec_fn
[PATCH 07/15] bcache: remove driver private bio splitting code
[PATCH 08/15] btrfs: remove bio splitting and merge_bvec_fn() calls
[PATCH 09/15] block: call blk_queue_split() in make_request functions
[PATCH 10/15] block: kill ->merge_bvec_fn and simplify bio_add_page
[PATCH 11/15] block: remove split code in blkdev_issue_discard
[PATCH 12/15] md/raid5: get rid of bio_fits_rdev()
[PATCH 13/15] block: remove bio_get_nr_vecs()
[PATCH 14/15] fs: use helper bio_add_page() instead of open coding on
[PATCH 15/15] Documentation: update notes in biovecs about

>
> Thanks.
>
>>
>> patch 1:
>>
>> add ->bio_split and blk_queue_split
>>
>> patch 2..n:
>>
>> one for each non-trivial driver that implements ->merge_bvec_fn to
>> remove it and instead split bios in ->make_request. The md patch
>> to do the right thing in chunk_aligned_read goes into the general
>> md patch here. The bcache patch also goes into this series.
>>
>> patch n+1:
>>
>> - add blk_queue_split calls for remaining trivial drivers
>>
>> patch n+2:
>>
>> - remove ->merge_bvec_fn and checking of max_sectors a for all
>> drivers, simplify bio_add_page
>>
>> patch n+2:
>>
>> - remove splitting in blkdev_issue_discard
>>
>> patch n+3
>>
>> - remove bio_fits_rdev
>>
>> patch n+4
>>
>> - remove bio_get_nr_vecs
>>
>> patch n+4
>>
>> - use bio_add_page
>>
>> patch n+5
>>
>> - update documentation

2015-05-26 23:03:24

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Tue, 26 May 2015 15:32:38 -0700 Ming Lin <[email protected]> wrote:

> On Tue, May 26, 2015 at 7:33 AM, Ming Lin <[email protected]> wrote:
> > On Mon, May 25, 2015 at 7:17 AM, Christoph Hellwig <[email protected]> wrote:
> >> On Mon, May 25, 2015 at 05:54:14PM +1000, NeilBrown wrote:
> >>> Did I write that? I guess I did :-(
> >>> I meant *after*. Don't get rid of bio_fits_rdev until split_bio is in
> >>> chunk_aligned_read().
> >>
> >> I suspect the whole series could use some reordering.
> >
> > Nice reordering.
> > I'll do this.
>
> Here is the reordering.
> https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
>
> I'll post it if you are OK.
>
> [PATCH 01/15] block: add blk_queue_split()
> [PATCH 02/15] md: remove ->merge_bvec_fn
> [PATCH 03/15] dm: remov merge functions
> [PATCH 04/15] drbd: remove ->merge_bvec_fn
> [PATCH 05/15] pktcdvd: remove ->merge_bvec_fn
> [PATCH 06/15] rbd: remove ->merge_bvec_fn
> [PATCH 07/15] bcache: remove driver private bio splitting code
> [PATCH 08/15] btrfs: remove bio splitting and merge_bvec_fn() calls
> [PATCH 09/15] block: call blk_queue_split() in make_request functions
> [PATCH 10/15] block: kill ->merge_bvec_fn and simplify bio_add_page
> [PATCH 11/15] block: remove split code in blkdev_issue_discard
> [PATCH 12/15] md/raid5: get rid of bio_fits_rdev()
> [PATCH 13/15] block: remove bio_get_nr_vecs()
> [PATCH 14/15] fs: use helper bio_add_page() instead of open coding on
> [PATCH 15/15] Documentation: update notes in biovecs about

The changes to dm.c and dm.h should be in the "dm:" patch, not "md:".

But I don't think the sequence is right.

You cannot remove ->merge_bvec_fn for *any* stacked device until *all* devices
make use of blk_queue_split() (or otherwise handle arbitrarily large bios).

I think it would be easiest to:
- add blk_queue_split() and call it from common code before ->make_request_fn
is called. The ensure all devices can accept arbitrarily large bios.
- driver-by-driver remove merge_bvec_fn and make sure the the driver can cope
with arbitrary bios themselve, calling blk_queue_split in the make_request
function only if needed
- finally remove the call to blk_queue_split from the common code.

Does that make sense to others?

Thanks,
NeilBrown

>
> >
> > Thanks.
> >
> >>
> >> patch 1:
> >>
> >> add ->bio_split and blk_queue_split
> >>
> >> patch 2..n:
> >>
> >> one for each non-trivial driver that implements ->merge_bvec_fn to
> >> remove it and instead split bios in ->make_request. The md patch
> >> to do the right thing in chunk_aligned_read goes into the general
> >> md patch here. The bcache patch also goes into this series.
> >>
> >> patch n+1:
> >>
> >> - add blk_queue_split calls for remaining trivial drivers
> >>
> >> patch n+2:
> >>
> >> - remove ->merge_bvec_fn and checking of max_sectors a for all
> >> drivers, simplify bio_add_page
> >>
> >> patch n+2:
> >>
> >> - remove splitting in blkdev_issue_discard
> >>
> >> patch n+3
> >>
> >> - remove bio_fits_rdev
> >>
> >> patch n+4
> >>
> >> - remove bio_get_nr_vecs
> >>
> >> patch n+4
> >>
> >> - use bio_add_page
> >>
> >> patch n+5
> >>
> >> - update documentation


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-26 23:08:30

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Tue, 26 May 2015 16:34:14 +0100 Alasdair G Kergon <[email protected]> wrote:

> On Tue, May 26, 2015 at 08:02:08AM -0700, Ming Lin wrote:
> > Now bio_add_page() can build lager bios.
> > And blk_queue_split() can split the bios in ->make_request() if needed.
>
> But why not try to make the bio the right size in the first place so you
> don't have to incur the performance impact of splitting?

Because we don't know what the "right" size is. And the "right" size can
change when array reconfiguration happens.

Splitting has to happen somewhere, if only in bio_addpage where it decides to
create a new bio rather than add another page to the current one. So moving
the split to a different level of the stack shouldn't necessarily change the
performance profile.

Obviously testing is important to confirm that.

NeilBrown

>
> What performance testing have you yet done to demonstrate the *actual* impact
> of this patchset in situations where merge_bvec_fn is currently a net benefit?
>
> Alasdair
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-26 23:42:41

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Tue, May 26, 2015 at 4:03 PM, NeilBrown <[email protected]> wrote:
> On Tue, 26 May 2015 15:32:38 -0700 Ming Lin <[email protected]> wrote:
>
>> On Tue, May 26, 2015 at 7:33 AM, Ming Lin <[email protected]> wrote:
>> > On Mon, May 25, 2015 at 7:17 AM, Christoph Hellwig <[email protected]> wrote:
>> >> On Mon, May 25, 2015 at 05:54:14PM +1000, NeilBrown wrote:
>> >>> Did I write that? I guess I did :-(
>> >>> I meant *after*. Don't get rid of bio_fits_rdev until split_bio is in
>> >>> chunk_aligned_read().
>> >>
>> >> I suspect the whole series could use some reordering.
>> >
>> > Nice reordering.
>> > I'll do this.
>>
>> Here is the reordering.
>> https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
>>
>> I'll post it if you are OK.
>>
>> [PATCH 01/15] block: add blk_queue_split()
>> [PATCH 02/15] md: remove ->merge_bvec_fn
>> [PATCH 03/15] dm: remov merge functions
>> [PATCH 04/15] drbd: remove ->merge_bvec_fn
>> [PATCH 05/15] pktcdvd: remove ->merge_bvec_fn
>> [PATCH 06/15] rbd: remove ->merge_bvec_fn
>> [PATCH 07/15] bcache: remove driver private bio splitting code
>> [PATCH 08/15] btrfs: remove bio splitting and merge_bvec_fn() calls
>> [PATCH 09/15] block: call blk_queue_split() in make_request functions
>> [PATCH 10/15] block: kill ->merge_bvec_fn and simplify bio_add_page
>> [PATCH 11/15] block: remove split code in blkdev_issue_discard
>> [PATCH 12/15] md/raid5: get rid of bio_fits_rdev()
>> [PATCH 13/15] block: remove bio_get_nr_vecs()
>> [PATCH 14/15] fs: use helper bio_add_page() instead of open coding on
>> [PATCH 15/15] Documentation: update notes in biovecs about
>
> The changes to dm.c and dm.h should be in the "dm:" patch, not "md:".

Will move it.

>
> But I don't think the sequence is right.
>
> You cannot remove ->merge_bvec_fn for *any* stacked device until *all* devices
> make use of blk_queue_split() (or otherwise handle arbitrarily large bios).
>
> I think it would be easiest to:
> - add blk_queue_split() and call it from common code before ->make_request_fn
> is called. The ensure all devices can accept arbitrarily large bios.

For "common code", do you mean "generic_make_request()"

diff --git a/block/blk-core.c b/block/blk-core.c
index fbbb337..bb6455b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1942,6 +1942,7 @@ void generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);

+ blk_queue_split(q, &bio, q->bio_split);
q->make_request_fn(q, bio);

bio = bio_list_pop(current->bio_list);

> - driver-by-driver remove merge_bvec_fn and make sure the the driver can cope
> with arbitrary bios themselve, calling blk_queue_split in the make_request
> function only if needed
> - finally remove the call to blk_queue_split from the common code.
>
> Does that make sense to others?
>
> Thanks,
> NeilBrown
>
>>
>> >
>> > Thanks.
>> >
>> >>
>> >> patch 1:
>> >>
>> >> add ->bio_split and blk_queue_split
>> >>
>> >> patch 2..n:
>> >>
>> >> one for each non-trivial driver that implements ->merge_bvec_fn to
>> >> remove it and instead split bios in ->make_request. The md patch
>> >> to do the right thing in chunk_aligned_read goes into the general
>> >> md patch here. The bcache patch also goes into this series.
>> >>
>> >> patch n+1:
>> >>
>> >> - add blk_queue_split calls for remaining trivial drivers
>> >>
>> >> patch n+2:
>> >>
>> >> - remove ->merge_bvec_fn and checking of max_sectors a for all
>> >> drivers, simplify bio_add_page
>> >>
>> >> patch n+2:
>> >>
>> >> - remove splitting in blkdev_issue_discard
>> >>
>> >> patch n+3
>> >>
>> >> - remove bio_fits_rdev
>> >>
>> >> patch n+4
>> >>
>> >> - remove bio_get_nr_vecs
>> >>
>> >> patch n+4
>> >>
>> >> - use bio_add_page
>> >>
>> >> patch n+5
>> >>
>> >> - update documentation
>

2015-05-27 00:38:49

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev()

On Tue, 26 May 2015 16:42:35 -0700 Ming Lin <[email protected]> wrote:

> On Tue, May 26, 2015 at 4:03 PM, NeilBrown <[email protected]> wrote:
> > On Tue, 26 May 2015 15:32:38 -0700 Ming Lin <[email protected]> wrote:
> >
> >> On Tue, May 26, 2015 at 7:33 AM, Ming Lin <[email protected]> wrote:
> >> > On Mon, May 25, 2015 at 7:17 AM, Christoph Hellwig <[email protected]> wrote:
> >> >> On Mon, May 25, 2015 at 05:54:14PM +1000, NeilBrown wrote:
> >> >>> Did I write that? I guess I did :-(
> >> >>> I meant *after*. Don't get rid of bio_fits_rdev until split_bio is in
> >> >>> chunk_aligned_read().
> >> >>
> >> >> I suspect the whole series could use some reordering.
> >> >
> >> > Nice reordering.
> >> > I'll do this.
> >>
> >> Here is the reordering.
> >> https://git.kernel.org/cgit/linux/kernel/git/mlin/linux.git/log/?h=block-generic-req
> >>
> >> I'll post it if you are OK.
> >>
> >> [PATCH 01/15] block: add blk_queue_split()
> >> [PATCH 02/15] md: remove ->merge_bvec_fn
> >> [PATCH 03/15] dm: remov merge functions
> >> [PATCH 04/15] drbd: remove ->merge_bvec_fn
> >> [PATCH 05/15] pktcdvd: remove ->merge_bvec_fn
> >> [PATCH 06/15] rbd: remove ->merge_bvec_fn
> >> [PATCH 07/15] bcache: remove driver private bio splitting code
> >> [PATCH 08/15] btrfs: remove bio splitting and merge_bvec_fn() calls
> >> [PATCH 09/15] block: call blk_queue_split() in make_request functions
> >> [PATCH 10/15] block: kill ->merge_bvec_fn and simplify bio_add_page
> >> [PATCH 11/15] block: remove split code in blkdev_issue_discard
> >> [PATCH 12/15] md/raid5: get rid of bio_fits_rdev()
> >> [PATCH 13/15] block: remove bio_get_nr_vecs()
> >> [PATCH 14/15] fs: use helper bio_add_page() instead of open coding on
> >> [PATCH 15/15] Documentation: update notes in biovecs about
> >
> > The changes to dm.c and dm.h should be in the "dm:" patch, not "md:".
>
> Will move it.
>
> >
> > But I don't think the sequence is right.
> >
> > You cannot remove ->merge_bvec_fn for *any* stacked device until *all* devices
> > make use of blk_queue_split() (or otherwise handle arbitrarily large bios).
> >
> > I think it would be easiest to:
> > - add blk_queue_split() and call it from common code before ->make_request_fn
> > is called. The ensure all devices can accept arbitrarily large bios.
>
> For "common code", do you mean "generic_make_request()"
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fbbb337..bb6455b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1942,6 +1942,7 @@ void generic_make_request(struct bio *bio)
> do {
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> + blk_queue_split(q, &bio, q->bio_split);
> q->make_request_fn(q, bio);
>
> bio = bio_list_pop(current->bio_list);

Yes, that is what I mean (assuming that is the only place that calls
->make_request_fn).

Thanks,
NeilBrown



Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-27 00:41:39

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Wed, May 27, 2015 at 09:06:40AM +1000, Neil Brown wrote:
> Because we don't know what the "right" size is. And the "right" size can
> change when array reconfiguration happens.

In certain configurations today, device-mapper does report back a sensible
maximum bio size smaller than would otherwise be used and thereby avoids
retrospective splitting. (In tests, the overhead of the duplicate calculation
was found to be negligible so we never restructured the code to optimise it away.)

> Splitting has to happen somewhere, if only in bio_addpage where it decides to
> create a new bio rather than add another page to the current one. So moving
> the split to a different level of the stack shouldn't necessarily change the
> performance profile.

It does sometimes make a significant difference to device-mapper stacks.
DM only uses it for performance reasons - it can already split bios when
it needs to. I tried to remove merge_bvec_fn from DM several years ago but
couldn't because of the adverse performance impact of lots of splitting activity.

The overall cost of splitting ought to be less in many (but not necessarily
all) cases now as a result of all these patches, so exactly where the best
balance lies now needs to be reassessed empirically. It is hard to reach
conclusions theoretically because of the complex interplay between the various
factors at different levels.

Alasdair

2015-06-01 06:02:44

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> > Does it make sense?
>
> To stripe across devices with different characteristics?
>
> Some suggestions.
>
> Prepare 3 kernels.
> O - Old kernel.
> M - Old kernel with merge_bvec_fn disabled.
> N - New kernel.
>
> You're trying to search for counter-examples to the hypothesis that
> "Kernel N always outperforms Kernel O". Then if you find any, trying
> to show either that the performance impediment is small enough that
> it doesn't matter or that the cases are sufficiently rare or obscure
> that they may be ignored because of the greater benefits of N in much more
> common cases.
>
> (1) You're looking to set up configurations where kernel O performs noticeably
> better than M. Then you're comparing the performance of O and N in those
> situations.
>
> (2) You're looking at other sensible configurations where O and M have
> similar performance, and comparing that with the performance of N.

I didn't find case (1).

But the important thing for this series is to simplify block layer
based on immutable biovecs. I don't expect performance improvement.

Here is the changes statistics.

"68 files changed, 336 insertions(+), 1331 deletions(-)"

I run below 3 test cases to make sure it didn't bring any regressions.
Test environment: 2 NVMe drives on 2 sockets server.
Each case run for 30 minutes.

2) btrfs radi0

mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
mount /dev/nvme0n1 /mnt

Then run 8K read.

[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=4
rw=read

[job1]
bs=8K
directory=/mnt
size=1G

2) ext4 on MD raid5

mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
mkfs.ext4 /dev/md0
mount /dev/md0 /mnt

fio script same as btrfs test

3) xfs on DM stripped target

pvcreate /dev/nvme0n1 /dev/nvme1n1
vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
mount /dev/striped_vol_group/striped_logical_volume /mnt

fio script same as btrfs test

------

Results:

4.1-rc4 4.1-rc4-patched
btrfs 1818.6MB/s 1874.1MB/s
ext4 717307KB/s 714030KB/s
xfs 1396.6MB/s 1398.6MB/s

2015-06-01 06:15:18

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

On Sat, May 23, 2015 at 7:15 AM, Christoph Hellwig <[email protected]> wrote:
> On Fri, May 22, 2015 at 11:18:32AM -0700, Ming Lin wrote:
>> This will bring not only performance improvements, but also a great amount
>> of reduction in code complexity all over the block layer. Performance gain
>> is possible due to the fact that bio_add_page() does not have to check
>> unnecesary conditions such as queue limits or if biovecs are mergeable.
>> Those will be delegated to the driver level. Kent already said that he
>> actually benchmarked the impact of this with fio on a micron p320h, which
>> showed definitely a positive impact.
>
> We'll need some actual numbers. I actually like these changes a lot
> and don't even need a performance justification for this fundamentally
> better model, but I'd really prefer to avoid any large scale regressions.
> I don't really expect them, but for code this fundamental we'll just
> need some benchmarks.
>
> Except for that these changes looks good, and the previous version
> passed my tests fine, so with some benchmarks you'ĺl have my ACK.

Can I have your ACK with these numbers?
https://lkml.org/lkml/2015/6/1/38

>
> I'd love to see this go into 4.2, but for that we'll need Jens
> approval and a merge into for-next very soon.

2015-06-02 20:59:21

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Sun, May 31, 2015 at 11:02 PM, Ming Lin <[email protected]> wrote:
> On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
>> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> > Does it make sense?
>>
>> To stripe across devices with different characteristics?
>>
>> Some suggestions.
>>
>> Prepare 3 kernels.
>> O - Old kernel.
>> M - Old kernel with merge_bvec_fn disabled.
>> N - New kernel.
>>
>> You're trying to search for counter-examples to the hypothesis that
>> "Kernel N always outperforms Kernel O". Then if you find any, trying
>> to show either that the performance impediment is small enough that
>> it doesn't matter or that the cases are sufficiently rare or obscure
>> that they may be ignored because of the greater benefits of N in much more
>> common cases.
>>
>> (1) You're looking to set up configurations where kernel O performs noticeably
>> better than M. Then you're comparing the performance of O and N in those
>> situations.
>>
>> (2) You're looking at other sensible configurations where O and M have
>> similar performance, and comparing that with the performance of N.
>
> I didn't find case (1).
>
> But the important thing for this series is to simplify block layer
> based on immutable biovecs. I don't expect performance improvement.
>
> Here is the changes statistics.
>
> "68 files changed, 336 insertions(+), 1331 deletions(-)"
>
> I run below 3 test cases to make sure it didn't bring any regressions.
> Test environment: 2 NVMe drives on 2 sockets server.
> Each case run for 30 minutes.
>
> 2) btrfs radi0
>
> mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> mount /dev/nvme0n1 /mnt
>
> Then run 8K read.
>
> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=4
> rw=read
>
> [job1]
> bs=8K
> directory=/mnt
> size=1G
>
> 2) ext4 on MD raid5
>
> mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> mkfs.ext4 /dev/md0
> mount /dev/md0 /mnt
>
> fio script same as btrfs test
>
> 3) xfs on DM stripped target
>
> pvcreate /dev/nvme0n1 /dev/nvme1n1
> vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> mount /dev/striped_vol_group/striped_logical_volume /mnt
>
> fio script same as btrfs test
>
> ------
>
> Results:
>
> 4.1-rc4 4.1-rc4-patched
> btrfs 1818.6MB/s 1874.1MB/s
> ext4 717307KB/s 714030KB/s
> xfs 1396.6MB/s 1398.6MB/s

Hi Alasdair & Mike,

Would you like these numbers?
I'd like to address your concerns to move forward.

Thanks.

2015-06-03 06:57:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

On Sun, May 31, 2015 at 11:15:09PM -0700, Ming Lin wrote:
> > Except for that these changes looks good, and the previous version
> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>
> Can I have your ACK with these numbers?
> https://lkml.org/lkml/2015/6/1/38

Looks good to me. Still like to see consensus from the DM folks.

2015-06-03 13:28:36

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

Christoph Hellwig <[email protected]> writes:

> On Sun, May 31, 2015 at 11:15:09PM -0700, Ming Lin wrote:
>> > Except for that these changes looks good, and the previous version
>> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>>
>> Can I have your ACK with these numbers?
>> https://lkml.org/lkml/2015/6/1/38
>
> Looks good to me. Still like to see consensus from the DM folks.

Ming, did you look into the increased stack usage reported by Huang
Ying?

-Jeff

2015-06-03 17:18:31

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 00/11] simplify block layer based on immutable biovecs

On Wed, Jun 3, 2015 at 6:28 AM, Jeff Moyer <[email protected]> wrote:
> Christoph Hellwig <[email protected]> writes:
>
>> On Sun, May 31, 2015 at 11:15:09PM -0700, Ming Lin wrote:
>>> > Except for that these changes looks good, and the previous version
>>> > passed my tests fine, so with some benchmarks you'ĺl have my ACK.
>>>
>>> Can I have your ACK with these numbers?
>>> https://lkml.org/lkml/2015/6/1/38
>>
>> Looks good to me. Still like to see consensus from the DM folks.
>
> Ming, did you look into the increased stack usage reported by Huang
> Ying?

Yes, I'll reply to Ying's email.

>
> -Jeff

2015-06-04 21:06:23

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Tue, Jun 02 2015 at 4:59pm -0400,
Ming Lin <[email protected]> wrote:

> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <[email protected]> wrote:
> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> >> > Does it make sense?
> >>
> >> To stripe across devices with different characteristics?
> >>
> >> Some suggestions.
> >>
> >> Prepare 3 kernels.
> >> O - Old kernel.
> >> M - Old kernel with merge_bvec_fn disabled.
> >> N - New kernel.
> >>
> >> You're trying to search for counter-examples to the hypothesis that
> >> "Kernel N always outperforms Kernel O". Then if you find any, trying
> >> to show either that the performance impediment is small enough that
> >> it doesn't matter or that the cases are sufficiently rare or obscure
> >> that they may be ignored because of the greater benefits of N in much more
> >> common cases.
> >>
> >> (1) You're looking to set up configurations where kernel O performs noticeably
> >> better than M. Then you're comparing the performance of O and N in those
> >> situations.
> >>
> >> (2) You're looking at other sensible configurations where O and M have
> >> similar performance, and comparing that with the performance of N.
> >
> > I didn't find case (1).
> >
> > But the important thing for this series is to simplify block layer
> > based on immutable biovecs. I don't expect performance improvement.

No simplifying isn't the important thing. Any change to remove the
merge_bvec callbacks needs to not introduce performance regressions on
enterprise systems with large RAID arrays, etc.

It is fine if there isn't a performance improvement but I really don't
think the limited testing you've done on a relatively small storage
configuration has come even close to showing these changes don't
introduce performance regressions.

> > Here is the changes statistics.
> >
> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
> >
> > I run below 3 test cases to make sure it didn't bring any regressions.
> > Test environment: 2 NVMe drives on 2 sockets server.
> > Each case run for 30 minutes.
> >
> > 2) btrfs radi0
> >
> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> > mount /dev/nvme0n1 /mnt
> >
> > Then run 8K read.
> >
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=4
> > rw=read
> >
> > [job1]
> > bs=8K
> > directory=/mnt
> > size=1G
> >
> > 2) ext4 on MD raid5
> >
> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> > mkfs.ext4 /dev/md0
> > mount /dev/md0 /mnt
> >
> > fio script same as btrfs test
> >
> > 3) xfs on DM stripped target
> >
> > pvcreate /dev/nvme0n1 /dev/nvme1n1
> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> > mount /dev/striped_vol_group/striped_logical_volume /mnt
> >
> > fio script same as btrfs test
> >
> > ------
> >
> > Results:
> >
> > 4.1-rc4 4.1-rc4-patched
> > btrfs 1818.6MB/s 1874.1MB/s
> > ext4 717307KB/s 714030KB/s
> > xfs 1396.6MB/s 1398.6MB/s
>
> Hi Alasdair & Mike,
>
> Would you like these numbers?
> I'd like to address your concerns to move forward.

I really don't see that these NVMe results prove much.

We need to test on large HW raid setups like a Netapp filer (or even
local SAS drives connected via some SAS controller). Like a 8+2 drive
RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
devices is also useful. It is larger RAID setups that will be more
sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
size boundaries.

There are tradeoffs between creating a really large bio and creating a
properly sized bio from the start. And yes, to one of neilb's original
points, limits do change and we suck at restacking limits.. so what was
once properly sized may no longer be but: that is a relatively rare
occurrence. Late splitting does do away with the limits stacking
disconnect. And in general I like the idea of removing all the
merge_bvec code. I just don't think I can confidently Ack such a
wholesale switch at this point with such limited performance analysis.
If we (the DM/lvm team at Red Hat) are being painted into a corner of
having to provide our own testing that meets our definition of
"thorough" then we'll need time to carry out those tests. But I'd hate
to hold up everyone because DM is not in agreement on this change...

So taking a step back, why can't we introduce late bio splitting in a
phased approach?

1: introduce late bio splitting to block core BUT still keep established
merge_bvec infrastructure
2: establish a way for upper layers to skip merge_bvec if they'd like to
do so (e.g. block-core exposes a 'use_late_bio_splitting' or
something for userspace or upper layers to set, can also have a
Kconfig that enables this feature by default)
3: we gain confidence in late bio-splitting and then carry on with the
removal of merge_bvec et al (could be incrementally done on a
per-driver basis, e.g. DM, MD, btrfs, etc, etc).

Mike

2015-06-04 22:21:32

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <[email protected]> wrote:
> On Tue, Jun 02 2015 at 4:59pm -0400,
> Ming Lin <[email protected]> wrote:
>
>> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <[email protected]> wrote:
>> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
>> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
>> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
>> >> > Does it make sense?
>> >>
>> >> To stripe across devices with different characteristics?
>> >>
>> >> Some suggestions.
>> >>
>> >> Prepare 3 kernels.
>> >> O - Old kernel.
>> >> M - Old kernel with merge_bvec_fn disabled.
>> >> N - New kernel.
>> >>
>> >> You're trying to search for counter-examples to the hypothesis that
>> >> "Kernel N always outperforms Kernel O". Then if you find any, trying
>> >> to show either that the performance impediment is small enough that
>> >> it doesn't matter or that the cases are sufficiently rare or obscure
>> >> that they may be ignored because of the greater benefits of N in much more
>> >> common cases.
>> >>
>> >> (1) You're looking to set up configurations where kernel O performs noticeably
>> >> better than M. Then you're comparing the performance of O and N in those
>> >> situations.
>> >>
>> >> (2) You're looking at other sensible configurations where O and M have
>> >> similar performance, and comparing that with the performance of N.
>> >
>> > I didn't find case (1).
>> >
>> > But the important thing for this series is to simplify block layer
>> > based on immutable biovecs. I don't expect performance improvement.
>
> No simplifying isn't the important thing. Any change to remove the
> merge_bvec callbacks needs to not introduce performance regressions on
> enterprise systems with large RAID arrays, etc.
>
> It is fine if there isn't a performance improvement but I really don't
> think the limited testing you've done on a relatively small storage
> configuration has come even close to showing these changes don't
> introduce performance regressions.
>
>> > Here is the changes statistics.
>> >
>> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
>> >
>> > I run below 3 test cases to make sure it didn't bring any regressions.
>> > Test environment: 2 NVMe drives on 2 sockets server.
>> > Each case run for 30 minutes.
>> >
>> > 2) btrfs radi0
>> >
>> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
>> > mount /dev/nvme0n1 /mnt
>> >
>> > Then run 8K read.
>> >
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=4
>> > rw=read
>> >
>> > [job1]
>> > bs=8K
>> > directory=/mnt
>> > size=1G
>> >
>> > 2) ext4 on MD raid5
>> >
>> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
>> > mkfs.ext4 /dev/md0
>> > mount /dev/md0 /mnt
>> >
>> > fio script same as btrfs test
>> >
>> > 3) xfs on DM stripped target
>> >
>> > pvcreate /dev/nvme0n1 /dev/nvme1n1
>> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
>> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
>> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
>> > mount /dev/striped_vol_group/striped_logical_volume /mnt
>> >
>> > fio script same as btrfs test
>> >
>> > ------
>> >
>> > Results:
>> >
>> > 4.1-rc4 4.1-rc4-patched
>> > btrfs 1818.6MB/s 1874.1MB/s
>> > ext4 717307KB/s 714030KB/s
>> > xfs 1396.6MB/s 1398.6MB/s
>>
>> Hi Alasdair & Mike,
>>
>> Would you like these numbers?
>> I'd like to address your concerns to move forward.
>
> I really don't see that these NVMe results prove much.
>
> We need to test on large HW raid setups like a Netapp filer (or even
> local SAS drives connected via some SAS controller). Like a 8+2 drive
> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
> devices is also useful. It is larger RAID setups that will be more
> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> size boundaries.

I'll test it on large HW raid setup.

Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
logical cpus/264G mem).
http://minggr.net/pub/20150604/hw_raid5.jpg

The stripe size is 64K.

I'm going to test ext4/btrfs/xfs on it.
"bs" set to 1216k(64K * 19 = 1216k)
and run 48 jobs.

[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=48
rw=read

[job1]
bs=1216K
directory=/mnt
size=1G

Or do you have other suggestions of what tests I should run?

Thanks.

2015-06-05 00:06:42

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Thu, Jun 04 2015 at 6:21pm -0400,
Ming Lin <[email protected]> wrote:

> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <[email protected]> wrote:
> >
> > We need to test on large HW raid setups like a Netapp filer (or even
> > local SAS drives connected via some SAS controller). Like a 8+2 drive
> > RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
> > devices is also useful. It is larger RAID setups that will be more
> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> > size boundaries.
>
> I'll test it on large HW raid setup.
>
> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
> logical cpus/264G mem).
> http://minggr.net/pub/20150604/hw_raid5.jpg
>
> The stripe size is 64K.
>
> I'm going to test ext4/btrfs/xfs on it.
> "bs" set to 1216k(64K * 19 = 1216k)
> and run 48 jobs.

Definitely an odd blocksize (though 1280K full stripe is pretty common
for 10+2 HW RAID6 w/ 128K chunk size).

> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=48
> rw=read
>
> [job1]
> bs=1216K
> directory=/mnt
> size=1G

How does time_based relate to size=1G? It'll rewrite the same 1 gig
file repeatedly?

> Or do you have other suggestions of what tests I should run?

You're welcome to run this job but I'll also check with others here to
see what fio jobs we used in the recent past when assessing performance
of the dm-crypt parallelization changes.

Also, a lot of care needs to be taken to eliminate jitter in the system
while the test is running. We got a lot of good insight from Bart Van
Assche on that and put it to practice. I'll see if we can (re)summarize
that too.

Mike

2015-06-05 05:21:32

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Thu, Jun 4, 2015 at 5:06 PM, Mike Snitzer <[email protected]> wrote:
> On Thu, Jun 04 2015 at 6:21pm -0400,
> Ming Lin <[email protected]> wrote:
>
>> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer <[email protected]> wrote:
>> >
>> > We need to test on large HW raid setups like a Netapp filer (or even
>> > local SAS drives connected via some SAS controller). Like a 8+2 drive
>> > RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
>> > devices is also useful. It is larger RAID setups that will be more
>> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> > size boundaries.
>>
>> I'll test it on large HW raid setup.
>>
>> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
>> logical cpus/264G mem).
>> http://minggr.net/pub/20150604/hw_raid5.jpg
>>
>> The stripe size is 64K.
>>
>> I'm going to test ext4/btrfs/xfs on it.
>> "bs" set to 1216k(64K * 19 = 1216k)
>> and run 48 jobs.
>
> Definitely an odd blocksize (though 1280K full stripe is pretty common
> for 10+2 HW RAID6 w/ 128K chunk size).

I can change it to 10 HDDs HW RAID6 w/ 128K chunk size, then use bs=1280K

>
>> [global]
>> ioengine=libaio
>> iodepth=64
>> direct=1
>> runtime=1800
>> time_based
>> group_reporting
>> numjobs=48
>> rw=read
>>
>> [job1]
>> bs=1216K
>> directory=/mnt
>> size=1G
>
> How does time_based relate to size=1G? It'll rewrite the same 1 gig
> file repeatedly?

Above job file is for read.
For write, I think so.
Do is make sense for performance test?

>
>> Or do you have other suggestions of what tests I should run?
>
> You're welcome to run this job but I'll also check with others here to
> see what fio jobs we used in the recent past when assessing performance
> of the dm-crypt parallelization changes.

That's very helpful.

>
> Also, a lot of care needs to be taken to eliminate jitter in the system
> while the test is running. We got a lot of good insight from Bart Van
> Assche on that and put it to practice. I'll see if we can (re)summarize
> that too.

Very helpful too.

Thanks.

>
> Mike

2015-06-09 06:09:40

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> We need to test on large HW raid setups like a Netapp filer (or even
> local SAS drives connected via some SAS controller). Like a 8+2 drive
> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
> devices is also useful. It is larger RAID setups that will be more
> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> size boundaries.

Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.

No performance regressions were introduced.

Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
Stripe size 64k and 128k were tested.

devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
spare_devs="/dev/sdl /dev/sdm"
stripe_size=64 (or 128)

MD RAID6 was created by:
mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size

DM stripe target was created by:
pvcreate $devs
vgcreate striped_vol_group $devs
lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group

Here is an example of fio script for stripe size 128k:
[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=48
gtod_reduce=0
norandommap
write_iops_log=fs

[job1]
bs=1280K
directory=/mnt
size=5G
rw=read

All results here: http://minggr.net/pub/20150608/fio_results/

Results summary:

1. HW RAID6: stripe size 64k
4.1-rc4 4.1-rc4-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 821.23 812.20 -1.09%
xfs write: 753.16 754.42 +0.16%
ext4 read: 827.80 834.82 +0.84%
ext4 write: 783.08 777.58 -0.70%
btrfs read: 859.26 871.68 +1.44%
btrfs write: 815.63 844.40 +3.52%

2. HW RAID6: stripe size 128k
4.1-rc4 4.1-rc4-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 948.27 979.11 +3.25%
xfs write: 820.78 819.94 -0.10%
ext4 read: 978.35 997.92 +2.00%
ext4 write: 853.51 847.97 -0.64%
btrfs read: 1013.1 1015.6 +0.24%
btrfs write: 854.43 850.42 -0.46%

3. MD RAID6: stripe size 64k
4.1-rc4 4.1-rc4-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 847.34 869.43 +2.60%
xfs write: 198.67 199.03 +0.18%
ext4 read: 763.89 767.79 +0.51%
ext4 write: 281.44 282.83 +0.49%
btrfs read: 756.02 743.69 -1.63%
btrfs write: 268.37 265.93 -0.90%

4. MD RAID6: stripe size 128k
4.1-rc4 4.1-rc4-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 993.04 1014.1 +2.12%
xfs write: 293.06 298.95 +2.00%
ext4 read: 1019.6 1020.9 +0.12%
ext4 write: 371.51 371.47 -0.01%
btrfs read: 1000.4 1020.8 +2.03%
btrfs write: 241.08 246.77 +2.36%

5. DM: stripe size 64k
4.1-rc4 4.1-rc4-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 1084.4 1080.1 -0.39%
xfs write: 1071.1 1063.4 -0.71%
ext4 read: 991.54 1003.7 +1.22%
ext4 write: 1069.7 1052.2 -1.63%
btrfs read: 1076.1 1082.1 +0.55%
btrfs write: 968.98 965.07 -0.40%

6. DM: stripe size 128k
4.1-rc4 4.1-rc4-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 1020.4 1066.1 +4.47%
xfs write: 1058.2 1066.6 +0.79%
ext4 read: 990.72 988.19 -0.25%
ext4 write: 1050.4 1070.2 +1.88%
btrfs read: 1080.9 1074.7 -0.57%
btrfs write: 975.10 972.76 -0.23%




2015-06-10 21:20:53

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[email protected]> wrote:
> On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> We need to test on large HW raid setups like a Netapp filer (or even
>> local SAS drives connected via some SAS controller). Like a 8+2 drive
>> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
>> devices is also useful. It is larger RAID setups that will be more
>> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> size boundaries.
>
> Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>
> No performance regressions were introduced.
>
> Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> Stripe size 64k and 128k were tested.
>
> devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> spare_devs="/dev/sdl /dev/sdm"
> stripe_size=64 (or 128)
>
> MD RAID6 was created by:
> mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>
> DM stripe target was created by:
> pvcreate $devs
> vgcreate striped_vol_group $devs
> lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> Here is an example of fio script for stripe size 128k:
> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=48
> gtod_reduce=0
> norandommap
> write_iops_log=fs
>
> [job1]
> bs=1280K
> directory=/mnt
> size=5G
> rw=read
>
> All results here: http://minggr.net/pub/20150608/fio_results/
>
> Results summary:
>
> 1. HW RAID6: stripe size 64k
> 4.1-rc4 4.1-rc4-patched
> ------- ---------------
> (MB/s) (MB/s)
> xfs read: 821.23 812.20 -1.09%
> xfs write: 753.16 754.42 +0.16%
> ext4 read: 827.80 834.82 +0.84%
> ext4 write: 783.08 777.58 -0.70%
> btrfs read: 859.26 871.68 +1.44%
> btrfs write: 815.63 844.40 +3.52%
>
> 2. HW RAID6: stripe size 128k
> 4.1-rc4 4.1-rc4-patched
> ------- ---------------
> (MB/s) (MB/s)
> xfs read: 948.27 979.11 +3.25%
> xfs write: 820.78 819.94 -0.10%
> ext4 read: 978.35 997.92 +2.00%
> ext4 write: 853.51 847.97 -0.64%
> btrfs read: 1013.1 1015.6 +0.24%
> btrfs write: 854.43 850.42 -0.46%
>
> 3. MD RAID6: stripe size 64k
> 4.1-rc4 4.1-rc4-patched
> ------- ---------------
> (MB/s) (MB/s)
> xfs read: 847.34 869.43 +2.60%
> xfs write: 198.67 199.03 +0.18%
> ext4 read: 763.89 767.79 +0.51%
> ext4 write: 281.44 282.83 +0.49%
> btrfs read: 756.02 743.69 -1.63%
> btrfs write: 268.37 265.93 -0.90%
>
> 4. MD RAID6: stripe size 128k
> 4.1-rc4 4.1-rc4-patched
> ------- ---------------
> (MB/s) (MB/s)
> xfs read: 993.04 1014.1 +2.12%
> xfs write: 293.06 298.95 +2.00%
> ext4 read: 1019.6 1020.9 +0.12%
> ext4 write: 371.51 371.47 -0.01%
> btrfs read: 1000.4 1020.8 +2.03%
> btrfs write: 241.08 246.77 +2.36%
>
> 5. DM: stripe size 64k
> 4.1-rc4 4.1-rc4-patched
> ------- ---------------
> (MB/s) (MB/s)
> xfs read: 1084.4 1080.1 -0.39%
> xfs write: 1071.1 1063.4 -0.71%
> ext4 read: 991.54 1003.7 +1.22%
> ext4 write: 1069.7 1052.2 -1.63%
> btrfs read: 1076.1 1082.1 +0.55%
> btrfs write: 968.98 965.07 -0.40%
>
> 6. DM: stripe size 128k
> 4.1-rc4 4.1-rc4-patched
> ------- ---------------
> (MB/s) (MB/s)
> xfs read: 1020.4 1066.1 +4.47%
> xfs write: 1058.2 1066.6 +0.79%
> ext4 read: 990.72 988.19 -0.25%
> ext4 write: 1050.4 1070.2 +1.88%
> btrfs read: 1080.9 1074.7 -0.57%
> btrfs write: 975.10 972.76 -0.23%

Hi Mike,

How about these numbers?

I'm also happy to run other fio jobs your team used.

Thanks.

2015-06-10 21:46:21

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Wed, Jun 10 2015 at 5:20pm -0400,
Ming Lin <[email protected]> wrote:

> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[email protected]> wrote:
> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> >> We need to test on large HW raid setups like a Netapp filer (or even
> >> local SAS drives connected via some SAS controller). Like a 8+2 drive
> >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
> >> devices is also useful. It is larger RAID setups that will be more
> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> >> size boundaries.
> >
> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
> >
> > No performance regressions were introduced.
> >
> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> > Stripe size 64k and 128k were tested.
> >
> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> > spare_devs="/dev/sdl /dev/sdm"
> > stripe_size=64 (or 128)
> >
> > MD RAID6 was created by:
> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
> >
> > DM stripe target was created by:
> > pvcreate $devs
> > vgcreate striped_vol_group $devs
> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group

DM had a regression relative to merge_bvec that wasn't fixed until
recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
casting bug in dm_merge_bvec()"). It was introduced in 4.1.

So your 4.1-rc4 DM stripe testing may have effectively been with
merge_bvec disabled.

> > Here is an example of fio script for stripe size 128k:
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=48
> > gtod_reduce=0
> > norandommap
> > write_iops_log=fs
> >
> > [job1]
> > bs=1280K
> > directory=/mnt
> > size=5G
> > rw=read
> >
> > All results here: http://minggr.net/pub/20150608/fio_results/
> >
> > Results summary:
> >
> > 1. HW RAID6: stripe size 64k
> > 4.1-rc4 4.1-rc4-patched
> > ------- ---------------
> > (MB/s) (MB/s)
> > xfs read: 821.23 812.20 -1.09%
> > xfs write: 753.16 754.42 +0.16%
> > ext4 read: 827.80 834.82 +0.84%
> > ext4 write: 783.08 777.58 -0.70%
> > btrfs read: 859.26 871.68 +1.44%
> > btrfs write: 815.63 844.40 +3.52%
> >
> > 2. HW RAID6: stripe size 128k
> > 4.1-rc4 4.1-rc4-patched
> > ------- ---------------
> > (MB/s) (MB/s)
> > xfs read: 948.27 979.11 +3.25%
> > xfs write: 820.78 819.94 -0.10%
> > ext4 read: 978.35 997.92 +2.00%
> > ext4 write: 853.51 847.97 -0.64%
> > btrfs read: 1013.1 1015.6 +0.24%
> > btrfs write: 854.43 850.42 -0.46%
> >
> > 3. MD RAID6: stripe size 64k
> > 4.1-rc4 4.1-rc4-patched
> > ------- ---------------
> > (MB/s) (MB/s)
> > xfs read: 847.34 869.43 +2.60%
> > xfs write: 198.67 199.03 +0.18%
> > ext4 read: 763.89 767.79 +0.51%
> > ext4 write: 281.44 282.83 +0.49%
> > btrfs read: 756.02 743.69 -1.63%
> > btrfs write: 268.37 265.93 -0.90%
> >
> > 4. MD RAID6: stripe size 128k
> > 4.1-rc4 4.1-rc4-patched
> > ------- ---------------
> > (MB/s) (MB/s)
> > xfs read: 993.04 1014.1 +2.12%
> > xfs write: 293.06 298.95 +2.00%
> > ext4 read: 1019.6 1020.9 +0.12%
> > ext4 write: 371.51 371.47 -0.01%
> > btrfs read: 1000.4 1020.8 +2.03%
> > btrfs write: 241.08 246.77 +2.36%
> >
> > 5. DM: stripe size 64k
> > 4.1-rc4 4.1-rc4-patched
> > ------- ---------------
> > (MB/s) (MB/s)
> > xfs read: 1084.4 1080.1 -0.39%
> > xfs write: 1071.1 1063.4 -0.71%
> > ext4 read: 991.54 1003.7 +1.22%
> > ext4 write: 1069.7 1052.2 -1.63%
> > btrfs read: 1076.1 1082.1 +0.55%
> > btrfs write: 968.98 965.07 -0.40%
> >
> > 6. DM: stripe size 128k
> > 4.1-rc4 4.1-rc4-patched
> > ------- ---------------
> > (MB/s) (MB/s)
> > xfs read: 1020.4 1066.1 +4.47%
> > xfs write: 1058.2 1066.6 +0.79%
> > ext4 read: 990.72 988.19 -0.25%
> > ext4 write: 1050.4 1070.2 +1.88%
> > btrfs read: 1080.9 1074.7 -0.57%
> > btrfs write: 975.10 972.76 -0.23%
>
> Hi Mike,
>
> How about these numbers?

Looks fairly good. I just am not sure the workload is going to test the
code paths in question like we'd hope. I'll have to set aside some time
to think through scenarios to test.

My concern still remains that at some point it the future we'll regret
not having merge_bvec but it'll be too late. That is just my own FUD at
this point...

> I'm also happy to run other fio jobs your team used.

I've been busy getting DM changes for the 4.2 merge window finalized.
As such I haven't connected with others on the team to discuss this
issue.

I'll see if we can make time in the next 2 days. But I also have
RHEL-specific kernel deadlines I'm coming up against.

Seems late to be staging this extensive a change for 4.2... are you
pushing for this code to land in the 4.2 merge window? Or do we have
time to work this further and target the 4.3 merge?

Mike

2015-06-10 22:06:24

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <[email protected]> wrote:
> On Wed, Jun 10 2015 at 5:20pm -0400,
> Ming Lin <[email protected]> wrote:
>
>> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[email protected]> wrote:
>> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> >> We need to test on large HW raid setups like a Netapp filer (or even
>> >> local SAS drives connected via some SAS controller). Like a 8+2 drive
>> >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
>> >> devices is also useful. It is larger RAID setups that will be more
>> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> >> size boundaries.
>> >
>> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
>> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>> >
>> > No performance regressions were introduced.
>> >
>> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
>> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
>> > Stripe size 64k and 128k were tested.
>> >
>> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
>> > spare_devs="/dev/sdl /dev/sdm"
>> > stripe_size=64 (or 128)
>> >
>> > MD RAID6 was created by:
>> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>> >
>> > DM stripe target was created by:
>> > pvcreate $devs
>> > vgcreate striped_vol_group $devs
>> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> DM had a regression relative to merge_bvec that wasn't fixed until
> recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
> casting bug in dm_merge_bvec()"). It was introduced in 4.1.
>
> So your 4.1-rc4 DM stripe testing may have effectively been with
> merge_bvec disabled.

I'l rebase it to latest Linus tree and re-run DM stripe testing.

>
>> > Here is an example of fio script for stripe size 128k:
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=48
>> > gtod_reduce=0
>> > norandommap
>> > write_iops_log=fs
>> >
>> > [job1]
>> > bs=1280K
>> > directory=/mnt
>> > size=5G
>> > rw=read
>> >
>> > All results here: http://minggr.net/pub/20150608/fio_results/
>> >
>> > Results summary:
>> >
>> > 1. HW RAID6: stripe size 64k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 821.23 812.20 -1.09%
>> > xfs write: 753.16 754.42 +0.16%
>> > ext4 read: 827.80 834.82 +0.84%
>> > ext4 write: 783.08 777.58 -0.70%
>> > btrfs read: 859.26 871.68 +1.44%
>> > btrfs write: 815.63 844.40 +3.52%
>> >
>> > 2. HW RAID6: stripe size 128k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 948.27 979.11 +3.25%
>> > xfs write: 820.78 819.94 -0.10%
>> > ext4 read: 978.35 997.92 +2.00%
>> > ext4 write: 853.51 847.97 -0.64%
>> > btrfs read: 1013.1 1015.6 +0.24%
>> > btrfs write: 854.43 850.42 -0.46%
>> >
>> > 3. MD RAID6: stripe size 64k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 847.34 869.43 +2.60%
>> > xfs write: 198.67 199.03 +0.18%
>> > ext4 read: 763.89 767.79 +0.51%
>> > ext4 write: 281.44 282.83 +0.49%
>> > btrfs read: 756.02 743.69 -1.63%
>> > btrfs write: 268.37 265.93 -0.90%
>> >
>> > 4. MD RAID6: stripe size 128k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 993.04 1014.1 +2.12%
>> > xfs write: 293.06 298.95 +2.00%
>> > ext4 read: 1019.6 1020.9 +0.12%
>> > ext4 write: 371.51 371.47 -0.01%
>> > btrfs read: 1000.4 1020.8 +2.03%
>> > btrfs write: 241.08 246.77 +2.36%
>> >
>> > 5. DM: stripe size 64k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 1084.4 1080.1 -0.39%
>> > xfs write: 1071.1 1063.4 -0.71%
>> > ext4 read: 991.54 1003.7 +1.22%
>> > ext4 write: 1069.7 1052.2 -1.63%
>> > btrfs read: 1076.1 1082.1 +0.55%
>> > btrfs write: 968.98 965.07 -0.40%
>> >
>> > 6. DM: stripe size 128k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 1020.4 1066.1 +4.47%
>> > xfs write: 1058.2 1066.6 +0.79%
>> > ext4 read: 990.72 988.19 -0.25%
>> > ext4 write: 1050.4 1070.2 +1.88%
>> > btrfs read: 1080.9 1074.7 -0.57%
>> > btrfs write: 975.10 972.76 -0.23%
>>
>> Hi Mike,
>>
>> How about these numbers?
>
> Looks fairly good. I just am not sure the workload is going to test the
> code paths in question like we'd hope. I'll have to set aside some time

How about adding some counters to record, for example, how many time
->merge_bvec is called in old kernel and how many time bio splitting is called
in patched kernel?

> to think through scenarios to test.

Great.

>
> My concern still remains that at some point it the future we'll regret
> not having merge_bvec but it'll be too late. That is just my own FUD at
> this point...
>
>> I'm also happy to run other fio jobs your team used.
>
> I've been busy getting DM changes for the 4.2 merge window finalized.
> As such I haven't connected with others on the team to discuss this
> issue.
>
> I'll see if we can make time in the next 2 days. But I also have
> RHEL-specific kernel deadlines I'm coming up against.
>
> Seems late to be staging this extensive a change for 4.2... are you
> pushing for this code to land in the 4.2 merge window? Or do we have
> time to work this further and target the 4.3 merge?

I'm OK to target the 4.3 merge.
But hope we can get it into linux-next tree ASAP for more wide tests.

>
> Mike

2015-06-12 05:49:09

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Wed, 2015-06-10 at 15:06 -0700, Ming Lin wrote:
> On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <[email protected]> wrote:
> > On Wed, Jun 10 2015 at 5:20pm -0400,
> > Ming Lin <[email protected]> wrote:
> >
> >> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[email protected]> wrote:
> >> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
> >> >> We need to test on large HW raid setups like a Netapp filer (or even
> >> >> local SAS drives connected via some SAS controller). Like a 8+2 drive
> >> >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
> >> >> devices is also useful. It is larger RAID setups that will be more
> >> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
> >> >> size boundaries.
> >> >
> >> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
> >> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
> >> >
> >> > No performance regressions were introduced.
> >> >
> >> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
> >> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
> >> > Stripe size 64k and 128k were tested.
> >> >
> >> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
> >> > spare_devs="/dev/sdl /dev/sdm"
> >> > stripe_size=64 (or 128)
> >> >
> >> > MD RAID6 was created by:
> >> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
> >> >
> >> > DM stripe target was created by:
> >> > pvcreate $devs
> >> > vgcreate striped_vol_group $devs
> >> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
> >
> > DM had a regression relative to merge_bvec that wasn't fixed until
> > recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
> > casting bug in dm_merge_bvec()"). It was introduced in 4.1.
> >
> > So your 4.1-rc4 DM stripe testing may have effectively been with
> > merge_bvec disabled.
>
> I'l rebase it to latest Linus tree and re-run DM stripe testing.

Here is the results for 4.1-rc7. Also looks good.

5. DM: stripe size 64k
4.1-rc7 4.1-rc7-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 784.0 783.5 -0.06%
xfs write: 751.8 768.8 +2.26%
ext4 read: 837.0 832.3 -0.56%
ext4 write: 806.8 814.3 +0.92%
btrfs read: 787.5 786.1 -0.17%
btrfs write: 722.8 718.7 -0.56%


6. DM: stripe size 128k
4.1-rc7 4.1-rc7-patched
------- ---------------
(MB/s) (MB/s)
xfs read: 1045.5 1068.8 +2.22%
xfs write: 1058.9 1052.7 -0.58%
ext4 read: 1001.8 1020.7 +1.88%
ext4 write: 1049.9 1053.7 +0.36%
btrfs read: 1082.8 1084.8 +0.18%
btrfs write: 948.15 948.74 +0.06%

2015-06-18 05:27:53

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

On Wed, Jun 10, 2015 at 2:46 PM, Mike Snitzer <[email protected]> wrote:
> On Wed, Jun 10 2015 at 5:20pm -0400,
> Ming Lin <[email protected]> wrote:
>
>> On Mon, Jun 8, 2015 at 11:09 PM, Ming Lin <[email protected]> wrote:
>> > On Thu, 2015-06-04 at 17:06 -0400, Mike Snitzer wrote:
>> >> We need to test on large HW raid setups like a Netapp filer (or even
>> >> local SAS drives connected via some SAS controller). Like a 8+2 drive
>> >> RAID6 or 8+1 RAID5 setup. Testing with MD raid on JBOD setups with 8
>> >> devices is also useful. It is larger RAID setups that will be more
>> >> sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> >> size boundaries.
>> >
>> > Here are tests results of xfs/ext4/btrfs read/write on HW RAID6/MD RAID6/DM stripe target.
>> > Each case run 0.5 hour, so it took 36 hours to finish all the tests on 4.1-rc4 and 4.1-rc4-patched kernels.
>> >
>> > No performance regressions were introduced.
>> >
>> > Test server: Dell R730xd(2 sockets/48 logical cpus/264G memory)
>> > HW RAID6/MD RAID6/DM stripe target were configured with 10 HDDs, each 280G
>> > Stripe size 64k and 128k were tested.
>> >
>> > devs="/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk"
>> > spare_devs="/dev/sdl /dev/sdm"
>> > stripe_size=64 (or 128)
>> >
>> > MD RAID6 was created by:
>> > mdadm --create --verbose /dev/md0 --level=6 --raid-devices=10 $devs --spare-devices=2 $spare_devs -c $stripe_size
>> >
>> > DM stripe target was created by:
>> > pvcreate $devs
>> > vgcreate striped_vol_group $devs
>> > lvcreate -i10 -I${stripe_size} -L2T -nstriped_logical_volume striped_vol_group
>
> DM had a regression relative to merge_bvec that wasn't fixed until
> recently (it wasn't in 4.1-rc4), see commit 1c220c69ce0 ("dm: fix
> casting bug in dm_merge_bvec()"). It was introduced in 4.1.
>
> So your 4.1-rc4 DM stripe testing may have effectively been with
> merge_bvec disabled.
>
>> > Here is an example of fio script for stripe size 128k:
>> > [global]
>> > ioengine=libaio
>> > iodepth=64
>> > direct=1
>> > runtime=1800
>> > time_based
>> > group_reporting
>> > numjobs=48
>> > gtod_reduce=0
>> > norandommap
>> > write_iops_log=fs
>> >
>> > [job1]
>> > bs=1280K
>> > directory=/mnt
>> > size=5G
>> > rw=read
>> >
>> > All results here: http://minggr.net/pub/20150608/fio_results/
>> >
>> > Results summary:
>> >
>> > 1. HW RAID6: stripe size 64k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 821.23 812.20 -1.09%
>> > xfs write: 753.16 754.42 +0.16%
>> > ext4 read: 827.80 834.82 +0.84%
>> > ext4 write: 783.08 777.58 -0.70%
>> > btrfs read: 859.26 871.68 +1.44%
>> > btrfs write: 815.63 844.40 +3.52%
>> >
>> > 2. HW RAID6: stripe size 128k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 948.27 979.11 +3.25%
>> > xfs write: 820.78 819.94 -0.10%
>> > ext4 read: 978.35 997.92 +2.00%
>> > ext4 write: 853.51 847.97 -0.64%
>> > btrfs read: 1013.1 1015.6 +0.24%
>> > btrfs write: 854.43 850.42 -0.46%
>> >
>> > 3. MD RAID6: stripe size 64k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 847.34 869.43 +2.60%
>> > xfs write: 198.67 199.03 +0.18%
>> > ext4 read: 763.89 767.79 +0.51%
>> > ext4 write: 281.44 282.83 +0.49%
>> > btrfs read: 756.02 743.69 -1.63%
>> > btrfs write: 268.37 265.93 -0.90%
>> >
>> > 4. MD RAID6: stripe size 128k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 993.04 1014.1 +2.12%
>> > xfs write: 293.06 298.95 +2.00%
>> > ext4 read: 1019.6 1020.9 +0.12%
>> > ext4 write: 371.51 371.47 -0.01%
>> > btrfs read: 1000.4 1020.8 +2.03%
>> > btrfs write: 241.08 246.77 +2.36%
>> >
>> > 5. DM: stripe size 64k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 1084.4 1080.1 -0.39%
>> > xfs write: 1071.1 1063.4 -0.71%
>> > ext4 read: 991.54 1003.7 +1.22%
>> > ext4 write: 1069.7 1052.2 -1.63%
>> > btrfs read: 1076.1 1082.1 +0.55%
>> > btrfs write: 968.98 965.07 -0.40%
>> >
>> > 6. DM: stripe size 128k
>> > 4.1-rc4 4.1-rc4-patched
>> > ------- ---------------
>> > (MB/s) (MB/s)
>> > xfs read: 1020.4 1066.1 +4.47%
>> > xfs write: 1058.2 1066.6 +0.79%
>> > ext4 read: 990.72 988.19 -0.25%
>> > ext4 write: 1050.4 1070.2 +1.88%
>> > btrfs read: 1080.9 1074.7 -0.57%
>> > btrfs write: 975.10 972.76 -0.23%
>>
>> Hi Mike,
>>
>> How about these numbers?
>
> Looks fairly good. I just am not sure the workload is going to test the
> code paths in question like we'd hope. I'll have to set aside some time
> to think through scenarios to test.

Hi Mike,

Will you get a chance to think about it?

Thanks.

>
> My concern still remains that at some point it the future we'll regret
> not having merge_bvec but it'll be too late. That is just my own FUD at
> this point...
>
>> I'm also happy to run other fio jobs your team used.
>
> I've been busy getting DM changes for the 4.2 merge window finalized.
> As such I haven't connected with others on the team to discuss this
> issue.
>
> I'll see if we can make time in the next 2 days. But I also have
> RHEL-specific kernel deadlines I'm coming up against.
>
> Seems late to be staging this extensive a change for 4.2... are you
> pushing for this code to land in the 4.2 merge window? Or do we have
> time to work this further and target the 4.3 merge?
>
> Mike