2014-12-22 11:49:24

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 00/17] simplify block layer based on immutable biovecs

This is the first attempt of simplifying block layer based on immutable
biovecs. Immutable biovecs, implemented by Kent Overstreet, have been
available in mainline since v3.14. Its original goal was actually making
generic_make_request() accept arbitrarily sized bios, and pushing the
splitting down to the drivers or wherever it's required. See also
discussions in the past, [1] [2] [3].

This will bring not only performance improvements, but also a great amount
of reduction in code complexity all over the block layer. Performance gain
is possible due to the fact that bio_add_page() does not have to check
unnecesary conditions such as queue limits or if biovecs are mergeable.
Those will be delegated to the driver level. Kent already said that he
actually benchmarked the impact of this with fio on a micron p320h, which
showed definitely a positive impact.

Moreover, this patchset also allows a lot of code to be deleted, mainly
because of removal of merge_bvec_fn() callbacks. We have been aware that
it has been always a delicate issue for stacking block drivers (e.g. md
and bcache) to handle merging bio consistently. This simplication will
help every individual block driver avoid having such an issue.

- Patch 01/17 allows generic_make_request handle arbitrarily sized bios,
by making make_request functions call blk_queue_split().
- Patch 02/17 simplifies __bio_add_page() to avoid calling
->merge_bvec_fn().
- Patch 03/17 modifies ways of issueing discard, write_same, and zeroout.
- Patch 04/17 gets rid of workarounds of bcache.
- Patches 05-06/17 remove unnecessary codes in btrfs, making use of
immutable biovecs.
- Patches 07-10/17 do refactoring to make the block layer use the new
iov_iter interface.
- Patch 11/17 allows queue_bounce to handle bios with > BIO_MAX_PAGES
- Patch 12-13/17 do refactoring and cleanup in MD-RAID.
- Patch 14 removes ->merge_bvec_fn() completely, which affects a lot of
block drivers, such as Ceph RBD, DRBD, device mapper, MD, etc.
- Patch 15-16 do refactoring and cleanup in filesystems, according to
new APIs like immutable biovecs.
- Patch 17 updates document about biovecs.

Patches are against 3.19-rc1. These are also available in my git repo at:

https://github.com/dongsupark/linux.git block-generic-req

This patchset is a prerequisite of other consecutive patchsets, e.g.
multipage biovecs, rewriting plugging, or rewriting direct-IO, which are
excluded this time. That means, this patchset should not bring any
regression to end-users. I already tested it with xfstests multiple times.
On the other hand, the multipage biovecs part is currently in heavy
development, with help of Kent and Ming Lin. Those experimental patches
are also available on other branches on my git tree. Once they are done,
I'm also going to post them to get reviews.

Comments are welcome.
Dongsu

[1] https://lkml.org/lkml/2014/11/23/263
[2] https://lkml.org/lkml/2013/11/25/732
[3] https://lkml.org/lkml/2014/2/26/618

Dongsu Park (1):
Documentation: update notes in biovecs about arbitrarily sized bios

Kent Overstreet (16):
block: make generic_make_request handle arbitrarily sized bios
block: simplify bio_add_page()
block: simplify issueing discard, write_same, zeroout
bcache: clean up hacks around bio_split_pool
btrfs: remove bio splitting and merge_bvec_fn() calls
btrfs: make use of immutable biovecs
block: replace sg_iovec with iov_iter
block: refactor __bio_copy_iov()
block: refactor iov_count_pages() from bio_{copy,map}_user_iov()
block: refactor bio_get_user_pages() from __bio_map_user_iov()
block: allow __blk_queue_bounce() to handle bios larger than
BIO_MAX_PAGES
md/raid10: make sync_request_write() call bio_copy_data()
md/raid5: get rid of bio_fits_rdev()
block: kill merge_bvec_fn() completely
fs: use helper bio_add_page() instead of open coding on bi_io_vec
fs: convert buffer head etc. to use immutable biovecs API.

Documentation/block/biovecs.txt | 17 +-
block/bio.c | 430 ++++++++++++----------------
block/blk-core.c | 19 +-
block/blk-lib.c | 173 ++---------
block/blk-map.c | 27 +-
block/blk-merge.c | 140 ++++++++-
block/blk-mq.c | 2 +
block/blk-settings.c | 22 --
block/bounce.c | 60 +++-
block/scsi_ioctl.c | 19 +-
drivers/block/drbd/drbd_int.h | 1 -
drivers/block/drbd/drbd_main.c | 1 -
drivers/block/drbd/drbd_req.c | 37 +--
drivers/block/pktcdvd.c | 27 +-
drivers/block/ps3vram.c | 2 +
drivers/block/rbd.c | 47 ---
drivers/block/rsxx/dev.c | 2 +
drivers/block/umem.c | 2 +
drivers/block/zram/zram_drv.c | 2 +
drivers/md/bcache/bcache.h | 18 --
drivers/md/bcache/io.c | 100 +------
drivers/md/bcache/journal.c | 4 +-
drivers/md/bcache/request.c | 16 +-
drivers/md/bcache/super.c | 32 +--
drivers/md/bcache/util.h | 5 +-
drivers/md/bcache/writeback.c | 4 +-
drivers/md/dm-cache-target.c | 21 --
drivers/md/dm-crypt.c | 16 --
drivers/md/dm-era-target.c | 15 -
drivers/md/dm-flakey.c | 16 --
drivers/md/dm-linear.c | 16 --
drivers/md/dm-snap.c | 15 -
drivers/md/dm-stripe.c | 21 --
drivers/md/dm-table.c | 8 -
drivers/md/dm-thin.c | 31 --
drivers/md/dm-verity.c | 16 --
drivers/md/dm.c | 122 +-------
drivers/md/dm.h | 2 -
drivers/md/linear.c | 46 ---
drivers/md/md.c | 4 +-
drivers/md/md.h | 8 -
drivers/md/multipath.c | 21 --
drivers/md/raid0.c | 57 ----
drivers/md/raid0.h | 2 -
drivers/md/raid1.c | 59 +---
drivers/md/raid10.c | 142 +--------
drivers/md/raid5.c | 51 +---
drivers/s390/block/dcssblk.c | 2 +
drivers/s390/block/xpram.c | 2 +
drivers/scsi/sg.c | 15 +-
drivers/staging/lustre/lustre/llite/lloop.c | 2 +
fs/btrfs/check-integrity.c | 22 +-
fs/btrfs/extent_io.c | 12 +-
fs/btrfs/file-item.c | 61 ++--
fs/btrfs/inode.c | 22 +-
fs/btrfs/volumes.c | 73 -----
fs/buffer.c | 11 +-
fs/jfs/jfs_logmgr.c | 14 +-
include/linux/bio.h | 10 +-
include/linux/blkdev.h | 17 +-
include/linux/device-mapper.h | 4 -
include/linux/uio.h | 2 +
kernel/power/block_io.c | 23 +-
lib/iovec.c | 30 ++
mm/page_io.c | 8 +-
65 files changed, 628 insertions(+), 1600 deletions(-)

--
2.1.0


2014-12-22 11:49:38

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 01/17] block: make generic_make_request handle arbitrarily sized bios

From: Kent Overstreet <[email protected]>

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Ming Lin <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Jiri Kosina <[email protected]>
Cc: Geoff Levand <[email protected]>
Cc: Jim Paris <[email protected]>
Cc: Joshua Morris <[email protected]>
Cc: Philip Kelleher <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Oleg Drokin <[email protected]>
Cc: Andreas Dilger <[email protected]>
---
block/blk-core.c | 19 ++--
block/blk-merge.c | 151 ++++++++++++++++++++++++++--
block/blk-mq.c | 2 +
drivers/block/drbd/drbd_req.c | 2 +
drivers/block/pktcdvd.c | 6 +-
drivers/block/ps3vram.c | 2 +
drivers/block/rsxx/dev.c | 2 +
drivers/block/umem.c | 2 +
drivers/block/zram/zram_drv.c | 2 +
drivers/md/dm.c | 2 +
drivers/md/md.c | 2 +
drivers/s390/block/dcssblk.c | 2 +
drivers/s390/block/xpram.c | 2 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 +
include/linux/blkdev.h | 3 +
15 files changed, 179 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 30f6153..e86ad75 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -585,6 +585,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (q->id < 0)
goto fail_q;

+ q->bio_split = bioset_create(4, 0);
+ if (!q->bio_split)
+ goto fail_id;
+
q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
q->backing_dev_info.state = 0;
@@ -594,7 +598,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

err = bdi_init(&q->backing_dev_info);
if (err)
- goto fail_id;
+ goto fail_split;

setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
laptop_mode_timer_fn, (unsigned long) q);
@@ -636,6 +640,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

fail_bdi:
bdi_destroy(&q->backing_dev_info);
+fail_split:
+ bioset_free(q->bio_split);
fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_q:
@@ -1552,6 +1558,8 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
struct request *req;
unsigned int request_count = 0;

+ blk_queue_split(q, &bio, q->bio_split);
+
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
@@ -1775,15 +1783,6 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}

- if (likely(bio_is_rw(bio) &&
- nr_sectors > queue_max_hw_sectors(q))) {
- printk(KERN_ERR "bio too big device %s (%u > %u)\n",
- bdevname(bio->bi_bdev, b),
- bio_sectors(bio),
- queue_max_hw_sectors(q));
- goto end_io;
- }
-
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(&part_to_disk(part)->part0,
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 89b97b5..3bc2068 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -9,12 +9,150 @@

#include "blk.h"

+static struct bio *blk_bio_discard_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ unsigned int max_discard_sectors, granularity;
+ int alignment;
+ sector_t tmp;
+ unsigned split_sectors;
+
+ /* Zero-sector (unknown) and one-sector granularities are the same. */
+ granularity = max(q->limits.discard_granularity >> 9, 1U);
+
+ max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+ max_discard_sectors -= max_discard_sectors % granularity;
+
+ if (unlikely(!max_discard_sectors)) {
+ /* XXX: warn */
+ return NULL;
+ }
+
+ if (bio_sectors(bio) <= max_discard_sectors)
+ return NULL;
+
+ split_sectors = max_discard_sectors;
+
+ /*
+ * If the next starting sector would be misaligned, stop the discard at
+ * the previous aligned sector.
+ */
+ alignment = (q->limits.discard_alignment >> 9) % granularity;
+
+ tmp = bio->bi_iter.bi_sector + split_sectors - alignment;
+ tmp = sector_div(tmp, granularity);
+
+ if (split_sectors > tmp)
+ split_sectors -= tmp;
+
+ return bio_split(bio, split_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_write_same_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ if (!q->limits.max_write_same_sectors)
+ return NULL;
+
+ if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
+ return NULL;
+
+ return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+}
+
+static struct bio *blk_bio_segment_split(struct request_queue *q,
+ struct bio *bio,
+ struct bio_set *bs)
+{
+ struct bio *split;
+ struct bio_vec bv = { 0 }, bvprv = { 0 };
+ struct bvec_iter iter;
+ unsigned seg_size = 0, nsegs = 0;
+ int prev = 0;
+
+ struct bvec_merge_data bvm = {
+ .bi_bdev = bio->bi_bdev,
+ .bi_sector = bio->bi_iter.bi_sector,
+ .bi_size = 0,
+ .bi_rw = bio->bi_rw,
+ };
+
+ bio_for_each_segment(bv, bio, iter) {
+ if (q->merge_bvec_fn &&
+ q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
+ goto split;
+
+ bvm.bi_size += bv.bv_len;
+
+ if (bvm.bi_size >> 9 > queue_max_sectors(q))
+ goto split;
+
+ if (prev && blk_queue_cluster(q)) {
+ if (seg_size + bv.bv_len > queue_max_segment_size(q))
+ goto new_segment;
+ if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
+ goto new_segment;
+ if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
+ goto new_segment;
+
+ seg_size += bv.bv_len;
+ bvprv = bv;
+ prev = 1;
+ continue;
+ }
+new_segment:
+ if (nsegs == queue_max_segments(q))
+ goto split;
+
+ nsegs++;
+ bvprv = bv;
+ prev = 1;
+ seg_size = bv.bv_len;
+ }
+
+ return NULL;
+split:
+ split = bio_clone_bioset(bio, GFP_NOIO, bs);
+
+ split->bi_iter.bi_size -= iter.bi_size;
+ bio->bi_iter = iter;
+
+ if (bio_integrity(bio)) {
+ bio_integrity_advance(bio, split->bi_iter.bi_size);
+ bio_integrity_trim(split, 0, bio_sectors(split));
+ }
+
+ return split;
+}
+
+void blk_queue_split(struct request_queue *q, struct bio **bio,
+ struct bio_set *bs)
+{
+ struct bio *split;
+
+ if ((*bio)->bi_rw & REQ_DISCARD)
+ split = blk_bio_discard_split(q, *bio, bs);
+ else if ((*bio)->bi_rw & REQ_WRITE_SAME)
+ split = blk_bio_write_same_split(q, *bio, bs);
+ else
+ split = blk_bio_segment_split(q, *bio, q->bio_split);
+
+ if (split) {
+ bio_chain(split, *bio);
+ generic_make_request(*bio);
+ *bio = split;
+ }
+}
+EXPORT_SYMBOL(blk_queue_split);
+
static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
struct bio *bio,
bool no_sg_merge)
{
struct bio_vec bv, bvprv = { NULL };
- int cluster, high, highprv = 1;
+ int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
struct bvec_iter iter;
@@ -36,7 +174,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
cluster = blk_queue_cluster(q);
seg_size = 0;
nr_phys_segs = 0;
- high = 0;
for_each_bio(bio) {
bio_for_each_segment(bv, bio, iter) {
/*
@@ -46,13 +183,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
if (no_sg_merge)
goto new_segment;

- /*
- * the trick here is making sure that a high page is
- * never considered part of another segment, since
- * that might change with the bounce page.
- */
- high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
- if (!high && !highprv && cluster) {
+ if (prev && cluster) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
goto new_segment;
@@ -72,8 +203,8 @@ new_segment:

nr_phys_segs++;
bvprv = bv;
+ prev = 1;
seg_size = bv.bv_len;
- highprv = high;
}
bbio = bio;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index da1ab56..20b3ddb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1259,6 +1259,8 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
return;
}

+ blk_queue_split(q, &bio, q->bio_split);
+
if (use_plug && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, &request_count))
return;
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 34f2f0b..dee706d 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1496,6 +1496,8 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
struct drbd_device *device = (struct drbd_device *) q->queuedata;
unsigned long start_jif;

+ blk_queue_split(q, &bio, q->bio_split);
+
start_jif = jiffies;

/*
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628da..ea10bd9 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2446,6 +2446,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
char b[BDEVNAME_SIZE];
struct bio *split;

+ blk_queue_bounce(q, &bio);
+
+ blk_queue_split(q, &bio, q->bio_split);
+
pd = q->queuedata;
if (!pd) {
pr_err("%s incorrect request queue\n",
@@ -2476,8 +2480,6 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
goto end_io;
}

- blk_queue_bounce(q, &bio);
-
do {
sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
sector_t last_zone = get_zone(bio_end_sector(bio) - 1, pd);
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index ef45cfb..a995972 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -603,6 +603,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev);
int busy;

+ blk_queue_split(q, &bio, q->bio_split);
+
dev_dbg(&dev->core, "%s\n", __func__);

spin_lock_irq(&priv->lock);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index ac8c62c..50ef199 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -148,6 +148,8 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
struct rsxx_bio_meta *bio_meta;
int st = -EINVAL;

+ blk_queue_split(q, &bio, q->bio_split);
+
might_sleep();

if (!card)
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5..13d577c 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -531,6 +531,8 @@ static void mm_make_request(struct request_queue *q, struct bio *bio)
(unsigned long long)bio->bi_iter.bi_sector,
bio->bi_iter.bi_size);

+ blk_queue_split(q, &bio, q->bio_split);
+
spin_lock_irq(&card->lock);
*card->biotail = bio;
bio->bi_next = NULL;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bd8bda3..19526d0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -909,6 +909,8 @@ static void zram_make_request(struct request_queue *queue, struct bio *bio)
{
struct zram *zram = queue->queuedata;

+ blk_queue_split(queue, &bio, queue->bio_split);
+
down_read(&zram->init_lock);
if (unlikely(!init_done(zram)))
goto error;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4c06585..5ce28a4 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1680,6 +1680,8 @@ static void dm_request(struct request_queue *q, struct bio *bio)
{
struct mapped_device *md = q->queuedata;

+ blk_queue_split(q, &bio, q->bio_split);
+
if (dm_request_based(md))
blk_queue_bio(q, bio);
else
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 709755f..48234eb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -249,6 +249,8 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
struct mddev *mddev = q->queuedata;
unsigned int sectors;

+ blk_queue_split(q, &bio, q->bio_split);
+
if (mddev == NULL || mddev->pers == NULL
|| !mddev->ready) {
bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index b550c8c..658bb7e 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -815,6 +815,8 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
unsigned long source_addr;
unsigned long bytes_done;

+ blk_queue_split(q, &bio, q->bio_split);
+
bytes_done = 0;
dev_info = bio->bi_bdev->bd_disk->private_data;
if (dev_info == NULL)
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 7d4e939..1305ed3 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,6 +190,8 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
unsigned long page_addr;
unsigned long bytes;

+ blk_queue_split(q, &bio, q->bio_split);
+
if ((bio->bi_iter.bi_sector & 7) != 0 ||
(bio->bi_iter.bi_size & 4095) != 0)
/* Request is not page-aligned. */
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index 0312488..fc85916 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -341,6 +341,8 @@ static void loop_make_request(struct request_queue *q, struct bio *old_bio)
int rw = bio_rw(old_bio);
int inactive;

+ blk_queue_split(q, &old_bio, q->bio_split);
+
if (!lo)
goto err;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 92f4b4b..191ee4b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -484,6 +484,7 @@ struct request_queue {

struct blk_mq_tag_set *tag_set;
struct list_head tag_set_list;
+ struct bio_set *bio_split;
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
@@ -807,6 +808,8 @@ extern void blk_rq_unprep_clone(struct request *rq);
extern int blk_insert_cloned_request(struct request_queue *q,
struct request *rq);
extern void blk_delay_queue(struct request_queue *, unsigned long);
+extern void blk_queue_split(struct request_queue *, struct bio **,
+ struct bio_set *);
extern void blk_recount_segments(struct request_queue *, struct bio *);
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,
--
2.1.0

2014-12-22 11:49:43

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 02/17] block: simplify bio_add_page()

From: Kent Overstreet <[email protected]>

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.
__bio_add_page() doesn't need to call ->merge_bvec_fn(), where
we can get rid of unnecessary code paths.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: rebase and resolve merge conflicts]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Ming Lin <[email protected]>
---
block/bio.c | 135 +++++++++++++++++++++++++-----------------------------------
1 file changed, 55 insertions(+), 80 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 471d738..955bc57 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -700,9 +700,23 @@ int bio_get_nr_vecs(struct block_device *bdev)
}
EXPORT_SYMBOL(bio_get_nr_vecs);

-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned int max_sectors)
+/**
+ * bio_add_pc_page - attempt to add page to bio
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ *
+ * This should only be used by REQ_PC bios.
+ */
+int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
+ *page, unsigned int len, unsigned int offset)
{
int retried_segments = 0;
struct bio_vec *bvec;
@@ -713,7 +727,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (unlikely(bio_flagged(bio, BIO_CLONED)))
return 0;

- if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
+ if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
return 0;

/*
@@ -726,28 +740,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page

if (page == prev->bv_page &&
offset == prev->bv_offset + prev->bv_len) {
- unsigned int prev_bv_len = prev->bv_len;
prev->bv_len += len;
-
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- /* prev_bvec is already charged in
- bi_size, discharge it in order to
- simulate merging updated prev_bvec
- as new bvec. */
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = bio->bi_iter.bi_size -
- prev_bv_len,
- .bi_rw = bio->bi_rw,
- };
-
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
- prev->bv_len -= len;
- return 0;
- }
- }
-
bio->bi_iter.bi_size += len;
goto done;
}
@@ -790,27 +783,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
blk_recount_segments(q, bio);
}

- /*
- * if queue has other restrictions (eg varying max sector size
- * depending on offset), it can specify a merge_bvec_fn in the
- * queue to get further control
- */
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = bio->bi_iter.bi_size - len,
- .bi_rw = bio->bi_rw,
- };
-
- /*
- * merge_bvec_fn() returns number of bytes it can accept
- * at this offset
- */
- if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len)
- goto failed;
- }
-
/* If we may be able to merge these biovecs, force a recount */
if (bio->bi_vcnt > 1 && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
bio->bi_flags &= ~(1 << BIO_SEG_VALID);
@@ -827,28 +799,6 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
blk_recount_segments(q, bio);
return 0;
}
-
-/**
- * bio_add_pc_page - attempt to add page to bio
- * @q: the target queue
- * @bio: destination bio
- * @page: page to add
- * @len: vec entry length
- * @offset: vec entry offset
- *
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
- *
- * This should only be used by REQ_PC bios.
- */
-int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
- unsigned int len, unsigned int offset)
-{
- return __bio_add_page(q, bio, page, len, offset,
- queue_max_hw_sectors(q));
-}
EXPORT_SYMBOL(bio_add_pc_page);

/**
@@ -858,22 +808,47 @@ EXPORT_SYMBOL(bio_add_pc_page);
* @len: vec entry length
* @offset: vec entry offset
*
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
+ * Attempt to add a page to the bio_vec maplist. This will only fail if
+ * bio->bi_vcnt == bio->bi_max_vecs.
*/
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
- unsigned int offset)
+int bio_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- unsigned int max_sectors;
+ struct bio_vec *bv;
+
+ /*
+ * cloned bio must not modify vec list
+ */
+ if (unlikely(bio_flagged(bio, BIO_CLONED)))
+ return 0;
+
+ /*
+ * For filesystems with a blocksize smaller than the pagesize
+ * we will often be called with the same page as last time and
+ * a consecutive offset. Optimize this special case.
+ */
+ if (bio->bi_vcnt > 0) {
+ bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+ if (page == bv->bv_page &&
+ offset == bv->bv_offset + bv->bv_len) {
+ bv->bv_len += len;
+ goto done;
+ }
+ }

- max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
- if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
- max_sectors = len >> 9;
+ if (bio->bi_vcnt >= bio->bi_max_vecs)
+ return 0;

- return __bio_add_page(q, bio, page, len, offset, max_sectors);
+ bv = &bio->bi_io_vec[bio->bi_vcnt];
+ bv->bv_page = page;
+ bv->bv_len = len;
+ bv->bv_offset = offset;
+
+ bio->bi_vcnt++;
+done:
+ bio->bi_iter.bi_size += len;
+ return len;
}
EXPORT_SYMBOL(bio_add_page);

--
2.1.0

2014-12-22 11:49:53

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 05/17] btrfs: remove bio splitting and merge_bvec_fn() calls

From: Kent Overstreet <[email protected]>

Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: [email protected]
---
fs/btrfs/volumes.c | 73 ------------------------------------------------------
1 file changed, 73 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 50c5a87..c627bf8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5691,34 +5691,6 @@ static noinline void btrfs_schedule_bio(struct btrfs_root *root,
&device->work);
}

-static int bio_size_ok(struct block_device *bdev, struct bio *bio,
- sector_t sector)
-{
- struct bio_vec *prev;
- struct request_queue *q = bdev_get_queue(bdev);
- unsigned int max_sectors = queue_max_sectors(q);
- struct bvec_merge_data bvm = {
- .bi_bdev = bdev,
- .bi_sector = sector,
- .bi_rw = bio->bi_rw,
- };
-
- if (WARN_ON(bio->bi_vcnt == 0))
- return 1;
-
- prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
- if (bio_sectors(bio) > max_sectors)
- return 0;
-
- if (!q->merge_bvec_fn)
- return 1;
-
- bvm.bi_size = bio->bi_iter.bi_size - prev->bv_len;
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len)
- return 0;
- return 1;
-}
-
static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
struct bio *bio, u64 physical, int dev_nr,
int rw, int async)
@@ -5752,38 +5724,6 @@ static void submit_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
btrfsic_submit_bio(rw, bio);
}

-static int breakup_stripe_bio(struct btrfs_root *root, struct btrfs_bio *bbio,
- struct bio *first_bio, struct btrfs_device *dev,
- int dev_nr, int rw, int async)
-{
- struct bio_vec *bvec = first_bio->bi_io_vec;
- struct bio *bio;
- int nr_vecs = bio_get_nr_vecs(dev->bdev);
- u64 physical = bbio->stripes[dev_nr].physical;
-
-again:
- bio = btrfs_bio_alloc(dev->bdev, physical >> 9, nr_vecs, GFP_NOFS);
- if (!bio)
- return -ENOMEM;
-
- while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) {
- if (bio_add_page(bio, bvec->bv_page, bvec->bv_len,
- bvec->bv_offset) < bvec->bv_len) {
- u64 len = bio->bi_iter.bi_size;
-
- atomic_inc(&bbio->stripes_pending);
- submit_stripe_bio(root, bbio, bio, physical, dev_nr,
- rw, async);
- physical += len;
- goto again;
- }
- bvec++;
- }
-
- submit_stripe_bio(root, bbio, bio, physical, dev_nr, rw, async);
- return 0;
-}
-
static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
{
atomic_inc(&bbio->error);
@@ -5862,19 +5802,6 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
continue;
}

- /*
- * Check and see if we're ok with this bio based on it's size
- * and offset with the given device.
- */
- if (!bio_size_ok(dev->bdev, first_bio,
- bbio->stripes[dev_nr].physical >> 9)) {
- ret = breakup_stripe_bio(root, bbio, first_bio, dev,
- dev_nr, rw, async_submit);
- BUG_ON(ret);
- dev_nr++;
- continue;
- }
-
if (dev_nr < total_devs - 1) {
bio = btrfs_bio_clone(first_bio, GFP_NOFS);
BUG_ON(!bio); /* -ENOMEM */
--
2.1.0

2014-12-22 11:49:49

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 03/17] block: simplify issueing discard, write_same, zeroout

From: Kent Overstreet <[email protected]>

Simplify special cases for issueing discard, write_same, and zeroout,
replacing bio_batch completions with submit_bio_wait(). This conversion
is possible because generic_make_request() will now do for us what the
code in blk-lib.c was doing manually, with the bio_batch stuff. So we
still need some looping in case we're trying to discard/zeroout more
than around a gigabyte, but when we can submit that much at a time
doing the submissions in parallel really shouldn't matter.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-lib.c | 173 ++++++++++----------------------------------------------
1 file changed, 29 insertions(+), 144 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 8411be3..deef044 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -9,23 +9,6 @@

#include "blk.h"

-struct bio_batch {
- atomic_t done;
- unsigned long flags;
- struct completion *wait;
-};
-
-static void bio_batch_end_io(struct bio *bio, int err)
-{
- struct bio_batch *bb = bio->bi_private;
-
- if (err && (err != -EOPNOTSUPP))
- clear_bit(BIO_UPTODATE, &bb->flags);
- if (atomic_dec_and_test(&bb->done))
- complete(bb->wait);
- bio_put(bio);
-}
-
/**
* blkdev_issue_discard - queue a discard
* @bdev: blockdev to issue discard for
@@ -40,15 +23,10 @@ static void bio_batch_end_io(struct bio *bio, int err)
int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
{
- DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q = bdev_get_queue(bdev);
int type = REQ_WRITE | REQ_DISCARD;
- unsigned int max_discard_sectors, granularity;
- int alignment;
- struct bio_batch bb;
struct bio *bio;
int ret = 0;
- struct blk_plug plug;

if (!q)
return -ENXIO;
@@ -56,69 +34,27 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
if (!blk_queue_discard(q))
return -EOPNOTSUPP;

- /* Zero-sector (unknown) and one-sector granularities are the same. */
- granularity = max(q->limits.discard_granularity >> 9, 1U);
- alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
- /*
- * Ensure that max_discard_sectors is of the proper
- * granularity, so that requests stay aligned after a split.
- */
- max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
- max_discard_sectors -= max_discard_sectors % granularity;
- if (unlikely(!max_discard_sectors)) {
- /* Avoid infinite loop below. Being cautious never hurts. */
- return -EOPNOTSUPP;
- }
-
if (flags & BLKDEV_DISCARD_SECURE) {
if (!blk_queue_secdiscard(q))
return -EOPNOTSUPP;
type |= REQ_SECURE;
}

- atomic_set(&bb.done, 1);
- bb.flags = 1 << BIO_UPTODATE;
- bb.wait = &wait;
-
- blk_start_plug(&plug);
while (nr_sects) {
- unsigned int req_sects;
- sector_t end_sect, tmp;
-
bio = bio_alloc(gfp_mask, 1);
- if (!bio) {
- ret = -ENOMEM;
- break;
- }
-
- req_sects = min_t(sector_t, nr_sects, max_discard_sectors);
+ if (!bio)
+ return -ENOMEM;

- /*
- * If splitting a request, and the next starting sector would be
- * misaligned, stop the discard at the previous aligned sector.
- */
- end_sect = sector + req_sects;
- tmp = end_sect;
- if (req_sects < nr_sects &&
- sector_div(tmp, granularity) != alignment) {
- end_sect = end_sect - alignment;
- sector_div(end_sect, granularity);
- end_sect = end_sect * granularity + alignment;
- req_sects = end_sect - sector;
- }
-
- bio->bi_iter.bi_sector = sector;
- bio->bi_end_io = bio_batch_end_io;
bio->bi_bdev = bdev;
- bio->bi_private = &bb;
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_iter.bi_size = min_t(sector_t, nr_sects, 1 << 20) << 9;

- bio->bi_iter.bi_size = req_sects << 9;
- nr_sects -= req_sects;
- sector = end_sect;
+ sector += bio_sectors(bio);
+ nr_sects -= bio_sectors(bio);

- atomic_inc(&bb.done);
- submit_bio(type, bio);
+ ret = submit_bio_wait(type, bio);
+ if (ret)
+ break;

/*
* We can loop for a long time in here, if someone does
@@ -128,14 +64,6 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
*/
cond_resched();
}
- blk_finish_plug(&plug);
-
- /* Wait for bios in-flight */
- if (!atomic_dec_and_test(&bb.done))
- wait_for_completion_io(&wait);
-
- if (!test_bit(BIO_UPTODATE, &bb.flags))
- ret = -EIO;

return ret;
}
@@ -156,61 +84,37 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask,
struct page *page)
{
- DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q = bdev_get_queue(bdev);
- unsigned int max_write_same_sectors;
- struct bio_batch bb;
struct bio *bio;
int ret = 0;

if (!q)
return -ENXIO;

- max_write_same_sectors = q->limits.max_write_same_sectors;
-
- if (max_write_same_sectors == 0)
+ if (!q->limits.max_write_same_sectors)
return -EOPNOTSUPP;

- atomic_set(&bb.done, 1);
- bb.flags = 1 << BIO_UPTODATE;
- bb.wait = &wait;
-
while (nr_sects) {
bio = bio_alloc(gfp_mask, 1);
- if (!bio) {
- ret = -ENOMEM;
- break;
- }
+ if (!bio)
+ return -ENOMEM;

- bio->bi_iter.bi_sector = sector;
- bio->bi_end_io = bio_batch_end_io;
bio->bi_bdev = bdev;
- bio->bi_private = &bb;
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_iter.bi_size = min_t(sector_t, nr_sects, 1 << 20) << 9;
bio->bi_vcnt = 1;
bio->bi_io_vec->bv_page = page;
bio->bi_io_vec->bv_offset = 0;
bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);

- if (nr_sects > max_write_same_sectors) {
- bio->bi_iter.bi_size = max_write_same_sectors << 9;
- nr_sects -= max_write_same_sectors;
- sector += max_write_same_sectors;
- } else {
- bio->bi_iter.bi_size = nr_sects << 9;
- nr_sects = 0;
- }
+ sector += bio_sectors(bio);
+ nr_sects -= bio_sectors(bio);

- atomic_inc(&bb.done);
- submit_bio(REQ_WRITE | REQ_WRITE_SAME, bio);
+ ret = submit_bio_wait(REQ_WRITE | REQ_WRITE_SAME, bio);
+ if (ret)
+ break;
}

- /* Wait for bios in-flight */
- if (!atomic_dec_and_test(&bb.done))
- wait_for_completion_io(&wait);
-
- if (!test_bit(BIO_UPTODATE, &bb.flags))
- ret = -ENOTSUPP;
-
return ret;
}
EXPORT_SYMBOL(blkdev_issue_write_same);
@@ -225,33 +129,22 @@ EXPORT_SYMBOL(blkdev_issue_write_same);
* Description:
* Generate and issue number of bios with zerofiled pages.
*/
-
static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask)
{
- int ret;
+ int ret = 0;
struct bio *bio;
- struct bio_batch bb;
unsigned int sz;
- DECLARE_COMPLETION_ONSTACK(wait);
-
- atomic_set(&bb.done, 1);
- bb.flags = 1 << BIO_UPTODATE;
- bb.wait = &wait;

- ret = 0;
- while (nr_sects != 0) {
+ while (nr_sects) {
bio = bio_alloc(gfp_mask,
- min(nr_sects, (sector_t)BIO_MAX_PAGES));
- if (!bio) {
- ret = -ENOMEM;
- break;
- }
+ min(nr_sects / (PAGE_SIZE >> 9),
+ (sector_t)BIO_MAX_PAGES));
+ if (!bio)
+ return -ENOMEM;

bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
- bio->bi_end_io = bio_batch_end_io;
- bio->bi_private = &bb;

while (nr_sects != 0) {
sz = min((sector_t) PAGE_SIZE >> 9 , nr_sects);
@@ -261,18 +154,11 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
if (ret < (sz << 9))
break;
}
- ret = 0;
- atomic_inc(&bb.done);
- submit_bio(WRITE, bio);
- }
-
- /* Wait for bios in-flight */
- if (!atomic_dec_and_test(&bb.done))
- wait_for_completion_io(&wait);

- if (!test_bit(BIO_UPTODATE, &bb.flags))
- /* One of bios in the batch was completed with error.*/
- ret = -EIO;
+ ret = submit_bio_wait(WRITE, bio);
+ if (ret)
+ break;
+ }

return ret;
}
@@ -287,7 +173,6 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
* Description:
* Generate and issue number of bios with zerofiled pages.
*/
-
int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask)
{
--
2.1.0

2014-12-22 11:49:58

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 06/17] btrfs: make use of immutable biovecs

From: Kent Overstreet <[email protected]>

Make use of the new API for immutable biovecs, instead of iterating
bi_io_vec[] manually just like done in the old era. That means, e.g.
calling bio_for_each_segment() by passing bvec and iter literally,
using bio_advance_iter() for looking up the next range of biovec.

This is going to be important for future block layer refactoring, and
using the standard primitives makes the code easier to audit.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: apply this conversion also in check-integrity.c, and add more
descrption in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: [email protected]
---
fs/btrfs/check-integrity.c | 22 ++++++++++-------
fs/btrfs/extent_io.c | 12 ++++++---
fs/btrfs/file-item.c | 61 +++++++++++++++++-----------------------------
fs/btrfs/inode.c | 22 ++++++-----------
4 files changed, 53 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index d897ef8..74ce4a2 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2963,6 +2963,9 @@ int btrfsic_submit_bh(int rw, struct buffer_head *bh)
static void __btrfsic_submit_bio(int rw, struct bio *bio)
{
struct btrfsic_dev_state *dev_state;
+ struct bio_vec bvec = { 0 };
+ struct bvec_iter iter = bio->bi_iter;
+ struct page *page;

if (!btrfsic_is_initialized)
return;
@@ -2979,7 +2982,7 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
int bio_is_patched;
char **mapped_datav;

- dev_bytenr = 512 * bio->bi_iter.bi_sector;
+ dev_bytenr = 512 * iter.bi_sector;
bio_is_patched = 0;
if (dev_state->state->print_mask &
BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH)
@@ -2987,7 +2990,7 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
"submit_bio(rw=0x%x, bi_vcnt=%u,"
" bi_sector=%llu (bytenr %llu), bi_bdev=%p)\n",
rw, bio->bi_vcnt,
- (unsigned long long)bio->bi_iter.bi_sector,
+ (unsigned long long)iter.bi_sector,
dev_bytenr, bio->bi_bdev);

mapped_datav = kmalloc(sizeof(*mapped_datav) * bio->bi_vcnt,
@@ -2995,13 +2998,14 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
if (!mapped_datav)
goto leave;
cur_bytenr = dev_bytenr;
- for (i = 0; i < bio->bi_vcnt; i++) {
- BUG_ON(bio->bi_io_vec[i].bv_len != PAGE_CACHE_SIZE);
- mapped_datav[i] = kmap(bio->bi_io_vec[i].bv_page);
+
+ bio_for_each_segment(bvec, bio, iter) {
+ BUG_ON(bvec.bv_len != PAGE_CACHE_SIZE);
+ mapped_datav[i] = kmap(bvec.bv_page);
if (!mapped_datav[i]) {
while (i > 0) {
i--;
- kunmap(bio->bi_io_vec[i].bv_page);
+ kunmap(bvec.bv_page);
}
kfree(mapped_datav);
goto leave;
@@ -3011,8 +3015,8 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
printk(KERN_INFO
"#%u: bytenr=%llu, len=%u, offset=%u\n",
i, cur_bytenr, bio->bi_io_vec[i].bv_len,
- bio->bi_io_vec[i].bv_offset);
- cur_bytenr += bio->bi_io_vec[i].bv_len;
+ bvec.bv_offset);
+ cur_bytenr += bvec.bv_len;
}
btrfsic_process_written_block(dev_state, dev_bytenr,
mapped_datav, bio->bi_vcnt,
@@ -3020,7 +3024,7 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
NULL, rw);
while (i > 0) {
i--;
- kunmap(bio->bi_io_vec[i].bv_page);
+ kunmap(bvec.bv_page);
}
kfree(mapped_datav);
} else if (NULL != dev_state && (rw & REQ_FLUSH)) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ebabd2..038b242 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2749,12 +2749,18 @@ static int __must_check submit_one_bio(int rw, struct bio *bio,
int mirror_num, unsigned long bio_flags)
{
int ret = 0;
- struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
- struct page *page = bvec->bv_page;
+ struct bio_vec bvec = { 0 };
+ struct bvec_iter iter;
+ struct page *page;
struct extent_io_tree *tree = bio->bi_private;
u64 start;

- start = page_offset(page) + bvec->bv_offset;
+ bio_for_each_segment(bvec, bio, iter)
+ if (bio_iter_last(bvec, iter))
+ break;
+
+ page = bvec.bv_page;
+ start = page_offset(page) + bvec.bv_offset;

bio->bi_private = NULL;

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 84a2d18..7816cb8 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -162,7 +162,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
struct inode *inode, struct bio *bio,
u64 logical_offset, u32 *dst, int dio)
{
- struct bio_vec *bvec = bio->bi_io_vec;
+ struct bvec_iter iter = bio->bi_iter;
struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
struct btrfs_csum_item *item = NULL;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
@@ -171,10 +171,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
u64 offset = 0;
u64 item_start_offset = 0;
u64 item_last_offset = 0;
- u64 disk_bytenr;
u32 diff;
int nblocks;
- int bio_index = 0;
int count;
u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);

@@ -204,8 +202,6 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (bio->bi_iter.bi_size > PAGE_CACHE_SIZE * 8)
path->reada = 2;

- WARN_ON(bio->bi_vcnt <= 0);
-
/*
* the free space stuff is only read when it hasn't been
* updated in the current transaction. So, we can safely
@@ -217,12 +213,13 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
path->skip_locking = 1;
}

- disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
if (dio)
offset = logical_offset;
- while (bio_index < bio->bi_vcnt) {
+ while (iter.bi_size) {
+ u64 disk_bytenr = (u64)iter.bi_sector << 9;
+ struct bio_vec bvec = bio_iter_iovec(bio, iter);
if (!dio)
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bvec.bv_page) + bvec.bv_offset;
count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
(u32 *)csum, nblocks);
if (count)
@@ -243,7 +240,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (BTRFS_I(inode)->root->root_key.objectid ==
BTRFS_DATA_RELOC_TREE_OBJECTID) {
set_extent_bits(io_tree, offset,
- offset + bvec->bv_len - 1,
+ offset + bvec.bv_len - 1,
EXTENT_NODATASUM, GFP_NOFS);
} else {
btrfs_info(BTRFS_I(inode)->root->fs_info,
@@ -281,12 +278,9 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
found:
csum += count * csum_size;
nblocks -= count;
- bio_index += count;
- while (count--) {
- disk_bytenr += bvec->bv_len;
- offset += bvec->bv_len;
- bvec++;
- }
+ bio_advance_iter(bio, &iter,
+ count << inode->i_sb->s_blocksize_bits);
+ offset += count << inode->i_sb->s_blocksize_bits;
}
btrfs_free_path(path);
return 0;
@@ -429,14 +423,12 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
struct btrfs_ordered_sum *sums;
struct btrfs_ordered_extent *ordered;
char *data;
- struct bio_vec *bvec = bio->bi_io_vec;
- int bio_index = 0;
+ struct bio_vec bvec;
+ struct bvec_iter iter;
int index;
- unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
u64 offset;

- WARN_ON(bio->bi_vcnt <= 0);
sums = kzalloc(btrfs_ordered_sum_size(root, bio->bi_iter.bi_size),
GFP_NOFS);
if (!sums)
@@ -448,53 +440,46 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
if (contig)
offset = file_start;
else
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bio_page(bio)) + bio_offset(bio);

ordered = btrfs_lookup_ordered_extent(inode, offset);
BUG_ON(!ordered); /* Logic error */
sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
index = 0;

- while (bio_index < bio->bi_vcnt) {
+ bio_for_each_segment(bvec, bio, iter) {
if (!contig)
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bvec.bv_page) + bvec.bv_offset;

if (offset >= ordered->file_offset + ordered->len ||
offset < ordered->file_offset) {
- unsigned long bytes_left;
sums->len = this_sum_bytes;
this_sum_bytes = 0;
btrfs_add_ordered_sum(inode, ordered, sums);
btrfs_put_ordered_extent(ordered);

- bytes_left = bio->bi_iter.bi_size - total_bytes;
-
- sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
- GFP_NOFS);
+ sums = kzalloc(btrfs_ordered_sum_size(root,
+ iter.bi_size), GFP_NOFS);
BUG_ON(!sums); /* -ENOMEM */
- sums->len = bytes_left;
+ sums->len = iter.bi_size;
ordered = btrfs_lookup_ordered_extent(inode, offset);
BUG_ON(!ordered); /* Logic error */
- sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
- total_bytes;
+ sums->bytenr = ((u64)iter.bi_sector) << 9;
index = 0;
}

- data = kmap_atomic(bvec->bv_page);
+ data = kmap_atomic(bvec.bv_page);
sums->sums[index] = ~(u32)0;
- sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
+ sums->sums[index] = btrfs_csum_data(data + bvec.bv_offset,
sums->sums[index],
- bvec->bv_len);
+ bvec.bv_len);
kunmap_atomic(data);
btrfs_csum_final(sums->sums[index],
(char *)(sums->sums + index));

- bio_index++;
index++;
- total_bytes += bvec->bv_len;
- this_sum_bytes += bvec->bv_len;
- offset += bvec->bv_len;
- bvec++;
+ offset += bvec.bv_len;
+ this_sum_bytes += bvec.bv_len;
}
this_sum_bytes = 0;
btrfs_add_ordered_sum(inode, ordered, sums);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e687bb0..9c513d8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7784,12 +7784,11 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct bio *bio;
struct bio *orig_bio = dip->orig_bio;
- struct bio_vec *bvec = orig_bio->bi_io_vec;
+ struct bio_vec bvec;
+ struct bvec_iter iter;
u64 start_sector = orig_bio->bi_iter.bi_sector;
u64 file_offset = dip->logical_offset;
- u64 submit_len = 0;
u64 map_length;
- int nr_pages = 0;
int ret;
int async_submit = 0;

@@ -7821,10 +7820,12 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
btrfs_io_bio(bio)->logical = file_offset;
atomic_inc(&dip->pending_bios);

- while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
- if (map_length < submit_len + bvec->bv_len ||
- bio_add_page(bio, bvec->bv_page, bvec->bv_len,
- bvec->bv_offset) < bvec->bv_len) {
+ bio_for_each_segment(bvec, orig_bio, iter) {
+ if (map_length < bio->bi_iter.bi_size + bvec.bv_len ||
+ bio_add_page(bio, bvec.bv_page, bvec.bv_len,
+ bvec.bv_offset) < bvec.bv_len) {
+ unsigned submit_len = bio->bi_iter.bi_size;
+
/*
* inc the count before we submit the bio so
* we know the end IO handler won't happen before
@@ -7844,9 +7845,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
start_sector += submit_len >> 9;
file_offset += submit_len;

- submit_len = 0;
- nr_pages = 0;
-
bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev,
start_sector, GFP_NOFS);
if (!bio)
@@ -7863,10 +7861,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
bio_put(bio);
goto out_err;
}
- } else {
- submit_len += bvec->bv_len;
- nr_pages++;
- bvec++;
}
}

--
2.1.0

2014-12-22 11:50:03

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 07/17] block: replace sg_iovec with iov_iter

From: Kent Overstreet <[email protected]>

Make use of a new interface provided by iov_iter, backed by
scatter-gather list of iovec, instead of the old interface based on
sg_iovec. Also use iov_iter_advance() instead of manual iteration.

This commit should contain only literal replacements, without
functional changes.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Doug Gilbert <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
---
block/bio.c | 114 +++++++++++++++++++++++++------------------------
block/blk-map.c | 27 ++++++------
block/scsi_ioctl.c | 19 +++------
drivers/scsi/sg.c | 15 +++----
include/linux/bio.h | 8 ++--
include/linux/blkdev.h | 4 +-
6 files changed, 90 insertions(+), 97 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 955bc57..4731c4a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -997,18 +997,17 @@ void bio_copy_data(struct bio *dst, struct bio *src)
EXPORT_SYMBOL(bio_copy_data);

struct bio_map_data {
- int nr_sgvecs;
int is_our_pages;
- struct sg_iovec sgvecs[];
+ struct iov_iter iter;
+ struct iovec sgvecs[];
};

static void bio_set_map_data(struct bio_map_data *bmd, struct bio *bio,
- const struct sg_iovec *iov, int iov_count,
- int is_our_pages)
+ const struct iov_iter *iter, int is_our_pages)
{
- memcpy(bmd->sgvecs, iov, sizeof(struct sg_iovec) * iov_count);
- bmd->nr_sgvecs = iov_count;
bmd->is_our_pages = is_our_pages;
+ bmd->iter = *iter;
+ memcpy(bmd->sgvecs, iter->iov, sizeof(struct iovec) * iter->nr_segs);
bio->bi_private = bmd;
}

@@ -1022,33 +1021,30 @@ static struct bio_map_data *bio_alloc_map_data(unsigned int iov_count,
sizeof(struct sg_iovec) * iov_count, gfp_mask);
}

-static int __bio_copy_iov(struct bio *bio, const struct sg_iovec *iov, int iov_count,
+static int __bio_copy_iov(struct bio *bio, const struct iov_iter *iter,
int to_user, int from_user, int do_free_page)
{
int ret = 0, i;
struct bio_vec *bvec;
- int iov_idx = 0;
- unsigned int iov_off = 0;
+ struct iov_iter iov_iter = *iter;

bio_for_each_segment_all(bvec, bio, i) {
char *bv_addr = page_address(bvec->bv_page);
unsigned int bv_len = bvec->bv_len;

- while (bv_len && iov_idx < iov_count) {
- unsigned int bytes;
- char __user *iov_addr;
-
- bytes = min_t(unsigned int,
- iov[iov_idx].iov_len - iov_off, bv_len);
- iov_addr = iov[iov_idx].iov_base + iov_off;
+ while (bv_len && iov_iter.count) {
+ struct iovec iov = iov_iter_iovec(&iov_iter);
+ unsigned int bytes = min_t(unsigned int, bv_len,
+ iov.iov_len);

if (!ret) {
if (to_user)
- ret = copy_to_user(iov_addr, bv_addr,
- bytes);
+ ret = copy_to_user(iov.iov_base,
+ bv_addr, bytes);

if (from_user)
- ret = copy_from_user(bv_addr, iov_addr,
+ ret = copy_from_user(bv_addr,
+ iov.iov_base,
bytes);

if (ret)
@@ -1057,13 +1053,7 @@ static int __bio_copy_iov(struct bio *bio, const struct sg_iovec *iov, int iov_c

bv_len -= bytes;
bv_addr += bytes;
- iov_addr += bytes;
- iov_off += bytes;
-
- if (iov[iov_idx].iov_len == iov_off) {
- iov_idx++;
- iov_off = 0;
- }
+ iov_iter_advance(&iov_iter, bytes);
}

if (do_free_page)
@@ -1092,7 +1082,7 @@ int bio_uncopy_user(struct bio *bio)
* don't copy into a random user address space, just free.
*/
if (current->mm)
- ret = __bio_copy_iov(bio, bmd->sgvecs, bmd->nr_sgvecs,
+ ret = __bio_copy_iov(bio, &bmd->iter,
bio_data_dir(bio) == READ,
0, bmd->is_our_pages);
else if (bmd->is_our_pages)
@@ -1120,7 +1110,7 @@ EXPORT_SYMBOL(bio_uncopy_user);
*/
struct bio *bio_copy_user_iov(struct request_queue *q,
struct rq_map_data *map_data,
- const struct sg_iovec *iov, int iov_count,
+ const struct iov_iter *iter,
int write_to_vm, gfp_t gfp_mask)
{
struct bio_map_data *bmd;
@@ -1129,16 +1119,17 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
struct bio *bio;
int i, ret;
int nr_pages = 0;
- unsigned int len = 0;
+ unsigned int len;
unsigned int offset = map_data ? map_data->offset & ~PAGE_MASK : 0;

- for (i = 0; i < iov_count; i++) {
+ for (i = 0; i < iter->nr_segs; i++) {
unsigned long uaddr;
unsigned long end;
unsigned long start;

- uaddr = (unsigned long)iov[i].iov_base;
- end = (uaddr + iov[i].iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ uaddr = (unsigned long) iter->iov[i].iov_base;
+ end = (uaddr + iter->iov[i].iov_len + PAGE_SIZE - 1)
+ >> PAGE_SHIFT;
start = uaddr >> PAGE_SHIFT;

/*
@@ -1148,13 +1139,12 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
return ERR_PTR(-EINVAL);

nr_pages += end - start;
- len += iov[i].iov_len;
}

if (offset)
nr_pages++;

- bmd = bio_alloc_map_data(iov_count, gfp_mask);
+ bmd = bio_alloc_map_data(iter->nr_segs, gfp_mask);
if (!bmd)
return ERR_PTR(-ENOMEM);

@@ -1171,7 +1161,12 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
if (map_data) {
nr_pages = 1 << map_data->page_order;
i = map_data->offset / PAGE_SIZE;
+ } else {
+ i = 0;
}
+
+ len = iter->count;
+
while (len) {
unsigned int bytes = PAGE_SIZE;

@@ -1213,12 +1208,12 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
*/
if ((!write_to_vm && (!map_data || !map_data->null_mapped)) ||
(map_data && map_data->from_user)) {
- ret = __bio_copy_iov(bio, iov, iov_count, 0, 1, 0);
+ ret = __bio_copy_iov(bio, iter, 0, 1, 0);
if (ret)
goto cleanup;
}

- bio_set_map_data(bmd, bio, iov, iov_count, map_data ? 0 : 1);
+ bio_set_map_data(bmd, bio, iter, map_data ? 0 : 1);
return bio;
cleanup:
if (!map_data)
@@ -1248,30 +1243,35 @@ struct bio *bio_copy_user(struct request_queue *q, struct rq_map_data *map_data,
unsigned long uaddr, unsigned int len,
int write_to_vm, gfp_t gfp_mask)
{
- struct sg_iovec iov;
+ struct iovec iov;
+ struct iov_iter i;

- iov.iov_base = (void __user *)uaddr;
+ iov.iov_base = (void __user *) uaddr;
iov.iov_len = len;

- return bio_copy_user_iov(q, map_data, &iov, 1, write_to_vm, gfp_mask);
+ iov_iter_init(&i, write_to_vm ? WRITE : READ, &iov, 1, len);
+
+ return bio_copy_user_iov(q, map_data, &i, write_to_vm, gfp_mask);
}
EXPORT_SYMBOL(bio_copy_user);

static struct bio *__bio_map_user_iov(struct request_queue *q,
struct block_device *bdev,
- const struct sg_iovec *iov, int iov_count,
+ const struct iov_iter *iter,
int write_to_vm, gfp_t gfp_mask)
{
- int i, j;
+ int j;
int nr_pages = 0;
struct page **pages;
struct bio *bio;
int cur_page = 0;
int ret, offset;
+ struct iov_iter i;
+ struct iovec iov;

- for (i = 0; i < iov_count; i++) {
- unsigned long uaddr = (unsigned long)iov[i].iov_base;
- unsigned long len = iov[i].iov_len;
+ iov_for_each(iov, i, *iter) {
+ unsigned long uaddr = (unsigned long) iov.iov_base;
+ unsigned long len = iov.iov_len;
unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
unsigned long start = uaddr >> PAGE_SHIFT;

@@ -1301,9 +1301,9 @@ static struct bio *__bio_map_user_iov(struct request_queue *q,
if (!pages)
goto out;

- for (i = 0; i < iov_count; i++) {
- unsigned long uaddr = (unsigned long)iov[i].iov_base;
- unsigned long len = iov[i].iov_len;
+ iov_for_each(iov, i, *iter) {
+ unsigned long uaddr = (unsigned long) iov.iov_base;
+ unsigned long len = iov.iov_len;
unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
unsigned long start = uaddr >> PAGE_SHIFT;
const int local_nr_pages = end - start;
@@ -1358,10 +1358,10 @@ static struct bio *__bio_map_user_iov(struct request_queue *q,
return bio;

out_unmap:
- for (i = 0; i < nr_pages; i++) {
- if(!pages[i])
+ for (j = 0; j < nr_pages; j++) {
+ if (!pages[j])
break;
- page_cache_release(pages[i]);
+ page_cache_release(pages[j]);
}
out:
kfree(pages);
@@ -1385,12 +1385,15 @@ struct bio *bio_map_user(struct request_queue *q, struct block_device *bdev,
unsigned long uaddr, unsigned int len, int write_to_vm,
gfp_t gfp_mask)
{
- struct sg_iovec iov;
+ struct iovec iov;
+ struct iov_iter i;

- iov.iov_base = (void __user *)uaddr;
+ iov.iov_base = (void __user *) uaddr;
iov.iov_len = len;

- return bio_map_user_iov(q, bdev, &iov, 1, write_to_vm, gfp_mask);
+ iov_iter_init(&i, write_to_vm ? WRITE : READ, &iov, 1, len);
+
+ return bio_map_user_iov(q, bdev, &i, write_to_vm, gfp_mask);
}
EXPORT_SYMBOL(bio_map_user);

@@ -1407,13 +1410,12 @@ EXPORT_SYMBOL(bio_map_user);
* device. Returns an error pointer in case of error.
*/
struct bio *bio_map_user_iov(struct request_queue *q, struct block_device *bdev,
- const struct sg_iovec *iov, int iov_count,
+ const struct iov_iter *iter,
int write_to_vm, gfp_t gfp_mask)
{
struct bio *bio;

- bio = __bio_map_user_iov(q, bdev, iov, iov_count, write_to_vm,
- gfp_mask);
+ bio = __bio_map_user_iov(q, bdev, iter, write_to_vm, gfp_mask);
if (IS_ERR(bio))
return bio;

diff --git a/block/blk-map.c b/block/blk-map.c
index f890d43..496af28 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -5,7 +5,7 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
-#include <scsi/sg.h> /* for struct sg_iovec */
+#include <linux/uio.h>

#include "blk.h"

@@ -187,20 +187,22 @@ EXPORT_SYMBOL(blk_rq_map_user);
* unmapping.
*/
int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
- struct rq_map_data *map_data, const struct sg_iovec *iov,
- int iov_count, unsigned int len, gfp_t gfp_mask)
+ struct rq_map_data *map_data,
+ const struct iov_iter *iter, gfp_t gfp_mask)
{
struct bio *bio;
- int i, read = rq_data_dir(rq) == READ;
+ int read = rq_data_dir(rq) == READ;
int unaligned = 0;
+ struct iov_iter i;
+ struct iovec iov;

- if (!iov || iov_count <= 0)
+ if (!iter || !iter->count)
return -EINVAL;

- for (i = 0; i < iov_count; i++) {
- unsigned long uaddr = (unsigned long)iov[i].iov_base;
+ iov_for_each(iov, i, *iter) {
+ unsigned long uaddr = (unsigned long) iov.iov_base;

- if (!iov[i].iov_len)
+ if (!iov.iov_len)
return -EINVAL;

/*
@@ -210,16 +212,15 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
unaligned = 1;
}

- if (unaligned || (q->dma_pad_mask & len) || map_data)
- bio = bio_copy_user_iov(q, map_data, iov, iov_count, read,
- gfp_mask);
+ if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
+ bio = bio_copy_user_iov(q, map_data, iter, read, gfp_mask);
else
- bio = bio_map_user_iov(q, NULL, iov, iov_count, read, gfp_mask);
+ bio = bio_map_user_iov(q, NULL, iter, read, gfp_mask);

if (IS_ERR(bio))
return PTR_ERR(bio);

- if (bio->bi_iter.bi_size != len) {
+ if (bio->bi_iter.bi_size != iter->count) {
/*
* Grab an extra reference to this bio, as bio_unmap_user()
* expects to be able to drop it twice as it happens on the
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 28163fa..8c6652b 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -332,7 +332,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,

ret = 0;
if (hdr->iovec_count) {
- size_t iov_data_len;
+ struct iov_iter i;
struct iovec *iov = NULL;

ret = rw_copy_check_uvector(-1, hdr->dxferp, hdr->iovec_count,
@@ -342,20 +342,13 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
goto out_free_cdb;
}

- iov_data_len = ret;
- ret = 0;
-
/* SG_IO howto says that the shorter of the two wins */
- if (hdr->dxfer_len < iov_data_len) {
- hdr->iovec_count = iov_shorten(iov,
- hdr->iovec_count,
- hdr->dxfer_len);
- iov_data_len = hdr->dxfer_len;
- }
+ iov_iter_init(&i,
+ rq_data_dir(rq) == READ ? WRITE : READ,
+ iov, hdr->iovec_count,
+ min_t(unsigned, ret, hdr->dxfer_len));

- ret = blk_rq_map_user_iov(q, rq, NULL, (struct sg_iovec *) iov,
- hdr->iovec_count,
- iov_data_len, GFP_KERNEL);
+ ret = blk_rq_map_user_iov(q, rq, NULL, &i, GFP_KERNEL);
kfree(iov);
} else if (hdr->dxfer_len)
ret = blk_rq_map_user(q, rq, NULL, hdr->dxferp, hdr->dxfer_len,
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index b14f64c..3ce5cad 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -1734,22 +1734,19 @@ sg_start_req(Sg_request *srp, unsigned char *cmd)
}

if (iov_count) {
- int len, size = sizeof(struct sg_iovec) * iov_count;
+ int size = sizeof(struct sg_iovec) * iov_count;
struct iovec *iov;
+ struct iov_iter i;

iov = memdup_user(hp->dxferp, size);
if (IS_ERR(iov))
return PTR_ERR(iov);

- len = iov_length(iov, iov_count);
- if (hp->dxfer_len < len) {
- iov_count = iov_shorten(iov, iov_count, hp->dxfer_len);
- len = hp->dxfer_len;
- }
+ iov_iter_init(&i, rw, iov, iov_count,
+ min_t(size_t, hp->dxfer_len,
+ iov_length(iov, iov_count)));

- res = blk_rq_map_user_iov(q, rq, md, (struct sg_iovec *)iov,
- iov_count,
- len, GFP_ATOMIC);
+ res = blk_rq_map_user_iov(q, rq, md, &i, GFP_ATOMIC);
kfree(iov);
} else
res = blk_rq_map_user(q, rq, md, hp->dxferp,
diff --git a/include/linux/bio.h b/include/linux/bio.h
index efead0b..a69f7b1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -430,11 +430,11 @@ extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
extern int bio_get_nr_vecs(struct block_device *);
extern struct bio *bio_map_user(struct request_queue *, struct block_device *,
unsigned long, unsigned int, int, gfp_t);
-struct sg_iovec;
+struct iov_iter;
struct rq_map_data;
extern struct bio *bio_map_user_iov(struct request_queue *,
struct block_device *,
- const struct sg_iovec *, int, int, gfp_t);
+ const struct iov_iter *, int, gfp_t);
extern void bio_unmap_user(struct bio *);
extern struct bio *bio_map_kern(struct request_queue *, void *, unsigned int,
gfp_t);
@@ -466,8 +466,8 @@ extern struct bio *bio_copy_user(struct request_queue *, struct rq_map_data *,
unsigned long, unsigned int, int, gfp_t);
extern struct bio *bio_copy_user_iov(struct request_queue *,
struct rq_map_data *,
- const struct sg_iovec *,
- int, int, gfp_t);
+ const struct iov_iter *,
+ int, gfp_t);
extern int bio_uncopy_user(struct bio *);
void zero_fill_bio(struct bio *bio);
extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 191ee4b..c03e37a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -853,8 +853,8 @@ extern int blk_rq_map_user(struct request_queue *, struct request *,
extern int blk_rq_unmap_user(struct bio *);
extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, unsigned int, gfp_t);
extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
- struct rq_map_data *, const struct sg_iovec *,
- int, unsigned int, gfp_t);
+ struct rq_map_data *, const struct iov_iter *,
+ gfp_t);
extern int blk_execute_rq(struct request_queue *, struct gendisk *,
struct request *, int);
extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
--
2.1.0

2014-12-22 11:50:14

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 09/17] block: refactor iov_count_pages() from bio_{copy,map}_user_iov()

From: Kent Overstreet <[email protected]>

Refactor the common part in bio_copy_user_iov() and
__bio_map_user_iov() to separate out iov_count_pages() into the general
iov_iter API, instead of open coding iov iterations as done previously.

This commit should contain only literal replacements, without
functional changes.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Hans J. Koch" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Al Viro <[email protected]>
---
block/bio.c | 43 ++++++-------------------------------------
include/linux/uio.h | 2 ++
lib/iovec.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 524b401..2adb68b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1110,24 +1110,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
unsigned int len;
unsigned int offset = map_data ? map_data->offset & ~PAGE_MASK : 0;

- for (i = 0; i < iter->nr_segs; i++) {
- unsigned long uaddr;
- unsigned long end;
- unsigned long start;
-
- uaddr = (unsigned long) iter->iov[i].iov_base;
- end = (uaddr + iter->iov[i].iov_len + PAGE_SIZE - 1)
- >> PAGE_SHIFT;
- start = uaddr >> PAGE_SHIFT;
-
- /*
- * Overflow, abort
- */
- if (end < start)
- return ERR_PTR(-EINVAL);
-
- nr_pages += end - start;
- }
+ nr_pages = iov_count_pages(iter, 0);
+ if (nr_pages < 0)
+ return ERR_PTR(nr_pages);

if (offset)
nr_pages++;
@@ -1257,25 +1242,9 @@ static struct bio *__bio_map_user_iov(struct request_queue *q,
struct iov_iter i;
struct iovec iov;

- iov_for_each(iov, i, *iter) {
- unsigned long uaddr = (unsigned long) iov.iov_base;
- unsigned long len = iov.iov_len;
- unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
- unsigned long start = uaddr >> PAGE_SHIFT;
-
- /*
- * Overflow, abort
- */
- if (end < start)
- return ERR_PTR(-EINVAL);
-
- nr_pages += end - start;
- /*
- * buffer must be aligned to at least hardsector size for now
- */
- if (uaddr & queue_dma_alignment(q))
- return ERR_PTR(-EINVAL);
- }
+ nr_pages = iov_count_pages(iter, queue_dma_alignment(q));
+ if (nr_pages < 0)
+ return ERR_PTR(nr_pages);

if (!nr_pages)
return ERR_PTR(-EINVAL);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1c5e453..142ff1b 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -72,6 +72,8 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)

unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to);

+int iov_count_pages(const struct iov_iter *iter, unsigned align);
+
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
void iov_iter_advance(struct iov_iter *i, size_t bytes);
diff --git a/lib/iovec.c b/lib/iovec.c
index 2d99cb4..2e75086 100644
--- a/lib/iovec.c
+++ b/lib/iovec.c
@@ -1,5 +1,7 @@
#include <linux/uaccess.h>
#include <linux/export.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
#include <linux/uio.h>

/*
@@ -85,3 +87,31 @@ int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
return 0;
}
EXPORT_SYMBOL(memcpy_fromiovecend);
+
+int iov_count_pages(const struct iov_iter *iter, unsigned align)
+{
+ struct iov_iter i = *iter;
+ int nr_pages = 0;
+
+ while (iov_iter_count(&i)) {
+ unsigned long uaddr = (unsigned long) i.iov->iov_base +
+ i.iov_offset;
+ unsigned long len = i.iov->iov_len - i.iov_offset;
+
+ if ((uaddr & align) || (len & align))
+ return -EINVAL;
+
+ /*
+ * Overflow, abort
+ */
+ if (uaddr + len < uaddr)
+ return -EINVAL;
+
+ nr_pages += DIV_ROUND_UP(len + offset_in_page(uaddr),
+ PAGE_SIZE);
+ iov_iter_advance(&i, len);
+ }
+
+ return nr_pages;
+}
+EXPORT_SYMBOL(iov_count_pages);
--
2.1.0

2014-12-22 11:50:10

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 08/17] block: refactor __bio_copy_iov()

From: Kent Overstreet <[email protected]>

Rewrite __bio_copy_iov() to use copy_page_to_iter() or
copy_page_from_iter(), according to Al Viro's suggestions.

This commit should contain only literal replacements, without
functional changes.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Al Viro <[email protected]>
---
block/bio.c | 62 +++++++++++++++++++++++++------------------------------------
1 file changed, 25 insertions(+), 37 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4731c4a..524b401 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1022,45 +1022,34 @@ static struct bio_map_data *bio_alloc_map_data(unsigned int iov_count,
}

static int __bio_copy_iov(struct bio *bio, const struct iov_iter *iter,
- int to_user, int from_user, int do_free_page)
+ int to_iov)
{
- int ret = 0, i;
+ int i;
struct bio_vec *bvec;
struct iov_iter iov_iter = *iter;

bio_for_each_segment_all(bvec, bio, i) {
- char *bv_addr = page_address(bvec->bv_page);
- unsigned int bv_len = bvec->bv_len;
-
- while (bv_len && iov_iter.count) {
- struct iovec iov = iov_iter_iovec(&iov_iter);
- unsigned int bytes = min_t(unsigned int, bv_len,
- iov.iov_len);
-
- if (!ret) {
- if (to_user)
- ret = copy_to_user(iov.iov_base,
- bv_addr, bytes);
-
- if (from_user)
- ret = copy_from_user(bv_addr,
- iov.iov_base,
- bytes);
-
- if (ret)
- ret = -EFAULT;
- }
-
- bv_len -= bytes;
- bv_addr += bytes;
- iov_iter_advance(&iov_iter, bytes);
- }
+ ssize_t ret;
+
+ if (to_iov == WRITE)
+ ret = copy_page_to_iter(bvec->bv_page,
+ bvec->bv_offset,
+ bvec->bv_len,
+ &iov_iter);
+ else
+ ret = copy_page_from_iter(bvec->bv_page,
+ bvec->bv_offset,
+ bvec->bv_len,
+ &iov_iter);
+
+ if (!iov_iter_count(&iov_iter))
+ break;

- if (do_free_page)
- __free_page(bvec->bv_page);
+ if (ret < bvec->bv_len)
+ return -EFAULT;
}

- return ret;
+ return 0;
}

/**
@@ -1081,11 +1070,10 @@ int bio_uncopy_user(struct bio *bio)
* if we're in a workqueue, the request is orphaned, so
* don't copy into a random user address space, just free.
*/
- if (current->mm)
- ret = __bio_copy_iov(bio, &bmd->iter,
- bio_data_dir(bio) == READ,
- 0, bmd->is_our_pages);
- else if (bmd->is_our_pages)
+ if (current->mm && bio_data_dir(bio) == READ)
+ ret = __bio_copy_iov(bio, &bmd->iter, WRITE);
+
+ if (bmd->is_our_pages)
bio_for_each_segment_all(bvec, bio, i)
__free_page(bvec->bv_page);
}
@@ -1208,7 +1196,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
*/
if ((!write_to_vm && (!map_data || !map_data->null_mapped)) ||
(map_data && map_data->from_user)) {
- ret = __bio_copy_iov(bio, iter, 0, 1, 0);
+ ret = __bio_copy_iov(bio, iter, READ);
if (ret)
goto cleanup;
}
--
2.1.0

2014-12-22 11:50:30

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 10/17] block: refactor bio_get_user_pages() from __bio_map_user_iov()

From: Kent Overstreet <[email protected]>

Split up a part of the code that was in __bio_map_user_iov() into
a new function bio_get_user_pages(). This helper is going to be used
by future block layer rewriting, especially from direct-IO part.

Note that this relies on the recent change to make
generic_make_request() take arbitrarily sized bios - we're not using
bio_add_page() here.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/bio.c | 130 +++++++++++++++++++++++++++-------------------------
include/linux/bio.h | 2 +
2 files changed, 70 insertions(+), 62 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 2adb68b..470e330 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1228,19 +1228,79 @@ struct bio *bio_copy_user(struct request_queue *q, struct rq_map_data *map_data,
}
EXPORT_SYMBOL(bio_copy_user);

+/**
+ * bio_get_user_pages - pin user pages and add them to a biovec
+ * @bio: bio to add pages to
+ * @uaddr: start of user address
+ * @len: length in bytes
+ * @write_to_vm: bool indicating writing to pages or not
+ *
+ * Pins pages for up to @len bytes and appends them to @bio's bvec array. May
+ * pin only part of the requested pages - @bio need not have room for all the
+ * pages and can already have had pages added to it.
+ *
+ * Returns the number of bytes from @len added to @bio.
+ */
+ssize_t bio_get_user_pages(struct bio *bio, struct iov_iter *i, int write_to_vm)
+{
+ while (bio->bi_vcnt < bio->bi_max_vecs && iov_iter_count(i)) {
+ struct iovec iov = iov_iter_iovec(i);
+ int ret;
+ unsigned nr_pages, bytes;
+ unsigned offset = offset_in_page(iov.iov_base);
+ struct bio_vec *bv;
+ struct page **pages;
+
+ nr_pages = min_t(size_t,
+ DIV_ROUND_UP(iov.iov_len + offset, PAGE_SIZE),
+ bio->bi_max_vecs - bio->bi_vcnt);
+
+ bv = &bio->bi_io_vec[bio->bi_vcnt];
+ pages = (void *) bv;
+
+ ret = get_user_pages_fast((unsigned long) iov.iov_base,
+ nr_pages, write_to_vm, pages);
+ if (ret < 0) {
+ if (bio->bi_vcnt)
+ return 0;
+
+ return ret;
+ }
+
+ bio->bi_vcnt += ret;
+ bytes = ret * PAGE_SIZE - offset;
+
+ while (ret--) {
+ bv[ret].bv_page = pages[ret];
+ bv[ret].bv_len = PAGE_SIZE;
+ bv[ret].bv_offset = 0;
+ }
+
+ bv[0].bv_offset += offset;
+ bv[0].bv_len -= offset;
+
+ if (bytes > iov.iov_len) {
+ bio->bi_io_vec[bio->bi_vcnt - 1].bv_len -=
+ bytes - iov.iov_len;
+ bytes = iov.iov_len;
+ }
+
+ bio->bi_iter.bi_size += bytes;
+ iov_iter_advance(i, bytes);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(bio_get_user_pages);
+
static struct bio *__bio_map_user_iov(struct request_queue *q,
struct block_device *bdev,
const struct iov_iter *iter,
int write_to_vm, gfp_t gfp_mask)
{
- int j;
+ ssize_t ret;
int nr_pages = 0;
- struct page **pages;
struct bio *bio;
- int cur_page = 0;
- int ret, offset;
- struct iov_iter i;
- struct iovec iov;

nr_pages = iov_count_pages(iter, queue_dma_alignment(q));
if (nr_pages < 0)
@@ -1253,57 +1313,10 @@ static struct bio *__bio_map_user_iov(struct request_queue *q,
if (!bio)
return ERR_PTR(-ENOMEM);

- ret = -ENOMEM;
- pages = kcalloc(nr_pages, sizeof(struct page *), gfp_mask);
- if (!pages)
+ ret = bio_get_user_pages(bio, (struct iov_iter *)iter, write_to_vm);
+ if (ret < 0)
goto out;

- iov_for_each(iov, i, *iter) {
- unsigned long uaddr = (unsigned long) iov.iov_base;
- unsigned long len = iov.iov_len;
- unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
- unsigned long start = uaddr >> PAGE_SHIFT;
- const int local_nr_pages = end - start;
- const int page_limit = cur_page + local_nr_pages;
-
- ret = get_user_pages_fast(uaddr, local_nr_pages,
- write_to_vm, &pages[cur_page]);
- if (ret < local_nr_pages) {
- ret = -EFAULT;
- goto out_unmap;
- }
-
- offset = uaddr & ~PAGE_MASK;
- for (j = cur_page; j < page_limit; j++) {
- unsigned int bytes = PAGE_SIZE - offset;
-
- if (len <= 0)
- break;
-
- if (bytes > len)
- bytes = len;
-
- /*
- * sorry...
- */
- if (bio_add_pc_page(q, bio, pages[j], bytes, offset) <
- bytes)
- break;
-
- len -= bytes;
- offset = 0;
- }
-
- cur_page = j;
- /*
- * release the pages we didn't map into the bio, if any
- */
- while (j < page_limit)
- page_cache_release(pages[j++]);
- }
-
- kfree(pages);
-
/*
* set data direction, and check if mapped pages need bouncing
*/
@@ -1314,14 +1327,7 @@ static struct bio *__bio_map_user_iov(struct request_queue *q,
bio->bi_flags |= (1 << BIO_USER_MAPPED);
return bio;

- out_unmap:
- for (j = 0; j < nr_pages; j++) {
- if (!pages[j])
- break;
- page_cache_release(pages[j]);
- }
out:
- kfree(pages);
bio_put(bio);
return ERR_PTR(ret);
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a69f7b1..c80131a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -428,6 +428,8 @@ extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
unsigned int, unsigned int);
extern int bio_get_nr_vecs(struct block_device *);
+struct iov_iter;
+extern ssize_t bio_get_user_pages(struct bio *, struct iov_iter *, int);
extern struct bio *bio_map_user(struct request_queue *, struct block_device *,
unsigned long, unsigned int, int, gfp_t);
struct iov_iter;
--
2.1.0

2014-12-22 11:50:37

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 13/17] md/raid5: get rid of bio_fits_rdev()

From: Kent Overstreet <[email protected]>

Remove bio_fits_rdev() completely, because ->merge_bvec_fn() has now
gone. There's no point in calling bio_fits_rdev() only for ensuring
aligned read from rdev.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: [email protected]
---
drivers/md/raid5.c | 23 +----------------------
1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c1b0d52..40e464c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4218,25 +4218,6 @@ static void raid5_align_endio(struct bio *bi, int error)
add_bio_to_retry(raid_bi, conf);
}

-static int bio_fits_rdev(struct bio *bi)
-{
- struct request_queue *q = bdev_get_queue(bi->bi_bdev);
-
- if (bio_sectors(bi) > queue_max_sectors(q))
- return 0;
- blk_recount_segments(q, bi);
- if (bi->bi_phys_segments > queue_max_segments(q))
- return 0;
-
- if (q->merge_bvec_fn)
- /* it's too hard to apply the merge_bvec_fn at this stage,
- * just just give up
- */
- return 0;
-
- return 1;
-}
-
static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
{
struct r5conf *conf = mddev->private;
@@ -4290,11 +4271,9 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
align_bi->bi_bdev = rdev->bdev;
__clear_bit(BIO_SEG_VALID, &align_bi->bi_flags);

- if (!bio_fits_rdev(align_bi) ||
- is_badblock(rdev, align_bi->bi_iter.bi_sector,
+ if (is_badblock(rdev, align_bi->bi_iter.bi_sector,
bio_sectors(align_bi),
&first_bad, &bad_sectors)) {
- /* too big in some way, or has a known bad block */
bio_put(align_bi);
rdev_dec_pending(rdev, mddev);
return 0;
--
2.1.0

2014-12-22 11:50:52

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 14/17] block: kill merge_bvec_fn() completely

From: Kent Overstreet <[email protected]>

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: Jiri Kosina <[email protected]>
Cc: Yehuda Sadeh <[email protected]>
Cc: Sage Weil <[email protected]>
Cc: Alex Elder <[email protected]>
Cc: [email protected]
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: [email protected]
Cc: Neil Brown <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
---
block/blk-merge.c | 17 +-----
block/blk-settings.c | 22 --------
drivers/block/drbd/drbd_int.h | 1 -
drivers/block/drbd/drbd_main.c | 1 -
drivers/block/drbd/drbd_req.c | 35 ------------
drivers/block/pktcdvd.c | 21 -------
drivers/block/rbd.c | 47 ----------------
drivers/md/dm-cache-target.c | 21 -------
drivers/md/dm-crypt.c | 16 ------
drivers/md/dm-era-target.c | 15 -----
drivers/md/dm-flakey.c | 16 ------
drivers/md/dm-linear.c | 16 ------
drivers/md/dm-snap.c | 15 -----
drivers/md/dm-stripe.c | 21 -------
drivers/md/dm-table.c | 8 ---
drivers/md/dm-thin.c | 31 -----------
drivers/md/dm-verity.c | 16 ------
drivers/md/dm.c | 120 +---------------------------------------
drivers/md/dm.h | 2 -
drivers/md/linear.c | 46 ----------------
drivers/md/md.c | 2 -
drivers/md/md.h | 8 ---
drivers/md/multipath.c | 21 -------
drivers/md/raid0.c | 57 -------------------
drivers/md/raid0.h | 2 -
drivers/md/raid1.c | 59 +-------------------
drivers/md/raid10.c | 122 +----------------------------------------
drivers/md/raid5.c | 28 ----------
include/linux/blkdev.h | 10 ----
include/linux/device-mapper.h | 4 --
30 files changed, 9 insertions(+), 791 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3bc2068..8cd7a83 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -69,24 +69,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio *split;
struct bio_vec bv = { 0 }, bvprv = { 0 };
struct bvec_iter iter;
- unsigned seg_size = 0, nsegs = 0;
+ unsigned seg_size = 0, nsegs = 0, sectors = 0;
int prev = 0;

- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = 0,
- .bi_rw = bio->bi_rw,
- };
-
bio_for_each_segment(bv, bio, iter) {
- if (q->merge_bvec_fn &&
- q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
- goto split;
-
- bvm.bi_size += bv.bv_len;
+ sectors += bv.bv_len >> 9;

- if (bvm.bi_size >> 9 > queue_max_sectors(q))
+ if (sectors > queue_max_sectors(q))
goto split;

if (prev && blk_queue_cluster(q)) {
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6ed2cbe..463a10a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -53,28 +53,6 @@ void blk_queue_unprep_rq(struct request_queue *q, unprep_rq_fn *ufn)
}
EXPORT_SYMBOL(blk_queue_unprep_rq);

-/**
- * blk_queue_merge_bvec - set a merge_bvec function for queue
- * @q: queue
- * @mbfn: merge_bvec_fn
- *
- * Usually queues have static limitations on the max sectors or segments that
- * we can put in a request. Stacking drivers may have some settings that
- * are dynamic, and thus we have to query the queue whether it is ok to
- * add a new bio_vec to a bio at a given offset or not. If the block device
- * has such limitations, it needs to register a merge_bvec_fn to control
- * the size of bio's sent to it. Note that a block device *must* allow a
- * single page to be added to an empty bio. The block device driver may want
- * to use the bio_split() function to deal with these bio's. By default
- * no merge_bvec_fn is defined for a queue, and only the fixed limits are
- * honored.
- */
-void blk_queue_merge_bvec(struct request_queue *q, merge_bvec_fn *mbfn)
-{
- q->merge_bvec_fn = mbfn;
-}
-EXPORT_SYMBOL(blk_queue_merge_bvec);
-
void blk_queue_softirq_done(struct request_queue *q, softirq_done_fn *fn)
{
q->softirq_done_fn = fn;
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index b905e98..63ce2b0 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1449,7 +1449,6 @@ extern void do_submit(struct work_struct *ws);
extern void __drbd_make_request(struct drbd_device *, struct bio *, unsigned long);
extern void drbd_make_request(struct request_queue *q, struct bio *bio);
extern int drbd_read_remote(struct drbd_device *device, struct drbd_request *req);
-extern int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec);
extern int is_valid_ar_handle(struct drbd_request *, sector_t);


diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 1fc8342..f49f53e 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2775,7 +2775,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
This triggers a max_bio_size message upon first attach or connect */
blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
- blk_queue_merge_bvec(q, drbd_merge_bvec);
q->queue_lock = &resource->req_lock;

device->md_io.page = alloc_page(GFP_KERNEL);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index dee706d..b57d30b 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1509,41 +1509,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
__drbd_make_request(device, bio, start_jif);
}

-/* This is called by bio_add_page().
- *
- * q->max_hw_sectors and other global limits are already enforced there.
- *
- * We need to call down to our lower level device,
- * in case it has special restrictions.
- *
- * We also may need to enforce configured max-bio-bvecs limits.
- *
- * As long as the BIO is empty we have to allow at least one bvec,
- * regardless of size and offset, so no need to ask lower levels.
- */
-int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)
-{
- struct drbd_device *device = (struct drbd_device *) q->queuedata;
- unsigned int bio_size = bvm->bi_size;
- int limit = DRBD_MAX_BIO_SIZE;
- int backing_limit;
-
- if (bio_size && get_ldev(device)) {
- unsigned int max_hw_sectors = queue_max_hw_sectors(q);
- struct request_queue * const b =
- device->ldev->backing_bdev->bd_disk->queue;
- if (b->merge_bvec_fn) {
- bvm->bi_bdev = device->ldev->backing_bdev;
- backing_limit = b->merge_bvec_fn(b, bvm, bvec);
- limit = min(limit, backing_limit);
- }
- put_ldev(device);
- if ((limit >> 9) > max_hw_sectors)
- limit = max_hw_sectors << 9;
- }
- return limit;
-}
-
void request_timer_fn(unsigned long data)
{
struct drbd_device *device = (struct drbd_device *) data;
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index ea10bd9..85eac23 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2505,26 +2505,6 @@ end_io:



-static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
- struct bio_vec *bvec)
-{
- struct pktcdvd_device *pd = q->queuedata;
- sector_t zone = get_zone(bmd->bi_sector, pd);
- int used = ((bmd->bi_sector - zone) << 9) + bmd->bi_size;
- int remaining = (pd->settings.size << 9) - used;
- int remaining2;
-
- /*
- * A bio <= PAGE_SIZE must be allowed. If it crosses a packet
- * boundary, pkt_make_request() will split the bio.
- */
- remaining2 = PAGE_SIZE - bmd->bi_size;
- remaining = max(remaining, remaining2);
-
- BUG_ON(remaining < 0);
- return remaining;
-}
-
static void pkt_init_queue(struct pktcdvd_device *pd)
{
struct request_queue *q = pd->disk->queue;
@@ -2532,7 +2512,6 @@ static void pkt_init_queue(struct pktcdvd_device *pd)
blk_queue_make_request(q, pkt_make_request);
blk_queue_logical_block_size(q, CD_FRAMESIZE);
blk_queue_max_hw_sectors(q, PACKET_MAX_SECTORS);
- blk_queue_merge_bvec(q, pkt_merge_bvec);
q->queuedata = pd;
}

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3ec85df..0c0e2d0 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3459,52 +3459,6 @@ static void rbd_request_fn(struct request_queue *q)
queue_work(rbd_wq, &rbd_dev->rq_work);
}

-/*
- * a queue callback. Makes sure that we don't create a bio that spans across
- * multiple osd objects. One exception would be with a single page bios,
- * which we handle later at bio_chain_clone_range()
- */
-static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd,
- struct bio_vec *bvec)
-{
- struct rbd_device *rbd_dev = q->queuedata;
- sector_t sector_offset;
- sector_t sectors_per_obj;
- sector_t obj_sector_offset;
- int ret;
-
- /*
- * Find how far into its rbd object the partition-relative
- * bio start sector is to offset relative to the enclosing
- * device.
- */
- sector_offset = get_start_sect(bmd->bi_bdev) + bmd->bi_sector;
- sectors_per_obj = 1 << (rbd_dev->header.obj_order - SECTOR_SHIFT);
- obj_sector_offset = sector_offset & (sectors_per_obj - 1);
-
- /*
- * Compute the number of bytes from that offset to the end
- * of the object. Account for what's already used by the bio.
- */
- ret = (int) (sectors_per_obj - obj_sector_offset) << SECTOR_SHIFT;
- if (ret > bmd->bi_size)
- ret -= bmd->bi_size;
- else
- ret = 0;
-
- /*
- * Don't send back more than was asked for. And if the bio
- * was empty, let the whole thing through because: "Note
- * that a block device *must* allow a single page to be
- * added to an empty bio."
- */
- rbd_assert(bvec->bv_len <= PAGE_SIZE);
- if (ret > (int) bvec->bv_len || !bmd->bi_size)
- ret = (int) bvec->bv_len;
-
- return ret;
-}
-
static void rbd_free_disk(struct rbd_device *rbd_dev)
{
struct gendisk *disk = rbd_dev->disk;
@@ -3771,7 +3725,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
q->limits.max_discard_sectors = segment_size / SECTOR_SIZE;
q->limits.discard_zeroes_data = 1;

- blk_queue_merge_bvec(q, rbd_merge_bvec);
disk->queue = q;

q->queuedata = rbd_dev;
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 1e96d78..72fd9d3 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -3277,26 +3277,6 @@ static int cache_iterate_devices(struct dm_target *ti,
return r;
}

-/*
- * We assume I/O is going to the origin (which is the volume
- * more likely to have restrictions e.g. by being striped).
- * (Looking up the exact location of the data would be expensive
- * and could always be out of date by the time the bio is submitted.)
- */
-static int cache_bvec_merge(struct dm_target *ti,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct cache *cache = ti->private;
- struct request_queue *q = bdev_get_queue(cache->origin_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = cache->origin_dev->bdev;
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
{
/*
@@ -3340,7 +3320,6 @@ static struct target_type cache_target = {
.status = cache_status,
.message = cache_message,
.iterate_devices = cache_iterate_devices,
- .merge = cache_bvec_merge,
.io_hints = cache_io_hints,
};

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 08981be..723c176 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1951,21 +1951,6 @@ error:
return -EINVAL;
}

-static int crypt_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct crypt_config *cc = ti->private;
- struct request_queue *q = bdev_get_queue(cc->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = cc->dev->bdev;
- bvm->bi_sector = cc->start + dm_target_offset(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int crypt_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -1986,7 +1971,6 @@ static struct target_type crypt_target = {
.preresume = crypt_preresume,
.resume = crypt_resume,
.message = crypt_message,
- .merge = crypt_merge,
.iterate_devices = crypt_iterate_devices,
};

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index ad913cd..0119ebf 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1673,20 +1673,6 @@ static int era_iterate_devices(struct dm_target *ti,
return fn(ti, era->origin_dev, 0, get_dev_size(era->origin_dev), data);
}

-static int era_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct era *era = ti->private;
- struct request_queue *q = bdev_get_queue(era->origin_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = era->origin_dev->bdev;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static void era_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct era *era = ti->private;
@@ -1717,7 +1703,6 @@ static struct target_type era_target = {
.status = era_status,
.message = era_message,
.iterate_devices = era_iterate_devices,
- .merge = era_merge,
.io_hints = era_io_hints
};

diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index b257e46..d955b3e 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -387,21 +387,6 @@ static int flakey_ioctl(struct dm_target *ti, unsigned int cmd, unsigned long ar
return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}

-static int flakey_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct flakey_c *fc = ti->private;
- struct request_queue *q = bdev_get_queue(fc->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = fc->dev->bdev;
- bvm->bi_sector = flakey_map_sector(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int flakey_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn, void *data)
{
struct flakey_c *fc = ti->private;
@@ -419,7 +404,6 @@ static struct target_type flakey_target = {
.end_io = flakey_end_io,
.status = flakey_status,
.ioctl = flakey_ioctl,
- .merge = flakey_merge,
.iterate_devices = flakey_iterate_devices,
};

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 53e848c..7dd5fc8 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -130,21 +130,6 @@ static int linear_ioctl(struct dm_target *ti, unsigned int cmd,
return r ? : __blkdev_driver_ioctl(dev->bdev, dev->mode, cmd, arg);
}

-static int linear_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct linear_c *lc = ti->private;
- struct request_queue *q = bdev_get_queue(lc->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = lc->dev->bdev;
- bvm->bi_sector = linear_map_sector(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int linear_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -162,7 +147,6 @@ static struct target_type linear_target = {
.map = linear_map,
.status = linear_status,
.ioctl = linear_ioctl,
- .merge = linear_merge,
.iterate_devices = linear_iterate_devices,
};

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 864b03f..2e6bb7e 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2234,20 +2234,6 @@ static void origin_status(struct dm_target *ti, status_type_t type,
}
}

-static int origin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct dm_origin *o = ti->private;
- struct request_queue *q = bdev_get_queue(o->dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = o->dev->bdev;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int origin_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -2265,7 +2251,6 @@ static struct target_type origin_target = {
.map = origin_map,
.resume = origin_resume,
.status = origin_status,
- .merge = origin_merge,
.iterate_devices = origin_iterate_devices,
};

diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index f8b37d4..09bb2fe 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -412,26 +412,6 @@ static void stripe_io_hints(struct dm_target *ti,
blk_limits_io_opt(limits, chunk_size * sc->stripes);
}

-static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct stripe_c *sc = ti->private;
- sector_t bvm_sector = bvm->bi_sector;
- uint32_t stripe;
- struct request_queue *q;
-
- stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector);
-
- q = bdev_get_queue(sc->stripe[stripe].dev->bdev);
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = sc->stripe[stripe].dev->bdev;
- bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static struct target_type stripe_target = {
.name = "striped",
.version = {1, 5, 1},
@@ -443,7 +423,6 @@ static struct target_type stripe_target = {
.status = stripe_status,
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
- .merge = stripe_merge,
};

int __init dm_stripe_init(void)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3afae9e..6c14cb4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -443,14 +443,6 @@ static int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
q->limits.alignment_offset,
(unsigned long long) start << SECTOR_SHIFT);

- /*
- * Check if merge fn is supported.
- * If not we'll force DM to use PAGE_SIZE or
- * smaller I/O, just to be safe.
- */
- if (dm_queue_merge_is_compulsory(q) && !ti->type->merge)
- blk_limits_max_hw_sectors(limits,
- (unsigned int) (PAGE_SIZE >> 9));
return 0;
}

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 8735543..8e0dd5e 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3548,20 +3548,6 @@ static int pool_iterate_devices(struct dm_target *ti,
return fn(ti, pt->data_dev, 0, ti->len, data);
}

-static int pool_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct pool_c *pt = ti->private;
- struct request_queue *q = bdev_get_queue(pt->data_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = pt->data_dev->bdev;
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static void set_discard_limits(struct pool_c *pt, struct queue_limits *limits)
{
struct pool *pool = pt->pool;
@@ -3653,7 +3639,6 @@ static struct target_type pool_target = {
.resume = pool_resume,
.message = pool_message,
.status = pool_status,
- .merge = pool_merge,
.iterate_devices = pool_iterate_devices,
.io_hints = pool_io_hints,
};
@@ -3979,21 +3964,6 @@ err:
DMEMIT("Error");
}

-static int thin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct thin_c *tc = ti->private;
- struct request_queue *q = bdev_get_queue(tc->pool_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = tc->pool_dev->bdev;
- bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int thin_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -4028,7 +3998,6 @@ static struct target_type thin_target = {
.presuspend = thin_presuspend,
.postsuspend = thin_postsuspend,
.status = thin_status,
- .merge = thin_merge,
.iterate_devices = thin_iterate_devices,
};

diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 7a7bab8..25d76a8 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -564,21 +564,6 @@ static int verity_ioctl(struct dm_target *ti, unsigned cmd,
cmd, arg);
}

-static int verity_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size)
-{
- struct dm_verity *v = ti->private;
- struct request_queue *q = bdev_get_queue(v->data_dev->bdev);
-
- if (!q->merge_bvec_fn)
- return max_size;
-
- bvm->bi_bdev = v->data_dev->bdev;
- bvm->bi_sector = verity_map_sector(v, bvm->bi_sector);
-
- return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
-}
-
static int verity_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -872,7 +857,6 @@ static struct target_type verity_target = {
.map = verity_map,
.status = verity_status,
.ioctl = verity_ioctl,
- .merge = verity_merge,
.iterate_devices = verity_iterate_devices,
.io_hints = verity_io_hints,
};
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ce28a4..7cf0dde 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -116,9 +116,8 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
#define DMF_FREEING 3
#define DMF_DELETING 4
#define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_MERGE_IS_OPTIONAL 6
-#define DMF_DEFERRED_REMOVE 7
-#define DMF_SUSPENDED_INTERNALLY 8
+#define DMF_DEFERRED_REMOVE 6
+#define DMF_SUSPENDED_INTERNALLY 7

/*
* A dummy definition to make RCU happy.
@@ -1586,60 +1585,6 @@ static void __split_and_process_bio(struct mapped_device *md,
* CRUD END
*---------------------------------------------------------------*/

-static int dm_merge_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mapped_device *md = q->queuedata;
- struct dm_table *map = dm_get_live_table_fast(md);
- struct dm_target *ti;
- sector_t max_sectors;
- int max_size = 0;
-
- if (unlikely(!map))
- goto out;
-
- ti = dm_table_find_target(map, bvm->bi_sector);
- if (!dm_target_is_valid(ti))
- goto out;
-
- /*
- * Find maximum amount of I/O that won't need splitting
- */
- max_sectors = min(max_io_len(bvm->bi_sector, ti),
- (sector_t) queue_max_sectors(q));
- max_size = (max_sectors << SECTOR_SHIFT) - bvm->bi_size;
- if (unlikely(max_size < 0)) /* this shouldn't _ever_ happen */
- max_size = 0;
-
- /*
- * merge_bvec_fn() returns number of bytes
- * it can accept at this offset
- * max is precomputed maximal io size
- */
- if (max_size && ti->type->merge)
- max_size = ti->type->merge(ti, bvm, biovec, max_size);
- /*
- * If the target doesn't support merge method and some of the devices
- * provided their merge_bvec method (we know this by looking for the
- * max_hw_sectors that dm_set_device_limits may set), then we can't
- * allow bios with multiple vector entries. So always set max_size
- * to 0, and the code below allows just one page.
- */
- else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
- max_size = 0;
-
-out:
- dm_put_live_table_fast(md);
- /*
- * Always allow an entire first page
- */
- if (max_size <= biovec->bv_len && !(bvm->bi_size >> SECTOR_SHIFT))
- max_size = biovec->bv_len;
-
- return max_size;
-}
-
/*
* The request function that just remaps the bio built up by
* dm_merge_bvec.
@@ -2030,7 +1975,6 @@ static void dm_init_md_queue(struct mapped_device *md)
md->queue->backing_dev_info.congested_data = md;
blk_queue_make_request(md->queue, dm_request);
blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
- blk_queue_merge_bvec(md->queue, dm_merge_bvec);
}

/*
@@ -2241,59 +2185,6 @@ static void __set_size(struct mapped_device *md, sector_t size)
}

/*
- * Return 1 if the queue has a compulsory merge_bvec_fn function.
- *
- * If this function returns 0, then the device is either a non-dm
- * device without a merge_bvec_fn, or it is a dm device that is
- * able to split any bios it receives that are too big.
- */
-int dm_queue_merge_is_compulsory(struct request_queue *q)
-{
- struct mapped_device *dev_md;
-
- if (!q->merge_bvec_fn)
- return 0;
-
- if (q->make_request_fn == dm_request) {
- dev_md = q->queuedata;
- if (test_bit(DMF_MERGE_IS_OPTIONAL, &dev_md->flags))
- return 0;
- }
-
- return 1;
-}
-
-static int dm_device_merge_is_compulsory(struct dm_target *ti,
- struct dm_dev *dev, sector_t start,
- sector_t len, void *data)
-{
- struct block_device *bdev = dev->bdev;
- struct request_queue *q = bdev_get_queue(bdev);
-
- return dm_queue_merge_is_compulsory(q);
-}
-
-/*
- * Return 1 if it is acceptable to ignore merge_bvec_fn based
- * on the properties of the underlying devices.
- */
-static int dm_table_merge_is_optional(struct dm_table *table)
-{
- unsigned i = 0;
- struct dm_target *ti;
-
- while (i < dm_table_get_num_targets(table)) {
- ti = dm_table_get_target(table, i++);
-
- if (ti->type->iterate_devices &&
- ti->type->iterate_devices(ti, dm_device_merge_is_compulsory, NULL))
- return 0;
- }
-
- return 1;
-}
-
-/*
* Returns old map, which caller must destroy.
*/
static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
@@ -2302,7 +2193,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
struct dm_table *old_map;
struct request_queue *q = md->queue;
sector_t size;
- int merge_is_optional;

size = dm_table_get_size(t);

@@ -2328,17 +2218,11 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,

__bind_mempools(md, t);

- merge_is_optional = dm_table_merge_is_optional(t);
-
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, t);
md->immutable_target_type = dm_table_get_immutable_target_type(t);

dm_table_set_restrictions(t, q, limits);
- if (merge_is_optional)
- set_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
- else
- clear_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
if (old_map)
dm_sync_table(md);

diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 84b0f9e4..08f47fc 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -76,8 +76,6 @@ bool dm_table_request_based(struct dm_table *t);
void dm_table_free_md_mempools(struct dm_table *t);
struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);

-int dm_queue_merge_is_compulsory(struct request_queue *q);
-
void dm_lock_md_type(struct mapped_device *md);
void dm_unlock_md_type(struct mapped_device *md);
void dm_set_md_type(struct mapped_device *md, unsigned type);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 64713b7..d831a5b 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -52,51 +52,6 @@ static inline struct dev_info *which_dev(struct mddev *mddev, sector_t sector)
return conf->disks + lo;
}

-/**
- * linear_mergeable_bvec -- tell bio layer if two requests can be merged
- * @q: request queue
- * @bvm: properties of new bio
- * @biovec: the request that could be merged to it.
- *
- * Return amount of bytes we can take at this offset
- */
-static int linear_mergeable_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mddev *mddev = q->queuedata;
- struct dev_info *dev0;
- unsigned long maxsectors, bio_sectors = bvm->bi_size >> 9;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int maxbytes = biovec->bv_len;
- struct request_queue *subq;
-
- rcu_read_lock();
- dev0 = which_dev(mddev, sector);
- maxsectors = dev0->end_sector - sector;
- subq = bdev_get_queue(dev0->rdev->bdev);
- if (subq->merge_bvec_fn) {
- bvm->bi_bdev = dev0->rdev->bdev;
- bvm->bi_sector -= dev0->end_sector - dev0->rdev->sectors;
- maxbytes = min(maxbytes, subq->merge_bvec_fn(subq, bvm,
- biovec));
- }
- rcu_read_unlock();
-
- if (maxsectors < bio_sectors)
- maxsectors = 0;
- else
- maxsectors -= bio_sectors;
-
- if (maxsectors <= (PAGE_SIZE >> 9 ) && bio_sectors == 0)
- return maxbytes;
-
- if (maxsectors > (maxbytes >> 9))
- return maxbytes;
- else
- return maxsectors << 9;
-}
-
static int linear_congested(void *data, int bits)
{
struct mddev *mddev = data;
@@ -217,7 +172,6 @@ static int linear_run (struct mddev *mddev)
mddev->private = conf;
md_set_array_sectors(mddev, linear_size(mddev, 0, 0));

- blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec);
mddev->queue->backing_dev_info.congested_fn = linear_congested;
mddev->queue->backing_dev_info.congested_data = mddev;

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 48234eb..0e34b76 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5033,7 +5033,6 @@ static void md_clean(struct mddev *mddev)
mddev->changed = 0;
mddev->degraded = 0;
mddev->safemode = 0;
- mddev->merge_check_needed = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
mddev->bitmap_info.default_space = 0;
@@ -5201,7 +5200,6 @@ static int do_md_stop(struct mddev *mddev, int mode,

__md_stop_writes(mddev);
__md_stop(mddev);
- mddev->queue->merge_bvec_fn = NULL;
mddev->queue->backing_dev_info.congested_fn = NULL;

/* tell userspace to handle 'inactive' */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 03cec5b..4932445 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -132,10 +132,6 @@ enum flag_bits {
Bitmap_sync, /* ..actually, not quite In_sync. Need a
* bitmap-based recovery to get fully in sync
*/
- Unmerged, /* device is being added to array and should
- * be considerred for bvec_merge_fn but not
- * yet for actual IO
- */
WriteMostly, /* Avoid reading if at all possible */
AutoDetected, /* added by auto-detect */
Blocked, /* An error occurred but has not yet
@@ -366,10 +362,6 @@ struct mddev {
int degraded; /* whether md should consider
* adding a spare
*/
- int merge_check_needed; /* at least one
- * member device
- * has a
- * merge_bvec_fn */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 399272f..2f82954 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -261,18 +261,6 @@ static int multipath_add_disk(struct mddev *mddev, struct md_rdev *rdev)
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);

- /* as we don't honour merge_bvec_fn, we must never risk
- * violating it, so limit ->max_segments to one, lying
- * within a single page.
- * (Note: it is very unlikely that a device with
- * merge_bvec_fn will be involved in multipath.)
- */
- if (q->merge_bvec_fn) {
- blk_queue_max_segments(mddev->queue, 1);
- blk_queue_segment_boundary(mddev->queue,
- PAGE_CACHE_SIZE - 1);
- }
-
spin_lock_irq(&conf->device_lock);
mddev->degraded--;
rdev->raid_disk = path;
@@ -436,15 +424,6 @@ static int multipath_run (struct mddev *mddev)
disk_stack_limits(mddev->gendisk, rdev->bdev,
rdev->data_offset << 9);

- /* as we don't honour merge_bvec_fn, we must never risk
- * violating it, not that we ever expect a device with
- * a merge_bvec_fn to be involved in multipath */
- if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
- blk_queue_max_segments(mddev->queue, 1);
- blk_queue_segment_boundary(mddev->queue,
- PAGE_CACHE_SIZE - 1);
- }
-
if (!test_bit(Faulty, &rdev->flags))
working_disks++;
}
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ba6b85d..bc4c0b6 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -195,9 +195,6 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
disk_stack_limits(mddev->gendisk, rdev1->bdev,
rdev1->data_offset << 9);

- if (rdev1->bdev->bd_disk->queue->merge_bvec_fn)
- conf->has_merge_bvec = 1;
-
if (!smallest || (rdev1->sectors < smallest->sectors))
smallest = rdev1;
cnt++;
@@ -354,59 +351,6 @@ static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
+ sector_div(sector, zone->nb_dev)];
}

-/**
- * raid0_mergeable_bvec -- tell bio layer if two requests can be merged
- * @q: request queue
- * @bvm: properties of new bio
- * @biovec: the request that could be merged to it.
- *
- * Return amount of bytes we can accept at this offset
- */
-static int raid0_mergeable_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mddev *mddev = q->queuedata;
- struct r0conf *conf = mddev->private;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- sector_t sector_offset = sector;
- int max;
- unsigned int chunk_sectors = mddev->chunk_sectors;
- unsigned int bio_sectors = bvm->bi_size >> 9;
- struct strip_zone *zone;
- struct md_rdev *rdev;
- struct request_queue *subq;
-
- if (is_power_of_2(chunk_sectors))
- max = (chunk_sectors - ((sector & (chunk_sectors-1))
- + bio_sectors)) << 9;
- else
- max = (chunk_sectors - (sector_div(sector, chunk_sectors)
- + bio_sectors)) << 9;
- if (max < 0)
- max = 0; /* bio_add cannot handle a negative return */
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- if (max < biovec->bv_len)
- /* too small already, no need to check further */
- return max;
- if (!conf->has_merge_bvec)
- return max;
-
- /* May need to check subordinate device */
- sector = sector_offset;
- zone = find_zone(mddev->private, &sector_offset);
- rdev = map_sector(mddev, zone, sector, &sector_offset);
- subq = bdev_get_queue(rdev->bdev);
- if (subq->merge_bvec_fn) {
- bvm->bi_bdev = rdev->bdev;
- bvm->bi_sector = sector_offset + zone->dev_start +
- rdev->data_offset;
- return min(max, subq->merge_bvec_fn(subq, bvm, biovec));
- } else
- return max;
-}
-
static sector_t raid0_size(struct mddev *mddev, sector_t sectors, int raid_disks)
{
sector_t array_sectors = 0;
@@ -471,7 +415,6 @@ static int raid0_run(struct mddev *mddev)
mddev->queue->backing_dev_info.ra_pages = 2* stripe;
}

- blk_queue_merge_bvec(mddev->queue, raid0_mergeable_bvec);
dump_zones(mddev);

ret = md_integrity_register(mddev);
diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h
index 05539d9..7127a62 100644
--- a/drivers/md/raid0.h
+++ b/drivers/md/raid0.h
@@ -12,8 +12,6 @@ struct r0conf {
struct md_rdev **devlist; /* lists of rdevs, pointed to
* by strip_zone->dev */
int nr_strip_zones;
- int has_merge_bvec; /* at least one member has
- * a merge_bvec_fn */
};

#endif
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 40b35be..c2f236c 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -551,7 +551,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
rdev = rcu_dereference(conf->mirrors[disk].rdev);
if (r1_bio->bios[disk] == IO_BLOCKED
|| rdev == NULL
- || test_bit(Unmerged, &rdev->flags)
|| test_bit(Faulty, &rdev->flags))
continue;
if (!test_bit(In_sync, &rdev->flags) &&
@@ -701,39 +700,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
return best_disk;
}

-static int raid1_mergeable_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mddev *mddev = q->queuedata;
- struct r1conf *conf = mddev->private;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int max = biovec->bv_len;
-
- if (mddev->merge_check_needed) {
- int disk;
- rcu_read_lock();
- for (disk = 0; disk < conf->raid_disks * 2; disk++) {
- struct md_rdev *rdev = rcu_dereference(
- conf->mirrors[disk].rdev);
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- struct request_queue *q =
- bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn) {
- bvm->bi_sector = sector +
- rdev->data_offset;
- bvm->bi_bdev = rdev->bdev;
- max = min(max, q->merge_bvec_fn(
- q, bvm, biovec));
- }
- }
- }
- rcu_read_unlock();
- }
- return max;
-
-}
-
int md_raid1_congested(struct mddev *mddev, int bits)
{
struct r1conf *conf = mddev->private;
@@ -1266,8 +1232,7 @@ read_again:
break;
}
r1_bio->bios[i] = NULL;
- if (!rdev || test_bit(Faulty, &rdev->flags)
- || test_bit(Unmerged, &rdev->flags)) {
+ if (!rdev || test_bit(Faulty, &rdev->flags)) {
if (i < conf->raid_disks)
set_bit(R1BIO_Degraded, &r1_bio->state);
continue;
@@ -1611,7 +1576,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
struct raid1_info *p;
int first = 0;
int last = conf->raid_disks - 1;
- struct request_queue *q = bdev_get_queue(rdev->bdev);

if (mddev->recovery_disabled == conf->recovery_disabled)
return -EBUSY;
@@ -1619,11 +1583,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;

- if (q->merge_bvec_fn) {
- set_bit(Unmerged, &rdev->flags);
- mddev->merge_check_needed = 1;
- }
-
for (mirror = first; mirror <= last; mirror++) {
p = conf->mirrors+mirror;
if (!p->rdev) {
@@ -1655,19 +1614,6 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
break;
}
}
- if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
- /* Some requests might not have seen this new
- * merge_bvec_fn. We must wait for them to complete
- * before merging the device fully.
- * First we make sure any code which has tested
- * our function has submitted the request, then
- * we wait for all outstanding requests to complete.
- */
- synchronize_sched();
- freeze_array(conf, 0);
- unfreeze_array(conf);
- clear_bit(Unmerged, &rdev->flags);
- }
md_integrity_add_rdev(rdev, mddev);
if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
@@ -2810,8 +2756,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
goto abort;
disk->rdev = rdev;
q = bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn)
- mddev->merge_check_needed = 1;

disk->head_position = 0;
disk->seq_start = MaxSector;
@@ -2957,7 +2901,6 @@ static int run(struct mddev *mddev)
if (mddev->queue) {
mddev->queue->backing_dev_info.congested_fn = raid1_congested;
mddev->queue->backing_dev_info.congested_data = mddev;
- blk_queue_merge_bvec(mddev->queue, raid1_mergeable_bvec);

if (discard_supported)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD,
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 4a40354..08772a9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -672,94 +672,6 @@ static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
return (vchunk << geo->chunk_shift) + offset;
}

-/**
- * raid10_mergeable_bvec -- tell bio layer if a two requests can be merged
- * @q: request queue
- * @bvm: properties of new bio
- * @biovec: the request that could be merged to it.
- *
- * Return amount of bytes we can accept at this offset
- * This requires checking for end-of-chunk if near_copies != raid_disks,
- * and for subordinate merge_bvec_fns if merge_check_needed.
- */
-static int raid10_mergeable_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mddev *mddev = q->queuedata;
- struct r10conf *conf = mddev->private;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int max;
- unsigned int chunk_sectors;
- unsigned int bio_sectors = bvm->bi_size >> 9;
- struct geom *geo = &conf->geo;
-
- chunk_sectors = (conf->geo.chunk_mask & conf->prev.chunk_mask) + 1;
- if (conf->reshape_progress != MaxSector &&
- ((sector >= conf->reshape_progress) !=
- conf->mddev->reshape_backwards))
- geo = &conf->prev;
-
- if (geo->near_copies < geo->raid_disks) {
- max = (chunk_sectors - ((sector & (chunk_sectors - 1))
- + bio_sectors)) << 9;
- if (max < 0)
- /* bio_add cannot handle a negative return */
- max = 0;
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- } else
- max = biovec->bv_len;
-
- if (mddev->merge_check_needed) {
- struct {
- struct r10bio r10_bio;
- struct r10dev devs[conf->copies];
- } on_stack;
- struct r10bio *r10_bio = &on_stack.r10_bio;
- int s;
- if (conf->reshape_progress != MaxSector) {
- /* Cannot give any guidance during reshape */
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- return 0;
- }
- r10_bio->sector = sector;
- raid10_find_phys(conf, r10_bio);
- rcu_read_lock();
- for (s = 0; s < conf->copies; s++) {
- int disk = r10_bio->devs[s].devnum;
- struct md_rdev *rdev = rcu_dereference(
- conf->mirrors[disk].rdev);
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- struct request_queue *q =
- bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn) {
- bvm->bi_sector = r10_bio->devs[s].addr
- + rdev->data_offset;
- bvm->bi_bdev = rdev->bdev;
- max = min(max, q->merge_bvec_fn(
- q, bvm, biovec));
- }
- }
- rdev = rcu_dereference(conf->mirrors[disk].replacement);
- if (rdev && !test_bit(Faulty, &rdev->flags)) {
- struct request_queue *q =
- bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn) {
- bvm->bi_sector = r10_bio->devs[s].addr
- + rdev->data_offset;
- bvm->bi_bdev = rdev->bdev;
- max = min(max, q->merge_bvec_fn(
- q, bvm, biovec));
- }
- }
- }
- rcu_read_unlock();
- }
- return max;
-}
-
/*
* This routine returns the disk from which the requested read should
* be done. There is a per-array 'next expected sequential IO' sector
@@ -822,12 +734,10 @@ retry:
disk = r10_bio->devs[slot].devnum;
rdev = rcu_dereference(conf->mirrors[disk].replacement);
if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
- test_bit(Unmerged, &rdev->flags) ||
r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
rdev = rcu_dereference(conf->mirrors[disk].rdev);
if (rdev == NULL ||
- test_bit(Faulty, &rdev->flags) ||
- test_bit(Unmerged, &rdev->flags))
+ test_bit(Faulty, &rdev->flags))
continue;
if (!test_bit(In_sync, &rdev->flags) &&
r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
@@ -1336,11 +1246,9 @@ retry_write:
blocked_rdev = rrdev;
break;
}
- if (rdev && (test_bit(Faulty, &rdev->flags)
- || test_bit(Unmerged, &rdev->flags)))
+ if (rdev && (test_bit(Faulty, &rdev->flags)))
rdev = NULL;
- if (rrdev && (test_bit(Faulty, &rrdev->flags)
- || test_bit(Unmerged, &rrdev->flags)))
+ if (rrdev && (test_bit(Faulty, &rrdev->flags)))
rrdev = NULL;

r10_bio->devs[i].bio = NULL;
@@ -1787,7 +1695,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
int mirror;
int first = 0;
int last = conf->geo.raid_disks - 1;
- struct request_queue *q = bdev_get_queue(rdev->bdev);

if (mddev->recovery_cp < MaxSector)
/* only hot-add to in-sync arrays, as recovery is
@@ -1800,11 +1707,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
if (rdev->raid_disk >= 0)
first = last = rdev->raid_disk;

- if (q->merge_bvec_fn) {
- set_bit(Unmerged, &rdev->flags);
- mddev->merge_check_needed = 1;
- }
-
if (rdev->saved_raid_disk >= first &&
conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
mirror = rdev->saved_raid_disk;
@@ -1843,19 +1745,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
rcu_assign_pointer(p->rdev, rdev);
break;
}
- if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
- /* Some requests might not have seen this new
- * merge_bvec_fn. We must wait for them to complete
- * before merging the device fully.
- * First we make sure any code which has tested
- * our function has submitted the request, then
- * we wait for all outstanding requests to complete.
- */
- synchronize_sched();
- freeze_array(conf, 0);
- unfreeze_array(conf);
- clear_bit(Unmerged, &rdev->flags);
- }
md_integrity_add_rdev(rdev, mddev);
if (mddev->queue && blk_queue_discard(bdev_get_queue(rdev->bdev)))
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue);
@@ -2404,7 +2293,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
d = r10_bio->devs[sl].devnum;
rdev = rcu_dereference(conf->mirrors[d].rdev);
if (rdev &&
- !test_bit(Unmerged, &rdev->flags) &&
test_bit(In_sync, &rdev->flags) &&
is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
&first_bad, &bad_sectors) == 0) {
@@ -2458,7 +2346,6 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
d = r10_bio->devs[sl].devnum;
rdev = rcu_dereference(conf->mirrors[d].rdev);
if (!rdev ||
- test_bit(Unmerged, &rdev->flags) ||
!test_bit(In_sync, &rdev->flags))
continue;

@@ -3657,8 +3544,6 @@ static int run(struct mddev *mddev)
disk->rdev = rdev;
}
q = bdev_get_queue(rdev->bdev);
- if (q->merge_bvec_fn)
- mddev->merge_check_needed = 1;
diff = (rdev->new_data_offset - rdev->data_offset);
if (!mddev->reshape_backwards)
diff = -diff;
@@ -3757,7 +3642,6 @@ static int run(struct mddev *mddev)
stripe /= conf->geo.near_copies;
if (mddev->queue->backing_dev_info.ra_pages < 2 * stripe)
mddev->queue->backing_dev_info.ra_pages = 2 * stripe;
- blk_queue_merge_bvec(mddev->queue, raid10_mergeable_bvec);
}

if (md_integrity_register(mddev))
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 40e464c..6008a30 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4103,32 +4103,6 @@ static int raid5_congested(void *data, int bits)
md_raid5_congested(mddev, bits);
}

-/* We want read requests to align with chunks where possible,
- * but write requests don't need to.
- */
-static int raid5_mergeable_bvec(struct request_queue *q,
- struct bvec_merge_data *bvm,
- struct bio_vec *biovec)
-{
- struct mddev *mddev = q->queuedata;
- sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
- int max;
- unsigned int chunk_sectors = mddev->chunk_sectors;
- unsigned int bio_sectors = bvm->bi_size >> 9;
-
- if ((bvm->bi_rw & 1) == WRITE)
- return biovec->bv_len; /* always allow writes to be mergeable */
-
- if (mddev->new_chunk_sectors < mddev->chunk_sectors)
- chunk_sectors = mddev->new_chunk_sectors;
- max = (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
- if (max < 0) max = 0;
- if (max <= biovec->bv_len && bio_sectors == 0)
- return biovec->bv_len;
- else
- return max;
-}
-
static int in_chunk_boundary(struct mddev *mddev, struct bio *bio)
{
sector_t sector = bio->bi_iter.bi_sector + get_start_sect(bio->bi_bdev);
@@ -6152,8 +6126,6 @@ static int run(struct mddev *mddev)
if (mddev->queue->backing_dev_info.ra_pages < 2 * stripe)
mddev->queue->backing_dev_info.ra_pages = 2 * stripe;

- blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
-
mddev->queue->backing_dev_info.congested_data = mddev;
mddev->queue->backing_dev_info.congested_fn = raid5_congested;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c03e37a..7a8c95c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -239,14 +239,6 @@ typedef int (prep_rq_fn) (struct request_queue *, struct request *);
typedef void (unprep_rq_fn) (struct request_queue *, struct request *);

struct bio_vec;
-struct bvec_merge_data {
- struct block_device *bi_bdev;
- sector_t bi_sector;
- unsigned bi_size;
- unsigned long bi_rw;
-};
-typedef int (merge_bvec_fn) (struct request_queue *, struct bvec_merge_data *,
- struct bio_vec *);
typedef void (softirq_done_fn)(struct request *);
typedef int (dma_drain_needed_fn)(struct request *);
typedef int (lld_busy_fn) (struct request_queue *q);
@@ -327,7 +319,6 @@ struct request_queue {
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
unprep_rq_fn *unprep_rq_fn;
- merge_bvec_fn *merge_bvec_fn;
softirq_done_fn *softirq_done_fn;
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
@@ -1036,7 +1027,6 @@ extern void blk_queue_lld_busy(struct request_queue *q, lld_busy_fn *fn);
extern void blk_queue_segment_boundary(struct request_queue *, unsigned long);
extern void blk_queue_prep_rq(struct request_queue *, prep_rq_fn *pfn);
extern void blk_queue_unprep_rq(struct request_queue *, unprep_rq_fn *ufn);
-extern void blk_queue_merge_bvec(struct request_queue *, merge_bvec_fn *);
extern void blk_queue_dma_alignment(struct request_queue *, int);
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index ca6d2acc..2f7f2df 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -77,9 +77,6 @@ typedef int (*dm_message_fn) (struct dm_target *ti, unsigned argc, char **argv);
typedef int (*dm_ioctl_fn) (struct dm_target *ti, unsigned int cmd,
unsigned long arg);

-typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm,
- struct bio_vec *biovec, int max_size);
-
/*
* These iteration functions are typically used to check (and combine)
* properties of underlying devices.
@@ -153,7 +150,6 @@ struct target_type {
dm_status_fn status;
dm_message_fn message;
dm_ioctl_fn ioctl;
- dm_merge_fn merge;
dm_busy_fn busy;
dm_iterate_devices_fn iterate_devices;
dm_io_hints_fn io_hints;
--
2.1.0

2014-12-22 11:50:48

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 16/17] fs: convert buffer head etc. to use immutable biovecs API.

From: Kent Overstreet <[email protected]>

Increase bio->bi_remaining instead of calling bio_get(),
and call bio_end() instead of bio_put() upon buffer_head submission.
Also make bio submission in kernel/power/block_io.c to properly submit
bios by checking whether bio_chain is available or not.

Doing that, some codes that have been still using the older API
can be converted in order to use the immutable biovecs API.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/buffer.c | 4 ++--
kernel/power/block_io.c | 23 ++++++++++++++++++-----
2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 35ac0ec..78e63e3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3038,13 +3038,13 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
if (buffer_prio(bh))
rw |= REQ_PRIO;

- bio_get(bio);
+ atomic_inc(&bio->bi_remaining);
submit_bio(rw, bio);

if (bio_flagged(bio, BIO_EOPNOTSUPP))
ret = -EOPNOTSUPP;

- bio_put(bio);
+ bio_endio(bio, 0);
return ret;
}
EXPORT_SYMBOL_GPL(_submit_bh);
diff --git a/kernel/power/block_io.c b/kernel/power/block_io.c
index 9a58bc2..7206408 100644
--- a/kernel/power/block_io.c
+++ b/kernel/power/block_io.c
@@ -34,7 +34,6 @@ static int submit(int rw, struct block_device *bdev, sector_t sector,
bio = bio_alloc(__GFP_WAIT | __GFP_HIGH, 1);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
- bio->bi_end_io = end_swap_bio_read;

if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) {
printk(KERN_ERR "PM: Adding page to bio failed at %llu\n",
@@ -44,15 +43,29 @@ static int submit(int rw, struct block_device *bdev, sector_t sector,
}

lock_page(page);
- bio_get(bio);

if (bio_chain == NULL) {
- submit_bio(bio_rw, bio);
- wait_on_page_locked(page);
+ int err = submit_bio_wait(bio_rw, bio);
+
+ if (err) {
+ SetPageError(page);
+ ClearPageUptodate(page);
+ pr_alert("Read-error on swap-device (%u:%u:%llu)\n",
+ imajor(bio->bi_bdev->bd_inode),
+ iminor(bio->bi_bdev->bd_inode),
+ (unsigned long long)bio->bi_iter.bi_sector);
+ } else {
+ SetPageUptodate(page);
+ }
+
if (rw == READ)
- bio_set_pages_dirty(bio);
+ set_page_dirty_lock(page);
+ unlock_page(page);
bio_put(bio);
} else {
+ bio->bi_end_io = end_swap_bio_read;
+ bio_get(bio);
+
if (rw == READ)
get_page(page); /* These pages are freed later */
bio->bi_private = *bio_chain;
--
2.1.0

2014-12-22 11:50:46

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 15/17] fs: use helper bio_add_page() instead of open coding on bi_io_vec

From: Kent Overstreet <[email protected]>

Call pre-defined helper bio_add_page() instead of open coding for
iterating through bi_io_vec[]. Doing that, it's possible to make some
parts in filesystems and mm/page_io.c simpler than before.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: [email protected]
---
fs/buffer.c | 7 ++-----
fs/jfs/jfs_logmgr.c | 14 ++++----------
mm/page_io.c | 8 +++-----
3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..35ac0ec 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3022,12 +3022,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)

bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
- bio->bi_io_vec[0].bv_page = bh->b_page;
- bio->bi_io_vec[0].bv_len = bh->b_size;
- bio->bi_io_vec[0].bv_offset = bh_offset(bh);

- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = bh->b_size;
+ bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
+ BUG_ON(bio->bi_iter.bi_size != bh->b_size);

bio->bi_end_io = end_bio_bh_io_sync;
bio->bi_private = bh;
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index bc462dc..46fae06 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -1999,12 +1999,9 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp)

bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
- bio->bi_io_vec[0].bv_page = bp->l_page;
- bio->bi_io_vec[0].bv_len = LOGPSIZE;
- bio->bi_io_vec[0].bv_offset = bp->l_offset;

- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = LOGPSIZE;
+ bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+ BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);

bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
@@ -2145,12 +2142,9 @@ static void lbmStartIO(struct lbuf * bp)
bio = bio_alloc(GFP_NOFS, 1);
bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
- bio->bi_io_vec[0].bv_page = bp->l_page;
- bio->bi_io_vec[0].bv_len = LOGPSIZE;
- bio->bi_io_vec[0].bv_offset = bp->l_offset;

- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = LOGPSIZE;
+ bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+ BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);

bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b..8c878c7 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -33,12 +33,10 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
if (bio) {
bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
- bio->bi_io_vec[0].bv_page = page;
- bio->bi_io_vec[0].bv_len = PAGE_SIZE;
- bio->bi_io_vec[0].bv_offset = 0;
- bio->bi_vcnt = 1;
- bio->bi_iter.bi_size = PAGE_SIZE;
bio->bi_end_io = end_io;
+
+ bio_add_page(bio, page, PAGE_SIZE, 0);
+ BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE);
}
return bio;
}
--
2.1.0

2014-12-22 11:52:25

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 17/17] Documentation: update notes in biovecs about arbitrarily sized bios

Update block/biovecs.txt so that it includes a note on what kind of
effects arbitrarily sized bios would bring to the block layer.
Also fix a trivial typo, bio_iter_iovec.

Signed-off-by: Dongsu Park <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
---
Documentation/block/biovecs.txt | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 74a32ad..339045d 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -24,7 +24,7 @@ particular, presenting the illusion of partially completed biovecs so that
normal code doesn't have to deal with bi_bvec_done.

* Driver code should no longer refer to biovecs directly; we now have
- bio_iovec() and bio_iovec_iter() macros that return literal struct biovecs,
+ bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs,
constructed from the raw biovecs but taking into account bi_bvec_done and
bi_size.

@@ -109,3 +109,18 @@ Other implications:
over all the biovecs in the new bio - which is silly as it's not needed.

So, don't use bi_vcnt anymore.
+
+ * As of 3.18, block layer is written based on merging biovecs. Its goal is
+ to avoid having to split bios; upper layer code such as bio_add_page()
+ checks what the underlying device can handle, and tries to always create
+ bios that don't need to be split. However, this approach has been actually
+ cumbersome and error-prone. It eventually breaks down with stacked devices
+ and devices with dynamic limits, which then adds a lot of complexity.
+
+ So its new interface allows the block layer to split bios as needed, so we
+ could eliminate a lot of complexity elsewhere - particularly in stacked
+ drivers. Code that creates bios can then create whatever size bios are
+ convenient, and more importantly stacked drivers don't have to deal with
+ both their own bio size limitations and the limitations of the underlying
+ devices. Thus there's no need to define ->merge_bvec_fn() callbacks for
+ individual block drivers.
--
2.1.0

2014-12-22 11:50:34

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 12/17] md/raid10: make sync_request_write() call bio_copy_data()

From: Kent Overstreet <[email protected]>

Refactor sync_request_write() of md/raid10 to use bio_copy_data()
instead of open coding bio_vec iterations.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: [email protected]
---
drivers/md/raid10.c | 20 +++++---------------
1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 32e282f..4a40354 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2107,18 +2107,11 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
tbio->bi_vcnt = vcnt;
tbio->bi_iter.bi_size = r10_bio->sectors << 9;
tbio->bi_rw = WRITE;
- tbio->bi_private = r10_bio;
tbio->bi_iter.bi_sector = r10_bio->devs[i].addr;
-
- for (j=0; j < vcnt ; j++) {
- tbio->bi_io_vec[j].bv_offset = 0;
- tbio->bi_io_vec[j].bv_len = PAGE_SIZE;
-
- memcpy(page_address(tbio->bi_io_vec[j].bv_page),
- page_address(fbio->bi_io_vec[j].bv_page),
- PAGE_SIZE);
- }
tbio->bi_end_io = end_sync_write;
+ tbio->bi_private = r10_bio;
+
+ bio_copy_data(tbio, fbio);

d = r10_bio->devs[i].devnum;
atomic_inc(&conf->mirrors[d].rdev->nr_pending);
@@ -2134,17 +2127,14 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
* that are active
*/
for (i = 0; i < conf->copies; i++) {
- int j, d;
+ int d;

tbio = r10_bio->devs[i].repl_bio;
if (!tbio || !tbio->bi_end_io)
continue;
if (r10_bio->devs[i].bio->bi_end_io != end_sync_write
&& r10_bio->devs[i].bio != fbio)
- for (j = 0; j < vcnt; j++)
- memcpy(page_address(tbio->bi_io_vec[j].bv_page),
- page_address(fbio->bi_io_vec[j].bv_page),
- PAGE_SIZE);
+ bio_copy_data(tbio, fbio);
d = r10_bio->devs[i].devnum;
atomic_inc(&r10_bio->remaining);
md_sync_acct(conf->mirrors[d].replacement->bdev,
--
2.1.0

2014-12-22 11:53:22

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 11/17] block: allow __blk_queue_bounce() to handle bios larger than BIO_MAX_PAGES

From: Kent Overstreet <[email protected]>

Allow __blk_queue_bounce() to handle bios with more than BIO_MAX_PAGES
segments. Doing that, it becomes possible to simplify the block layer
in the kernel.

The issue is that any code that clones the bio and must clone the biovec
(i.e. it can't use bio_clone_fast()) won't be able to allocate a bio with
more than BIO_MAX_PAGES - bio_alloc_bioset() always fails in that case.

Fortunately, it's easy to make __blk_queue_bounce() just process part of
the bio if necessary, using bi_remaining to count the splits and punting
the rest back to generic_make_request().

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/bounce.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 52 insertions(+), 8 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index ab21ba2..689ea89 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -196,6 +196,43 @@ static int must_snapshot_stable_pages(struct request_queue *q, struct bio *bio)
}
#endif /* CONFIG_NEED_BOUNCE_POOL */

+static struct bio *bio_clone_segments(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, unsigned nsegs)
+{
+ struct bvec_iter iter;
+ struct bio_vec bv;
+ struct bio *bio;
+
+ bio = bio_alloc_bioset(gfp_mask, nsegs, bs);
+ if (!bio)
+ return NULL;
+
+ bio->bi_bdev = bio_src->bi_bdev;
+ bio->bi_rw = bio_src->bi_rw;
+ bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
+
+ bio_for_each_segment(bv, bio_src, iter) {
+ bio->bi_io_vec[bio->bi_vcnt++] = bv;
+ bio->bi_iter.bi_size += bv.bv_len;
+ if (!--nsegs)
+ break;
+ }
+
+ if (bio_integrity(bio_src)) {
+ int ret;
+
+ ret = bio_integrity_clone(bio, bio_src, gfp_mask);
+ if (ret < 0) {
+ bio_put(bio);
+ return NULL;
+ }
+ }
+
+ bio_src->bi_iter = iter;
+
+ return bio;
+}
+
static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
mempool_t *pool, int force)
{
@@ -203,17 +240,24 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
int rw = bio_data_dir(*bio_orig);
struct bio_vec *to, from;
struct bvec_iter iter;
- unsigned i;
+ int i, nsegs = 0, bounce = force;

- if (force)
- goto bounce;
- bio_for_each_segment(from, *bio_orig, iter)
+ bio_for_each_segment(from, *bio_orig, iter) {
+ nsegs++;
if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
- goto bounce;
+ bounce = 1;
+ }
+
+ if (!bounce)
+ return;

- return;
-bounce:
- bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
+ bio = bio_clone_segments(*bio_orig, GFP_NOIO, fs_bio_set,
+ min(nsegs, BIO_MAX_PAGES));
+
+ if ((*bio_orig)->bi_iter.bi_size) {
+ atomic_inc(&(*bio_orig)->bi_remaining);
+ generic_make_request(*bio_orig);
+ }

bio_for_each_segment_all(to, bio, i) {
struct page *page = to->bv_page;
--
2.1.0

2014-12-22 11:54:42

by Dongsu Park

[permalink] [raw]
Subject: [RFC PATCH 04/17] bcache: clean up hacks around bio_split_pool

From: Kent Overstreet <[email protected]>

There has been workarounds only in bcache, for splitting pool as well
as submitting bios. Since generic_make_request() is able to handle
arbitrarily sized bios, it's now possible to delete those hacks.

Signed-off-by: Kent Overstreet <[email protected]>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <[email protected]>
Cc: [email protected]
---
drivers/md/bcache/bcache.h | 18 --------
drivers/md/bcache/io.c | 100 +-----------------------------------------
drivers/md/bcache/journal.c | 4 +-
drivers/md/bcache/request.c | 16 +++----
drivers/md/bcache/super.c | 32 +-------------
drivers/md/bcache/util.h | 5 ++-
drivers/md/bcache/writeback.c | 4 +-
7 files changed, 18 insertions(+), 161 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 04f7bc2..6b420a5 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -243,19 +243,6 @@ struct keybuf {
DECLARE_ARRAY_ALLOCATOR(struct keybuf_key, freelist, KEYBUF_NR);
};

-struct bio_split_pool {
- struct bio_set *bio_split;
- mempool_t *bio_split_hook;
-};
-
-struct bio_split_hook {
- struct closure cl;
- struct bio_split_pool *p;
- struct bio *bio;
- bio_end_io_t *bi_end_io;
- void *bi_private;
-};
-
struct bcache_device {
struct closure cl;

@@ -288,8 +275,6 @@ struct bcache_device {
int (*cache_miss)(struct btree *, struct search *,
struct bio *, unsigned);
int (*ioctl) (struct bcache_device *, fmode_t, unsigned, unsigned long);
-
- struct bio_split_pool bio_split_hook;
};

struct io {
@@ -454,8 +439,6 @@ struct cache {
atomic_long_t meta_sectors_written;
atomic_long_t btree_sectors_written;
atomic_long_t sectors_written;
-
- struct bio_split_pool bio_split_hook;
};

struct gc_stat {
@@ -873,7 +856,6 @@ void bch_bbio_endio(struct cache_set *, struct bio *, int, const char *);
void bch_bbio_free(struct bio *, struct cache_set *);
struct bio *bch_bbio_alloc(struct cache_set *);

-void bch_generic_make_request(struct bio *, struct bio_split_pool *);
void __bch_submit_bbio(struct bio *, struct cache_set *);
void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);

diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa..86a0bb8 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -11,104 +11,6 @@

#include <linux/blkdev.h>

-static unsigned bch_bio_max_sectors(struct bio *bio)
-{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- struct bio_vec bv;
- struct bvec_iter iter;
- unsigned ret = 0, seg = 0;
-
- if (bio->bi_rw & REQ_DISCARD)
- return min(bio_sectors(bio), q->limits.max_discard_sectors);
-
- bio_for_each_segment(bv, bio, iter) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_iter.bi_sector,
- .bi_size = ret << 9,
- .bi_rw = bio->bi_rw,
- };
-
- if (seg == min_t(unsigned, BIO_MAX_PAGES,
- queue_max_segments(q)))
- break;
-
- if (q->merge_bvec_fn &&
- q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len)
- break;
-
- seg++;
- ret += bv.bv_len >> 9;
- }
-
- ret = min(ret, queue_max_sectors(q));
-
- WARN_ON(!ret);
- ret = max_t(int, ret, bio_iovec(bio).bv_len >> 9);
-
- return ret;
-}
-
-static void bch_bio_submit_split_done(struct closure *cl)
-{
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- s->bio->bi_end_io = s->bi_end_io;
- s->bio->bi_private = s->bi_private;
- bio_endio_nodec(s->bio, 0);
-
- closure_debug_destroy(&s->cl);
- mempool_free(s, s->p->bio_split_hook);
-}
-
-static void bch_bio_submit_split_endio(struct bio *bio, int error)
-{
- struct closure *cl = bio->bi_private;
- struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
-
- if (error)
- clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
-
- bio_put(bio);
- closure_put(cl);
-}
-
-void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
-{
- struct bio_split_hook *s;
- struct bio *n;
-
- if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
- goto submit;
-
- if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
- goto submit;
-
- s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
- closure_init(&s->cl, NULL);
-
- s->bio = bio;
- s->p = p;
- s->bi_end_io = bio->bi_end_io;
- s->bi_private = bio->bi_private;
- bio_get(bio);
-
- do {
- n = bio_next_split(bio, bch_bio_max_sectors(bio),
- GFP_NOIO, s->p->bio_split);
-
- n->bi_end_io = bch_bio_submit_split_endio;
- n->bi_private = &s->cl;
-
- closure_get(&s->cl);
- generic_make_request(n);
- } while (n != bio);
-
- continue_at(&s->cl, bch_bio_submit_split_done, NULL);
-submit:
- generic_make_request(bio);
-}
-
/* Bios with headers */

void bch_bbio_free(struct bio *bio, struct cache_set *c)
@@ -138,7 +40,7 @@ void __bch_submit_bbio(struct bio *bio, struct cache_set *c)
bio->bi_bdev = PTR_CACHE(c, &b->key, 0)->bdev;

b->submit_time_us = local_clock_us();
- closure_bio_submit(bio, bio->bi_private, PTR_CACHE(c, &b->key, 0));
+ closure_bio_submit(bio, bio->bi_private);
}

void bch_submit_bbio(struct bio *bio, struct cache_set *c,
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index fe080ad..af47e6c 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -61,7 +61,7 @@ reread: left = ca->sb.bucket_size - offset;
bio->bi_private = &cl;
bch_bio_map(bio, data);

- closure_bio_submit(bio, &cl, ca);
+ closure_bio_submit(bio, &cl);
closure_sync(&cl);

/* This function could be simpler now since we no longer write
@@ -646,7 +646,7 @@ static void journal_write_unlocked(struct closure *cl)
spin_unlock(&c->journal.lock);

while ((bio = bio_list_pop(&list)))
- closure_bio_submit(bio, cl, c->cache[0]);
+ closure_bio_submit(bio, cl);

continue_at(cl, journal_write_done, NULL);
}
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ab43fad..89500e0 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -710,7 +710,7 @@ static void cached_dev_read_error(struct closure *cl)

/* XXX: invalidate cache */

- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}

continue_at(cl, cached_dev_cache_miss_done, NULL);
@@ -833,7 +833,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
s->cache_miss = miss;
s->iop.bio = cache_bio;
bio_get(cache_bio);
- closure_bio_submit(cache_bio, &s->cl, s->d);
+ closure_bio_submit(cache_bio, &s->cl);

return ret;
out_put:
@@ -841,7 +841,7 @@ out_put:
out_submit:
miss->bi_end_io = request_endio;
miss->bi_private = &s->cl;
- closure_bio_submit(miss, &s->cl, s->d);
+ closure_bio_submit(miss, &s->cl);
return ret;
}

@@ -906,7 +906,7 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)

if (!(bio->bi_rw & REQ_DISCARD) ||
blk_queue_discard(bdev_get_queue(dc->bdev)))
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
} else if (s->iop.writeback) {
bch_writeback_add(dc);
s->iop.bio = bio;
@@ -921,12 +921,12 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
flush->bi_end_io = request_endio;
flush->bi_private = cl;

- closure_bio_submit(flush, cl, s->d);
+ closure_bio_submit(flush, cl);
}
} else {
s->iop.bio = bio_clone_fast(bio, GFP_NOIO, dc->disk.bio_split);

- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);
}

closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
@@ -942,7 +942,7 @@ static void cached_dev_nodata(struct closure *cl)
bch_journal_meta(s->iop.c, cl);

/* If it's a flush, we send the flush to the backing device too */
- closure_bio_submit(bio, cl, s->d);
+ closure_bio_submit(bio, cl);

continue_at(cl, cached_dev_bio_complete, NULL);
}
@@ -986,7 +986,7 @@ static void cached_dev_make_request(struct request_queue *q, struct bio *bio)
!blk_queue_discard(bdev_get_queue(dc->bdev)))
bio_endio(bio, 0);
else
- bch_generic_make_request(bio, &d->bio_split_hook);
+ generic_make_request(bio);
}
}

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4dd2bb7..a542b58 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -59,29 +59,6 @@ struct workqueue_struct *bcache_wq;

#define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE)

-static void bio_split_pool_free(struct bio_split_pool *p)
-{
- if (p->bio_split_hook)
- mempool_destroy(p->bio_split_hook);
-
- if (p->bio_split)
- bioset_free(p->bio_split);
-}
-
-static int bio_split_pool_init(struct bio_split_pool *p)
-{
- p->bio_split = bioset_create(4, 0);
- if (!p->bio_split)
- return -ENOMEM;
-
- p->bio_split_hook = mempool_create_kmalloc_pool(4,
- sizeof(struct bio_split_hook));
- if (!p->bio_split_hook)
- return -ENOMEM;
-
- return 0;
-}
-
/* Superblock */

static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
@@ -537,7 +514,7 @@ static void prio_io(struct cache *ca, uint64_t bucket, unsigned long rw)
bio->bi_private = ca;
bch_bio_map(bio, ca->disk_buckets);

- closure_bio_submit(bio, &ca->prio, ca);
+ closure_bio_submit(bio, &ca->prio);
closure_sync(cl);
}

@@ -757,7 +734,6 @@ static void bcache_device_free(struct bcache_device *d)
put_disk(d->disk);
}

- bio_split_pool_free(&d->bio_split_hook);
if (d->bio_split)
bioset_free(d->bio_split);
if (is_vmalloc_addr(d->full_dirty_stripes))
@@ -810,7 +786,6 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
return minor;

if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
- bio_split_pool_init(&d->bio_split_hook) ||
!(d->disk = alloc_disk(1))) {
ida_simple_remove(&bcache_minor, minor);
return -ENOMEM;
@@ -1799,8 +1774,6 @@ void bch_cache_release(struct kobject *kobj)
ca->set->cache[ca->sb.nr_this_dev] = NULL;
}

- bio_split_pool_free(&ca->bio_split_hook);
-
free_pages((unsigned long) ca->disk_buckets, ilog2(bucket_pages(ca)));
kfree(ca->prio_buckets);
vfree(ca->buckets);
@@ -1845,8 +1818,7 @@ static int cache_alloc(struct cache_sb *sb, struct cache *ca)
ca->sb.nbuckets)) ||
!(ca->prio_buckets = kzalloc(sizeof(uint64_t) * prio_buckets(ca) *
2, GFP_KERNEL)) ||
- !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)) ||
- bio_split_pool_init(&ca->bio_split_hook))
+ !(ca->disk_buckets = alloc_bucket_pages(GFP_KERNEL, ca)))
return -ENOMEM;

ca->prio_last_buckets = ca->prio_buckets + prio_buckets(ca);
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 98df757..e3dee05 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@

#include <linux/blkdev.h>
#include <linux/errno.h>
+#include <linux/blkdev.h>
#include <linux/kernel.h>
#include <linux/llist.h>
#include <linux/ratelimit.h>
@@ -576,10 +577,10 @@ static inline sector_t bdev_sectors(struct block_device *bdev)
return bdev->bd_inode->i_size >> 9;
}

-#define closure_bio_submit(bio, cl, dev) \
+#define closure_bio_submit(bio, cl) \
do { \
closure_get(cl); \
- bch_generic_make_request(bio, &(dev)->bio_split_hook); \
+ generic_make_request(bio); \
} while (0)

uint64_t bch_crc64_update(uint64_t, const void *, size_t);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bc..ca38362 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -188,7 +188,7 @@ static void write_dirty(struct closure *cl)
io->bio.bi_bdev = io->dc->bdev;
io->bio.bi_end_io = dirty_endio;

- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);

continue_at(cl, write_dirty_finish, system_wq);
}
@@ -208,7 +208,7 @@ static void read_dirty_submit(struct closure *cl)
{
struct dirty_io *io = container_of(cl, struct dirty_io, cl);

- closure_bio_submit(&io->bio, cl, &io->dc->disk);
+ closure_bio_submit(&io->bio, cl);

continue_at(cl, write_dirty, system_wq);
}
--
2.1.0

2014-12-22 15:22:48

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC PATCH 15/17] fs: use helper bio_add_page() instead of open coding on bi_io_vec

On 12/22/2014 05:48 AM, Dongsu Park wrote:
> From: Kent Overstreet <[email protected]>
>
> Call pre-defined helper bio_add_page() instead of open coding for
> iterating through bi_io_vec[]. Doing that, it's possible to make some
> parts in filesystems and mm/page_io.c simpler than before.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> [dpark: add more description in commit message]
> Signed-off-by: Dongsu Park <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Dave Kleikamp <[email protected]>

Acked-by: Dave Kleikamp <[email protected]>

> Cc: [email protected]
> ---
> fs/buffer.c | 7 ++-----
> fs/jfs/jfs_logmgr.c | 14 ++++----------
> mm/page_io.c | 8 +++-----
> 3 files changed, 9 insertions(+), 20 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..35ac0ec 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3022,12 +3022,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
>
> bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
> bio->bi_bdev = bh->b_bdev;
> - bio->bi_io_vec[0].bv_page = bh->b_page;
> - bio->bi_io_vec[0].bv_len = bh->b_size;
> - bio->bi_io_vec[0].bv_offset = bh_offset(bh);
>
> - bio->bi_vcnt = 1;
> - bio->bi_iter.bi_size = bh->b_size;
> + bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
> + BUG_ON(bio->bi_iter.bi_size != bh->b_size);
>
> bio->bi_end_io = end_bio_bh_io_sync;
> bio->bi_private = bh;
> diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
> index bc462dc..46fae06 100644
> --- a/fs/jfs/jfs_logmgr.c
> +++ b/fs/jfs/jfs_logmgr.c
> @@ -1999,12 +1999,9 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp)
>
> bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
> bio->bi_bdev = log->bdev;
> - bio->bi_io_vec[0].bv_page = bp->l_page;
> - bio->bi_io_vec[0].bv_len = LOGPSIZE;
> - bio->bi_io_vec[0].bv_offset = bp->l_offset;
>
> - bio->bi_vcnt = 1;
> - bio->bi_iter.bi_size = LOGPSIZE;
> + bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
> + BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
>
> bio->bi_end_io = lbmIODone;
> bio->bi_private = bp;
> @@ -2145,12 +2142,9 @@ static void lbmStartIO(struct lbuf * bp)
> bio = bio_alloc(GFP_NOFS, 1);
> bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
> bio->bi_bdev = log->bdev;
> - bio->bi_io_vec[0].bv_page = bp->l_page;
> - bio->bi_io_vec[0].bv_len = LOGPSIZE;
> - bio->bi_io_vec[0].bv_offset = bp->l_offset;
>
> - bio->bi_vcnt = 1;
> - bio->bi_iter.bi_size = LOGPSIZE;
> + bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
> + BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
>
> bio->bi_end_io = lbmIODone;
> bio->bi_private = bp;
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 955db8b..8c878c7 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -33,12 +33,10 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
> if (bio) {
> bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
> bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
> - bio->bi_io_vec[0].bv_page = page;
> - bio->bi_io_vec[0].bv_len = PAGE_SIZE;
> - bio->bi_io_vec[0].bv_offset = 0;
> - bio->bi_vcnt = 1;
> - bio->bi_iter.bi_size = PAGE_SIZE;
> bio->bi_end_io = end_io;
> +
> + bio_add_page(bio, page, PAGE_SIZE, 0);
> + BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE);
> }
> return bio;
> }
>

2014-12-23 10:16:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 01/17] block: make generic_make_request handle arbitrarily sized bios

> +void blk_queue_split(struct request_queue *q, struct bio **bio,
> + struct bio_set *bs)
> +{
> + struct bio *split;
> +
> + if ((*bio)->bi_rw & REQ_DISCARD)
> + split = blk_bio_discard_split(q, *bio, bs);
> + else if ((*bio)->bi_rw & REQ_WRITE_SAME)
> + split = blk_bio_write_same_split(q, *bio, bs);
> + else
> + split = blk_bio_segment_split(q, *bio, q->bio_split);
> +
> + if (split) {
> + bio_chain(split, *bio);
> + generic_make_request(*bio);
> + *bio = split;
> + }
> +}
> +EXPORT_SYMBOL(blk_queue_split);

I think blk_queue_split needs to explicitly skip BLOCK_PC bios. Those
are SCSI pass through ioctls that we can't split due to their opaque
nature.

2014-12-23 10:22:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 02/17] block: simplify bio_add_page()

On Mon, Dec 22, 2014 at 12:48:29PM +0100, Dongsu Park wrote:
> From: Kent Overstreet <[email protected]>
>
> Since generic_make_request() can now handle arbitrary size bios, all we
> have to do is make sure the bvec array doesn't overflow.
> __bio_add_page() doesn't need to call ->merge_bvec_fn(), where
> we can get rid of unnecessary code paths.

This needs an explanation of why removign the call to ->merge_bvec_fn
is fine for bio_add_pc_page. I guess it's because neither
the target pscsi pass through mode, nor the osd code ever use anything
but a simple scsi devices that doesn't even have one, but it needs to be
clearly spelled out.

> + * Attempt to add a page to the bio_vec maplist. This will only fail if
> + * bio->bi_vcnt == bio->bi_max_vecs.

It also fails on a cloned bio, although that might better be turned into
a BUG_ON().

2014-12-23 10:23:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 03/17] block: simplify issueing discard, write_same, zeroout

On Mon, Dec 22, 2014 at 12:48:30PM +0100, Dongsu Park wrote:
> From: Kent Overstreet <[email protected]>
>
> Simplify special cases for issueing discard, write_same, and zeroout,
> replacing bio_batch completions with submit_bio_wait(). This conversion
> is possible because generic_make_request() will now do for us what the
> code in blk-lib.c was doing manually, with the bio_batch stuff. So we
> still need some looping in case we're trying to discard/zeroout more
> than around a gigabyte, but when we can submit that much at a time
> doing the submissions in parallel really shouldn't matter.

Unless there this makes later patches simpler I don't see a good reason
to remove this parallel submission for the gain of only about 100 less
lines of code.

2014-12-23 10:35:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 06/17] btrfs: make use of immutable biovecs

This seems like it could be applied without the rest of the series,
right? Might be worth to get it into the btrfs tree ASAP?

2014-12-23 10:44:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 07/17] block: replace sg_iovec with iov_iter

Does this and the next three patches really depend on the earlier ones?
Unless I'm missing something they are cleanups on their own.

It might make sense to get all these cleanups out as a preparatory
series first.

2014-12-23 10:45:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 08/17] block: refactor __bio_copy_iov()

> static int __bio_copy_iov(struct bio *bio, const struct iov_iter *iter,
> + int to_iov)
> {
> + int i;
> struct bio_vec *bvec;
> struct iov_iter iov_iter = *iter;

Why not pass the iov_iter by value?

> bio_for_each_segment_all(bvec, bio, i) {
> + ssize_t ret;
> +
> + if (to_iov == WRITE)
> + ret = copy_page_to_iter(bvec->bv_page,
> + bvec->bv_offset,
> + bvec->bv_len,
> + &iov_iter);
> + else
> + ret = copy_page_from_iter(bvec->bv_page,
> + bvec->bv_offset,
> + bvec->bv_len,
> + &iov_iter);
> +
> + if (!iov_iter_count(&iov_iter))
> + break;
>
> + if (ret < bvec->bv_len)
> + return -EFAULT;
> }
>
> + return 0;

Seems like this should be split into two functions for the read
and write cases?

2014-12-23 10:48:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 12/17] md/raid10: make sync_request_write() call bio_copy_data()

On Mon, Dec 22, 2014 at 12:48:39PM +0100, Dongsu Park wrote:
> From: Kent Overstreet <[email protected]>
>
> Refactor sync_request_write() of md/raid10 to use bio_copy_data()
> instead of open coding bio_vec iterations.

Seems like another one for the prep sweries?

2014-12-23 10:51:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 16/17] fs: convert buffer head etc. to use immutable biovecs API.

On Mon, Dec 22, 2014 at 12:48:43PM +0100, Dongsu Park wrote:
> From: Kent Overstreet <[email protected]>
>
> Increase bio->bi_remaining instead of calling bio_get(),
> and call bio_end() instead of bio_put() upon buffer_head submission.

Nees an explanation on why this is done.

> Also make bio submission in kernel/power/block_io.c to properly submit
> bios by checking whether bio_chain is available or not.

Should be a separate patch.

And of course both should go into the preparation series of cleanups.

2014-12-23 10:52:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 17/17] Documentation: update notes in biovecs about arbitrarily sized bios

> + * As of 3.18, block layer is written based on merging biovecs. Its goal is

I don't think this is true..

2014-12-23 11:41:45

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 01/17] block: make generic_make_request handle arbitrarily sized bios

Hi Christoph,

On 23.12.2014 02:16, Christoph Hellwig wrote:
> > +void blk_queue_split(struct request_queue *q, struct bio **bio,
> > + struct bio_set *bs)
> > +{
> > + struct bio *split;
> > +
> > + if ((*bio)->bi_rw & REQ_DISCARD)
> > + split = blk_bio_discard_split(q, *bio, bs);
> > + else if ((*bio)->bi_rw & REQ_WRITE_SAME)
> > + split = blk_bio_write_same_split(q, *bio, bs);
> > + else
> > + split = blk_bio_segment_split(q, *bio, q->bio_split);
> > +
> > + if (split) {
> > + bio_chain(split, *bio);
> > + generic_make_request(*bio);
> > + *bio = split;
> > + }
> > +}
> > +EXPORT_SYMBOL(blk_queue_split);
>
> I think blk_queue_split needs to explicitly skip BLOCK_PC bios. Those
> are SCSI pass through ioctls that we can't split due to their opaque
> nature.

You mean, checking rq->cmd_type == REQ_TYPE_BLOCK_PC, right?

I'm wondering about how to check that in blk_queue_split().
At the moment when blk_queue_split() is called, it's even before a request
is mapped e.g. in blk_sq_make_request().
Unlike scsi drivers where it's easy to get cmd->rq, blk_queue_split()
doesn't seem to be able to get a request by blk_get_request().

Or am I missing something?

Thanks,
Dongsu

2014-12-23 11:46:47

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 02/17] block: simplify bio_add_page()

On 23.12.2014 02:22, Christoph Hellwig wrote:
> On Mon, Dec 22, 2014 at 12:48:29PM +0100, Dongsu Park wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > Since generic_make_request() can now handle arbitrary size bios, all we
> > have to do is make sure the bvec array doesn't overflow.
> > __bio_add_page() doesn't need to call ->merge_bvec_fn(), where
> > we can get rid of unnecessary code paths.
>
> This needs an explanation of why removign the call to ->merge_bvec_fn
> is fine for bio_add_pc_page. I guess it's because neither
> the target pscsi pass through mode, nor the osd code ever use anything
> but a simple scsi devices that doesn't even have one, but it needs to be
> clearly spelled out.

Agreed.

> > + * Attempt to add a page to the bio_vec maplist. This will only fail if
> > + * bio->bi_vcnt == bio->bi_max_vecs.
>
> It also fails on a cloned bio, although that might better be turned into
> a BUG_ON().

Agreed, I'll update both of them in the next round.

Thanks,
Dongsu

2014-12-23 12:09:12

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 06/17] btrfs: make use of immutable biovecs

On 23.12.2014 02:35, Christoph Hellwig wrote:
> This seems like it could be applied without the rest of the series,
> right? Might be worth to get it into the btrfs tree ASAP?

Ah, you're right.
While patch #5 must be in this series, patch #6 does not necessarily
have to be included. This conversion should belong to the multipage
bvecs series. So I'll skip #6 in the next round.

Thanks,
Dongsu

2014-12-23 12:18:21

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 07/17] block: replace sg_iovec with iov_iter

On 23.12.2014 02:44, Christoph Hellwig wrote:
> Does this and the next three patches really depend on the earlier ones?
> Unless I'm missing something they are cleanups on their own.
>
> It might make sense to get all these cleanups out as a preparatory
> series first.

I think so too. Patches #07-10 can be split into a separate patchset.
I guess they are included just because Kent tried to follow up
suggestions in the previous discussion.
I don't care about either way. So I'll split them up.

Thanks,
Dongsu

2014-12-23 12:25:04

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 08/17] block: refactor __bio_copy_iov()

On 23.12.2014 02:45, Christoph Hellwig wrote:
> > static int __bio_copy_iov(struct bio *bio, const struct iov_iter *iter,
> > + int to_iov)
> > {
> > + int i;
> > struct bio_vec *bvec;
> > struct iov_iter iov_iter = *iter;
>
> Why not pass the iov_iter by value?

Agreed.

> > bio_for_each_segment_all(bvec, bio, i) {
> > + ssize_t ret;
> > +
> > + if (to_iov == WRITE)
> > + ret = copy_page_to_iter(bvec->bv_page,
> > + bvec->bv_offset,
> > + bvec->bv_len,
> > + &iov_iter);
> > + else
> > + ret = copy_page_from_iter(bvec->bv_page,
> > + bvec->bv_offset,
> > + bvec->bv_len,
> > + &iov_iter);
> > +
> > + if (!iov_iter_count(&iov_iter))
> > + break;
> >
> > + if (ret < bvec->bv_len)
> > + return -EFAULT;
> > }
> >
> > + return 0;
>
> Seems like this should be split into two functions for the read
> and write cases?

Agreed. I'll update it in the next round.

Thanks,
Dongsu

2014-12-23 12:31:14

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 12/17] md/raid10: make sync_request_write() call bio_copy_data()

On 23.12.2014 02:48, Christoph Hellwig wrote:
> On Mon, Dec 22, 2014 at 12:48:39PM +0100, Dongsu Park wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > Refactor sync_request_write() of md/raid10 to use bio_copy_data()
> > instead of open coding bio_vec iterations.
>
> Seems like another one for the prep sweries?

Right. I'll also move this to the prep series with iov_iter conversions etc.

Thanks,
Dongsu

2014-12-23 12:33:33

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 16/17] fs: convert buffer head etc. to use immutable biovecs API.

On 23.12.2014 02:51, Christoph Hellwig wrote:
> On Mon, Dec 22, 2014 at 12:48:43PM +0100, Dongsu Park wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > Increase bio->bi_remaining instead of calling bio_get(),
> > and call bio_end() instead of bio_put() upon buffer_head submission.
>
> Nees an explanation on why this is done.

Right.

> > Also make bio submission in kernel/power/block_io.c to properly submit
> > bios by checking whether bio_chain is available or not.
>
> Should be a separate patch.
>
> And of course both should go into the preparation series of cleanups.

Yep, agreed.
I'll split it up and also move them to the prep series.

Thanks,
Dongsu

2014-12-23 12:34:59

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 17/17] Documentation: update notes in biovecs about arbitrarily sized bios

On 23.12.2014 02:52, Christoph Hellwig wrote:
> > + * As of 3.18, block layer is written based on merging biovecs. Its goal is
>
> I don't think this is true..

Okay, I'll try to describe it more accurately in the next round.

Thanks,
Dongsu

2014-12-23 14:44:29

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH 05/17] btrfs: remove bio splitting and merge_bvec_fn() calls

On Mon, Dec 22, 2014 at 6:48 AM, Dongsu Park
<[email protected]> wrote:
> From: Kent Overstreet <[email protected]>
>
> Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
> device limits as well as calling ->merge_bvec_fn() etc. That is not
> necessary any more, because generic_make_request() is now able to
> handle arbitrarily sized bios. So clean up unnecessary code paths.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> [dpark: add more description in commit message]
> Signed-off-by: Dongsu Park <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Josef Bacik <[email protected]>
> Cc: [email protected]
> ---

Looks good, I'll test it here too. Thanks!

Signed-off-by: Chris Mason <[email protected]>


2014-12-23 19:14:38

by Geoff Levand

[permalink] [raw]
Subject: Re: [RFC PATCH 01/17] block: make generic_make_request handle arbitrarily sized bios

Hi,

On Mon, 2014-12-22 at 12:48 +0100, Dongsu Park wrote:
> From: Kent Overstreet <[email protected]>

> --- a/drivers/block/ps3vram.c
> +++ b/drivers/block/ps3vram.c
> @@ -603,6 +603,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
> struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev);
> int busy;
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> dev_dbg(&dev->core, "%s\n", __func__);
>
> spin_lock_irq(&priv->lock);

The dev_dbg() here marks the entry to ps3vram_make_request(), so
we should put the blk_queue_split() call after the dev_dbg() call.

-Geoff

2014-12-24 10:37:33

by Dongsu Park

[permalink] [raw]
Subject: Re: [RFC PATCH 01/17] block: make generic_make_request handle arbitrarily sized bios

On 23.12.2014 11:14, Geoff Levand wrote:
> On Mon, 2014-12-22 at 12:48 +0100, Dongsu Park wrote:
> > From: Kent Overstreet <[email protected]>
>
> > --- a/drivers/block/ps3vram.c
> > +++ b/drivers/block/ps3vram.c
> > @@ -603,6 +603,8 @@ static void ps3vram_make_request(struct request_queue *q, struct bio *bio)
> > struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev);
> > int busy;
> >
> > + blk_queue_split(q, &bio, q->bio_split);
> > +
> > dev_dbg(&dev->core, "%s\n", __func__);
> >
> > spin_lock_irq(&priv->lock);
>
> The dev_dbg() here marks the entry to ps3vram_make_request(), so
> we should put the blk_queue_split() call after the dev_dbg() call.

Okay, I'll do it. Thanks for the review.

Dongsu

> -Geoff
>

2014-12-25 06:09:29

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 01/17] block: make generic_make_request handle arbitrarily sized bios

On Mon, Dec 22, 2014 at 7:48 PM, Dongsu Park
<[email protected]> wrote:
> From: Kent Overstreet <[email protected]>
>
> The way the block layer is currently written, it goes to great lengths
> to avoid having to split bios; upper layer code (such as bio_add_page())
> checks what the underlying device can handle and tries to always create
> bios that don't need to be split.
>
> But this approach becomes unwieldy and eventually breaks down with
> stacked devices and devices with dynamic limits, and it adds a lot of
> complexity. If the block layer could split bios as needed, we could
> eliminate a lot of complexity elsewhere - particularly in stacked
> drivers. Code that creates bios can then create whatever size bios are
> convenient, and more importantly stacked drivers don't have to deal with
> both their own bio size limitations and the limitations of the
> (potentially multiple) devices underneath them. In the future this will
> let us delete merge_bvec_fn and a bunch of other code.

Looks it is a very good idea to split bio in block.

>
> We do this by adding calls to blk_queue_split() to the various
> make_request functions that need it - a few can already handle arbitrary

I am wondering why the bio isn't splitted just before q->make_request_fn
is called in generic_make_request()? By this way, drivers won't need
to call blk_queue_split() at all. Is it because performance reason? or
others?

> size bios. Note that we add the call _after_ any call to
> blk_queue_bounce(); this means that blk_queue_split() and
> blk_recalc_rq_segments() don't need to be concerned with bouncing
> affecting segment merging.
>
> Some make_request_fn() callbacks were simple enough to audit and verify
> they don't need blk_queue_split() calls. The skipped ones are:
>
> * nfhd_make_request (arch/m68k/emu/nfblock.c)
> * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
> * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
> * brd_make_request (ramdisk - drivers/block/brd.c)
> * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
> * loop_make_request
> * null_queue_bio
> * bcache's make_request fns

I guess the above drivers haven't max_sectors/max_segment
limit.

Thanks,
Ming Lei

2014-12-27 15:02:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 01/17] block: make generic_make_request handle arbitrarily sized bios

On Tue, Dec 23, 2014 at 12:41:40PM +0100, Dongsu Park wrote:
> You mean, checking rq->cmd_type == REQ_TYPE_BLOCK_PC, right?
>
> I'm wondering about how to check that in blk_queue_split().
> At the moment when blk_queue_split() is called, it's even before a request
> is mapped e.g. in blk_sq_make_request().
> Unlike scsi drivers where it's easy to get cmd->rq, blk_queue_split()
> doesn't seem to be able to get a request by blk_get_request().
>
> Or am I missing something?

You're probably missing what I didn't notice either: BLOCK_PC requests
are never sent through ->make_request. Consider my comment withdrawn
and sorry for the confusion.

2014-12-27 15:03:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 07/17] block: replace sg_iovec with iov_iter

On Tue, Dec 23, 2014 at 01:18:15PM +0100, Dongsu Park wrote:
> I think so too. Patches #07-10 can be split into a separate patchset.
> I guess they are included just because Kent tried to follow up
> suggestions in the previous discussion.
> I don't care about either way. So I'll split them up.

In case I wasn't quite clear: I'd prefer you to send the patches that
just clean existing code and API use first, and then the bio splitting
series on top of that.