2012-08-22 17:04:32

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 00/13] Block cleanups

Buncha changes in this version.

I came up with a new, much better solution to the "bio allocation from
bio sets under generic_make_request()" problem. Previously, I was
working around this in all of my code that allocated bios; it'd mask out
GFP_WAIT if current->bio_list != NULL, then punt to workqueue if the
allocation failed - so as to avoid the deadlock from allocating multiple
bios from the same bio set.

This approach worked, but it required it to be explicitly handled in any
stacking driver that might split a bio an arbitrary number of times -
less than ideal.

My new patch inverts this - when we go to allocate a bio and we're
blocking another bio from being submitted, we punt the bio(s) that are
being blocked off to a workqueue to be submitted. This means we only
have to handle it in one place, in bio_alloc_bioset() - stack block
drivers no longer have to do anything special (I got to delete some code
from bcache, and dm wasn't handling this at all before).

Other big change - I consolidated bio allocation. Now bio_alloc(),
bio_kmalloc(), bio_clone(), and bio_clone_kmalloc() are all one line
wrappers.

Incorporated a bunch of review feedback, most of which I hopefully
remembered to note in the patch descriptions this time.

Kent Overstreet (13):
block: Generalized bio pool freeing
dm: Use bioset's front_pad for dm_rq_clone_bio_info
block: Add bio_reset()
pktcdvd: Switch to bio_kmalloc()
block: Kill bi_destructor
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
block: Avoid deadlocks with bio allocation by stacking drivers
block: Add an explicit bio flag for bios that own their bvec
block: Rename bio_split() -> bio_pair_split()
block: Introduce new bio_split()
block: Rework bio_pair_split()
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Only clone bio vecs that are in use

Documentation/block/biodoc.txt | 5 -
block/blk-core.c | 10 +-
drivers/block/drbd/drbd_main.c | 13 +-
drivers/block/drbd/drbd_req.c | 18 +-
drivers/block/osdblk.c | 3 +-
drivers/block/pktcdvd.c | 73 ++-----
drivers/block/rbd.c | 8 +-
drivers/md/dm-crypt.c | 9 -
drivers/md/dm-io.c | 11 -
drivers/md/dm.c | 68 ++----
drivers/md/linear.c | 6 +-
drivers/md/md.c | 44 +---
drivers/md/raid0.c | 8 +-
drivers/md/raid10.c | 23 +-
drivers/target/target_core_iblock.c | 9 -
fs/bio-integrity.c | 45 ----
fs/bio.c | 419 +++++++++++++++++++++++-------------
fs/exofs/ore.c | 5 +-
include/linux/bio.h | 155 +++++++------
include/linux/blk_types.h | 16 +-
20 files changed, 434 insertions(+), 514 deletions(-)

--
1.7.12


2012-08-22 17:04:40

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 03/13] block: Add bio_reset()

Reusing bios is something that's been highly frowned upon in the past,
but driver code keeps doing it anyways. If it's going to happen anyways,
we should provide a generic method.

This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
was open coding it, by doing a bio_init() and resetting bi_destructor.

v5: Add a define BIO_RESET_BITS, to be very explicit about what parts of
bio->bi_flags are saved.
v6: Further commenting verbosity, per Tejun

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
---
fs/bio.c | 9 +++++++++
include/linux/bio.h | 1 +
include/linux/blk_types.h | 13 +++++++++++++
3 files changed, 23 insertions(+)

diff --git a/fs/bio.c b/fs/bio.c
index 86e0534..81bdf4f 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -261,6 +261,15 @@ void bio_init(struct bio *bio)
}
EXPORT_SYMBOL(bio_init);

+void bio_reset(struct bio *bio)
+{
+ unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
+
+ memset(bio, 0, BIO_RESET_BYTES);
+ bio->bi_flags = flags|(1 << BIO_UPTODATE);
+}
+EXPORT_SYMBOL(bio_reset);
+
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 2643589..ba60319 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -226,6 +226,7 @@ extern void __bio_clone(struct bio *, struct bio *);
extern struct bio *bio_clone(struct bio *, gfp_t);

extern void bio_init(struct bio *);
+extern void bio_reset(struct bio *);

extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index af9dd9d..4fe4984 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -59,6 +59,10 @@ struct bio {
unsigned int bi_seg_front_size;
unsigned int bi_seg_back_size;

+ /*
+ * Everything starting with bi_max_vecs will be preserved by bio_reset()
+ */
+
unsigned int bi_max_vecs; /* max bvl_vecs we can hold */

atomic_t bi_cnt; /* pin count */
@@ -93,6 +97,8 @@ struct bio {
struct bio_vec bi_inline_vecs[0];
};

+#define BIO_RESET_BYTES offsetof(struct bio, bi_max_vecs)
+
/*
* bio flags
*/
@@ -108,6 +114,13 @@ struct bio {
#define BIO_FS_INTEGRITY 9 /* fs owns integrity data, not block layer */
#define BIO_QUIET 10 /* Make BIO Quiet */
#define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+
+/*
+ * Flags starting here get preserved by bio_reset() - this includes
+ * BIO_POOL_IDX()
+ */
+#define BIO_RESET_BITS 12
+
#define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

/*
--
1.7.12

2012-08-22 17:04:44

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 02/13] dm: Use bioset's front_pad for dm_rq_clone_bio_info

Previously, dm_rq_clone_bio_info needed to be freed by the bio's
destructor to avoid a memory leak in the blk_rq_prep_clone() error path.
This gets rid of a memory allocation and means we can kill
dm_rq_bio_destructor.

v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun

Signed-off-by: Kent Overstreet <[email protected]>
CC: Alasdair Kergon <[email protected]>
---
drivers/md/dm.c | 39 +++++++++++----------------------------
1 file changed, 11 insertions(+), 28 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 0c3d6dd..5ed9779 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -86,12 +86,17 @@ struct dm_rq_target_io {
};

/*
- * For request-based dm.
- * One of these is allocated per bio.
+ * For request-based dm - the bio clones we allocate are embedded in these
+ * structs.
+ *
+ * We allocate these with bio_alloc_bioset, using the front_pad parameter when
+ * the bioset is created - this means the bio has to come at the end of the
+ * struct.
*/
struct dm_rq_clone_bio_info {
struct bio *orig;
struct dm_rq_target_io *tio;
+ struct bio clone;
};

union map_info *dm_get_mapinfo(struct bio *bio)
@@ -467,16 +472,6 @@ static void free_rq_tio(struct dm_rq_target_io *tio)
mempool_free(tio, tio->md->tio_pool);
}

-static struct dm_rq_clone_bio_info *alloc_bio_info(struct mapped_device *md)
-{
- return mempool_alloc(md->io_pool, GFP_ATOMIC);
-}
-
-static void free_bio_info(struct dm_rq_clone_bio_info *info)
-{
- mempool_free(info, info->tio->md->io_pool);
-}
-
static int md_in_flight(struct mapped_device *md)
{
return atomic_read(&md->pending[READ]) +
@@ -1460,30 +1455,17 @@ void dm_dispatch_request(struct request *rq)
}
EXPORT_SYMBOL_GPL(dm_dispatch_request);

-static void dm_rq_bio_destructor(struct bio *bio)
-{
- struct dm_rq_clone_bio_info *info = bio->bi_private;
- struct mapped_device *md = info->tio->md;
-
- free_bio_info(info);
- bio_free(bio, md->bs);
-}
-
static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
void *data)
{
struct dm_rq_target_io *tio = data;
- struct mapped_device *md = tio->md;
- struct dm_rq_clone_bio_info *info = alloc_bio_info(md);
-
- if (!info)
- return -ENOMEM;
+ struct dm_rq_clone_bio_info *info =
+ container_of(bio, struct dm_rq_clone_bio_info, clone);

info->orig = bio_orig;
info->tio = tio;
bio->bi_end_io = end_clone_bio;
bio->bi_private = info;
- bio->bi_destructor = dm_rq_bio_destructor;

return 0;
}
@@ -2718,7 +2700,8 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity)
if (!pools->tio_pool)
goto free_io_pool_and_out;

- pools->bs = bioset_create(pool_size, 0);
+ pools->bs = bioset_create(pool_size,
+ offsetof(struct dm_rq_clone_bio_info, clone));
if (!pools->bs)
goto free_tio_pool_and_out;

--
1.7.12

2012-08-22 17:04:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 10/13] block: Introduce new bio_split()

The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.

v5: Take out current->bio_list check and make it the caller's
responsibility, per Boaz
v6: Rename @ret to @split, change to not return original bio or copy
bi_private/bi_end_io, per Tejun

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: Martin K. Petersen <[email protected]>
---
fs/bio-integrity.c | 1 -
fs/bio.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/bio.h | 13 ++++++++
3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index e85c04b..35ee3d4 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -672,7 +672,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,

BUG_ON(bip == NULL);
BUG_ON(bi == NULL);
- BUG_ON(!bio_flagged(bio, BIO_CLONED));

nr_sectors = bio_integrity_hw_sectors(bi, sectors);
bip->bip_sector = bip->bip_sector + offset;
diff --git a/fs/bio.c b/fs/bio.c
index 2f2cf19..c079006 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1541,6 +1541,101 @@ struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors)
EXPORT_SYMBOL(bio_pair_split);

/**
+ * bio_split - split a bio
+ * @bio: bio to split
+ * @sectors: number of sectors to split from the front of @bio
+ * @gfp: gfp mask
+ * @bs: bio set to allocate from
+ *
+ * Allocates and returns a new bio which represents @sectors from the start of
+ * @bio, and updates @bio to represent the remaining sectors.
+ *
+ * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
+ * unchanged.
+ *
+ * The newly allocated bio will point to @bio's bi_io_vec, if the split was on a
+ * bvec boundry; it is the caller's responsibility to ensure that @bio is not
+ * freed before the split.
+ */
+struct bio *bio_split(struct bio *bio, int sectors,
+ gfp_t gfp, struct bio_set *bs)
+{
+ unsigned idx, vcnt = 0, nbytes = sectors << 9;
+ struct bio_vec *bv;
+ struct bio *split = NULL;
+
+ BUG_ON(sectors <= 0);
+ BUG_ON(sectors >= bio_sectors(bio));
+
+ trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
+ bio->bi_sector + sectors);
+
+ if (bio->bi_rw & REQ_DISCARD) {
+ split = bio_alloc_bioset(gfp, 1, bs);
+
+ split->bi_io_vec = bio_iovec(bio);
+ idx = 0;
+ goto out;
+ }
+
+ bio_for_each_segment(bv, bio, idx) {
+ vcnt = idx - bio->bi_idx;
+
+ /*
+ * The split might occur on a bvec boundry (nbytes == 0) - in
+ * that case we can reuse the bvec from the old bio.
+ *
+ * Or, if the split isn't aligned, we'll have to allocate a new
+ * bvec and patch up the two copies of the bvec we split in.
+ */
+
+ if (!nbytes) {
+ split = bio_alloc_bioset(gfp, 0, bs);
+ if (!split)
+ return NULL;
+
+ split->bi_io_vec = bio_iovec(bio);
+ split->bi_flags |= 1 << BIO_CLONED;
+ break;
+ } else if (nbytes < bv->bv_len) {
+ split = bio_alloc_bioset(gfp, ++vcnt, bs);
+ if (!split)
+ return NULL;
+
+ memcpy(split->bi_io_vec, bio_iovec(bio),
+ sizeof(struct bio_vec) * vcnt);
+
+ split->bi_io_vec[vcnt - 1].bv_len = nbytes;
+ bv->bv_offset += nbytes;
+ bv->bv_len -= nbytes;
+ break;
+ }
+
+ nbytes -= bv->bv_len;
+ }
+out:
+ split->bi_bdev = bio->bi_bdev;
+ split->bi_sector = bio->bi_sector;
+ split->bi_size = sectors << 9;
+ split->bi_rw = bio->bi_rw;
+ split->bi_vcnt = vcnt;
+ split->bi_max_vecs = vcnt;
+
+ bio->bi_sector += sectors;
+ bio->bi_size -= sectors << 9;
+ bio->bi_idx = idx;
+
+ if (bio_integrity(bio)) {
+ bio_integrity_clone(split, bio, gfp, bs);
+ bio_integrity_trim(split, 0, bio_sectors(split));
+ bio_integrity_trim(bio, bio_sectors(split), bio_sectors(bio));
+ }
+
+ return split;
+}
+EXPORT_SYMBOL_GPL(bio_split);
+
+/**
* bio_sector_offset - Find hardware sector offset in bio
* @bio: bio to inspect
* @index: bio_vec index
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 13f4188..1c3bb47 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -201,6 +201,19 @@ struct bio_pair {
atomic_t cnt;
int error;
};
+
+extern struct bio *bio_split(struct bio *bio, int sectors,
+ gfp_t gfp, struct bio_set *bs);
+
+static inline struct bio *bio_next_split(struct bio *bio, int sectors,
+ gfp_t gfp, struct bio_set *bs)
+{
+ if (sectors >= bio_sectors(bio))
+ return bio;
+
+ return bio_split(bio, sectors, gfp, bs);
+}
+
extern struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors);
extern void bio_pair_release(struct bio_pair *dbio);

--
1.7.12

2012-08-22 17:05:07

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 13/13] block: Only clone bio vecs that are in use

bcache creates large bios internally, and then splits them according to
the device requirements before it sends them down. If a lower level
device tries to clone the bio, and the original bio had more than
BIO_MAX_PAGES, the clone will fail unecessarily.

We can fix this by only cloning the bio vecs that are actually in use.

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: Alasdair Kergon <[email protected]>
CC: Sage Weil <[email protected]>
---
drivers/block/rbd.c | 2 +-
drivers/md/dm.c | 5 ++---
fs/bio.c | 15 ++++++++++-----
3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 63e5852..5c3457f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -734,7 +734,7 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
}

while (old_chain && (total < len)) {
- tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
+ tmp = bio_kmalloc(gfpmask, bio_segments(old_chain));
if (!tmp)
goto err_out;

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a3c38b9..3aeb108 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1080,11 +1080,10 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
{
struct bio *clone;

- clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
+ clone = bio_alloc_bioset(GFP_NOIO, bv_count, bs);
__bio_clone(clone, bio);
clone->bi_sector = sector;
- clone->bi_idx = idx;
- clone->bi_vcnt = idx + bv_count;
+ clone->bi_vcnt = bv_count;
clone->bi_size = to_bytes(len);
clone->bi_flags &= ~(1 << BIO_SEG_VALID);

diff --git a/fs/bio.c b/fs/bio.c
index a58c3c6..3f43e50 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -450,11 +450,16 @@ EXPORT_SYMBOL(bio_phys_segments);
* Clone a &bio. Caller will own the returned bio, but not
* the actual data it points to. Reference count of returned
* bio will be one.
+ *
+ * We don't clone the entire bvec, just the part from bi_idx to b_vcnt
+ * (i.e. what the bio currently points to, so the new bio is still
+ * equivalent to the old bio).
*/
void __bio_clone(struct bio *bio, struct bio *bio_src)
{
- memcpy(bio->bi_io_vec, bio_src->bi_io_vec,
- bio_src->bi_max_vecs * sizeof(struct bio_vec));
+ memcpy(bio->bi_io_vec,
+ bio_iovec(bio_src),
+ bio_segments(bio_src) * sizeof(struct bio_vec));

/*
* most users will be overriding ->bi_bdev with a new target,
@@ -463,10 +468,10 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
bio->bi_sector = bio_src->bi_sector;
bio->bi_bdev = bio_src->bi_bdev;
bio->bi_flags |= 1 << BIO_CLONED;
+ bio->bi_flags &= ~(1 << BIO_SEG_VALID);
bio->bi_rw = bio_src->bi_rw;
- bio->bi_vcnt = bio_src->bi_vcnt;
+ bio->bi_vcnt = bio_segments(bio_src);
bio->bi_size = bio_src->bi_size;
- bio->bi_idx = bio_src->bi_idx;
}
EXPORT_SYMBOL(__bio_clone);

@@ -483,7 +488,7 @@ struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
{
struct bio *b;

- b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
+ b = bio_alloc_bioset(gfp_mask, bio_segments(bio), bs);
if (!b)
return NULL;

--
1.7.12

2012-08-22 17:05:18

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 12/13] block: Add bio_clone_bioset(), bio_clone_kmalloc()

Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().

This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.

This will also help in a later patch changing how bio cloning works.

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: NeilBrown <[email protected]>
CC: Alasdair Kergon <[email protected]>
CC: Boaz Harrosh <[email protected]>
CC: Jeff Garzik <[email protected]>
---
block/blk-core.c | 8 +-------
drivers/block/osdblk.c | 3 +--
drivers/md/dm.c | 4 ++--
drivers/md/md.c | 20 +-------------------
fs/bio.c | 13 ++++++++-----
fs/exofs/ore.c | 5 ++---
include/linux/bio.h | 17 ++++++++++++++---
7 files changed, 29 insertions(+), 41 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f3780f5..66f0bb4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2781,16 +2781,10 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
blk_rq_init(NULL, rq);

__rq_for_each_bio(bio_src, rq_src) {
- bio = bio_alloc_bioset(gfp_mask, bio_src->bi_max_vecs, bs);
+ bio = bio_clone_bioset(bio_src, gfp_mask, bs);
if (!bio)
goto free_and_out;

- __bio_clone(bio, bio_src);
-
- if (bio_integrity(bio_src) &&
- bio_integrity_clone(bio, bio_src, gfp_mask, bs))
- goto free_and_out;
-
if (bio_ctr && bio_ctr(bio, bio_src, data))
goto free_and_out;

diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 87311eb..1bbc681 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -266,11 +266,10 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
struct bio *tmp, *new_chain = NULL, *tail = NULL;

while (old_chain) {
- tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
+ tmp = bio_clone_kmalloc(old_chain, gfpmask);
if (!tmp)
goto err_out;

- __bio_clone(tmp, old_chain);
tmp->bi_bdev = NULL;
gfpmask &= ~__GFP_WAIT;
tmp->bi_next = NULL;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ed9779..a3c38b9 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1124,8 +1124,8 @@ static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
* ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush
* and discard, so no need for concern about wasted bvec allocations.
*/
- clone = bio_alloc_bioset(GFP_NOIO, ci->bio->bi_max_vecs, ci->md->bs);
- __bio_clone(clone, ci->bio);
+ clone = bio_clone_bioset(ci->bio, GFP_NOIO, ci->md->bs);
+
if (len) {
clone->bi_sector = ci->sector;
clone->bi_size = to_bytes(len);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b8eebe3..7a2b079 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -173,28 +173,10 @@ EXPORT_SYMBOL_GPL(bio_alloc_mddev);
struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
{
- struct bio *b;
-
if (!mddev || !mddev->bio_set)
return bio_clone(bio, gfp_mask);

- b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
- if (!b)
- return NULL;
-
- __bio_clone(b, bio);
- if (bio_integrity(bio)) {
- int ret;
-
- ret = bio_integrity_clone(b, bio, gfp_mask, mddev->bio_set);
-
- if (ret < 0) {
- bio_put(b);
- return NULL;
- }
- }
-
- return b;
+ return bio_clone_bioset(bio, gfp_mask, mddev->bio_set);
}
EXPORT_SYMBOL_GPL(bio_clone_mddev);

diff --git a/fs/bio.c b/fs/bio.c
index f50adbd..a58c3c6 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -471,16 +471,19 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
EXPORT_SYMBOL(__bio_clone);

/**
- * bio_clone - clone a bio
+ * bio_clone_bioset - clone a bio
* @bio: bio to clone
* @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
*
* Like __bio_clone, only also allocates the returned bio
*/
-struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
+ struct bio_set *bs)
{
- struct bio *b = bio_alloc(gfp_mask, bio->bi_max_vecs);
+ struct bio *b;

+ b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
if (!b)
return NULL;

@@ -489,7 +492,7 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
if (bio_integrity(bio)) {
int ret;

- ret = bio_integrity_clone(b, bio, gfp_mask, fs_bio_set);
+ ret = bio_integrity_clone(b, bio, gfp_mask, bs);

if (ret < 0) {
bio_put(b);
@@ -499,7 +502,7 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)

return b;
}
-EXPORT_SYMBOL(bio_clone);
+EXPORT_SYMBOL(bio_clone_bioset);

/**
* bio_get_nr_vecs - return approx number of vecs
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 1585db1..f936cb5 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -814,8 +814,8 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
struct bio *bio;

if (per_dev != master_dev) {
- bio = bio_kmalloc(GFP_KERNEL,
- master_dev->bio->bi_max_vecs);
+ bio = bio_clone_kmalloc(master_dev->bio,
+ GFP_KERNEL);
if (unlikely(!bio)) {
ORE_DBGMSG(
"Failed to allocate BIO size=%u\n",
@@ -824,7 +824,6 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
goto out;
}

- __bio_clone(bio, master_dev->bio);
bio->bi_bdev = NULL;
bio->bi_next = NULL;
per_dev->offset = master_dev->offset;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3ad3540..1725c2a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -223,6 +223,9 @@ extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
extern void bio_put(struct bio *);
extern void bio_free(struct bio *);

+extern void __bio_clone(struct bio *, struct bio *);
+extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
+
extern struct bio_set *fs_bio_set;

static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
@@ -230,6 +233,11 @@ static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
}

+static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+{
+ return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
+}
+
#define BIO_KMALLOC_POOL NULL

static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
@@ -237,13 +245,16 @@ static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
return bio_alloc_bioset(gfp_mask, nr_iovecs, BIO_KMALLOC_POOL);
}

+static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
+{
+ return bio_clone_bioset(bio, gfp_mask, BIO_KMALLOC_POOL);
+
+}
+
extern void bio_endio(struct bio *, int);
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);

-extern void __bio_clone(struct bio *, struct bio *);
-extern struct bio *bio_clone(struct bio *, gfp_t);
-
extern void bio_init(struct bio *);
extern void bio_reset(struct bio *);

--
1.7.12

2012-08-22 17:05:45

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 11/13] block: Rework bio_pair_split()

This changes bio_pair_split() to use the new bio_split() underneath,
which gets rid of the single page bio limitation. The various callers
are fixed up for the slightly different struct bio_pair, and to remove
the unnecessary checks.

v5: Move extern declaration to proper patch, per Boaz

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: NeilBrown <[email protected]>
CC: Lars Ellenberg <[email protected]>
CC: Peter Osterlund <[email protected]>
CC: Sage Weil <[email protected]>
CC: Martin K. Petersen <[email protected]>
---
drivers/block/drbd/drbd_req.c | 16 +------
drivers/block/pktcdvd.c | 4 +-
drivers/block/rbd.c | 7 ++--
drivers/md/linear.c | 4 +-
drivers/md/raid0.c | 6 +--
drivers/md/raid10.c | 21 ++--------
fs/bio-integrity.c | 44 --------------------
fs/bio.c | 97 +++++++++++++++++++------------------------
include/linux/bio.h | 22 ++++------
9 files changed, 66 insertions(+), 155 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index fbb0471..7d3e662 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1122,18 +1122,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
do {
inc_ap_bio(mdev, 1);
} while (drbd_make_request_common(mdev, bio, start_time));
- return;
- }
-
- /* can this bio be split generically?
- * Maybe add our own split-arbitrary-bios function. */
- if (bio->bi_vcnt != 1 || bio->bi_idx != 0 || bio->bi_size > DRBD_MAX_BIO_SIZE) {
- /* rather error out here than BUG in bio_split */
- dev_err(DEV, "bio would need to, but cannot, be split: "
- "(vcnt=%u,idx=%u,size=%u,sector=%llu)\n",
- bio->bi_vcnt, bio->bi_idx, bio->bi_size,
- (unsigned long long)bio->bi_sector);
- bio_endio(bio, -EINVAL);
} else {
/* This bio crosses some boundary, so we have to split it. */
struct bio_pair *bp;
@@ -1160,10 +1148,10 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)

D_ASSERT(e_enr == s_enr + 1);

- while (drbd_make_request_common(mdev, &bp->bio1, start_time))
+ while (drbd_make_request_common(mdev, &bp->split, start_time))
inc_ap_bio(mdev, 1);

- while (drbd_make_request_common(mdev, &bp->bio2, start_time))
+ while (drbd_make_request_common(mdev, bio, start_time))
inc_ap_bio(mdev, 1);

dec_ap_bio(mdev);
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 18393a1..6709c1d 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2471,8 +2471,8 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
first_sectors = last_zone - bio->bi_sector;
bp = bio_pair_split(bio, first_sectors);
BUG_ON(!bp);
- pkt_make_request(q, &bp->bio1);
- pkt_make_request(q, &bp->bio2);
+ pkt_make_request(q, &bp->split);
+ pkt_make_request(q, bio);
bio_pair_release(bp);
return;
}
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 1f5b483..63e5852 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -751,14 +751,13 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,

/* split the bio. We'll release it either in the next
call, or it will have to be released outside */
- bp = bio_pair_split(old_chain,
- (len - total) / SECTOR_SIZE);
+ bp = bio_pair_split(old_chain, (len - total) / SECTOR_SIZE);
if (!bp)
goto err_out;

- __bio_clone(tmp, &bp->bio1);
+ __bio_clone(tmp, &bp->split);

- *next = &bp->bio2;
+ *next = bp->orig;
} else {
__bio_clone(tmp, old_chain);
*next = old_chain->bi_next;
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index e860cb9..7c6cafd 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -316,8 +316,8 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)

bp = bio_pair_split(bio, end_sector - bio->bi_sector);

- linear_make_request(mddev, &bp->bio1);
- linear_make_request(mddev, &bp->bio2);
+ linear_make_request(mddev, &bp->split);
+ linear_make_request(mddev, bio);
bio_pair_release(bp);
return;
}
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index c89c8aa..3469adf 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -520,9 +520,9 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
(chunk_sects-1)));
else
bp = bio_pair_split(bio, chunk_sects -
- sector_div(sector, chunk_sects));
- raid0_make_request(mddev, &bp->bio1);
- raid0_make_request(mddev, &bp->bio2);
+ sector_div(sector, chunk_sects));
+ raid0_make_request(mddev, &bp->split);
+ raid0_make_request(mddev, bio);
bio_pair_release(bp);
return;
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0f31ec4..9fa07c7 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1080,15 +1080,9 @@ static void make_request(struct mddev *mddev, struct bio * bio)
&& (conf->geo.near_copies < conf->geo.raid_disks
|| conf->prev.near_copies < conf->prev.raid_disks))) {
struct bio_pair *bp;
- /* Sanity check -- queue functions should prevent this happening */
- if (bio->bi_vcnt != 1 ||
- bio->bi_idx != 0)
- goto bad_map;
- /* This is a one page bio that upper layers
- * refuse to split for us, so we need to split it.
- */
+
bp = bio_pair_split(bio,
- chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
+ chunk_sects - (bio->bi_sector & (chunk_sects - 1)));

/* Each of these 'make_request' calls will call 'wait_barrier'.
* If the first succeeds but the second blocks due to the resync
@@ -1102,8 +1096,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
conf->nr_waiting++;
spin_unlock_irq(&conf->resync_lock);

- make_request(mddev, &bp->bio1);
- make_request(mddev, &bp->bio2);
+ make_request(mddev, &bp->split);
+ make_request(mddev, bio);

spin_lock_irq(&conf->resync_lock);
conf->nr_waiting--;
@@ -1112,13 +1106,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)

bio_pair_release(bp);
return;
- bad_map:
- printk("md/raid10:%s: make_request bug: can't convert block across chunks"
- " or bigger than %dk %llu %d\n", mdname(mddev), chunk_sects/2,
- (unsigned long long)bio->bi_sector, bio->bi_size >> 10);
-
- bio_io_error(bio);
- return;
}

md_write_start(mddev, bio);
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 35ee3d4..08a9e12 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -681,50 +681,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,
EXPORT_SYMBOL(bio_integrity_trim);

/**
- * bio_integrity_split - Split integrity metadata
- * @bio: Protected bio
- * @bp: Resulting bio_pair
- * @sectors: Offset
- *
- * Description: Splits an integrity page into a bio_pair.
- */
-void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
-{
- struct blk_integrity *bi;
- struct bio_integrity_payload *bip = bio->bi_integrity;
- unsigned int nr_sectors;
-
- if (bio_integrity(bio) == 0)
- return;
-
- bi = bdev_get_integrity(bio->bi_bdev);
- BUG_ON(bi == NULL);
- BUG_ON(bip->bip_vcnt != 1);
-
- nr_sectors = bio_integrity_hw_sectors(bi, sectors);
-
- bp->bio1.bi_integrity = &bp->bip1;
- bp->bio2.bi_integrity = &bp->bip2;
-
- bp->iv1 = bip->bip_vec[0];
- bp->iv2 = bip->bip_vec[0];
-
- bp->bip1.bip_vec[0] = bp->iv1;
- bp->bip2.bip_vec[0] = bp->iv2;
-
- bp->iv1.bv_len = sectors * bi->tuple_size;
- bp->iv2.bv_offset += sectors * bi->tuple_size;
- bp->iv2.bv_len -= sectors * bi->tuple_size;
-
- bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
- bp->bip2.bip_sector = bio->bi_integrity->bip_sector + nr_sectors;
-
- bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
- bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
-}
-EXPORT_SYMBOL(bio_integrity_split);
-
-/**
* bio_integrity_clone - Callback for cloning bios with integrity metadata
* @bio: New bio
* @bio_src: Original bio
diff --git a/fs/bio.c b/fs/bio.c
index c079006..f50adbd 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -37,7 +37,7 @@
*/
#define BIO_INLINE_VECS 4

-static mempool_t *bio_split_pool __read_mostly;
+static struct bio_set *bio_split_pool __read_mostly;

/*
* if you change this list, also change bvec_alloc or things will
@@ -1461,80 +1461,70 @@ void bio_endio(struct bio *bio, int error)
}
EXPORT_SYMBOL(bio_endio);

+/**
+ * bio_pair_release - drop refcount on a bio_pair
+ *
+ * This is called by bio_pair_split's endio function, and also must be called by
+ * the caller of bio_pair_split().
+ */
void bio_pair_release(struct bio_pair *bp)
{
if (atomic_dec_and_test(&bp->cnt)) {
- struct bio *master = bp->bio1.bi_private;
+ bp->orig->bi_end_io = bp->bi_end_io;
+ bp->orig->bi_private = bp->bi_private;

- bio_endio(master, bp->error);
- mempool_free(bp, bp->bio2.bi_private);
+ bio_endio(bp->orig, 0);
+ bio_put(&bp->split);
}
}
EXPORT_SYMBOL(bio_pair_release);

-static void bio_pair_end_1(struct bio *bi, int err)
+static void bio_pair_end(struct bio *bio, int error)
{
- struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);
+ struct bio_pair *bp = bio->bi_private;

- if (err)
- bp->error = err;
-
- bio_pair_release(bp);
-}
-
-static void bio_pair_end_2(struct bio *bi, int err)
-{
- struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);
-
- if (err)
- bp->error = err;
+ if (error)
+ clear_bit(BIO_UPTODATE, &bp->orig->bi_flags);

bio_pair_release(bp);
}

-/*
- * split a bio - only worry about a bio with a single page in its iovec
+/**
+ * bio_pair_split - split a bio, and chain the completions
+ * @bio: bio to split
+ * @sectors: number of sectors to split from the front of @bio
+ *
+ * This wraps bio_split(), and puts the split and the original bio in a struct
+ * bio_pair. It also hooks into the completions so the original bio will be
+ * completed once both splits have been completed.
+ *
+ * The caller will own a refcount on the returned bio_pair, which must be
+ * dropped with bio_pair_release().
*/
-struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors)
+struct bio_pair *bio_pair_split(struct bio *bio, int sectors)
{
- struct bio_pair *bp = mempool_alloc(bio_split_pool, GFP_NOIO);
+ struct bio_pair *bp;
+ struct bio *split;

- if (!bp)
- return bp;
-
- trace_block_split(bdev_get_queue(bi->bi_bdev), bi,
- bi->bi_sector + first_sectors);
-
- BUG_ON(bi->bi_vcnt != 1);
- BUG_ON(bi->bi_idx != 0);
- atomic_set(&bp->cnt, 3);
- bp->error = 0;
- bp->bio1 = *bi;
- bp->bio2 = *bi;
- bp->bio2.bi_sector += first_sectors;
- bp->bio2.bi_size -= first_sectors << 9;
- bp->bio1.bi_size = first_sectors << 9;
+ split = bio_split(bio, sectors, GFP_NOIO, bio_split_pool);
+ if (!split)
+ return NULL;

- bp->bv1 = bi->bi_io_vec[0];
- bp->bv2 = bi->bi_io_vec[0];
- bp->bv2.bv_offset += first_sectors << 9;
- bp->bv2.bv_len -= first_sectors << 9;
- bp->bv1.bv_len = first_sectors << 9;
+ BUG_ON(split == bio);

- bp->bio1.bi_io_vec = &bp->bv1;
- bp->bio2.bi_io_vec = &bp->bv2;
+ bp = container_of(split, struct bio_pair, split);

- bp->bio1.bi_max_vecs = 1;
- bp->bio2.bi_max_vecs = 1;
+ atomic_set(&bp->cnt, 3);
+ bp->orig = bio;

- bp->bio1.bi_end_io = bio_pair_end_1;
- bp->bio2.bi_end_io = bio_pair_end_2;
+ bp->bi_end_io = bio->bi_end_io;
+ bp->bi_private = bio->bi_private;

- bp->bio1.bi_private = bi;
- bp->bio2.bi_private = bio_split_pool;
+ bio->bi_private = bp;
+ bio->bi_end_io = bio_pair_end;

- if (bio_integrity(bi))
- bio_integrity_split(bi, bp, first_sectors);
+ split->bi_private = bp;
+ split->bi_end_io = bio_pair_end;

return bp;
}
@@ -1856,8 +1846,7 @@ static int __init init_bio(void)
if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
panic("bio: can't create integrity pool\n");

- bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
- sizeof(struct bio_pair));
+ bio_split_pool = bioset_create(BIO_POOL_SIZE, offsetof(struct bio_pair, split));
if (!bio_split_pool)
panic("bio: can't create split pool\n");

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1c3bb47..3ad3540 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -192,14 +192,13 @@ struct bio_integrity_payload {
* in bio2.bi_private
*/
struct bio_pair {
- struct bio bio1, bio2;
- struct bio_vec bv1, bv2;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
- struct bio_integrity_payload bip1, bip2;
- struct bio_vec iv1, iv2;
-#endif
- atomic_t cnt;
- int error;
+ atomic_t cnt;
+
+ bio_end_io_t *bi_end_io;
+ void *bi_private;
+
+ struct bio *orig;
+ struct bio split;
};

extern struct bio *bio_split(struct bio *bio, int sectors,
@@ -544,7 +543,6 @@ extern int bio_integrity_prep(struct bio *);
extern void bio_integrity_endio(struct bio *, int);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
-extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t, struct bio_set *);
extern int bioset_integrity_create(struct bio_set *, int);
extern void bioset_integrity_free(struct bio_set *);
@@ -588,12 +586,6 @@ static inline int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
return 0;
}

-static inline void bio_integrity_split(struct bio *bio, struct bio_pair *bp,
- int sectors)
-{
- return;
-}
-
static inline void bio_integrity_advance(struct bio *bio,
unsigned int bytes_done)
{
--
1.7.12

2012-08-22 17:06:09

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 09/13] block: Rename bio_split() -> bio_pair_split()

This is prep work for introducing a more general bio_split()

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: NeilBrown <[email protected]>
CC: Alasdair Kergon <[email protected]>
CC: Lars Ellenberg <[email protected]>
CC: Peter Osterlund <[email protected]>
CC: Sage Weil <[email protected]>
---
drivers/block/drbd/drbd_req.c | 2 +-
drivers/block/pktcdvd.c | 2 +-
drivers/block/rbd.c | 3 ++-
drivers/md/linear.c | 2 +-
drivers/md/raid0.c | 6 +++---
drivers/md/raid10.c | 4 ++--
fs/bio.c | 4 ++--
include/linux/bio.h | 2 +-
8 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 910335c..fbb0471 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1149,7 +1149,7 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
const int sps = 1 << HT_SHIFT; /* sectors per slot */
const int mask = sps - 1;
const sector_t first_sectors = sps - (sect & mask);
- bp = bio_split(bio, first_sectors);
+ bp = bio_pair_split(bio, first_sectors);

/* we need to get a "reference count" (ap_bio_cnt)
* to avoid races with the disconnect/reconnect/suspend code.
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index ae55f08..18393a1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2469,7 +2469,7 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
if (last_zone != zone) {
BUG_ON(last_zone != zone + pd->settings.size);
first_sectors = last_zone - bio->bi_sector;
- bp = bio_split(bio, first_sectors);
+ bp = bio_pair_split(bio, first_sectors);
BUG_ON(!bp);
pkt_make_request(q, &bp->bio1);
pkt_make_request(q, &bp->bio2);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9917943..1f5b483 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -751,7 +751,8 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,

/* split the bio. We'll release it either in the next
call, or it will have to be released outside */
- bp = bio_split(old_chain, (len - total) / SECTOR_SIZE);
+ bp = bio_pair_split(old_chain,
+ (len - total) / SECTOR_SIZE);
if (!bp)
goto err_out;

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index fa211d8..e860cb9 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -314,7 +314,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)

rcu_read_unlock();

- bp = bio_split(bio, end_sector - bio->bi_sector);
+ bp = bio_pair_split(bio, end_sector - bio->bi_sector);

linear_make_request(mddev, &bp->bio1);
linear_make_request(mddev, &bp->bio2);
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index de63a1f..c89c8aa 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -516,11 +516,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
* refuse to split for us, so we need to split it.
*/
if (likely(is_power_of_2(chunk_sects)))
- bp = bio_split(bio, chunk_sects - (sector &
+ bp = bio_pair_split(bio, chunk_sects - (sector &
(chunk_sects-1)));
else
- bp = bio_split(bio, chunk_sects -
- sector_div(sector, chunk_sects));
+ bp = bio_pair_split(bio, chunk_sects -
+ sector_div(sector, chunk_sects));
raid0_make_request(mddev, &bp->bio1);
raid0_make_request(mddev, &bp->bio2);
bio_pair_release(bp);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1c2eb38..0f31ec4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1087,8 +1087,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
/* This is a one page bio that upper layers
* refuse to split for us, so we need to split it.
*/
- bp = bio_split(bio,
- chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
+ bp = bio_pair_split(bio,
+ chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );

/* Each of these 'make_request' calls will call 'wait_barrier'.
* If the first succeeds but the second blocks due to the resync
diff --git a/fs/bio.c b/fs/bio.c
index b9c022c..2f2cf19 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1495,7 +1495,7 @@ static void bio_pair_end_2(struct bio *bi, int err)
/*
* split a bio - only worry about a bio with a single page in its iovec
*/
-struct bio_pair *bio_split(struct bio *bi, int first_sectors)
+struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors)
{
struct bio_pair *bp = mempool_alloc(bio_split_pool, GFP_NOIO);

@@ -1538,7 +1538,7 @@ struct bio_pair *bio_split(struct bio *bi, int first_sectors)

return bp;
}
-EXPORT_SYMBOL(bio_split);
+EXPORT_SYMBOL(bio_pair_split);

/**
* bio_sector_offset - Find hardware sector offset in bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 427d05a..13f4188 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -201,7 +201,7 @@ struct bio_pair {
atomic_t cnt;
int error;
};
-extern struct bio_pair *bio_split(struct bio *bi, int first_sectors);
+extern struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors);
extern void bio_pair_release(struct bio_pair *dbio);

extern struct bio_set *bioset_create(unsigned int, unsigned int);
--
1.7.12

2012-08-22 17:06:13

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 08/13] block: Add an explicit bio flag for bios that own their bvec

This is for the new bio splitting code. When we split a bio, if the
split occured on a bvec boundry we reuse the bvec for the new bio. But
that means bio_free() can't free it, hence the explicit flag.

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
fs/bio.c | 4 +++-
include/linux/bio.h | 5 -----
include/linux/blk_types.h | 1 +
3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index a82a3c7..b9c022c 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -246,7 +246,7 @@ void bio_free(struct bio *bio)
return;
}

- if (bio_has_allocated_vec(bio))
+ if (bio_flagged(bio, BIO_OWNS_VEC))
bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));

if (bio_integrity(bio))
@@ -365,6 +365,8 @@ retry:
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
if (unlikely(!bvl))
goto err_free;
+
+ bio->bi_flags |= 1 << BIO_OWNS_VEC;
} else if (nr_iovecs) {
bvl = bio->bi_inline_vecs;
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ba5b52e..427d05a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -84,11 +84,6 @@ static inline void *bio_data(struct bio *bio)
return NULL;
}

-static inline int bio_has_allocated_vec(struct bio *bio)
-{
- return bio->bi_io_vec && bio->bi_io_vec != bio->bi_inline_vecs;
-}
-
/*
* will die
*/
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 045ebe0..c1878bd 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -117,6 +117,7 @@ struct bio {
* BIO_POOL_IDX()
*/
#define BIO_RESET_BITS 12
+#define BIO_OWNS_VEC 12 /* bio_free() should free bvec */

#define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

--
1.7.12

2012-08-22 17:06:37

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 07/13] block: Avoid deadlocks with bio allocation by stacking drivers

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request(), we risk deadlock.

This would happen if e.g. a bio ever needed to be split more than once,
and it's difficult to handle correctly in the drivers - so in practice
it's not.

This patch fixes this issue by allocating a rescuer workqueue for each
bio_set, and punting queued bios to said rescuer when necessary:

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
---
fs/bio.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++---
include/linux/bio.h | 75 ++++++++++++++++++++++++++++++-----------------------
2 files changed, 112 insertions(+), 37 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index aa67bf3..a82a3c7 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -279,6 +279,23 @@ void bio_reset(struct bio *bio)
}
EXPORT_SYMBOL(bio_reset);

+static void bio_alloc_rescue(struct work_struct *work)
+{
+ struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+ struct bio *bio;
+
+ while (1) {
+ spin_lock(&bs->rescue_lock);
+ bio = bio_list_pop(&bs->rescue_list);
+ spin_unlock(&bs->rescue_lock);
+
+ if (!bio)
+ break;
+
+ generic_make_request(bio);
+ }
+}
+
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
@@ -292,6 +309,7 @@ EXPORT_SYMBOL(bio_reset);
**/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -306,16 +324,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
p = kmalloc(sizeof(struct bio) +
nr_iovecs * sizeof(struct bio_vec),
gfp_mask);
+
front_pad = 0;
inline_vecs = nr_iovecs;
} else {
- p = mempool_alloc(bs->bio_pool, gfp_mask);
+ /*
+ * If we're running under generic_make_request()
+ * (current->bio_list != NULL), we risk deadlock if we sleep on
+ * allocation and there's already bios on current->bio_list that
+ * were allocated from the same bio_set; they won't be submitted
+ * (and thus freed) as long as we're blocked here.
+ *
+ * To deal with this, we first try the allocation without using
+ * the mempool; if that fails, we punt all the bios on
+ * current->bio_list to a different thread and then retry the
+ * allocation with the original gfp mask.
+ */
+
+ if (current->bio_list &&
+ !bio_list_empty(current->bio_list) &&
+ (gfp_mask & __GFP_WAIT))
+ gfp_mask &= GFP_ATOMIC;
+retry:
+ if (gfp_mask & __GFP_WAIT)
+ p = mempool_alloc(bs->bio_pool, gfp_mask);
+ else
+ p = kmem_cache_alloc(bs->bio_slab, gfp_mask);
+
front_pad = bs->front_pad;
inline_vecs = BIO_INLINE_VECS;
}

if (unlikely(!p))
- return NULL;
+ goto err;

bio = p + front_pad;
bio_init(bio);
@@ -336,6 +377,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)

err_free:
mempool_free(p, bs->bio_pool);
+err:
+ if (gfp_mask != saved_gfp) {
+ gfp_mask = saved_gfp;
+
+ spin_lock(&bs->rescue_lock);
+ bio_list_merge(&bs->rescue_list, current->bio_list);
+ bio_list_init(current->bio_list);
+ spin_unlock(&bs->rescue_lock);
+
+ queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ goto retry;
+ }
+
return NULL;
}
EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1544,6 +1598,9 @@ static void biovec_free_pools(struct bio_set *bs)

void bioset_free(struct bio_set *bs)
{
+ if (bs->rescue_workqueue)
+ destroy_workqueue(bs->rescue_workqueue);
+
if (bs->bio_pool)
mempool_destroy(bs->bio_pool);

@@ -1579,6 +1636,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)

bs->front_pad = front_pad;

+ spin_lock_init(&bs->rescue_lock);
+ bio_list_init(&bs->rescue_list);
+ INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
if (!bs->bio_slab) {
kfree(bs);
@@ -1589,9 +1650,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
if (!bs->bio_pool)
goto bad;

- if (!biovec_create_pools(bs, pool_size))
- return bs;
+ if (biovec_create_pools(bs, pool_size))
+ goto bad;
+
+ bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+ if (!bs->rescue_workqueue)
+ goto bad;

+ return bs;
bad:
bioset_free(bs);
return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b22c22b..ba5b52e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
static inline void bio_disassociate_task(struct bio *bio) { }
#endif /* CONFIG_BLK_CGROUP */

-/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-#define BIO_POOL_SIZE 2
-#define BIOVEC_NR_POOLS 6
-#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
-
-struct bio_set {
- struct kmem_cache *bio_slab;
- unsigned int front_pad;
-
- mempool_t *bio_pool;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
- mempool_t *bio_integrity_pool;
-#endif
- mempool_t *bvec_pool;
-};
-
-struct biovec_slab {
- int nr_vecs;
- char *name;
- struct kmem_cache *slab;
-};
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
-
#ifdef CONFIG_HIGHMEM
/*
* remember never ever reenable interrupts between a bvec_kmap_irq and
@@ -497,6 +464,48 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
return bio;
}

+/*
+ * bio_set is used to allow other portions of the IO system to
+ * allocate their own private memory pools for bio and iovec structures.
+ * These memory pools in turn all allocate from the bio_slab
+ * and the bvec_slabs[].
+ */
+#define BIO_POOL_SIZE 2
+#define BIOVEC_NR_POOLS 6
+#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
+
+struct bio_set {
+ struct kmem_cache *bio_slab;
+ unsigned int front_pad;
+
+ mempool_t *bio_pool;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ mempool_t *bio_integrity_pool;
+#endif
+ mempool_t *bvec_pool;
+
+ /*
+ * Deadlock avoidance for stacking block drivers: see comments in
+ * bio_alloc_bioset() for details
+ */
+ spinlock_t rescue_lock;
+ struct bio_list rescue_list;
+ struct work_struct rescue_work;
+ struct workqueue_struct *rescue_workqueue;
+};
+
+struct biovec_slab {
+ int nr_vecs;
+ char *name;
+ struct kmem_cache *slab;
+};
+
+/*
+ * a small number of entries is fine, not going to be performance critical.
+ * basically we just need to survive
+ */
+#define BIO_SPLIT_ENTRIES 2
+
#if defined(CONFIG_BLK_DEV_INTEGRITY)

#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
--
1.7.12

2012-08-22 17:06:56

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 06/13] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

Previously, bio_kmalloc() and bio_alloc_bioset() behaved slightly
different because there was some almost-duplicated code - this fixes
that issue.

This'll also help with some future cleanups - there are a fair number of
functions that allocate bios (e.g. bio_clone()), and now they don't have
to be duplicated for bio_alloc(), bio_alloc_bioset(), and bio_kmalloc().

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
---
fs/bio.c | 96 ++++++++++++++---------------------------------------
include/linux/bio.h | 18 +++++++---
2 files changed, 38 insertions(+), 76 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index fac171f..aa67bf3 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -55,8 +55,7 @@ static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
* IO code that does not need private memory pools.
*/
struct bio_set *fs_bio_set;
-
-#define BIO_KMALLOC_POOL NULL
+EXPORT_SYMBOL(fs_bio_set);

/*
* Our slab pool management
@@ -293,33 +292,43 @@ EXPORT_SYMBOL(bio_reset);
**/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ unsigned front_pad;
+ unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
struct bio_vec *bvl = NULL;
struct bio *bio;
void *p;

- p = mempool_alloc(bs->bio_pool, gfp_mask);
+ if (nr_iovecs > UIO_MAXIOV)
+ return NULL;
+
+ if (bs == BIO_KMALLOC_POOL) {
+ p = kmalloc(sizeof(struct bio) +
+ nr_iovecs * sizeof(struct bio_vec),
+ gfp_mask);
+ front_pad = 0;
+ inline_vecs = nr_iovecs;
+ } else {
+ p = mempool_alloc(bs->bio_pool, gfp_mask);
+ front_pad = bs->front_pad;
+ inline_vecs = BIO_INLINE_VECS;
+ }
+
if (unlikely(!p))
return NULL;
- bio = p + bs->front_pad;

+ bio = p + front_pad;
bio_init(bio);
- bio->bi_pool = bs;

- if (unlikely(!nr_iovecs))
- goto out_set;
-
- if (nr_iovecs <= BIO_INLINE_VECS) {
- bvl = bio->bi_inline_vecs;
- nr_iovecs = BIO_INLINE_VECS;
- } else {
+ if (nr_iovecs > inline_vecs) {
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
if (unlikely(!bvl))
goto err_free;
-
- nr_iovecs = bvec_nr_vecs(idx);
+ } else if (nr_iovecs) {
+ bvl = bio->bi_inline_vecs;
}
-out_set:
+
+ bio->bi_pool = bs;
bio->bi_flags |= idx << BIO_POOL_OFFSET;
bio->bi_max_vecs = nr_iovecs;
bio->bi_io_vec = bvl;
@@ -331,63 +340,6 @@ err_free:
}
EXPORT_SYMBOL(bio_alloc_bioset);

-/**
- * bio_alloc - allocate a new bio, memory pool backed
- * @gfp_mask: allocation mask to use
- * @nr_iovecs: number of iovecs
- *
- * bio_alloc will allocate a bio and associated bio_vec array that can hold
- * at least @nr_iovecs entries. Allocations will be done from the
- * fs_bio_set. Also see @bio_alloc_bioset and @bio_kmalloc.
- *
- * If %__GFP_WAIT is set, then bio_alloc will always be able to allocate
- * a bio. This is due to the mempool guarantees. To make this work, callers
- * must never allocate more than 1 bio at a time from this pool. Callers
- * that need to allocate more than 1 bio must always submit the previously
- * allocated bio for IO before attempting to allocate a new one. Failure to
- * do so can cause livelocks under memory pressure.
- *
- * RETURNS:
- * Pointer to new bio on success, NULL on failure.
- */
-struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
-{
- return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-}
-EXPORT_SYMBOL(bio_alloc);
-
-/**
- * bio_kmalloc - allocate a bio for I/O using kmalloc()
- * @gfp_mask: the GFP_ mask given to the slab allocator
- * @nr_iovecs: number of iovecs to pre-allocate
- *
- * Description:
- * Allocate a new bio with @nr_iovecs bvecs. If @gfp_mask contains
- * %__GFP_WAIT, the allocation is guaranteed to succeed.
- *
- **/
-struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
-{
- struct bio *bio;
-
- if (nr_iovecs > UIO_MAXIOV)
- return NULL;
-
- bio = kmalloc(sizeof(struct bio) + nr_iovecs * sizeof(struct bio_vec),
- gfp_mask);
- if (unlikely(!bio))
- return NULL;
-
- bio_init(bio);
- bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
- bio->bi_max_vecs = nr_iovecs;
- bio->bi_io_vec = bio->bi_inline_vecs;
- bio->bi_pool = BIO_KMALLOC_POOL;
-
- return bio;
-}
-EXPORT_SYMBOL(bio_kmalloc);
-
void zero_fill_bio(struct bio *bio)
{
unsigned long flags;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 393c87e..b22c22b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -212,12 +212,24 @@ extern void bio_pair_release(struct bio_pair *dbio);
extern struct bio_set *bioset_create(unsigned int, unsigned int);
extern void bioset_free(struct bio_set *);

-extern struct bio *bio_alloc(gfp_t, unsigned int);
-extern struct bio *bio_kmalloc(gfp_t, unsigned int);
extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
extern void bio_put(struct bio *);
extern void bio_free(struct bio *);

+extern struct bio_set *fs_bio_set;
+
+static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
+{
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
+}
+
+#define BIO_KMALLOC_POOL NULL
+
+static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
+{
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, BIO_KMALLOC_POOL);
+}
+
extern void bio_endio(struct bio *, int);
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);
@@ -305,8 +317,6 @@ struct biovec_slab {
struct kmem_cache *slab;
};

-extern struct bio_set *fs_bio_set;
-
/*
* a small number of entries is fine, not going to be performance critical.
* basically we just need to survive
--
1.7.12

2012-08-22 17:07:19

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 05/13] block: Kill bi_destructor

Now that we've got generic code for freeing bios allocated from bio
pools, this isn't needed anymore.

This also changes the semantics of bio_free() a bit - it now also frees
bios allocated by bio_kmalloc(). It's also no longer exported, as
without bi_destructor there should be no need for it to be called
anywhere else.

v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
---
Documentation/block/biodoc.txt | 5 -----
block/blk-core.c | 2 +-
fs/bio.c | 35 ++++++++++++++---------------------
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 3 ---
5 files changed, 16 insertions(+), 31 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index e418dc0..8df5e8e 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -465,7 +465,6 @@ struct bio {
bio_end_io_t *bi_end_io; /* bi_end_io (bio) */
atomic_t bi_cnt; /* pin count: free when it hits zero */
void *bi_private;
- bio_destructor_t *bi_destructor; /* bi_destructor (bio) */
};

With this multipage bio design:
@@ -647,10 +646,6 @@ for a non-clone bio. There are the 6 pools setup for different size biovecs,
so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
given size from these slabs.

-The bi_destructor() routine takes into account the possibility of the bio
-having originated from a different source (see later discussions on
-n/w to block transfers and kvec_cb)
-
The bio_get() routine may be used to hold an extra reference on a bio prior
to i/o submission, if the bio fields are likely to be accessed after the
i/o is issued (since the bio may otherwise get freed in case i/o completion
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b4dbdf..f3780f5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2807,7 +2807,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,

free_and_out:
if (bio)
- bio_free(bio, bs);
+ bio_free(bio);
blk_rq_unprep_clone(rq);

return -ENOMEM;
diff --git a/fs/bio.c b/fs/bio.c
index 81bdf4f..fac171f 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -56,6 +56,8 @@ static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
*/
struct bio_set *fs_bio_set;

+#define BIO_KMALLOC_POOL NULL
+
/*
* Our slab pool management
*/
@@ -232,10 +234,19 @@ fallback:
return bvl;
}

-void bio_free(struct bio *bio, struct bio_set *bs)
+void bio_free(struct bio *bio)
{
+ struct bio_set *bs = bio->bi_pool;
void *p;

+ if (bs == BIO_KMALLOC_POOL) {
+ /* Bio was allocated by bio_kmalloc() */
+ if (bio_integrity(bio))
+ bio_integrity_free(bio, fs_bio_set);
+ kfree(bio);
+ return;
+ }
+
if (bio_has_allocated_vec(bio))
bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));

@@ -251,7 +262,6 @@ void bio_free(struct bio *bio, struct bio_set *bs)

mempool_free(p, bs->bio_pool);
}
-EXPORT_SYMBOL(bio_free);

void bio_init(struct bio *bio)
{
@@ -346,13 +356,6 @@ struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
}
EXPORT_SYMBOL(bio_alloc);

-static void bio_kmalloc_destructor(struct bio *bio)
-{
- if (bio_integrity(bio))
- bio_integrity_free(bio, fs_bio_set);
- kfree(bio);
-}
-
/**
* bio_kmalloc - allocate a bio for I/O using kmalloc()
* @gfp_mask: the GFP_ mask given to the slab allocator
@@ -379,7 +382,7 @@ struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
bio->bi_max_vecs = nr_iovecs;
bio->bi_io_vec = bio->bi_inline_vecs;
- bio->bi_destructor = bio_kmalloc_destructor;
+ bio->bi_pool = BIO_KMALLOC_POOL;

return bio;
}
@@ -417,17 +420,7 @@ void bio_put(struct bio *bio)
*/
if (atomic_dec_and_test(&bio->bi_cnt)) {
bio_disassociate_task(bio);
- bio->bi_next = NULL;
-
- /*
- * This if statement is temporary - bi_pool is replacing
- * bi_destructor, but bi_destructor will be taken out in another
- * patch.
- */
- if (bio->bi_pool)
- bio_free(bio, bio->bi_pool);
- else
- bio->bi_destructor(bio);
+ bio_free(bio);
}
}
EXPORT_SYMBOL(bio_put);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ba60319..393c87e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -216,7 +216,7 @@ extern struct bio *bio_alloc(gfp_t, unsigned int);
extern struct bio *bio_kmalloc(gfp_t, unsigned int);
extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
extern void bio_put(struct bio *);
-extern void bio_free(struct bio *, struct bio_set *);
+extern void bio_free(struct bio *);

extern void bio_endio(struct bio *, int);
struct request_queue;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4fe4984..045ebe0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -84,11 +84,8 @@ struct bio {
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif

- /* If bi_pool is non NULL, bi_destructor is not called */
struct bio_set *bi_pool;

- bio_destructor_t *bi_destructor; /* destructor */
-
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
--
1.7.12

2012-08-22 17:07:22

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 04/13] pktcdvd: Switch to bio_kmalloc()

This is prep work for killing bi_destructor - previously, pktcdvd had
its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
necessitating its own bi_destructor implementation.

v5: Un-reorder some functions, to make the patch easier to review

Signed-off-by: Kent Overstreet <[email protected]>
CC: Peter Osterlund <[email protected]>
---
drivers/block/pktcdvd.c | 67 ++++++++++++-------------------------------------
1 file changed, 16 insertions(+), 51 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index ba66e44..ae55f08 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -101,6 +101,8 @@ static struct dentry *pkt_debugfs_root = NULL; /* /sys/kernel/debug/pktcdvd */
static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev);
static int pkt_remove_dev(dev_t pkt_dev);
static int pkt_seq_show(struct seq_file *m, void *p);
+static void pkt_end_io_read(struct bio *bio, int err);
+static void pkt_end_io_packet_write(struct bio *bio, int err);



@@ -522,38 +524,6 @@ static void pkt_bio_finished(struct pktcdvd_device *pd)
}
}

-static void pkt_bio_destructor(struct bio *bio)
-{
- kfree(bio->bi_io_vec);
- kfree(bio);
-}
-
-static struct bio *pkt_bio_alloc(int nr_iovecs)
-{
- struct bio_vec *bvl = NULL;
- struct bio *bio;
-
- bio = kmalloc(sizeof(struct bio), GFP_KERNEL);
- if (!bio)
- goto no_bio;
- bio_init(bio);
-
- bvl = kcalloc(nr_iovecs, sizeof(struct bio_vec), GFP_KERNEL);
- if (!bvl)
- goto no_bvl;
-
- bio->bi_max_vecs = nr_iovecs;
- bio->bi_io_vec = bvl;
- bio->bi_destructor = pkt_bio_destructor;
-
- return bio;
-
- no_bvl:
- kfree(bio);
- no_bio:
- return NULL;
-}
-
/*
* Allocate a packet_data struct
*/
@@ -567,10 +537,13 @@ static struct packet_data *pkt_alloc_packet_data(int frames)
goto no_pkt;

pkt->frames = frames;
- pkt->w_bio = pkt_bio_alloc(frames);
+ pkt->w_bio = bio_kmalloc(GFP_KERNEL, frames);
if (!pkt->w_bio)
goto no_bio;

+ pkt->w_bio->bi_end_io = pkt_end_io_packet_write;
+ pkt->w_bio->bi_private = pkt;
+
for (i = 0; i < frames / FRAMES_PER_PAGE; i++) {
pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
if (!pkt->pages[i])
@@ -581,9 +554,12 @@ static struct packet_data *pkt_alloc_packet_data(int frames)
bio_list_init(&pkt->orig_bios);

for (i = 0; i < frames; i++) {
- struct bio *bio = pkt_bio_alloc(1);
+ struct bio *bio = bio_kmalloc(GFP_KERNEL, 1);
if (!bio)
goto no_rd_bio;
+
+ bio->bi_end_io = pkt_end_io_read;
+ bio->bi_private = pkt;
pkt->r_bios[i] = bio;
}

@@ -1111,21 +1087,15 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt)
* Schedule reads for missing parts of the packet.
*/
for (f = 0; f < pkt->frames; f++) {
- struct bio_vec *vec;
-
int p, offset;
+
if (written[f])
continue;
+
bio = pkt->r_bios[f];
- vec = bio->bi_io_vec;
- bio_init(bio);
- bio->bi_max_vecs = 1;
- bio->bi_sector = pkt->sector + f * (CD_FRAMESIZE >> 9);
- bio->bi_bdev = pd->bdev;
- bio->bi_end_io = pkt_end_io_read;
- bio->bi_private = pkt;
- bio->bi_io_vec = vec;
- bio->bi_destructor = pkt_bio_destructor;
+ bio_reset(bio);
+ bio->bi_sector = pkt->sector + f * (CD_FRAMESIZE >> 9);
+ bio->bi_bdev = pd->bdev;

p = (f * CD_FRAMESIZE) / PAGE_SIZE;
offset = (f * CD_FRAMESIZE) % PAGE_SIZE;
@@ -1418,14 +1388,9 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
}

/* Start the write request */
- bio_init(pkt->w_bio);
- pkt->w_bio->bi_max_vecs = PACKET_MAX_SIZE;
+ bio_reset(pkt->w_bio);
pkt->w_bio->bi_sector = pkt->sector;
pkt->w_bio->bi_bdev = pd->bdev;
- pkt->w_bio->bi_end_io = pkt_end_io_packet_write;
- pkt->w_bio->bi_private = pkt;
- pkt->w_bio->bi_io_vec = bvec;
- pkt->w_bio->bi_destructor = pkt_bio_destructor;
for (f = 0; f < pkt->frames; f++)
if (!bio_add_page(pkt->w_bio, bvec[f].bv_page, CD_FRAMESIZE, bvec[f].bv_offset))
BUG();
--
1.7.12

2012-08-22 17:07:53

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v6 01/13] block: Generalized bio pool freeing

With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.

This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.

v6: Explain the temporary if statement in bio_put

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: NeilBrown <[email protected]>
CC: Alasdair Kergon <[email protected]>
CC: Nicholas Bellinger <[email protected]>
CC: Lars Ellenberg <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
drivers/block/drbd/drbd_main.c | 13 +------------
drivers/md/dm-crypt.c | 9 ---------
drivers/md/dm-io.c | 11 -----------
drivers/md/dm.c | 20 --------------------
drivers/md/md.c | 28 ++++------------------------
drivers/target/target_core_iblock.c | 9 ---------
fs/bio.c | 31 +++++++++++++------------------
include/linux/blk_types.h | 3 +++
8 files changed, 21 insertions(+), 103 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index dbe6135..5964459 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -161,23 +161,12 @@ static const struct block_device_operations drbd_ops = {
.release = drbd_release,
};

-static void bio_destructor_drbd(struct bio *bio)
-{
- bio_free(bio, drbd_md_io_bio_set);
-}
-
struct bio *bio_alloc_drbd(gfp_t gfp_mask)
{
- struct bio *bio;
-
if (!drbd_md_io_bio_set)
return bio_alloc(gfp_mask, 1);

- bio = bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
- if (!bio)
- return NULL;
- bio->bi_destructor = bio_destructor_drbd;
- return bio;
+ return bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
}

#ifdef __CHECKER__
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 664743d..3c0acba 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -798,14 +798,6 @@ static int crypt_convert(struct crypt_config *cc,
return 0;
}

-static void dm_crypt_bio_destructor(struct bio *bio)
-{
- struct dm_crypt_io *io = bio->bi_private;
- struct crypt_config *cc = io->cc;
-
- bio_free(bio, cc->bs);
-}
-
/*
* Generate a new unfragmented bio with the given size
* This should never violate the device limitations
@@ -974,7 +966,6 @@ static void clone_init(struct dm_crypt_io *io, struct bio *clone)
clone->bi_end_io = crypt_endio;
clone->bi_bdev = cc->dev->bdev;
clone->bi_rw = io->base_bio->bi_rw;
- clone->bi_destructor = dm_crypt_bio_destructor;
}

static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index ea5dd28..1c46f97 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -249,16 +249,6 @@ static void vm_dp_init(struct dpages *dp, void *data)
dp->context_ptr = data;
}

-static void dm_bio_destructor(struct bio *bio)
-{
- unsigned region;
- struct io *io;
-
- retrieve_io_and_region_from_bio(bio, &io, &region);
-
- bio_free(bio, io->client->bios);
-}
-
/*
* Functions for getting the pages from kernel memory.
*/
@@ -317,7 +307,6 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
bio->bi_sector = where->sector + (where->count - remaining);
bio->bi_bdev = where->bdev;
bio->bi_end_io = endio;
- bio->bi_destructor = dm_bio_destructor;
store_io_and_region_in_bio(bio, io, region);

if (rw & REQ_DISCARD) {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4e09b6f..0c3d6dd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -681,11 +681,6 @@ static void clone_endio(struct bio *bio, int error)
}
}

- /*
- * Store md for cleanup instead of tio which is about to get freed.
- */
- bio->bi_private = md->bs;
-
free_tio(md, tio);
bio_put(bio);
dec_pending(io, error);
@@ -1032,11 +1027,6 @@ static void __map_bio(struct dm_target *ti, struct bio *clone,
/* error the io and bail out, or requeue it if needed */
md = tio->io->md;
dec_pending(tio->io, r);
- /*
- * Store bio_set for cleanup.
- */
- clone->bi_end_io = NULL;
- clone->bi_private = md->bs;
bio_put(clone);
free_tio(md, tio);
} else if (r) {
@@ -1055,13 +1045,6 @@ struct clone_info {
unsigned short idx;
};

-static void dm_bio_destructor(struct bio *bio)
-{
- struct bio_set *bs = bio->bi_private;
-
- bio_free(bio, bs);
-}
-
/*
* Creates a little bio that just does part of a bvec.
*/
@@ -1073,7 +1056,6 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
struct bio_vec *bv = bio->bi_io_vec + idx;

clone = bio_alloc_bioset(GFP_NOIO, 1, bs);
- clone->bi_destructor = dm_bio_destructor;
*clone->bi_io_vec = *bv;

clone->bi_sector = sector;
@@ -1105,7 +1087,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
clone->bi_vcnt = idx + bv_count;
@@ -1150,7 +1131,6 @@ static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
*/
clone = bio_alloc_bioset(GFP_NOIO, ci->bio->bi_max_vecs, ci->md->bs);
__bio_clone(clone, ci->bio);
- clone->bi_destructor = dm_bio_destructor;
if (len) {
clone->bi_sector = ci->sector;
clone->bi_size = to_bytes(len);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3f6203a..b8eebe3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -155,32 +155,17 @@ static int start_readonly;
* like bio_clone, but with a local bio set
*/

-static void mddev_bio_destructor(struct bio *bio)
-{
- struct mddev *mddev, **mddevp;
-
- mddevp = (void*)bio;
- mddev = mddevp[-1];
-
- bio_free(bio, mddev->bio_set);
-}
-
struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
struct mddev *mddev)
{
struct bio *b;
- struct mddev **mddevp;

if (!mddev || !mddev->bio_set)
return bio_alloc(gfp_mask, nr_iovecs);

- b = bio_alloc_bioset(gfp_mask, nr_iovecs,
- mddev->bio_set);
+ b = bio_alloc_bioset(gfp_mask, nr_iovecs, mddev->bio_set);
if (!b)
return NULL;
- mddevp = (void*)b;
- mddevp[-1] = mddev;
- b->bi_destructor = mddev_bio_destructor;
return b;
}
EXPORT_SYMBOL_GPL(bio_alloc_mddev);
@@ -189,18 +174,14 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
{
struct bio *b;
- struct mddev **mddevp;

if (!mddev || !mddev->bio_set)
return bio_clone(bio, gfp_mask);

- b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs,
- mddev->bio_set);
+ b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
if (!b)
return NULL;
- mddevp = (void*)b;
- mddevp[-1] = mddev;
- b->bi_destructor = mddev_bio_destructor;
+
__bio_clone(b, bio);
if (bio_integrity(bio)) {
int ret;
@@ -5006,8 +4987,7 @@ int md_run(struct mddev *mddev)
}

if (mddev->bio_set == NULL)
- mddev->bio_set = bioset_create(BIO_POOL_SIZE,
- sizeof(struct mddev *));
+ mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);

spin_lock(&pers_lock);
pers = find_pers(mddev->level, mddev->clevel);
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 76db75e..e58cd7d 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -543,14 +543,6 @@ static void iblock_complete_cmd(struct se_cmd *cmd)
kfree(ibr);
}

-static void iblock_bio_destructor(struct bio *bio)
-{
- struct se_cmd *cmd = bio->bi_private;
- struct iblock_dev *ib_dev = cmd->se_dev->dev_ptr;
-
- bio_free(bio, ib_dev->ibd_bio_set);
-}
-
static struct bio *
iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)
{
@@ -572,7 +564,6 @@ iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)

bio->bi_bdev = ib_dev->ibd_bd;
bio->bi_private = cmd;
- bio->bi_destructor = iblock_bio_destructor;
bio->bi_end_io = &iblock_bio_done;
bio->bi_sector = lba;
return bio;
diff --git a/fs/bio.c b/fs/bio.c
index 5eaa70c..86e0534 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -271,10 +271,6 @@ EXPORT_SYMBOL(bio_init);
* bio_alloc_bioset will try its own mempool to satisfy the allocation.
* If %__GFP_WAIT is set then we will block on the internal pool waiting
* for a &struct bio to become free.
- *
- * Note that the caller must set ->bi_destructor on successful return
- * of a bio, to do the appropriate freeing of the bio once the reference
- * count drops to zero.
**/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
@@ -289,6 +285,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
bio = p + bs->front_pad;

bio_init(bio);
+ bio->bi_pool = bs;

if (unlikely(!nr_iovecs))
goto out_set;
@@ -315,11 +312,6 @@ err_free:
}
EXPORT_SYMBOL(bio_alloc_bioset);

-static void bio_fs_destructor(struct bio *bio)
-{
- bio_free(bio, fs_bio_set);
-}
-
/**
* bio_alloc - allocate a new bio, memory pool backed
* @gfp_mask: allocation mask to use
@@ -341,12 +333,7 @@ static void bio_fs_destructor(struct bio *bio)
*/
struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
{
- struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-
- if (bio)
- bio->bi_destructor = bio_fs_destructor;
-
- return bio;
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
}
EXPORT_SYMBOL(bio_alloc);

@@ -422,7 +409,16 @@ void bio_put(struct bio *bio)
if (atomic_dec_and_test(&bio->bi_cnt)) {
bio_disassociate_task(bio);
bio->bi_next = NULL;
- bio->bi_destructor(bio);
+
+ /*
+ * This if statement is temporary - bi_pool is replacing
+ * bi_destructor, but bi_destructor will be taken out in another
+ * patch.
+ */
+ if (bio->bi_pool)
+ bio_free(bio, bio->bi_pool);
+ else
+ bio->bi_destructor(bio);
}
}
EXPORT_SYMBOL(bio_put);
@@ -473,12 +469,11 @@ EXPORT_SYMBOL(__bio_clone);
*/
struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
{
- struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
+ struct bio *b = bio_alloc(gfp_mask, bio->bi_max_vecs);

if (!b)
return NULL;

- b->bi_destructor = bio_fs_destructor;
__bio_clone(b, bio);

if (bio_integrity(bio)) {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7b7ac9c..af9dd9d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -80,6 +80,9 @@ struct bio {
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif

+ /* If bi_pool is non NULL, bi_destructor is not called */
+ struct bio_set *bi_pool;
+
bio_destructor_t *bi_destructor; /* destructor */

/*
--
1.7.12

2012-08-22 17:13:37

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH v6 12/13] block: Add bio_clone_bioset(), bio_clone_kmalloc()

On 08/22/2012 01:04 PM, Kent Overstreet wrote:
> Previously, there was bio_clone() but it only allocated from the fs bio
> set; as a result various users were open coding it and using
> __bio_clone().
>
> This changes bio_clone() to become bio_clone_bioset(), and then we add
> bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
> the functionality the last patch adedd.
>
> This will also help in a later patch changing how bio cloning works.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> CC: Jens Axboe <[email protected]>
> CC: NeilBrown <[email protected]>
> CC: Alasdair Kergon <[email protected]>
> CC: Boaz Harrosh <[email protected]>
> CC: Jeff Garzik <[email protected]>
> ---
> block/blk-core.c | 8 +-------
> drivers/block/osdblk.c | 3 +--
> drivers/md/dm.c | 4 ++--
> drivers/md/md.c | 20 +-------------------
> fs/bio.c | 13 ++++++++-----
> fs/exofs/ore.c | 5 ++---
> include/linux/bio.h | 17 ++++++++++++++---
> 7 files changed, 29 insertions(+), 41 deletions(-)

Acked-by: Jeff Garzik <[email protected]>


2012-08-22 17:43:59

by Adrian Bunk

[permalink] [raw]
Subject: Re: [PATCH v6 08/13] block: Add an explicit bio flag for bios that own their bvec

On Wed, Aug 22, 2012 at 10:04:05AM -0700, Kent Overstreet wrote:
>...
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -117,6 +117,7 @@ struct bio {
> * BIO_POOL_IDX()
> */
> #define BIO_RESET_BITS 12
> +#define BIO_OWNS_VEC 12 /* bio_free() should free bvec */
>...

This doesn't look right.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2012-08-22 18:32:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 02/13] dm: Use bioset's front_pad for dm_rq_clone_bio_info

On Wed, Aug 22, 2012 at 10:03:59AM -0700, Kent Overstreet wrote:
> Previously, dm_rq_clone_bio_info needed to be freed by the bio's
> destructor to avoid a memory leak in the blk_rq_prep_clone() error path.
> This gets rid of a memory allocation and means we can kill
> dm_rq_bio_destructor.
>
> v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun
>
> Signed-off-by: Kent Overstreet <[email protected]>
> CC: Alasdair Kergon <[email protected]>

Acked-by: Tejun Heo <[email protected]>

It would be nice if dm people can review these patches tho.

Thanks.

--
tejun

2012-08-22 18:34:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 03/13] block: Add bio_reset()

On Wed, Aug 22, 2012 at 10:04:00AM -0700, Kent Overstreet wrote:
> Reusing bios is something that's been highly frowned upon in the past,
> but driver code keeps doing it anyways. If it's going to happen anyways,
> we should provide a generic method.
>
> This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
> was open coding it, by doing a bio_init() and resetting bi_destructor.
>
> v5: Add a define BIO_RESET_BITS, to be very explicit about what parts of
> bio->bi_flags are saved.
> v6: Further commenting verbosity, per Tejun
>
> Signed-off-by: Kent Overstreet <[email protected]>
> CC: Jens Axboe <[email protected]>

Given some users are converted later.

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2012-08-22 19:22:50

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 08/13] block: Add an explicit bio flag for bios that own their bvec

On Wed, Aug 22, 2012 at 08:43:52PM +0300, Adrian Bunk wrote:
> On Wed, Aug 22, 2012 at 10:04:05AM -0700, Kent Overstreet wrote:
> >...
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -117,6 +117,7 @@ struct bio {
> > * BIO_POOL_IDX()
> > */
> > #define BIO_RESET_BITS 12
> > +#define BIO_OWNS_VEC 12 /* bio_free() should free bvec */
> >...
>
> This doesn't look right.

Well, the first 12 bits are reset, so bit 12 will get preserved... I
guess it's unusual to have a duplicated enum value like that but
BIO_RESET_BITS is just a marker, not a real bit.

2012-08-22 19:51:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 03/13] block: Add bio_reset()

On Wed, Aug 22, 2012 at 11:34:26AM -0700, Tejun Heo wrote:
> On Wed, Aug 22, 2012 at 10:04:00AM -0700, Kent Overstreet wrote:
> > Reusing bios is something that's been highly frowned upon in the past,
> > but driver code keeps doing it anyways. If it's going to happen anyways,
> > we should provide a generic method.
> >
> > This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
> > was open coding it, by doing a bio_init() and resetting bi_destructor.
> >
> > v5: Add a define BIO_RESET_BITS, to be very explicit about what parts of
> > bio->bi_flags are saved.
> > v6: Further commenting verbosity, per Tejun
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > CC: Jens Axboe <[email protected]>
>
> Given some users are converted later.
>
> Acked-by: Tejun Heo <[email protected]>

Noticed it was a bit fuzzy which fields get reset and which aren't and
had a off-line discussion with Kent. The conclusion was that
bi_end_io and bi_private should be reset and bi_integrity should be
freed on bio_reset().

Thanks.

--
tejun

2012-08-22 19:55:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 04/13] pktcdvd: Switch to bio_kmalloc()

(cc'ing Jiri, hi!)

On Wed, Aug 22, 2012 at 10:04:01AM -0700, Kent Overstreet wrote:
> This is prep work for killing bi_destructor - previously, pktcdvd had
> its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
> necessitating its own bi_destructor implementation.
>
> v5: Un-reorder some functions, to make the patch easier to review
>
> Signed-off-by: Kent Overstreet <[email protected]>
> CC: Peter Osterlund <[email protected]>

Apart from bio_reset() not resetting bi_end_io and bi_private, this
looks fine to me but lack of testing makes me a bit hesitant to ack
it.

Peter doesn't want to be the maintainer of pktcdvd anymore. Maybe
Jiri can be tricked into taking this one too? :)

Thanks.

--
tejun

2012-08-22 20:00:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 05/13] block: Kill bi_destructor

Hello,

On Wed, Aug 22, 2012 at 10:04:02AM -0700, Kent Overstreet wrote:
> +#define BIO_KMALLOC_POOL NULL

I would much prefer just doing

if (!bs) {
/* do kmalloc/kfree thing */
} else {
/* do bioset thing */
}

NULL @bs indicating no bioset is perfectly natural and so is using
generic memory allocation in the absense of bioset. I don't see any
value in defining Bio_KMALLOC_POOL to be NULL.

Thanks.

--
tejun

2012-08-22 20:08:08

by Adrian Bunk

[permalink] [raw]
Subject: Re: [PATCH v6 08/13] block: Add an explicit bio flag for bios that own their bvec

On Wed, Aug 22, 2012 at 12:22:41PM -0700, Kent Overstreet wrote:
> On Wed, Aug 22, 2012 at 08:43:52PM +0300, Adrian Bunk wrote:
> > On Wed, Aug 22, 2012 at 10:04:05AM -0700, Kent Overstreet wrote:
> > >...
> > > --- a/include/linux/blk_types.h
> > > +++ b/include/linux/blk_types.h
> > > @@ -117,6 +117,7 @@ struct bio {
> > > * BIO_POOL_IDX()
> > > */
> > > #define BIO_RESET_BITS 12
> > > +#define BIO_OWNS_VEC 12 /* bio_free() should free bvec */
> > >...
> >
> > This doesn't look right.
>
> Well, the first 12 bits are reset, so bit 12 will get preserved... I
> guess it's unusual to have a duplicated enum value like that but
> BIO_RESET_BITS is just a marker, not a real bit.

Wouldn't a BIO_RESET_MASK be better than BIO_RESET_BITS?

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2012-08-22 20:17:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 06/13] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

Hello,

On Wed, Aug 22, 2012 at 10:04:03AM -0700, Kent Overstreet wrote:
> Previously, bio_kmalloc() and bio_alloc_bioset() behaved slightly
> different because there was some almost-duplicated code - this fixes
> that issue.

What were those slight differences? Why is it safe to change the
behaviors to match each other?

> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> {
> + unsigned front_pad;
> + unsigned inline_vecs;
> unsigned long idx = BIO_POOL_NONE;
> struct bio_vec *bvl = NULL;
> struct bio *bio;
> void *p;
>
> - p = mempool_alloc(bs->bio_pool, gfp_mask);
> + if (nr_iovecs > UIO_MAXIOV)
> + return NULL;

This test used to only happen for bio_kmalloc(). If I follow the code
I can see that UIO_MAXIOV is larger than BIOVEC_MAX_IDX, so this
doesn't really affect the capability of bioset allocs; however, given
that bioset allocation already checks against BIOVEC_MAX_IDX, I don't
see why this test is now also applied to them.

> + if (bs == BIO_KMALLOC_POOL) {
> + p = kmalloc(sizeof(struct bio) +
> + nr_iovecs * sizeof(struct bio_vec),
> + gfp_mask);
> + front_pad = 0;
> + inline_vecs = nr_iovecs;
> + } else {
> + p = mempool_alloc(bs->bio_pool, gfp_mask);
> + front_pad = bs->front_pad;
> + inline_vecs = BIO_INLINE_VECS;
> + }
> +
> if (unlikely(!p))
> return NULL;
> - bio = p + bs->front_pad;
>
> + bio = p + front_pad;
> bio_init(bio);
> - bio->bi_pool = bs;
>
> - if (unlikely(!nr_iovecs))
> - goto out_set;
> -
> - if (nr_iovecs <= BIO_INLINE_VECS) {
> - bvl = bio->bi_inline_vecs;
> - nr_iovecs = BIO_INLINE_VECS;
> - } else {
> + if (nr_iovecs > inline_vecs) {
> bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
> if (unlikely(!bvl))
> goto err_free;
> -
> - nr_iovecs = bvec_nr_vecs(idx);
> + } else if (nr_iovecs) {
> + bvl = bio->bi_inline_vecs;
> }
> -out_set:
> +
> + bio->bi_pool = bs;
> bio->bi_flags |= idx << BIO_POOL_OFFSET;
> bio->bi_max_vecs = nr_iovecs;
> bio->bi_io_vec = bvl;
> @@ -331,63 +340,6 @@ err_free:
> }
> EXPORT_SYMBOL(bio_alloc_bioset);

> +extern struct bio_set *fs_bio_set;
> +
> +static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
> +{
> + return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
> +}
> +
> +#define BIO_KMALLOC_POOL NULL
> +
> +static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
> +{
> + return bio_alloc_bioset(gfp_mask, nr_iovecs, BIO_KMALLOC_POOL);
> +}

And we lose /** comments on two exported functions and
bio_alloc_bioset() comment doesn't explain that it now also handles
NULL bioset case. Now that they share common implementation, you can
update bio_alloc_bioset() and refer to it from its wrappers but please
don't drop them like this.

Thanks.

--
tejun

2012-08-22 20:35:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 07/13] block: Avoid deadlocks with bio allocation by stacking drivers

Hello, Kent.

On Wed, Aug 22, 2012 at 10:04:04AM -0700, Kent Overstreet wrote:
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request(), we risk deadlock.
>
> This would happen if e.g. a bio ever needed to be split more than once,
> and it's difficult to handle correctly in the drivers - so in practice
> it's not.
>
> This patch fixes this issue by allocating a rescuer workqueue for each
> bio_set, and punting queued bios to said rescuer when necessary:

This description needs to be several times more descriptive than it
currently is. The deadlock issue at hand is very unobvious. Please
explain it so that someone who isn't familiar with the problem can
understand it by reading it.

Also, please describe how the patch was tested.

> @@ -306,16 +324,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> p = kmalloc(sizeof(struct bio) +
> nr_iovecs * sizeof(struct bio_vec),
> gfp_mask);
> +
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> - p = mempool_alloc(bs->bio_pool, gfp_mask);
> + /*
> + * If we're running under generic_make_request()
> + * (current->bio_list != NULL), we risk deadlock if we sleep on
> + * allocation and there's already bios on current->bio_list that
> + * were allocated from the same bio_set; they won't be submitted
> + * (and thus freed) as long as we're blocked here.
> + *
> + * To deal with this, we first try the allocation without using
> + * the mempool; if that fails, we punt all the bios on
> + * current->bio_list to a different thread and then retry the
> + * allocation with the original gfp mask.
> + */

Ditto here.

> + if (current->bio_list &&
> + !bio_list_empty(current->bio_list) &&
> + (gfp_mask & __GFP_WAIT))
> + gfp_mask &= GFP_ATOMIC;

Why aren't we turning off __GFP_WAIT instead? e.g. What if the caller
has one of NUMA flags or __GFP_COLD specified?

> +retry:
> + if (gfp_mask & __GFP_WAIT)
> + p = mempool_alloc(bs->bio_pool, gfp_mask);
> + else
> + p = kmem_cache_alloc(bs->bio_slab, gfp_mask);

Why is it necessary to avoid using the mempool if __GFP_WAIT is
already cleared?

> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -336,6 +377,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }
> +
> return NULL;
> }
> EXPORT_SYMBOL(bio_alloc_bioset);
> @@ -1544,6 +1598,9 @@ static void biovec_free_pools(struct bio_set *bs)
>
> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)
> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);
>
> @@ -1579,6 +1636,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
>
> bs->front_pad = front_pad;
>
> + spin_lock_init(&bs->rescue_lock);
> + bio_list_init(&bs->rescue_list);
> + INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> +
> bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
> if (!bs->bio_slab) {
> kfree(bs);
> @@ -1589,9 +1650,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
> if (!bs->bio_pool)
> goto bad;
>
> - if (!biovec_create_pools(bs, pool_size))
> - return bs;
> + if (biovec_create_pools(bs, pool_size))
> + goto bad;
> +
> + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> + if (!bs->rescue_workqueue)
> + goto bad;
>
> + return bs;
> bad:
> bioset_free(bs);
> return NULL;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b22c22b..ba5b52e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
> static inline void bio_disassociate_task(struct bio *bio) { }
> #endif /* CONFIG_BLK_CGROUP */
>
> -/*
> - * bio_set is used to allow other portions of the IO system to
> - * allocate their own private memory pools for bio and iovec structures.
> - * These memory pools in turn all allocate from the bio_slab
> - * and the bvec_slabs[].
> - */
> -#define BIO_POOL_SIZE 2
> -#define BIOVEC_NR_POOLS 6
> -#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
> -
> -struct bio_set {
> - struct kmem_cache *bio_slab;
> - unsigned int front_pad;
> -
> - mempool_t *bio_pool;
> -#if defined(CONFIG_BLK_DEV_INTEGRITY)
> - mempool_t *bio_integrity_pool;
> -#endif
> - mempool_t *bvec_pool;
> -};
> -
> -struct biovec_slab {
> - int nr_vecs;
> - char *name;
> - struct kmem_cache *slab;
> -};

Plesae don't mix struct definition relocation (or any relocation
really) with actual changes. It buries the actual changes and makes
reviewing difficult.

Thanks.

--
tejun

2012-08-22 20:46:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 10/13] block: Introduce new bio_split()

Hello, Kent.

On Wed, Aug 22, 2012 at 10:04:07AM -0700, Kent Overstreet wrote:
> @@ -672,7 +672,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,
>
> BUG_ON(bip == NULL);
> BUG_ON(bi == NULL);
> - BUG_ON(!bio_flagged(bio, BIO_CLONED));
>
> nr_sectors = bio_integrity_hw_sectors(bi, sectors);
> bip->bip_sector = bip->bip_sector + offset;

It depends on what BIO_CLONED is supposed to do but maybe the right
thing to do is clearing BIO_CLONED on split bio? The split bio isn't
a clone after all.

> /**
> + * bio_split - split a bio
> + * @bio: bio to split
> + * @sectors: number of sectors to split from the front of @bio
> + * @gfp: gfp mask
> + * @bs: bio set to allocate from
> + *
> + * Allocates and returns a new bio which represents @sectors from the start of
> + * @bio, and updates @bio to represent the remaining sectors.
> + *
> + * If bio_sectors(@bio) was less than or equal to @sectors, returns @bio
> + * unchanged.

This is no longer true.

> + * The newly allocated bio will point to @bio's bi_io_vec, if the split was on a
> + * bvec boundry; it is the caller's responsibility to ensure that @bio is not
> + * freed before the split.
> + */
> +struct bio *bio_split(struct bio *bio, int sectors,
> + gfp_t gfp, struct bio_set *bs)
> +{
...
> + if (bio_integrity(bio)) {
> + bio_integrity_clone(split, bio, gfp, bs);

What happens if bio_integrity_clone() fails?

> + bio_integrity_trim(split, 0, bio_sectors(split));
> + bio_integrity_trim(bio, bio_sectors(split), bio_sectors(bio));

I complained pretty loudly about this not being mentioned in the
description of this patch or the following one and there still is no
explanation. Come on, Kent.

> +static inline struct bio *bio_next_split(struct bio *bio, int sectors,
> + gfp_t gfp, struct bio_set *bs)
> +{
> + if (sectors >= bio_sectors(bio))
> + return bio;
> +
> + return bio_split(bio, sectors, gfp, bs);
> +}

First of all, I don't think this is necessary. In addition, this
doesn't have any comment explaining what it does and is never
mentioned or justified in patch description. Kent, you're repeating
the same mistakes even after being explicitly pointed out multiple
times. *Please* pay more attention.

Thanks.

--
tejun

2012-08-22 21:04:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 11/13] block: Rework bio_pair_split()

Hello, Kent.

On Wed, Aug 22, 2012 at 10:04:08AM -0700, Kent Overstreet wrote:
> This changes bio_pair_split() to use the new bio_split() underneath,
> which gets rid of the single page bio limitation. The various callers
> are fixed up for the slightly different struct bio_pair, and to remove
> the unnecessary checks.

This changes an existing API both in its interface and behavior and
there's no detailed explanation on how it's changed and what are the
implications.

> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 1f5b483..63e5852 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -751,14 +751,13 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
>
> /* split the bio. We'll release it either in the next
> call, or it will have to be released outside */
> - bp = bio_pair_split(old_chain,
> - (len - total) / SECTOR_SIZE);
> + bp = bio_pair_split(old_chain, (len - total) / SECTOR_SIZE);

Probably belongs to the previous patch which renamed bio_split() to
bio_pair_split()? Another thing is, is this renaming really
necessary? If you did,

* s/bio_split()/bio_pair_split().

* introduce better and prettier bio_split() which has
different semantics.

* replace bio_pair_split() users with bio_split().

the renaming would have made sense, but you renamed an existing API,
intrudced a new one and then changed the renamed old API. Doesn't
make too much sense to me.

> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0f31ec4..9fa07c7 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1080,15 +1080,9 @@ static void make_request(struct mddev *mddev, struct bio * bio)
> && (conf->geo.near_copies < conf->geo.raid_disks
> || conf->prev.near_copies < conf->prev.raid_disks))) {
> struct bio_pair *bp;
> - /* Sanity check -- queue functions should prevent this happening */
> - if (bio->bi_vcnt != 1 ||
> - bio->bi_idx != 0)
> - goto bad_map;
> - /* This is a one page bio that upper layers
> - * refuse to split for us, so we need to split it.
> - */
> +
> bp = bio_pair_split(bio,
> - chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
> + chunk_sects - (bio->bi_sector & (chunk_sects - 1)));

I suppose this one too belongs to the previous rename patch?

> --- a/fs/bio-integrity.c
> +++ b/fs/bio-integrity.c
> @@ -681,50 +681,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,
> EXPORT_SYMBOL(bio_integrity_trim);
>
> /**
> - * bio_integrity_split - Split integrity metadata
> - * @bio: Protected bio
> - * @bp: Resulting bio_pair
> - * @sectors: Offset
> - *
> - * Description: Splits an integrity page into a bio_pair.
> - */
> -void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
> -{
> - struct blk_integrity *bi;
> - struct bio_integrity_payload *bip = bio->bi_integrity;
> - unsigned int nr_sectors;
> -
> - if (bio_integrity(bio) == 0)
> - return;
> -
> - bi = bdev_get_integrity(bio->bi_bdev);
> - BUG_ON(bi == NULL);
> - BUG_ON(bip->bip_vcnt != 1);
> -
> - nr_sectors = bio_integrity_hw_sectors(bi, sectors);
> -
> - bp->bio1.bi_integrity = &bp->bip1;
> - bp->bio2.bi_integrity = &bp->bip2;
> -
> - bp->iv1 = bip->bip_vec[0];
> - bp->iv2 = bip->bip_vec[0];
> -
> - bp->bip1.bip_vec[0] = bp->iv1;
> - bp->bip2.bip_vec[0] = bp->iv2;
> -
> - bp->iv1.bv_len = sectors * bi->tuple_size;
> - bp->iv2.bv_offset += sectors * bi->tuple_size;
> - bp->iv2.bv_len -= sectors * bi->tuple_size;
> -
> - bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
> - bp->bip2.bip_sector = bio->bi_integrity->bip_sector + nr_sectors;
> -
> - bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
> - bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
> -}
> -EXPORT_SYMBOL(bio_integrity_split);

I complained about this in the last posting and in the previous patch.
Please respond. Martin, are you okay with these integrity changes?

> -static void bio_pair_end_1(struct bio *bi, int err)
> +static void bio_pair_end(struct bio *bio, int error)
> {
> - struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);
> + struct bio_pair *bp = bio->bi_private;
>
> - if (err)
> - bp->error = err;
> -
> - bio_pair_release(bp);
> -}
> -
> -static void bio_pair_end_2(struct bio *bi, int err)
> -{
> - struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);
> -
> - if (err)
> - bp->error = err;
> + if (error)
> + clear_bit(BIO_UPTODATE, &bp->orig->bi_flags);
>
> bio_pair_release(bp);
> }

Why is losing error value okay here?

> @@ -1856,8 +1846,7 @@ static int __init init_bio(void)
> if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
> panic("bio: can't create integrity pool\n");
>
> - bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
> - sizeof(struct bio_pair));
> + bio_split_pool = bioset_create(BIO_POOL_SIZE, offsetof(struct bio_pair, split));
> if (!bio_split_pool)
> panic("bio: can't create split pool\n");

It would be nice to mention that using this from stacking drivers is
inherently broken. This is something which has been broken before
this patch but still. bio_split*() should always require separate
biosets.

> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1c3bb47..3ad3540 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -192,14 +192,13 @@ struct bio_integrity_payload {
> * in bio2.bi_private
> */
> struct bio_pair {
> - struct bio bio1, bio2;
> - struct bio_vec bv1, bv2;
> -#if defined(CONFIG_BLK_DEV_INTEGRITY)
> - struct bio_integrity_payload bip1, bip2;
> - struct bio_vec iv1, iv2;
> -#endif
> - atomic_t cnt;
> - int error;
> + atomic_t cnt;
> +
> + bio_end_io_t *bi_end_io;
> + void *bi_private;
> +
> + struct bio *orig;
> + struct bio split;
> };

So, this is struct is allocated as frontpad, which is a pretty unusual
thing to do. Please explain and emphasize that ->split should come
last. Also, given that it's a pair split, it would be nice to somehow
indicate that ->split is the earlier half. Before this change it was
pretty clear with ->bio1/2.

Thanks.

--
tejun

2012-08-22 21:07:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 12/13] block: Add bio_clone_bioset(), bio_clone_kmalloc()

On Wed, Aug 22, 2012 at 10:04:09AM -0700, Kent Overstreet wrote:
> Previously, there was bio_clone() but it only allocated from the fs bio
> set; as a result various users were open coding it and using
> __bio_clone().
>
> This changes bio_clone() to become bio_clone_bioset(), and then we add
> bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
> the functionality the last patch adedd.
>
> This will also help in a later patch changing how bio cloning works.

I'd prefer simply adding @bioset to bio_clone() so that the caller
always has to make the choice consciously. We're updating all the
callers anyway.

Thanks.

--
tejun

2012-08-22 21:10:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 13/13] block: Only clone bio vecs that are in use

Hello, Kent.

On Wed, Aug 22, 2012 at 10:04:10AM -0700, Kent Overstreet wrote:
> bcache creates large bios internally, and then splits them according to
> the device requirements before it sends them down. If a lower level
> device tries to clone the bio, and the original bio had more than
> BIO_MAX_PAGES, the clone will fail unecessarily.
>
> We can fix this by only cloning the bio vecs that are actually in use.

I'm pretty sure I sound like a broken record by now, but

* How was this tested?

* What are the implications and possible dangers?

> @@ -463,10 +468,10 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
> bio->bi_sector = bio_src->bi_sector;
> bio->bi_bdev = bio_src->bi_bdev;
> bio->bi_flags |= 1 << BIO_CLONED;
> + bio->bi_flags &= ~(1 << BIO_SEG_VALID);

For the n'th time, explain please.

Thanks.

--
tejun

2012-08-22 21:27:10

by Nicholas A. Bellinger

[permalink] [raw]
Subject: Re: [PATCH v6 01/13] block: Generalized bio pool freeing

On Wed, 2012-08-22 at 10:03 -0700, Kent Overstreet wrote:
> With the old code, when you allocate a bio from a bio pool you have to
> implement your own destructor that knows how to find the bio pool the
> bio was originally allocated from.
>
> This adds a new field to struct bio (bi_pool) and changes
> bio_alloc_bioset() to use it. This makes various bio destructors
> unnecessary, so they're then deleted.
>
> v6: Explain the temporary if statement in bio_put
>
> Signed-off-by: Kent Overstreet <[email protected]>
> CC: Jens Axboe <[email protected]>
> CC: NeilBrown <[email protected]>
> CC: Alasdair Kergon <[email protected]>
> CC: Nicholas Bellinger <[email protected]>
> CC: Lars Ellenberg <[email protected]>
> Acked-by: Tejun Heo <[email protected]>
> ---
> drivers/block/drbd/drbd_main.c | 13 +------------
> drivers/md/dm-crypt.c | 9 ---------
> drivers/md/dm-io.c | 11 -----------
> drivers/md/dm.c | 20 --------------------
> drivers/md/md.c | 28 ++++------------------------
> drivers/target/target_core_iblock.c | 9 ---------
> fs/bio.c | 31 +++++++++++++------------------
> include/linux/blk_types.h | 3 +++
> 8 files changed, 21 insertions(+), 103 deletions(-)
>

<SNIP>

> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index 76db75e..e58cd7d 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -543,14 +543,6 @@ static void iblock_complete_cmd(struct se_cmd *cmd)
> kfree(ibr);
> }
>
> -static void iblock_bio_destructor(struct bio *bio)
> -{
> - struct se_cmd *cmd = bio->bi_private;
> - struct iblock_dev *ib_dev = cmd->se_dev->dev_ptr;
> -
> - bio_free(bio, ib_dev->ibd_bio_set);
> -}
> -
> static struct bio *
> iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)
> {
> @@ -572,7 +564,6 @@ iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)
>
> bio->bi_bdev = ib_dev->ibd_bd;
> bio->bi_private = cmd;
> - bio->bi_destructor = iblock_bio_destructor;
> bio->bi_end_io = &iblock_bio_done;
> bio->bi_sector = lba;
> return bio;

The change to drop iblock_bio_destructor() looks fine to me. For the
drivers/target/ bit, feel free to add:

Acked-by: Nicholas Bellinger <[email protected]>

Nice work Kent!

2012-08-22 21:30:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 02/13] dm: Use bioset's front_pad for dm_rq_clone_bio_info

On Wed, Aug 22, 2012 at 10:03:59AM -0700, Kent Overstreet wrote:
> Previously, dm_rq_clone_bio_info needed to be freed by the bio's
> destructor to avoid a memory leak in the blk_rq_prep_clone() error path.
> This gets rid of a memory allocation and means we can kill
> dm_rq_bio_destructor.
>
> v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun
>
> Signed-off-by: Kent Overstreet <[email protected]>
> CC: Alasdair Kergon <[email protected]>
> ---
> drivers/md/dm.c | 39 +++++++++++----------------------------
> 1 file changed, 11 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 0c3d6dd..5ed9779 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -86,12 +86,17 @@ struct dm_rq_target_io {
> };
>
> /*
> - * For request-based dm.
> - * One of these is allocated per bio.
> + * For request-based dm - the bio clones we allocate are embedded in these
> + * structs.
> + *
> + * We allocate these with bio_alloc_bioset, using the front_pad parameter when
> + * the bioset is created - this means the bio has to come at the end of the
> + * struct.
> */
> struct dm_rq_clone_bio_info {
> struct bio *orig;
> struct dm_rq_target_io *tio;
> + struct bio clone;
> };
>
> union map_info *dm_get_mapinfo(struct bio *bio)
> @@ -467,16 +472,6 @@ static void free_rq_tio(struct dm_rq_target_io *tio)
> mempool_free(tio, tio->md->tio_pool);
> }
>
> -static struct dm_rq_clone_bio_info *alloc_bio_info(struct mapped_device *md)
> -{
> - return mempool_alloc(md->io_pool, GFP_ATOMIC);
> -}
> -
> -static void free_bio_info(struct dm_rq_clone_bio_info *info)
> -{
> - mempool_free(info, info->tio->md->io_pool);
> -}
> -

With this change, do you still need "_rq_bio_info_cache" slab cache? I would
think that it can be cleaned up now?

Thanks
Vivek

2012-08-23 18:00:56

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 00/13] Block cleanups

On Wed, Aug 22, 2012 at 10:03:57AM -0700, Kent Overstreet wrote:

[..]
> Kent Overstreet (13):
> block: Generalized bio pool freeing
> dm: Use bioset's front_pad for dm_rq_clone_bio_info
> block: Add bio_reset()
> pktcdvd: Switch to bio_kmalloc()
> block: Kill bi_destructor
> block: Consolidate bio_alloc_bioset(), bio_kmalloc()

This patch series says "Block Cleanups" in the subject. I think that
it can be divided in two parts. First part seems to be cleanup where
we get rid of ->bi_destructor and merge kmalloc() and allocation
from bio_set.

Second part seems to be more of enhancing bio splitting functionality
and not necessarily a cleanup. So may be breaking this series in two
parts makes sense. First part is cleanup and second part is enhanced
bio splitting capabilities of block layer.

Thanks
Vivek

> block: Avoid deadlocks with bio allocation by stacking drivers
> block: Add an explicit bio flag for bios that own their bvec
> block: Rename bio_split() -> bio_pair_split()
> block: Introduce new bio_split()
> block: Rework bio_pair_split()
> block: Add bio_clone_bioset(), bio_clone_kmalloc()
> block: Only clone bio vecs that are in use
>
> Documentation/block/biodoc.txt | 5 -
> block/blk-core.c | 10 +-
> drivers/block/drbd/drbd_main.c | 13 +-
> drivers/block/drbd/drbd_req.c | 18 +-
> drivers/block/osdblk.c | 3 +-
> drivers/block/pktcdvd.c | 73 ++-----
> drivers/block/rbd.c | 8 +-
> drivers/md/dm-crypt.c | 9 -
> drivers/md/dm-io.c | 11 -
> drivers/md/dm.c | 68 ++----
> drivers/md/linear.c | 6 +-
> drivers/md/md.c | 44 +---
> drivers/md/raid0.c | 8 +-
> drivers/md/raid10.c | 23 +-
> drivers/target/target_core_iblock.c | 9 -
> fs/bio-integrity.c | 45 ----
> fs/bio.c | 419 +++++++++++++++++++++++-------------
> fs/exofs/ore.c | 5 +-
> include/linux/bio.h | 155 +++++++------
> include/linux/blk_types.h | 16 +-
> 20 files changed, 434 insertions(+), 514 deletions(-)
>
> --
> 1.7.12

2012-08-24 02:26:15

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v6 11/13] block: Rework bio_pair_split()

>>>>> "Tejun" == Tejun Heo <[email protected]> writes:

Tejun> I complained about this in the last posting and in the previous
Tejun> patch. Please respond. Martin, are you okay with these
Tejun> integrity changes?

I missed the first several iterations of all this while I was out on
vacation. I'll have to try to wrap my head around the new approach.

However, I'm not sure I like the overall approach of the new splitting.
Instead of all this cloning, slicing and dicing of bio_vecs I'd rather
we bit the bullet and had an offset + length for the vector inside each
bio. That way we could keep the bio_vec immutable and make clones more
lightweight since their vecs would always point to the parent. This also
makes it trivial to split I/Os in the stacking drivers and removes evils
in the partial completion code path. It would also allow to sever the
ties between "size of block range operated on" vs. bi_size which we need
for copy offload, discard, etc.

--
Martin K. Petersen Oracle Linux Engineering

2012-08-24 05:04:38

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 06/13] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

On Wed, Aug 22, 2012 at 01:17:30PM -0700, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 22, 2012 at 10:04:03AM -0700, Kent Overstreet wrote:
> > Previously, bio_kmalloc() and bio_alloc_bioset() behaved slightly
> > different because there was some almost-duplicated code - this fixes
> > that issue.
>
> What were those slight differences? Why is it safe to change the
> behaviors to match each other?

I explained that in another email, but I should've had all that in the
patch description. Updated patch below.

> > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > {
> > + unsigned front_pad;
> > + unsigned inline_vecs;
> > unsigned long idx = BIO_POOL_NONE;
> > struct bio_vec *bvl = NULL;
> > struct bio *bio;
> > void *p;
> >
> > - p = mempool_alloc(bs->bio_pool, gfp_mask);
> > + if (nr_iovecs > UIO_MAXIOV)
> > + return NULL;
>
> This test used to only happen for bio_kmalloc(). If I follow the code
> I can see that UIO_MAXIOV is larger than BIOVEC_MAX_IDX, so this
> doesn't really affect the capability of bioset allocs; however, given
> that bioset allocation already checks against BIOVEC_MAX_IDX, I don't
> see why this test is now also applied to them.

Having arbitrary limits that are different for kmalloced bios and bioset
allocated bios seems _very_ sketchy to me. I tend to think that
UIO_MAXIOV check shouldn't be there at all... but if it is IMO it makes
slightly more sense for it to apply to all bio allocations.

As you mentioned it doesn't affect the behaviour of the code, but
supposing UIO_MAXIOV was less than BIO_MAX_PAGES, whatever was depending
on that check would then implicitly depend on the bios not being
allocated from a bioset. Ugly.

> And we lose /** comments on two exported functions and
> bio_alloc_bioset() comment doesn't explain that it now also handles
> NULL bioset case. Now that they share common implementation, you can
> update bio_alloc_bioset() and refer to it from its wrappers but please
> don't drop them like this.

So if I follow you, you're fine with dropping the comments from the
single line wrappers provided their information is rolled into
bio_alloc_bioset()'s documentation? That's what I should have done,
I'll do that now.


commit 3af8f6d2d6c7434f810c1919b68d8d89d948bb54
Author: Kent Overstreet <[email protected]>
Date: Thu Aug 23 21:49:23 2012 -0700

block: Consolidate bio_alloc_bioset(), bio_kmalloc()

Previously, bio_kmalloc() and bio_alloc_bioset() behaved slightly
different because there was some almost-duplicated code - this fixes
some of that.

The important change is that previously bio_kmalloc() always set
bi_io_vec = bi_inline_vecs, even if nr_iovecs == 0 - unlike
bio_alloc_bioset(). This would cause bio_has_data() to return true; I
don't know if this resulted in any actual bugs but it was certainly
wrong.

bio_kmalloc() and bio_alloc_bioset() also have different arbitrary
limits on nr_iovecs - 1024 (UIO_MAXIOV) for bio_kmalloc(), 256
(BIO_MAX_PAGES) for bio_alloc_bioset(). This patch doesn't fix that
issue, but at least said arbitrary limits are

This'll also help with some future cleanups - there are a fair number of
functions that allocate bios (e.g. bio_clone()), and now they don't have
to be duplicated for bio_alloc(), bio_alloc_bioset(), and bio_kmalloc().

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
v7: Re-add dropped comments, improv patch description

diff --git a/fs/bio.c b/fs/bio.c
index abfb135..0c8432a 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -55,8 +55,7 @@ static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
* IO code that does not need private memory pools.
*/
struct bio_set *fs_bio_set;
-
-#define BIO_KMALLOC_POOL NULL
+EXPORT_SYMBOL(fs_bio_set);

/*
* Our slab pool management
@@ -290,39 +289,58 @@ EXPORT_SYMBOL(bio_reset);
* @bs: the bio_set to allocate from.
*
* Description:
- * bio_alloc_bioset will try its own mempool to satisfy the allocation.
- * If %__GFP_WAIT is set then we will block on the internal pool waiting
- * for a &struct bio to become free.
- **/
+ * If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
+ * backed by the @bs's mempool.
+ *
+ * When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
+ * able to allocate a bio. This is due to the mempool guarantees. To make this
+ * work, callers must never allocate more than 1 bio at a time from this pool.
+ * Callers that need to allocate more than 1 bio must always submit the
+ * previously allocated bio for IO before attempting to allocate a new one.
+ * Failure to do so can cause deadlocks under memory pressure.
+ *
+ * RETURNS:
+ * Pointer to new bio on success, NULL on failure.
+ */
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ unsigned front_pad;
+ unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
struct bio_vec *bvl = NULL;
struct bio *bio;
void *p;

- p = mempool_alloc(bs->bio_pool, gfp_mask);
+ if (nr_iovecs > UIO_MAXIOV)
+ return NULL;
+
+ if (bs == BIO_KMALLOC_POOL) {
+ p = kmalloc(sizeof(struct bio) +
+ nr_iovecs * sizeof(struct bio_vec),
+ gfp_mask);
+ front_pad = 0;
+ inline_vecs = nr_iovecs;
+ } else {
+ p = mempool_alloc(bs->bio_pool, gfp_mask);
+ front_pad = bs->front_pad;
+ inline_vecs = BIO_INLINE_VECS;
+ }
+
if (unlikely(!p))
return NULL;
- bio = p + bs->front_pad;

+ bio = p + front_pad;
bio_init(bio);
- bio->bi_pool = bs;

- if (unlikely(!nr_iovecs))
- goto out_set;
-
- if (nr_iovecs <= BIO_INLINE_VECS) {
- bvl = bio->bi_inline_vecs;
- nr_iovecs = BIO_INLINE_VECS;
- } else {
+ if (nr_iovecs > inline_vecs) {
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
if (unlikely(!bvl))
goto err_free;
-
- nr_iovecs = bvec_nr_vecs(idx);
+ } else if (nr_iovecs) {
+ bvl = bio->bi_inline_vecs;
}
-out_set:
+
+ bio->bi_pool = bs;
bio->bi_flags |= idx << BIO_POOL_OFFSET;
bio->bi_max_vecs = nr_iovecs;
bio->bi_io_vec = bvl;
@@ -334,63 +352,6 @@ err_free:
}
EXPORT_SYMBOL(bio_alloc_bioset);

-/**
- * bio_alloc - allocate a new bio, memory pool backed
- * @gfp_mask: allocation mask to use
- * @nr_iovecs: number of iovecs
- *
- * bio_alloc will allocate a bio and associated bio_vec array that can hold
- * at least @nr_iovecs entries. Allocations will be done from the
- * fs_bio_set. Also see @bio_alloc_bioset and @bio_kmalloc.
- *
- * If %__GFP_WAIT is set, then bio_alloc will always be able to allocate
- * a bio. This is due to the mempool guarantees. To make this work, callers
- * must never allocate more than 1 bio at a time from this pool. Callers
- * that need to allocate more than 1 bio must always submit the previously
- * allocated bio for IO before attempting to allocate a new one. Failure to
- * do so can cause livelocks under memory pressure.
- *
- * RETURNS:
- * Pointer to new bio on success, NULL on failure.
- */
-struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
-{
- return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-}
-EXPORT_SYMBOL(bio_alloc);
-
-/**
- * bio_kmalloc - allocate a bio for I/O using kmalloc()
- * @gfp_mask: the GFP_ mask given to the slab allocator
- * @nr_iovecs: number of iovecs to pre-allocate
- *
- * Description:
- * Allocate a new bio with @nr_iovecs bvecs. If @gfp_mask contains
- * %__GFP_WAIT, the allocation is guaranteed to succeed.
- *
- **/
-struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
-{
- struct bio *bio;
-
- if (nr_iovecs > UIO_MAXIOV)
- return NULL;
-
- bio = kmalloc(sizeof(struct bio) + nr_iovecs * sizeof(struct bio_vec),
- gfp_mask);
- if (unlikely(!bio))
- return NULL;
-
- bio_init(bio);
- bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
- bio->bi_max_vecs = nr_iovecs;
- bio->bi_io_vec = bio->bi_inline_vecs;
- bio->bi_pool = BIO_KMALLOC_POOL;
-
- return bio;
-}
-EXPORT_SYMBOL(bio_kmalloc);
-
void zero_fill_bio(struct bio *bio)
{
unsigned long flags;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 393c87e..b22c22b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -212,12 +212,24 @@ extern void bio_pair_release(struct bio_pair *dbio);
extern struct bio_set *bioset_create(unsigned int, unsigned int);
extern void bioset_free(struct bio_set *);

-extern struct bio *bio_alloc(gfp_t, unsigned int);
-extern struct bio *bio_kmalloc(gfp_t, unsigned int);
extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
extern void bio_put(struct bio *);
extern void bio_free(struct bio *);

+extern struct bio_set *fs_bio_set;
+
+static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
+{
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
+}
+
+#define BIO_KMALLOC_POOL NULL
+
+static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
+{
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, BIO_KMALLOC_POOL);
+}
+
extern void bio_endio(struct bio *, int);
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);
@@ -305,8 +317,6 @@ struct biovec_slab {
struct kmem_cache *slab;
};

-extern struct bio_set *fs_bio_set;
-
/*
* a small number of entries is fine, not going to be performance critical.
* basically we just need to survive

2012-08-24 05:10:14

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 05/13] block: Kill bi_destructor

On Wed, Aug 22, 2012 at 01:00:32PM -0700, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 22, 2012 at 10:04:02AM -0700, Kent Overstreet wrote:
> > +#define BIO_KMALLOC_POOL NULL
>
> I would much prefer just doing
>
> if (!bs) {
> /* do kmalloc/kfree thing */
> } else {
> /* do bioset thing */
> }
>
> NULL @bs indicating no bioset is perfectly natural and so is using
> generic memory allocation in the absense of bioset. I don't see any
> value in defining Bio_KMALLOC_POOL to be NULL.

Eh, kind of disagree but at this point I don't care that much, I'll
change it.

2012-08-24 05:56:32

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 07/13] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 22, 2012 at 01:30:03PM -0700, Tejun Heo wrote:
> This description needs to be several times more descriptive than it
> currently is. The deadlock issue at hand is very unobvious. Please
> explain it so that someone who isn't familiar with the problem can
> understand it by reading it.
>
> Also, please describe how the patch was tested.

New patch below, tell me how the new description reads?

> > @@ -306,16 +324,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > p = kmalloc(sizeof(struct bio) +
> > nr_iovecs * sizeof(struct bio_vec),
> > gfp_mask);
> > +
> > front_pad = 0;
> > inline_vecs = nr_iovecs;
> > } else {
> > - p = mempool_alloc(bs->bio_pool, gfp_mask);
> > + /*
> > + * If we're running under generic_make_request()
> > + * (current->bio_list != NULL), we risk deadlock if we sleep on
> > + * allocation and there's already bios on current->bio_list that
> > + * were allocated from the same bio_set; they won't be submitted
> > + * (and thus freed) as long as we're blocked here.
> > + *
> > + * To deal with this, we first try the allocation without using
> > + * the mempool; if that fails, we punt all the bios on
> > + * current->bio_list to a different thread and then retry the
> > + * allocation with the original gfp mask.
> > + */
>
> Ditto here.

Ok, tell me if the new version below is better.

>
> > + if (current->bio_list &&
> > + !bio_list_empty(current->bio_list) &&
> > + (gfp_mask & __GFP_WAIT))
> > + gfp_mask &= GFP_ATOMIC;
>
> Why aren't we turning off __GFP_WAIT instead? e.g. What if the caller
> has one of NUMA flags or __GFP_COLD specified?

Didn't think of that. The reason I did it that way is I wasn't sure if
just doing &= ~__GFP_WAIT would be correct, since that would leave
__GFP_IO|__GFP_FS set.

Look at how the mempool code uses the gfp flags, it's not clear to me
what happens in general if __GFP_WAIT isn't set but __GFP_IO is... but I
_think_ masking out __GFP_WAIT is sufficient and correct.

(The mempool code masks out __GFP_IO where it masks out __GFP_WAIT, but
it doesn't mask out __GFP_FS. Why would masking out __GFP_IO matter? And
if it does, why isn't __GFP_FS masked out? Inquiring minds want to know.

But at least for my purposes masking out __GFP_WAIT looks correct.

> > +retry:
> > + if (gfp_mask & __GFP_WAIT)
> > + p = mempool_alloc(bs->bio_pool, gfp_mask);
> > + else
> > + p = kmem_cache_alloc(bs->bio_slab, gfp_mask);
>
> Why is it necessary to avoid using the mempool if __GFP_WAIT is
> already cleared?

Now that I think about it, avoiding the mempool shouldn't be necessary
at all (and it's better not to since the mempool code adds various
flags, so allocating without those flags is at least inconsistent).

> Plesae don't mix struct definition relocation (or any relocation
> really) with actual changes. It buries the actual changes and makes
> reviewing difficult.

Make a new patch that does nothing more than reorder the definitions,
then?


commit d4351a8fc634d3ed95c27f633184f3f87ba5c754
Author: Kent Overstreet <[email protected]>
Date: Thu Aug 23 22:54:17 2012 -0700

block: Avoid deadlocks with bio allocation by stacking drivers

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request() (i.e. a stacking block
driver), we risk deadlock.

This is because of the code in generic_make_request() that converts
recursion to iteration; any bios we submit won't actually be submitted
(so they can complete and eventually be freed) until after we return -
this means if we allocate a second bio, we're blocking the first one
from ever being freed.

Thus if enough threads call into a stacking block driver at the same
time with bios that need multiple splits, and the bio_set's reserve gets
used up, we deadlock.

This can be worked around in the driver code - we could check if we're
running under generic_make_request(), then mask out __GFP_WAIT when we
go to allocate a bio, and if the allocation fails punt to workqueue and
retry the allocation.

But this is tricky and not a generic solution. This patch solves it for
all users by inverting the previously described technique. We allocate a
rescuer workqueue for each bio_set, and then in the allocation code if
there are bios on current->bio_list we would be blocking, we punt them
to the rescuer workqueue to be submitted.

Tested it by forcing the rescue codepath to be taken (by disabling the
first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
of arbitrary bio splitting) and verified that the rescuer was being
invoked.

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>

diff --git a/fs/bio.c b/fs/bio.c
index ed34526..f9f87d7 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -282,6 +282,23 @@ void bio_reset(struct bio *bio)
}
EXPORT_SYMBOL(bio_reset);

+static void bio_alloc_rescue(struct work_struct *work)
+{
+ struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+ struct bio *bio;
+
+ while (1) {
+ spin_lock(&bs->rescue_lock);
+ bio = bio_list_pop(&bs->rescue_list);
+ spin_unlock(&bs->rescue_lock);
+
+ if (!bio)
+ break;
+
+ generic_make_request(bio);
+ }
+}
+
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
@@ -304,6 +321,7 @@ EXPORT_SYMBOL(bio_reset);
*/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -318,16 +336,44 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
p = kmalloc(sizeof(struct bio) +
nr_iovecs * sizeof(struct bio_vec),
gfp_mask);
+
front_pad = 0;
inline_vecs = nr_iovecs;
} else {
+ /*
+ * generic_make_request() converts recursion to iteration; this
+ * means if we're running beneath it, any bios we allocate and
+ * submit will not be submitted (and thus freed) until after we
+ * return.
+ *
+ * This exposes us to a potential deadlock if we allocate
+ * multiple bios from the same bio_set() while running
+ * underneath generic_make_request(). If we were to allocate
+ * multiple bios (say a stacking block driver that was splitting
+ * bios), we would deadlock if we exhausted the mempool's
+ * reserve.
+ *
+ * We solve this, and guarantee forward progress, with a rescuer
+ * workqueue per bio_set. If we go to allocate and there are
+ * bios on current->bio_list, we first try the allocation
+ * without __GFP_WAIT; if that fails, we punt those bios we
+ * would be blocking to the rescuer workqueue before we retry
+ * with the original gfp_flags.
+ */
+
+ if (current->bio_list &&
+ !bio_list_empty(current->bio_list) &&
+ (gfp_mask & __GFP_WAIT))
+ gfp_mask &= ~__GFP_WAIT;
+retry:
p = mempool_alloc(bs->bio_pool, gfp_mask);
+
front_pad = bs->front_pad;
inline_vecs = BIO_INLINE_VECS;
}

if (unlikely(!p))
- return NULL;
+ goto err;

bio = p + front_pad;
bio_init(bio);
@@ -348,6 +394,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)

err_free:
mempool_free(p, bs->bio_pool);
+err:
+ if (gfp_mask != saved_gfp) {
+ gfp_mask = saved_gfp;
+
+ spin_lock(&bs->rescue_lock);
+ bio_list_merge(&bs->rescue_list, current->bio_list);
+ bio_list_init(current->bio_list);
+ spin_unlock(&bs->rescue_lock);
+
+ queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ goto retry;
+ }
+
return NULL;
}
EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1556,6 +1615,9 @@ static void biovec_free_pools(struct bio_set *bs)

void bioset_free(struct bio_set *bs)
{
+ if (bs->rescue_workqueue)
+ destroy_workqueue(bs->rescue_workqueue);
+
if (bs->bio_pool)
mempool_destroy(bs->bio_pool);

@@ -1591,6 +1653,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)

bs->front_pad = front_pad;

+ spin_lock_init(&bs->rescue_lock);
+ bio_list_init(&bs->rescue_list);
+ INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
if (!bs->bio_slab) {
kfree(bs);
@@ -1601,9 +1667,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
if (!bs->bio_pool)
goto bad;

- if (!biovec_create_pools(bs, pool_size))
- return bs;
+ if (biovec_create_pools(bs, pool_size))
+ goto bad;
+
+ bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+ if (!bs->rescue_workqueue)
+ goto bad;

+ return bs;
bad:
bioset_free(bs);
return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b22c22b..ba5b52e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
static inline void bio_disassociate_task(struct bio *bio) { }
#endif /* CONFIG_BLK_CGROUP */

-/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-#define BIO_POOL_SIZE 2
-#define BIOVEC_NR_POOLS 6
-#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
-
-struct bio_set {
- struct kmem_cache *bio_slab;
- unsigned int front_pad;
-
- mempool_t *bio_pool;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
- mempool_t *bio_integrity_pool;
-#endif
- mempool_t *bvec_pool;
-};
-
-struct biovec_slab {
- int nr_vecs;
- char *name;
- struct kmem_cache *slab;
-};
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
-
#ifdef CONFIG_HIGHMEM
/*
* remember never ever reenable interrupts between a bvec_kmap_irq and
@@ -497,6 +464,48 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
return bio;
}

+/*
+ * bio_set is used to allow other portions of the IO system to
+ * allocate their own private memory pools for bio and iovec structures.
+ * These memory pools in turn all allocate from the bio_slab
+ * and the bvec_slabs[].
+ */
+#define BIO_POOL_SIZE 2
+#define BIOVEC_NR_POOLS 6
+#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
+
+struct bio_set {
+ struct kmem_cache *bio_slab;
+ unsigned int front_pad;
+
+ mempool_t *bio_pool;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ mempool_t *bio_integrity_pool;
+#endif
+ mempool_t *bvec_pool;
+
+ /*
+ * Deadlock avoidance for stacking block drivers: see comments in
+ * bio_alloc_bioset() for details
+ */
+ spinlock_t rescue_lock;
+ struct bio_list rescue_list;
+ struct work_struct rescue_work;
+ struct workqueue_struct *rescue_workqueue;
+};
+
+struct biovec_slab {
+ int nr_vecs;
+ char *name;
+ struct kmem_cache *slab;
+};
+
+/*
+ * a small number of entries is fine, not going to be performance critical.
+ * basically we just need to survive
+ */
+#define BIO_SPLIT_ENTRIES 2
+
#if defined(CONFIG_BLK_DEV_INTEGRITY)

#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))

2012-08-24 06:24:55

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 12/13] block: Add bio_clone_bioset(), bio_clone_kmalloc()

On Wed, Aug 22, 2012 at 02:07:40PM -0700, Tejun Heo wrote:
> On Wed, Aug 22, 2012 at 10:04:09AM -0700, Kent Overstreet wrote:
> > Previously, there was bio_clone() but it only allocated from the fs bio
> > set; as a result various users were open coding it and using
> > __bio_clone().
> >
> > This changes bio_clone() to become bio_clone_bioset(), and then we add
> > bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
> > the functionality the last patch adedd.
> >
> > This will also help in a later patch changing how bio cloning works.
>
> I'd prefer simply adding @bioset to bio_clone() so that the caller
> always has to make the choice consciously. We're updating all the
> callers anyway.

Possibly, but the btrfs code uses bio_clone() and there fs_bio_set may
be correct (will have to look at what it's doing, if it's cloning a bio
that was allocated out of fs_bio_set that would be bad..)

I would also prefer to simply drop bio_clone() so that
bio_clone_bioset() matches bio_alloc_bioset(), but regardless that'll
have to be a different patch (and I don't think I've had to update any
of the bio_clone() callers in this patch series anyways).

2012-08-24 07:05:45

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 13/13] block: Only clone bio vecs that are in use

On Wed, Aug 22, 2012 at 02:10:45PM -0700, Tejun Heo wrote:
> Hello, Kent.
>
> On Wed, Aug 22, 2012 at 10:04:10AM -0700, Kent Overstreet wrote:
> > bcache creates large bios internally, and then splits them according to
> > the device requirements before it sends them down. If a lower level
> > device tries to clone the bio, and the original bio had more than
> > BIO_MAX_PAGES, the clone will fail unecessarily.
> >
> > We can fix this by only cloning the bio vecs that are actually in use.
>
> I'm pretty sure I sound like a broken record by now, but
>
> * How was this tested?
>
> * What are the implications and possible dangers?

I've said all that on list, but I gather what you really wanted was to
have it all in the patch description. Will do.

Though actually, this patch by itself is pretty innocuous. It's my bio
splitting code that actually depends on the fields of struct bio
(bi_idx) being used consistently.

Anyways, tell me if the description below is better.

> > @@ -463,10 +468,10 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
> > bio->bi_sector = bio_src->bi_sector;
> > bio->bi_bdev = bio_src->bi_bdev;
> > bio->bi_flags |= 1 << BIO_CLONED;
> > + bio->bi_flags &= ~(1 << BIO_SEG_VALID);
>
> For the n'th time, explain please.

Argh, I could've sworn I dropped that part.


commit 0edda563aef9432b45f0c6a50f52590b92594560
Author: Kent Overstreet <[email protected]>
Date: Thu Aug 23 23:26:38 2012 -0700

block: Only clone bio vecs that are in use

bcache creates large bios internally, and then splits them according to
the device requirements before it sends them down. If a lower level
device tries to clone the bio, and the original bio had more than
BIO_MAX_PAGES, the clone will fail unecessarily.

We can fix this by only cloning the bio vecs that are actually in use -
as for as the block layer is concerned the new bio is still equivalent
to the old bio.

This code should in general be safe as long as all the block layer code
uses bi_idx, bi_vcnt consistently; since bios are cloned by code that
doesn't own the original bio there's little room for issues caused by
code playing games with the original bio's bi_io_vec. One perhaps
imagine code depending the clone and original bio's io vecs lining up a
certain way, but auditing and testing haven't turned up anything.

Testing: This code has been in the bcache tree for quite awhile, and has
been tested with various md layers and dm targets (including strange
things like multipath).

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: Alasdair Kergon <[email protected]>
CC: Sage Weil <[email protected]>

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 63e5852..5c3457f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -734,7 +734,7 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
}

while (old_chain && (total < len)) {
- tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
+ tmp = bio_kmalloc(gfpmask, bio_segments(old_chain));
if (!tmp)
goto err_out;

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a3c38b9..3aeb108 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1080,11 +1080,10 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
{
struct bio *clone;

- clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
+ clone = bio_alloc_bioset(GFP_NOIO, bv_count, bs);
__bio_clone(clone, bio);
clone->bi_sector = sector;
- clone->bi_idx = idx;
- clone->bi_vcnt = idx + bv_count;
+ clone->bi_vcnt = bv_count;
clone->bi_size = to_bytes(len);
clone->bi_flags &= ~(1 << BIO_SEG_VALID);

diff --git a/fs/bio.c b/fs/bio.c
index 895e2a6..cd80710 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -467,11 +467,16 @@ EXPORT_SYMBOL(bio_phys_segments);
* Clone a &bio. Caller will own the returned bio, but not
* the actual data it points to. Reference count of returned
* bio will be one.
+ *
+ * We don't clone the entire bvec, just the part from bi_idx to b_vcnt
+ * (i.e. what the bio currently points to, so the new bio is still
+ * equivalent to the old bio).
*/
void __bio_clone(struct bio *bio, struct bio *bio_src)
{
- memcpy(bio->bi_io_vec, bio_src->bi_io_vec,
- bio_src->bi_max_vecs * sizeof(struct bio_vec));
+ memcpy(bio->bi_io_vec,
+ bio_iovec(bio_src),
+ bio_segments(bio_src) * sizeof(struct bio_vec));

/*
* most users will be overriding ->bi_bdev with a new target,
@@ -481,9 +486,8 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
bio->bi_bdev = bio_src->bi_bdev;
bio->bi_flags |= 1 << BIO_CLONED;
bio->bi_rw = bio_src->bi_rw;
- bio->bi_vcnt = bio_src->bi_vcnt;
+ bio->bi_vcnt = bio_segments(bio_src);
bio->bi_size = bio_src->bi_size;
- bio->bi_idx = bio_src->bi_idx;
}
EXPORT_SYMBOL(__bio_clone);

@@ -500,7 +504,7 @@ struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
{
struct bio *b;

- b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
+ b = bio_alloc_bioset(gfp_mask, bio_segments(bio), bs);
if (!b)
return NULL;

2012-08-24 07:15:25

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 02/13] dm: Use bioset's front_pad for dm_rq_clone_bio_info

On Wed, Aug 22, 2012 at 05:30:10PM -0400, Vivek Goyal wrote:
> On Wed, Aug 22, 2012 at 10:03:59AM -0700, Kent Overstreet wrote:
> > Previously, dm_rq_clone_bio_info needed to be freed by the bio's
> > destructor to avoid a memory leak in the blk_rq_prep_clone() error path.
> > This gets rid of a memory allocation and means we can kill
> > dm_rq_bio_destructor.
> >
> > v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > CC: Alasdair Kergon <[email protected]>
> > ---
> > drivers/md/dm.c | 39 +++++++++++----------------------------
> > 1 file changed, 11 insertions(+), 28 deletions(-)
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 0c3d6dd..5ed9779 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -86,12 +86,17 @@ struct dm_rq_target_io {
> > };
> >
> > /*
> > - * For request-based dm.
> > - * One of these is allocated per bio.
> > + * For request-based dm - the bio clones we allocate are embedded in these
> > + * structs.
> > + *
> > + * We allocate these with bio_alloc_bioset, using the front_pad parameter when
> > + * the bioset is created - this means the bio has to come at the end of the
> > + * struct.
> > */
> > struct dm_rq_clone_bio_info {
> > struct bio *orig;
> > struct dm_rq_target_io *tio;
> > + struct bio clone;
> > };
> >
> > union map_info *dm_get_mapinfo(struct bio *bio)
> > @@ -467,16 +472,6 @@ static void free_rq_tio(struct dm_rq_target_io *tio)
> > mempool_free(tio, tio->md->tio_pool);
> > }
> >
> > -static struct dm_rq_clone_bio_info *alloc_bio_info(struct mapped_device *md)
> > -{
> > - return mempool_alloc(md->io_pool, GFP_ATOMIC);
> > -}
> > -
> > -static void free_bio_info(struct dm_rq_clone_bio_info *info)
> > -{
> > - mempool_free(info, info->tio->md->io_pool);
> > -}
> > -
>
> With this change, do you still need "_rq_bio_info_cache" slab cache? I would
> think that it can be cleaned up now?

It looks like it, but I'm hesitent to make more extensive changes to the
dm code given that I'm unfamiliar with it and I haven't been able to
personally test the request type dm target code.

That and the way io_pool is overloaded. I see too many ways I could
screw things up.

Also it looks like the equivalent change ought to be done with struct
dm_io first (then we'd have removed all the users of io_pool), but
honestly it takes me forever to do anything in the dm code so I'd rather
leave that to someone else.

2012-08-24 10:31:28

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 11/13] block: Rework bio_pair_split()

On Wed, Aug 22, 2012 at 02:04:10PM -0700, Tejun Heo wrote:
> Hello, Kent.
>
> On Wed, Aug 22, 2012 at 10:04:08AM -0700, Kent Overstreet wrote:
> > This changes bio_pair_split() to use the new bio_split() underneath,
> > which gets rid of the single page bio limitation. The various callers
> > are fixed up for the slightly different struct bio_pair, and to remove
> > the unnecessary checks.
>
> This changes an existing API both in its interface and behavior and
> there's no detailed explanation on how it's changed and what are the
> implications.

Well, there were only two changes that affected the users: The single
page restriction is gone, and the bio1 and bio2 members became split and
orig, due to the new implementation.

I did note the first in the patch description, I'll also note the
second.

> > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> > index 1f5b483..63e5852 100644
> > --- a/drivers/block/rbd.c
> > +++ b/drivers/block/rbd.c
> > @@ -751,14 +751,13 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
> >
> > /* split the bio. We'll release it either in the next
> > call, or it will have to be released outside */
> > - bp = bio_pair_split(old_chain,
> > - (len - total) / SECTOR_SIZE);
> > + bp = bio_pair_split(old_chain, (len - total) / SECTOR_SIZE);
>
> Probably belongs to the previous patch which renamed bio_split() to
> bio_pair_split()?

Yeah.

> Another thing is, is this renaming really
> necessary? If you did,
>
> * s/bio_split()/bio_pair_split().
>
> * introduce better and prettier bio_split() which has
> different semantics.
>
> * replace bio_pair_split() users with bio_split().
>
> the renaming would have made sense, but you renamed an existing API,
> intrudced a new one and then changed the renamed old API. Doesn't
> make too much sense to me.

What you describe is almost exactly what I did, minus step 3.

IMO, bio_pair_split() should probably be deprecated because of the
shared mempool issue (which is fixable) and because chaining is
dangerous with arbitrary splitting and generally not what users want.

But converting the existing users over to the new bio_split() would be
very nontrivial, and without that we'd have two different
implementations of bio splitting in the kernel. So under the
circumstances, making the old bio_split() a wrapper around the new one
makes sense, IMO.

We can deprecate bio_pair_split() and remove existing usage one at a
time later.

>
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index 0f31ec4..9fa07c7 100644
> > --- a/drivers/md/raid10.c
> > +++ b/drivers/md/raid10.c
> > @@ -1080,15 +1080,9 @@ static void make_request(struct mddev *mddev, struct bio * bio)
> > && (conf->geo.near_copies < conf->geo.raid_disks
> > || conf->prev.near_copies < conf->prev.raid_disks))) {
> > struct bio_pair *bp;
> > - /* Sanity check -- queue functions should prevent this happening */
> > - if (bio->bi_vcnt != 1 ||
> > - bio->bi_idx != 0)
> > - goto bad_map;
> > - /* This is a one page bio that upper layers
> > - * refuse to split for us, so we need to split it.
> > - */
> > +
> > bp = bio_pair_split(bio,
> > - chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
> > + chunk_sects - (bio->bi_sector & (chunk_sects - 1)));
>
> I suppose this one too belongs to the previous rename patch?

Yeah, I'll move it.

>
> > --- a/fs/bio-integrity.c
> > +++ b/fs/bio-integrity.c
> > @@ -681,50 +681,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,
> > EXPORT_SYMBOL(bio_integrity_trim);
> >
> > /**
> > - * bio_integrity_split - Split integrity metadata
> > - * @bio: Protected bio
> > - * @bp: Resulting bio_pair
> > - * @sectors: Offset
> > - *
> > - * Description: Splits an integrity page into a bio_pair.
> > - */
> > -void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
> > -{
> > - struct blk_integrity *bi;
> > - struct bio_integrity_payload *bip = bio->bi_integrity;
> > - unsigned int nr_sectors;
> > -
> > - if (bio_integrity(bio) == 0)
> > - return;
> > -
> > - bi = bdev_get_integrity(bio->bi_bdev);
> > - BUG_ON(bi == NULL);
> > - BUG_ON(bip->bip_vcnt != 1);
> > -
> > - nr_sectors = bio_integrity_hw_sectors(bi, sectors);
> > -
> > - bp->bio1.bi_integrity = &bp->bip1;
> > - bp->bio2.bi_integrity = &bp->bip2;
> > -
> > - bp->iv1 = bip->bip_vec[0];
> > - bp->iv2 = bip->bip_vec[0];
> > -
> > - bp->bip1.bip_vec[0] = bp->iv1;
> > - bp->bip2.bip_vec[0] = bp->iv2;
> > -
> > - bp->iv1.bv_len = sectors * bi->tuple_size;
> > - bp->iv2.bv_offset += sectors * bi->tuple_size;
> > - bp->iv2.bv_len -= sectors * bi->tuple_size;
> > -
> > - bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
> > - bp->bip2.bip_sector = bio->bi_integrity->bip_sector + nr_sectors;
> > -
> > - bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
> > - bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
> > -}
> > -EXPORT_SYMBOL(bio_integrity_split);
>
> I complained about this in the last posting and in the previous patch.
> Please respond. Martin, are you okay with these integrity changes?

I did respond. I said more before, but in short the old
bio_integrity_split() only worked for single page bios, and thus wasn't
useful even as a starting point.

What my bio_split() does with the integrity is actually basically what
dm does (that's the only other place bio_integrity_trim()) is used).

Sometimes I'm not sure if you read all my emails...

>
> > -static void bio_pair_end_1(struct bio *bi, int err)
> > +static void bio_pair_end(struct bio *bio, int error)
> > {
> > - struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);
> > + struct bio_pair *bp = bio->bi_private;
> >
> > - if (err)
> > - bp->error = err;
> > -
> > - bio_pair_release(bp);
> > -}
> > -
> > -static void bio_pair_end_2(struct bio *bi, int err)
> > -{
> > - struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);
> > -
> > - if (err)
> > - bp->error = err;
> > + if (error)
> > + clear_bit(BIO_UPTODATE, &bp->orig->bi_flags);
> >
> > bio_pair_release(bp);
> > }
>
> Why is losing error value okay here?

I suppose there's no good reason to and the old code passed it up.
Fixing that.

>
> > @@ -1856,8 +1846,7 @@ static int __init init_bio(void)
> > if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
> > panic("bio: can't create integrity pool\n");
> >
> > - bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
> > - sizeof(struct bio_pair));
> > + bio_split_pool = bioset_create(BIO_POOL_SIZE, offsetof(struct bio_pair, split));
> > if (!bio_split_pool)
> > panic("bio: can't create split pool\n");
>
> It would be nice to mention that using this from stacking drivers is
> inherently broken. This is something which has been broken before
> this patch but still. bio_split*() should always require separate
> biosets.

Yeah. I'll make a note of it. Fixing it is going to have to come later -
but as I mentioned earlier I'd prefer to just get rid of it.

Shouldn't be _that_ hard to get rid of it, just that is going to have to
be its own patch series.

> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 1c3bb47..3ad3540 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -192,14 +192,13 @@ struct bio_integrity_payload {
> > * in bio2.bi_private
> > */
> > struct bio_pair {
> > - struct bio bio1, bio2;
> > - struct bio_vec bv1, bv2;
> > -#if defined(CONFIG_BLK_DEV_INTEGRITY)
> > - struct bio_integrity_payload bip1, bip2;
> > - struct bio_vec iv1, iv2;
> > -#endif
> > - atomic_t cnt;
> > - int error;
> > + atomic_t cnt;
> > +
> > + bio_end_io_t *bi_end_io;
> > + void *bi_private;
> > +
> > + struct bio *orig;
> > + struct bio split;
> > };
>
> So, this is struct is allocated as frontpad, which is a pretty unusual
> thing to do. Please explain and emphasize that ->split should come
> last.

Will do.

> Also, given that it's a pair split, it would be nice to somehow
> indicate that ->split is the earlier half. Before this change it was
> pretty clear with ->bio1/2.

Fixed the comment to indicate that too.


commit 5db0562e43099d06ed9488e7ae6bf0a0cc6ef5da
Author: Kent Overstreet <[email protected]>
Date: Fri Aug 24 03:27:24 2012 -0700

block: Rework bio_pair_split()

This changes bio_pair_split() to use the new bio_split() underneath.

This has two user visible changes: First, it's not restricted to single
page bios anymore. Secondly, where before callers used bio1 and bio2 in
struct bio_pair, now they use split and orig - this being a result of
the new implementation.

bio_integrity_split() is removed, as the old bio_pair_split() was the
only user; the new bio_split() uses bio_integrity_clone/trim much like
the dm code does.

v5: Move extern declaration to proper patch, per Boaz

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: NeilBrown <[email protected]>
CC: Lars Ellenberg <[email protected]>
CC: Peter Osterlund <[email protected]>
CC: Sage Weil <[email protected]>
CC: Martin K. Petersen <[email protected]>

diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index fbb0471..7d3e662 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1122,18 +1122,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
do {
inc_ap_bio(mdev, 1);
} while (drbd_make_request_common(mdev, bio, start_time));
- return;
- }
-
- /* can this bio be split generically?
- * Maybe add our own split-arbitrary-bios function. */
- if (bio->bi_vcnt != 1 || bio->bi_idx != 0 || bio->bi_size > DRBD_MAX_BIO_SIZE) {
- /* rather error out here than BUG in bio_split */
- dev_err(DEV, "bio would need to, but cannot, be split: "
- "(vcnt=%u,idx=%u,size=%u,sector=%llu)\n",
- bio->bi_vcnt, bio->bi_idx, bio->bi_size,
- (unsigned long long)bio->bi_sector);
- bio_endio(bio, -EINVAL);
} else {
/* This bio crosses some boundary, so we have to split it. */
struct bio_pair *bp;
@@ -1160,10 +1148,10 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)

D_ASSERT(e_enr == s_enr + 1);

- while (drbd_make_request_common(mdev, &bp->bio1, start_time))
+ while (drbd_make_request_common(mdev, &bp->split, start_time))
inc_ap_bio(mdev, 1);

- while (drbd_make_request_common(mdev, &bp->bio2, start_time))
+ while (drbd_make_request_common(mdev, bio, start_time))
inc_ap_bio(mdev, 1);

dec_ap_bio(mdev);
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 310d699..66e2fe5 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2468,8 +2468,8 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
first_sectors = last_zone - bio->bi_sector;
bp = bio_pair_split(bio, first_sectors);
BUG_ON(!bp);
- pkt_make_request(q, &bp->bio1);
- pkt_make_request(q, &bp->bio2);
+ pkt_make_request(q, &bp->split);
+ pkt_make_request(q, bio);
bio_pair_release(bp);
return;
}
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b4d106e..63e5852 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -755,9 +755,9 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
if (!bp)
goto err_out;

- __bio_clone(tmp, &bp->bio1);
+ __bio_clone(tmp, &bp->split);

- *next = &bp->bio2;
+ *next = bp->orig;
} else {
__bio_clone(tmp, old_chain);
*next = old_chain->bi_next;
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index e860cb9..7c6cafd 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -316,8 +316,8 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)

bp = bio_pair_split(bio, end_sector - bio->bi_sector);

- linear_make_request(mddev, &bp->bio1);
- linear_make_request(mddev, &bp->bio2);
+ linear_make_request(mddev, &bp->split);
+ linear_make_request(mddev, bio);
bio_pair_release(bp);
return;
}
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index c89c8aa..3469adf 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -520,9 +520,9 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
(chunk_sects-1)));
else
bp = bio_pair_split(bio, chunk_sects -
- sector_div(sector, chunk_sects));
- raid0_make_request(mddev, &bp->bio1);
- raid0_make_request(mddev, &bp->bio2);
+ sector_div(sector, chunk_sects));
+ raid0_make_request(mddev, &bp->split);
+ raid0_make_request(mddev, bio);
bio_pair_release(bp);
return;
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b770f3e..9fa07c7 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1080,13 +1080,7 @@ static void make_request(struct mddev *mddev, struct bio * bio)
&& (conf->geo.near_copies < conf->geo.raid_disks
|| conf->prev.near_copies < conf->prev.raid_disks))) {
struct bio_pair *bp;
- /* Sanity check -- queue functions should prevent this happening */
- if (bio->bi_vcnt != 1 ||
- bio->bi_idx != 0)
- goto bad_map;
- /* This is a one page bio that upper layers
- * refuse to split for us, so we need to split it.
- */
+
bp = bio_pair_split(bio,
chunk_sects - (bio->bi_sector & (chunk_sects - 1)));

@@ -1102,8 +1096,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
conf->nr_waiting++;
spin_unlock_irq(&conf->resync_lock);

- make_request(mddev, &bp->bio1);
- make_request(mddev, &bp->bio2);
+ make_request(mddev, &bp->split);
+ make_request(mddev, bio);

spin_lock_irq(&conf->resync_lock);
conf->nr_waiting--;
@@ -1112,13 +1106,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)

bio_pair_release(bp);
return;
- bad_map:
- printk("md/raid10:%s: make_request bug: can't convert block across chunks"
- " or bigger than %dk %llu %d\n", mdname(mddev), chunk_sects/2,
- (unsigned long long)bio->bi_sector, bio->bi_size >> 10);
-
- bio_io_error(bio);
- return;
}

md_write_start(mddev, bio);
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 35ee3d4..08a9e12c 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -681,50 +681,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,
EXPORT_SYMBOL(bio_integrity_trim);

/**
- * bio_integrity_split - Split integrity metadata
- * @bio: Protected bio
- * @bp: Resulting bio_pair
- * @sectors: Offset
- *
- * Description: Splits an integrity page into a bio_pair.
- */
-void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
-{
- struct blk_integrity *bi;
- struct bio_integrity_payload *bip = bio->bi_integrity;
- unsigned int nr_sectors;
-
- if (bio_integrity(bio) == 0)
- return;
-
- bi = bdev_get_integrity(bio->bi_bdev);
- BUG_ON(bi == NULL);
- BUG_ON(bip->bip_vcnt != 1);
-
- nr_sectors = bio_integrity_hw_sectors(bi, sectors);
-
- bp->bio1.bi_integrity = &bp->bip1;
- bp->bio2.bi_integrity = &bp->bip2;
-
- bp->iv1 = bip->bip_vec[0];
- bp->iv2 = bip->bip_vec[0];
-
- bp->bip1.bip_vec[0] = bp->iv1;
- bp->bip2.bip_vec[0] = bp->iv2;
-
- bp->iv1.bv_len = sectors * bi->tuple_size;
- bp->iv2.bv_offset += sectors * bi->tuple_size;
- bp->iv2.bv_len -= sectors * bi->tuple_size;
-
- bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
- bp->bip2.bip_sector = bio->bi_integrity->bip_sector + nr_sectors;
-
- bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
- bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
-}
-EXPORT_SYMBOL(bio_integrity_split);
-
-/**
* bio_integrity_clone - Callback for cloning bios with integrity metadata
* @bio: New bio
* @bio_src: Original bio
diff --git a/fs/bio.c b/fs/bio.c
index 0f76172..bff9a8e 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -37,7 +37,7 @@
*/
#define BIO_INLINE_VECS 4

-static mempool_t *bio_split_pool __read_mostly;
+static struct bio_set *bio_split_pool __read_mostly;

/*
* if you change this list, also change bvec_alloc or things will
@@ -1478,30 +1478,27 @@ void bio_endio(struct bio *bio, int error)
}
EXPORT_SYMBOL(bio_endio);

+/**
+ * bio_pair_release - drop refcount on a bio_pair
+ *
+ * This is called by bio_pair_split's endio function, and also must be called by
+ * the caller of bio_pair_split().
+ */
void bio_pair_release(struct bio_pair *bp)
{
if (atomic_dec_and_test(&bp->cnt)) {
- struct bio *master = bp->bio1.bi_private;
+ bp->orig->bi_end_io = bp->bi_end_io;
+ bp->orig->bi_private = bp->bi_private;

- bio_endio(master, bp->error);
- mempool_free(bp, bp->bio2.bi_private);
+ bio_endio(bp->orig, bp->error);
+ bio_put(&bp->split);
}
}
EXPORT_SYMBOL(bio_pair_release);

-static void bio_pair_end_1(struct bio *bi, int err)
-{
- struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);
-
- if (err)
- bp->error = err;
-
- bio_pair_release(bp);
-}
-
-static void bio_pair_end_2(struct bio *bi, int err)
+static void bio_pair_endio(struct bio *bio, int err)
{
- struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);
+ struct bio_pair *bp = bio->bi_private;

if (err)
bp->error = err;
@@ -1509,49 +1506,46 @@ static void bio_pair_end_2(struct bio *bi, int err)
bio_pair_release(bp);
}

-/*
- * split a bio - only worry about a bio with a single page in its iovec
+/**
+ * bio_pair_split - split a bio, and chain the completions
+ * @bio: bio to split
+ * @sectors: number of sectors to split from the front of @bio
+ *
+ * This wraps bio_split(), and puts the split and the original bio in a struct
+ * bio_pair. It also hooks into the completions so the original bio will be
+ * completed once both splits have been completed.
+ *
+ * The caller will own a refcount on the returned bio_pair, which must be
+ * dropped with bio_pair_release().
+ *
+ * Note: this interface is not safe and could cause deadlocks when used with
+ * stacked virtual block devices - to be correct it should allow the caller to
+ * pass in a bio_set to use. It shouldn't be used to split a bio more than once
+ * either, for that use bio_split().
*/
-struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors)
+struct bio_pair *bio_pair_split(struct bio *bio, int sectors)
{
- struct bio_pair *bp = mempool_alloc(bio_split_pool, GFP_NOIO);
+ struct bio_pair *bp;
+ struct bio *split;

- if (!bp)
- return bp;
+ split = bio_split(bio, sectors, GFP_NOIO, bio_split_pool);
+ if (!split)
+ return NULL;

- trace_block_split(bdev_get_queue(bi->bi_bdev), bi,
- bi->bi_sector + first_sectors);
+ bp = container_of(split, struct bio_pair, split);

- BUG_ON(bi->bi_vcnt != 1);
- BUG_ON(bi->bi_idx != 0);
atomic_set(&bp->cnt, 3);
bp->error = 0;
- bp->bio1 = *bi;
- bp->bio2 = *bi;
- bp->bio2.bi_sector += first_sectors;
- bp->bio2.bi_size -= first_sectors << 9;
- bp->bio1.bi_size = first_sectors << 9;
-
- bp->bv1 = bi->bi_io_vec[0];
- bp->bv2 = bi->bi_io_vec[0];
- bp->bv2.bv_offset += first_sectors << 9;
- bp->bv2.bv_len -= first_sectors << 9;
- bp->bv1.bv_len = first_sectors << 9;
-
- bp->bio1.bi_io_vec = &bp->bv1;
- bp->bio2.bi_io_vec = &bp->bv2;
-
- bp->bio1.bi_max_vecs = 1;
- bp->bio2.bi_max_vecs = 1;
+ bp->orig = bio;

- bp->bio1.bi_end_io = bio_pair_end_1;
- bp->bio2.bi_end_io = bio_pair_end_2;
+ bp->bi_end_io = bio->bi_end_io;
+ bp->bi_private = bio->bi_private;

- bp->bio1.bi_private = bi;
- bp->bio2.bi_private = bio_split_pool;
+ bio->bi_private = bp;
+ bio->bi_end_io = bio_pair_endio;

- if (bio_integrity(bi))
- bio_integrity_split(bi, bp, first_sectors);
+ split->bi_private = bp;
+ split->bi_end_io = bio_pair_endio;

return bp;
}
@@ -1874,8 +1868,7 @@ static int __init init_bio(void)
if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
panic("bio: can't create integrity pool\n");

- bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
- sizeof(struct bio_pair));
+ bio_split_pool = bioset_create(BIO_POOL_SIZE, offsetof(struct bio_pair, split));
if (!bio_split_pool)
panic("bio: can't create split pool\n");

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3afef0b..090ce10 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -182,24 +182,23 @@ struct bio_integrity_payload {
#endif /* CONFIG_BLK_DEV_INTEGRITY */

/*
- * A bio_pair is used when we need to split a bio.
- * This can only happen for a bio that refers to just one
- * page of data, and in the unusual situation when the
- * page crosses a chunk/device boundary
- *
- * The address of the master bio is stored in bio1.bi_private
- * The address of the pool the pair was allocated from is stored
- * in bio2.bi_private
+ * A bio_pair is used for splitting a bio and chaining the completions. split is
+ * the first half of the split, representing first_sectors from the original
+ * bio; orig is updated to represent the second half.
*/
struct bio_pair {
- struct bio bio1, bio2;
- struct bio_vec bv1, bv2;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
- struct bio_integrity_payload bip1, bip2;
- struct bio_vec iv1, iv2;
-#endif
+ bio_end_io_t *bi_end_io;
+ void *bi_private;
+
atomic_t cnt;
int error;
+
+ struct bio *orig;
+ /*
+ * Since this struct is allocated using the front_pad option of struct
+ * bio_set, split must come last.
+ */
+ struct bio split;
};

extern struct bio *bio_split(struct bio *bio, int sectors,
@@ -554,7 +553,6 @@ extern int bio_integrity_prep(struct bio *);
extern void bio_integrity_endio(struct bio *, int);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
-extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t, struct bio_set *);
extern int bioset_integrity_create(struct bio_set *, int);
extern void bioset_integrity_free(struct bio_set *);
@@ -598,12 +596,6 @@ static inline int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
return 0;
}

-static inline void bio_integrity_split(struct bio *bio, struct bio_pair *bp,
- int sectors)
-{
- return;
-}
-
static inline void bio_integrity_advance(struct bio *bio,
unsigned int bytes_done)
{

2012-08-24 10:38:13

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 11/13] block: Rework bio_pair_split()

On Thu, Aug 23, 2012 at 10:25:47PM -0400, Martin K. Petersen wrote:
> >>>>> "Tejun" == Tejun Heo <[email protected]> writes:
>
> Tejun> I complained about this in the last posting and in the previous
> Tejun> patch. Please respond. Martin, are you okay with these
> Tejun> integrity changes?
>
> I missed the first several iterations of all this while I was out on
> vacation. I'll have to try to wrap my head around the new approach.
>
> However, I'm not sure I like the overall approach of the new splitting.
> Instead of all this cloning, slicing and dicing of bio_vecs I'd rather
> we bit the bullet and had an offset + length for the vector inside each
> bio. That way we could keep the bio_vec immutable and make clones more
> lightweight since their vecs would always point to the parent. This also
> makes it trivial to split I/Os in the stacking drivers and removes evils
> in the partial completion code path. It would also allow to sever the
> ties between "size of block range operated on" vs. bi_size which we need
> for copy offload, discard, etc.

Agree 110% - making bio_vecs immutable and keeping the offset in the bio
is something I've been talking about for ages, I'd love to see it
happen.

But that's going to be a much more invasive change so if I'm going to do
it (and I am willing to work on it) it's just going to be a bit. This is
really a stopgap solution.

As far as the integrity splitting, it's similar to what the existing dm
code does (main difference is dm already has the bio cloned, my
bio_split() doesn't assume anything about the bio being split). Not sure
how that affects ownership of the integrity data, honestly that part
kind of confuses me.

2012-08-24 12:47:38

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 00/13] Block cleanups

On Thu, Aug 23, 2012 at 02:00:41PM -0400, Vivek Goyal wrote:
> On Wed, Aug 22, 2012 at 10:03:57AM -0700, Kent Overstreet wrote:
>
> [..]
> > Kent Overstreet (13):
> > block: Generalized bio pool freeing
> > dm: Use bioset's front_pad for dm_rq_clone_bio_info
> > block: Add bio_reset()
> > pktcdvd: Switch to bio_kmalloc()
> > block: Kill bi_destructor
> > block: Consolidate bio_alloc_bioset(), bio_kmalloc()
>
> This patch series says "Block Cleanups" in the subject. I think that
> it can be divided in two parts. First part seems to be cleanup where
> we get rid of ->bi_destructor and merge kmalloc() and allocation
> from bio_set.
>
> Second part seems to be more of enhancing bio splitting functionality
> and not necessarily a cleanup. So may be breaking this series in two
> parts makes sense. First part is cleanup and second part is enhanced
> bio splitting capabilities of block layer.

Yeah, that's a good idea - will do.

2012-08-24 18:40:47

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 02/13] dm: Use bioset's front_pad for dm_rq_clone_bio_info

On Fri, Aug 24, 2012 at 12:14:48AM -0700, Kent Overstreet wrote:

[..]
> > > -static struct dm_rq_clone_bio_info *alloc_bio_info(struct mapped_device *md)
> > > -{
> > > - return mempool_alloc(md->io_pool, GFP_ATOMIC);
> > > -}
> > > -
> > > -static void free_bio_info(struct dm_rq_clone_bio_info *info)
> > > -{
> > > - mempool_free(info, info->tio->md->io_pool);
> > > -}
> > > -
> >
> > With this change, do you still need "_rq_bio_info_cache" slab cache? I would
> > think that it can be cleaned up now?
>
> It looks like it, but I'm hesitent to make more extensive changes to the
> dm code given that I'm unfamiliar with it and I haven't been able to
> personally test the request type dm target code.
>
> That and the way io_pool is overloaded. I see too many ways I could
> screw things up.

I understand your concern but still if you leave it behind, job is
half done. You moved rq_bio_info in bio front padding but left the
associated cache and mempool behind. I would say we need to clean
it up and then get ACK from dm/md folks.

I am looking at the code and one thing which is not clear to me is
__bind_mempools() which assumes that md->io_pool is always set. With
your change md->io_pool is set only for BIO based targets and not
request based targets. So that will need some tidying up.

Testing of request based target should be easy. Just enable multipath
for your sata disk.


>
> Also it looks like the equivalent change ought to be done with struct
> dm_io first (then we'd have removed all the users of io_pool), but
> honestly it takes me forever to do anything in the dm code so I'd rather
> leave that to someone else.

I think we can leave io_pool behind. Just that it remains null for
request based targets.

Thanks
Vivek

2012-08-24 20:08:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 06/13] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

Hello,

On Thu, Aug 23, 2012 at 10:04:00PM -0700, Kent Overstreet wrote:
> > > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > > {
> > > + unsigned front_pad;
> > > + unsigned inline_vecs;
> > > unsigned long idx = BIO_POOL_NONE;
> > > struct bio_vec *bvl = NULL;
> > > struct bio *bio;
> > > void *p;
> > >
> > > - p = mempool_alloc(bs->bio_pool, gfp_mask);
> > > + if (nr_iovecs > UIO_MAXIOV)
> > > + return NULL;
> >
> > This test used to only happen for bio_kmalloc(). If I follow the code
> > I can see that UIO_MAXIOV is larger than BIOVEC_MAX_IDX, so this
> > doesn't really affect the capability of bioset allocs; however, given
> > that bioset allocation already checks against BIOVEC_MAX_IDX, I don't
> > see why this test is now also applied to them.
>
> Having arbitrary limits that are different for kmalloced bios and bioset
> allocated bios seems _very_ sketchy to me. I tend to think that
> UIO_MAXIOV check shouldn't be there at all... but if it is IMO it makes
> slightly more sense for it to apply to all bio allocations.
>
> As you mentioned it doesn't affect the behaviour of the code, but
> supposing UIO_MAXIOV was less than BIO_MAX_PAGES, whatever was depending
> on that check would then implicitly depend on the bios not being
> allocated from a bioset. Ugly.

Please keep UIO_MAXIOV test on kmalloc case only. If you want to
change that, please do that in a separate patch with its own
justification.

> > And we lose /** comments on two exported functions and
> > bio_alloc_bioset() comment doesn't explain that it now also handles
> > NULL bioset case. Now that they share common implementation, you can
> > update bio_alloc_bioset() and refer to it from its wrappers but please
> > don't drop them like this.
>
> So if I follow you, you're fine with dropping the comments from the
> single line wrappers provided their information is rolled into
> bio_alloc_bioset()'s documentation? That's what I should have done,
> I'll do that now.

Not really, for example, if you had

/* explain A in detail */
a() {};

and if you introduce __a() which does __A and make a its wrapper.

/* explain __A in detail */
__a() {};

/* explain A briefly and refer to __a() for details */
a() {};

Thanks.

--
tejun

2012-08-24 20:29:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 07/13] block: Avoid deadlocks with bio allocation by stacking drivers

Hello,

On Thu, Aug 23, 2012 at 10:55:54PM -0700, Kent Overstreet wrote:
> > Why aren't we turning off __GFP_WAIT instead? e.g. What if the caller
> > has one of NUMA flags or __GFP_COLD specified?
>
> Didn't think of that. The reason I did it that way is I wasn't sure if
> just doing &= ~__GFP_WAIT would be correct, since that would leave
> __GFP_IO|__GFP_FS set.

Using the appropriate __GFP_IO/FS flags is the caller's
responsibility. The only thing bioset needs to worry and take action
about here is __GFP_WAIT causing indefinite wait in mempool.

> > Plesae don't mix struct definition relocation (or any relocation
> > really) with actual changes. It buries the actual changes and makes
> > reviewing difficult.
>
> Make a new patch that does nothing more than reorder the definitions,
> then?

Yeap, with justification.

> block: Avoid deadlocks with bio allocation by stacking drivers
>
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request() (i.e. a stacking block
> driver), we risk deadlock.
>
> This is because of the code in generic_make_request() that converts
> recursion to iteration; any bios we submit won't actually be submitted
> (so they can complete and eventually be freed) until after we return -
> this means if we allocate a second bio, we're blocking the first one
> from ever being freed.
>
> Thus if enough threads call into a stacking block driver at the same
> time with bios that need multiple splits, and the bio_set's reserve gets
> used up, we deadlock.
>
> This can be worked around in the driver code - we could check if we're
> running under generic_make_request(), then mask out __GFP_WAIT when we
> go to allocate a bio, and if the allocation fails punt to workqueue and
> retry the allocation.
>
> But this is tricky and not a generic solution. This patch solves it for
> all users by inverting the previously described technique. We allocate a
> rescuer workqueue for each bio_set, and then in the allocation code if
> there are bios on current->bio_list we would be blocking, we punt them
> to the rescuer workqueue to be submitted.
>
> Tested it by forcing the rescue codepath to be taken (by disabling the
> first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
> of arbitrary bio splitting) and verified that the rescuer was being
> invoked.

Yeah, the description looks good to me.

> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> {
> + gfp_t saved_gfp = gfp_mask;
> unsigned front_pad;
> unsigned inline_vecs;
> unsigned long idx = BIO_POOL_NONE;
> @@ -318,16 +336,44 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> p = kmalloc(sizeof(struct bio) +
> nr_iovecs * sizeof(struct bio_vec),
> gfp_mask);
> +
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> + /*
> + * generic_make_request() converts recursion to iteration; this
> + * means if we're running beneath it, any bios we allocate and
> + * submit will not be submitted (and thus freed) until after we
> + * return.
> + *
> + * This exposes us to a potential deadlock if we allocate
> + * multiple bios from the same bio_set() while running
> + * underneath generic_make_request(). If we were to allocate
> + * multiple bios (say a stacking block driver that was splitting
> + * bios), we would deadlock if we exhausted the mempool's
> + * reserve.
> + *
> + * We solve this, and guarantee forward progress, with a rescuer
> + * workqueue per bio_set. If we go to allocate and there are
> + * bios on current->bio_list, we first try the allocation
> + * without __GFP_WAIT; if that fails, we punt those bios we
> + * would be blocking to the rescuer workqueue before we retry
> + * with the original gfp_flags.
> + */

Can you please add a comment in generic_make_request() to describe the
issue briefly and link back here?

> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)

Why is the conditional necessary? Is it possible to have a bioset w/o
rescue_workqueue?

> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);

This makes bioset_free() require process context, which probably is
okay for bioset but still isn't nice. Might worth noting in the patch
description.

Thanks.

--
tejun

2012-08-24 20:36:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 12/13] block: Add bio_clone_bioset(), bio_clone_kmalloc()

Hello,

On Thu, Aug 23, 2012 at 11:24:18PM -0700, Kent Overstreet wrote:
> > I'd prefer simply adding @bioset to bio_clone() so that the caller
> > always has to make the choice consciously. We're updating all the
> > callers anyway.
>
> Possibly, but the btrfs code uses bio_clone() and there fs_bio_set may
> be correct (will have to look at what it's doing, if it's cloning a bio
> that was allocated out of fs_bio_set that would be bad..)

Yeah, I think it's generally a good idea to require explicit bioset
specification even if that ends up being fs_bio_set or NULL.

> I would also prefer to simply drop bio_clone() so that
> bio_clone_bioset() matches bio_alloc_bioset(), but regardless that'll
> have to be a different patch (and I don't think I've had to update any
> of the bio_clone() callers in this patch series anyways).

Ooh yeah, bio_clone_bioset() would be the better name for it.

Thanks.

--
tejun

2012-08-24 20:42:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 13/13] block: Only clone bio vecs that are in use

Hello, Kent.

On Fri, Aug 24, 2012 at 12:05:08AM -0700, Kent Overstreet wrote:
> > I'm pretty sure I sound like a broken record by now, but
> >
> > * How was this tested?
> >
> > * What are the implications and possible dangers?
>
> I've said all that on list, but I gather what you really wanted was to
> have it all in the patch description. Will do.

Yeap.

> > > @@ -463,10 +468,10 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
> > > bio->bi_sector = bio_src->bi_sector;
> > > bio->bi_bdev = bio_src->bi_bdev;
> > > bio->bi_flags |= 1 << BIO_CLONED;
> > > + bio->bi_flags &= ~(1 << BIO_SEG_VALID);
> >
> > For the n'th time, explain please.
>
> Argh, I could've sworn I dropped that part.

Can we drop it tho? If we're changing bvecs, we probably should be
clearing SEG_VALID on both bios.

> commit 0edda563aef9432b45f0c6a50f52590b92594560
> Author: Kent Overstreet <[email protected]>
> Date: Thu Aug 23 23:26:38 2012 -0700
>
> block: Only clone bio vecs that are in use
>
> bcache creates large bios internally, and then splits them according to
> the device requirements before it sends them down. If a lower level
> device tries to clone the bio, and the original bio had more than
> BIO_MAX_PAGES, the clone will fail unecessarily.
>
> We can fix this by only cloning the bio vecs that are actually in use -
> as for as the block layer is concerned the new bio is still equivalent
> to the old bio.
>
> This code should in general be safe as long as all the block layer code
> uses bi_idx, bi_vcnt consistently; since bios are cloned by code that
> doesn't own the original bio there's little room for issues caused by
> code playing games with the original bio's bi_io_vec. One perhaps
> imagine code depending the clone and original bio's io vecs lining up a
> certain way, but auditing and testing haven't turned up anything.
>
> Testing: This code has been in the bcache tree for quite awhile, and has
> been tested with various md layers and dm targets (including strange
> things like multipath).

Yeap, looks much better to me.

Thanks.

--
tejun

2012-08-24 20:53:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 11/13] block: Rework bio_pair_split()

Hello,

On Fri, Aug 24, 2012 at 03:30:49AM -0700, Kent Overstreet wrote:
> > I complained about this in the last posting and in the previous patch.
> > Please respond. Martin, are you okay with these integrity changes?
>
> I did respond. I said more before, but in short the old
> bio_integrity_split() only worked for single page bios, and thus wasn't
> useful even as a starting point.
>
> What my bio_split() does with the integrity is actually basically what
> dm does (that's the only other place bio_integrity_trim()) is used).

And all those information should be in the patch description. Imagine
this patch hitting some enterprise deployment, say four years after
now, and causing weird performance regression and someone succeeding
to bisect it down to this commit. None of us would be remembering
much about this and could in fact be doing something completely
different. Patches should at least try to record its visible
implications.

This type of information should be always attached to the patch not
only for future references but also to alert reviewers who aren't
necessarily following the whole series and its iterations.

> > Also, given that it's a pair split, it would be nice to somehow
> > indicate that ->split is the earlier half. Before this change it was
> > pretty clear with ->bio1/2.
>
> Fixed the comment to indicate that too.

Any chance we can do that with field name?

> commit 5db0562e43099d06ed9488e7ae6bf0a0cc6ef5da
> Author: Kent Overstreet <[email protected]>
> Date: Fri Aug 24 03:27:24 2012 -0700
>
> block: Rework bio_pair_split()
>
> This changes bio_pair_split() to use the new bio_split() underneath.
>
> This has two user visible changes: First, it's not restricted to single
> page bios anymore. Secondly, where before callers used bio1 and bio2 in
> struct bio_pair, now they use split and orig - this being a result of
> the new implementation.
>
> bio_integrity_split() is removed, as the old bio_pair_split() was the
> only user; the new bio_split() uses bio_integrity_clone/trim much like
> the dm code does.

"which may affect XXX in YYY cases but I think it's okay because ZZZ."

If you aren't that sure,

"which may affect AAA but I don't think that's a big deal
because BBB. This definitely requires Martin's ack."

Thanks.

--
tejun

2012-08-24 20:58:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6 11/13] block: Rework bio_pair_split()

Hello, Martin.

On Thu, Aug 23, 2012 at 10:25:47PM -0400, Martin K. Petersen wrote:
> However, I'm not sure I like the overall approach of the new splitting.
> Instead of all this cloning, slicing and dicing of bio_vecs I'd rather
> we bit the bullet and had an offset + length for the vector inside each
> bio. That way we could keep the bio_vec immutable and make clones more
> lightweight since their vecs would always point to the parent. This also
> makes it trivial to split I/Os in the stacking drivers and removes evils
> in the partial completion code path. It would also allow to sever the
> ties between "size of block range operated on" vs. bi_size which we need
> for copy offload, discard, etc.

Yeah, I'm fairly sure we all want that but that's gonna have to be a
separate not-so-small project. Also, how we split underlying bvec
doesn't affect the split interface update being done here. This
patchset is updating split so that it can handle arbirarily sized
bios, which is useful whether its implementation is using immutable
bvec or COWing it.

Thanks.

--
tejun

2012-08-28 17:23:43

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v6 08/13] block: Add an explicit bio flag for bios that own their bvec

On Wed, Aug 22, 2012 at 11:00:38PM +0300, Adrian Bunk wrote:
> On Wed, Aug 22, 2012 at 12:22:41PM -0700, Kent Overstreet wrote:
> > On Wed, Aug 22, 2012 at 08:43:52PM +0300, Adrian Bunk wrote:
> > > On Wed, Aug 22, 2012 at 10:04:05AM -0700, Kent Overstreet wrote:
> > > >...
> > > > --- a/include/linux/blk_types.h
> > > > +++ b/include/linux/blk_types.h
> > > > @@ -117,6 +117,7 @@ struct bio {
> > > > * BIO_POOL_IDX()
> > > > */
> > > > #define BIO_RESET_BITS 12
> > > > +#define BIO_OWNS_VEC 12 /* bio_free() should free bvec */
> > > >...
> > >
> > > This doesn't look right.
> >
> > Well, the first 12 bits are reset, so bit 12 will get preserved... I
> > guess it's unusual to have a duplicated enum value like that but
> > BIO_RESET_BITS is just a marker, not a real bit.
>
> Wouldn't a BIO_RESET_MASK be better than BIO_RESET_BITS?

Ehh... Might be good to have BIO_RESET_MASK too, but we really want
BIO_RESET_BITS with the flags as that makes it explicit which flags get
cleared and which don't. At first I was going to add BIO_RESET_MASK too,
but it just seemed to get ugly and verbose and didn't fit nicely
anywhere, so I'm dithering and not doing anything with it for now.

2012-08-28 23:19:58

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v6 04/13] pktcdvd: Switch to bio_kmalloc()

On Wed, 22 Aug 2012, Tejun Heo wrote:

> (cc'ing Jiri, hi!)

Hi there! :)

> On Wed, Aug 22, 2012 at 10:04:01AM -0700, Kent Overstreet wrote:
> > This is prep work for killing bi_destructor - previously, pktcdvd had
> > its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
> > necessitating its own bi_destructor implementation.
> >
> > v5: Un-reorder some functions, to make the patch easier to review
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > CC: Peter Osterlund <[email protected]>
>
> Apart from bio_reset() not resetting bi_end_io and bi_private, this
> looks fine to me but lack of testing makes me a bit hesitant to ack
> it.
>
> Peter doesn't want to be the maintainer of pktcdvd anymore. Maybe
> Jiri can be tricked into taking this one too? :)

Gah, seems like the floppy saga is going to cause me some more trouble in
the future ... :)

Well, I definitely have hardware back at home that supports this, so I can
take it. Doesn't seem to be heavy traffic code either.

Peter, ok with that? Also, how was this usually fed upstream -- through
Jens' tree?

--
Jiri Kosina
SUSE Labs

2012-08-29 04:35:37

by Peter Osterlund

[permalink] [raw]
Subject: Re: [PATCH v6 04/13] pktcdvd: Switch to bio_kmalloc()

On Tue, 28 Aug 2012, Jiri Kosina wrote:

> On Wed, 22 Aug 2012, Tejun Heo wrote:
>
>> (cc'ing Jiri, hi!)
>
> Hi there! :)
>
>> On Wed, Aug 22, 2012 at 10:04:01AM -0700, Kent Overstreet wrote:
>>> This is prep work for killing bi_destructor - previously, pktcdvd had
>>> its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
>>> necessitating its own bi_destructor implementation.
>>>
>>> v5: Un-reorder some functions, to make the patch easier to review
>>>
>>> Signed-off-by: Kent Overstreet <[email protected]>
>>> CC: Peter Osterlund <[email protected]>
>>
>> Apart from bio_reset() not resetting bi_end_io and bi_private, this
>> looks fine to me but lack of testing makes me a bit hesitant to ack
>> it.
>>
>> Peter doesn't want to be the maintainer of pktcdvd anymore. Maybe
>> Jiri can be tricked into taking this one too? :)
>
> Gah, seems like the floppy saga is going to cause me some more trouble in
> the future ... :)
>
> Well, I definitely have hardware back at home that supports this, so I can
> take it. Doesn't seem to be heavy traffic code either.
>
> Peter, ok with that? Also, how was this usually fed upstream -- through
> Jens' tree?

Yes, I am ok with that.

--
Peter Osterlund - [email protected], [email protected]
http://web.comhem.se/petero2home