2012-08-28 17:37:45

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 0/9] Block cleanups, deadlock fix

Since v6:

* Rebased onto Linus master
* Split off the bio splitting patches
* Various review feedback

Kent Overstreet (9):
block: Generalized bio pool freeing
dm: Use bioset's front_pad for dm_rq_clone_bio_info
block: Add bio_reset()
pktcdvd: Switch to bio_kmalloc()
block: Kill bi_destructor
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Reorder struct bio_set
block: Avoid deadlocks with bio allocation by stacking drivers

Documentation/block/biodoc.txt | 5 -
block/blk-core.c | 10 +-
drivers/block/drbd/drbd_main.c | 13 +-
drivers/block/osdblk.c | 3 +-
drivers/block/pktcdvd.c | 52 ++------
drivers/md/dm-crypt.c | 9 --
drivers/md/dm-io.c | 11 --
drivers/md/dm.c | 68 +++-------
drivers/md/md.c | 44 +------
drivers/target/target_core_iblock.c | 9 --
fs/bio.c | 243 ++++++++++++++++++++----------------
fs/exofs/ore.c | 5 +-
include/linux/bio.h | 111 ++++++++++------
include/linux/blk_types.h | 23 +++-
14 files changed, 260 insertions(+), 346 deletions(-)

--
1.7.12


2012-08-28 17:37:49

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 2/9] dm: Use bioset's front_pad for dm_rq_clone_bio_info

Previously, dm_rq_clone_bio_info needed to be freed by the bio's
destructor to avoid a memory leak in the blk_rq_prep_clone() error path.
This gets rid of a memory allocation and means we can kill
dm_rq_bio_destructor.

The _rq_bio_info_cache kmem cache is unused now and needs to be deleted,
but due to the way io_pool is used and overloaded this looks not quite
trivial so I'm leaving it for a later patch.

v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun

Signed-off-by: Kent Overstreet <[email protected]>
CC: Alasdair Kergon <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
drivers/md/dm.c | 44 ++++++++++++++++----------------------------
1 file changed, 16 insertions(+), 28 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 0c3d6dd..9206781 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -86,12 +86,17 @@ struct dm_rq_target_io {
};

/*
- * For request-based dm.
- * One of these is allocated per bio.
+ * For request-based dm - the bio clones we allocate are embedded in these
+ * structs.
+ *
+ * We allocate these with bio_alloc_bioset, using the front_pad parameter when
+ * the bioset is created - this means the bio has to come at the end of the
+ * struct.
*/
struct dm_rq_clone_bio_info {
struct bio *orig;
struct dm_rq_target_io *tio;
+ struct bio clone;
};

union map_info *dm_get_mapinfo(struct bio *bio)
@@ -211,6 +216,11 @@ struct dm_md_mempools {
static struct kmem_cache *_io_cache;
static struct kmem_cache *_tio_cache;
static struct kmem_cache *_rq_tio_cache;
+
+/*
+ * Unused now, and needs to be deleted. But since io_pool is overloaded and it's
+ * still used for _io_cache, I'm leaving this for a later cleanup
+ */
static struct kmem_cache *_rq_bio_info_cache;

static int __init local_init(void)
@@ -467,16 +477,6 @@ static void free_rq_tio(struct dm_rq_target_io *tio)
mempool_free(tio, tio->md->tio_pool);
}

-static struct dm_rq_clone_bio_info *alloc_bio_info(struct mapped_device *md)
-{
- return mempool_alloc(md->io_pool, GFP_ATOMIC);
-}
-
-static void free_bio_info(struct dm_rq_clone_bio_info *info)
-{
- mempool_free(info, info->tio->md->io_pool);
-}
-
static int md_in_flight(struct mapped_device *md)
{
return atomic_read(&md->pending[READ]) +
@@ -1460,30 +1460,17 @@ void dm_dispatch_request(struct request *rq)
}
EXPORT_SYMBOL_GPL(dm_dispatch_request);

-static void dm_rq_bio_destructor(struct bio *bio)
-{
- struct dm_rq_clone_bio_info *info = bio->bi_private;
- struct mapped_device *md = info->tio->md;
-
- free_bio_info(info);
- bio_free(bio, md->bs);
-}
-
static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
void *data)
{
struct dm_rq_target_io *tio = data;
- struct mapped_device *md = tio->md;
- struct dm_rq_clone_bio_info *info = alloc_bio_info(md);
-
- if (!info)
- return -ENOMEM;
+ struct dm_rq_clone_bio_info *info =
+ container_of(bio, struct dm_rq_clone_bio_info, clone);

info->orig = bio_orig;
info->tio = tio;
bio->bi_end_io = end_clone_bio;
bio->bi_private = info;
- bio->bi_destructor = dm_rq_bio_destructor;

return 0;
}
@@ -2718,7 +2705,8 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity)
if (!pools->tio_pool)
goto free_io_pool_and_out;

- pools->bs = bioset_create(pool_size, 0);
+ pools->bs = bioset_create(pool_size,
+ offsetof(struct dm_rq_clone_bio_info, clone));
if (!pools->bs)
goto free_tio_pool_and_out;

--
1.7.12

2012-08-28 17:37:54

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 4/9] pktcdvd: Switch to bio_kmalloc()

This is prep work for killing bi_destructor - previously, pktcdvd had
its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
necessitating its own bi_destructor implementation.

v5: Un-reorder some functions, to make the patch easier to review

Signed-off-by: Kent Overstreet <[email protected]>
CC: Peter Osterlund <[email protected]>
---
drivers/block/pktcdvd.c | 52 +++++++------------------------------------------
1 file changed, 7 insertions(+), 45 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index ba66e44..2e7de7a 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -522,38 +522,6 @@ static void pkt_bio_finished(struct pktcdvd_device *pd)
}
}

-static void pkt_bio_destructor(struct bio *bio)
-{
- kfree(bio->bi_io_vec);
- kfree(bio);
-}
-
-static struct bio *pkt_bio_alloc(int nr_iovecs)
-{
- struct bio_vec *bvl = NULL;
- struct bio *bio;
-
- bio = kmalloc(sizeof(struct bio), GFP_KERNEL);
- if (!bio)
- goto no_bio;
- bio_init(bio);
-
- bvl = kcalloc(nr_iovecs, sizeof(struct bio_vec), GFP_KERNEL);
- if (!bvl)
- goto no_bvl;
-
- bio->bi_max_vecs = nr_iovecs;
- bio->bi_io_vec = bvl;
- bio->bi_destructor = pkt_bio_destructor;
-
- return bio;
-
- no_bvl:
- kfree(bio);
- no_bio:
- return NULL;
-}
-
/*
* Allocate a packet_data struct
*/
@@ -567,7 +535,7 @@ static struct packet_data *pkt_alloc_packet_data(int frames)
goto no_pkt;

pkt->frames = frames;
- pkt->w_bio = pkt_bio_alloc(frames);
+ pkt->w_bio = bio_kmalloc(GFP_KERNEL, frames);
if (!pkt->w_bio)
goto no_bio;

@@ -581,9 +549,10 @@ static struct packet_data *pkt_alloc_packet_data(int frames)
bio_list_init(&pkt->orig_bios);

for (i = 0; i < frames; i++) {
- struct bio *bio = pkt_bio_alloc(1);
+ struct bio *bio = bio_kmalloc(GFP_KERNEL, 1);
if (!bio)
goto no_rd_bio;
+
pkt->r_bios[i] = bio;
}

@@ -1111,21 +1080,17 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt)
* Schedule reads for missing parts of the packet.
*/
for (f = 0; f < pkt->frames; f++) {
- struct bio_vec *vec;
-
int p, offset;
+
if (written[f])
continue;
+
bio = pkt->r_bios[f];
- vec = bio->bi_io_vec;
- bio_init(bio);
- bio->bi_max_vecs = 1;
+ bio_reset(bio);
bio->bi_sector = pkt->sector + f * (CD_FRAMESIZE >> 9);
bio->bi_bdev = pd->bdev;
bio->bi_end_io = pkt_end_io_read;
bio->bi_private = pkt;
- bio->bi_io_vec = vec;
- bio->bi_destructor = pkt_bio_destructor;

p = (f * CD_FRAMESIZE) / PAGE_SIZE;
offset = (f * CD_FRAMESIZE) % PAGE_SIZE;
@@ -1418,14 +1383,11 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
}

/* Start the write request */
- bio_init(pkt->w_bio);
- pkt->w_bio->bi_max_vecs = PACKET_MAX_SIZE;
+ bio_reset(pkt->w_bio);
pkt->w_bio->bi_sector = pkt->sector;
pkt->w_bio->bi_bdev = pd->bdev;
pkt->w_bio->bi_end_io = pkt_end_io_packet_write;
pkt->w_bio->bi_private = pkt;
- pkt->w_bio->bi_io_vec = bvec;
- pkt->w_bio->bi_destructor = pkt_bio_destructor;
for (f = 0; f < pkt->frames; f++)
if (!bio_add_page(pkt->w_bio, bvec[f].bv_page, CD_FRAMESIZE, bvec[f].bv_offset))
BUG();
--
1.7.12

2012-08-28 17:38:03

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 8/9] block: Reorder struct bio_set

This is prep work for the next patch, which embeds a struct bio_list in
struct bio_set.

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
---
include/linux/bio.h | 66 ++++++++++++++++++++++++++---------------------------
1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index f9b61b4..3a8345e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -299,39 +299,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
static inline void bio_disassociate_task(struct bio *bio) { }
#endif /* CONFIG_BLK_CGROUP */

-/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-#define BIO_POOL_SIZE 2
-#define BIOVEC_NR_POOLS 6
-#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
-
-struct bio_set {
- struct kmem_cache *bio_slab;
- unsigned int front_pad;
-
- mempool_t *bio_pool;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
- mempool_t *bio_integrity_pool;
-#endif
- mempool_t *bvec_pool;
-};
-
-struct biovec_slab {
- int nr_vecs;
- char *name;
- struct kmem_cache *slab;
-};
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
-
#ifdef CONFIG_HIGHMEM
/*
* remember never ever reenable interrupts between a bvec_kmap_irq and
@@ -506,6 +473,39 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
return bio;
}

+/*
+ * bio_set is used to allow other portions of the IO system to
+ * allocate their own private memory pools for bio and iovec structures.
+ * These memory pools in turn all allocate from the bio_slab
+ * and the bvec_slabs[].
+ */
+#define BIO_POOL_SIZE 2
+#define BIOVEC_NR_POOLS 6
+#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
+
+struct bio_set {
+ struct kmem_cache *bio_slab;
+ unsigned int front_pad;
+
+ mempool_t *bio_pool;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ mempool_t *bio_integrity_pool;
+#endif
+ mempool_t *bvec_pool;
+};
+
+struct biovec_slab {
+ int nr_vecs;
+ char *name;
+ struct kmem_cache *slab;
+};
+
+/*
+ * a small number of entries is fine, not going to be performance critical.
+ * basically we just need to survive
+ */
+#define BIO_SPLIT_ENTRIES 2
+
#if defined(CONFIG_BLK_DEV_INTEGRITY)

#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
--
1.7.12

2012-08-28 17:38:00

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 7/9] block: Add bio_clone_bioset(), bio_clone_kmalloc()

Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().

This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.

This will also help in a later patch changing how bio cloning works.

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: NeilBrown <[email protected]>
CC: Alasdair Kergon <[email protected]>
CC: Boaz Harrosh <[email protected]>
CC: Jeff Garzik <[email protected]>
Acked-by: Jeff Garzik <[email protected]>
---
block/blk-core.c | 8 +-------
drivers/block/osdblk.c | 3 +--
drivers/md/dm.c | 4 ++--
drivers/md/md.c | 20 +-------------------
fs/bio.c | 13 ++++++++-----
fs/exofs/ore.c | 5 ++---
include/linux/bio.h | 17 ++++++++++++++---
7 files changed, 29 insertions(+), 41 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f3780f5..66f0bb4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2781,16 +2781,10 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
blk_rq_init(NULL, rq);

__rq_for_each_bio(bio_src, rq_src) {
- bio = bio_alloc_bioset(gfp_mask, bio_src->bi_max_vecs, bs);
+ bio = bio_clone_bioset(bio_src, gfp_mask, bs);
if (!bio)
goto free_and_out;

- __bio_clone(bio, bio_src);
-
- if (bio_integrity(bio_src) &&
- bio_integrity_clone(bio, bio_src, gfp_mask, bs))
- goto free_and_out;
-
if (bio_ctr && bio_ctr(bio, bio_src, data))
goto free_and_out;

diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 87311eb..1bbc681 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -266,11 +266,10 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
struct bio *tmp, *new_chain = NULL, *tail = NULL;

while (old_chain) {
- tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
+ tmp = bio_clone_kmalloc(old_chain, gfpmask);
if (!tmp)
goto err_out;

- __bio_clone(tmp, old_chain);
tmp->bi_bdev = NULL;
gfpmask &= ~__GFP_WAIT;
tmp->bi_next = NULL;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9206781..4b67966 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1129,8 +1129,8 @@ static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
* ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush
* and discard, so no need for concern about wasted bvec allocations.
*/
- clone = bio_alloc_bioset(GFP_NOIO, ci->bio->bi_max_vecs, ci->md->bs);
- __bio_clone(clone, ci->bio);
+ clone = bio_clone_bioset(ci->bio, GFP_NOIO, ci->md->bs);
+
if (len) {
clone->bi_sector = ci->sector;
clone->bi_size = to_bytes(len);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b8eebe3..7a2b079 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -173,28 +173,10 @@ EXPORT_SYMBOL_GPL(bio_alloc_mddev);
struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
{
- struct bio *b;
-
if (!mddev || !mddev->bio_set)
return bio_clone(bio, gfp_mask);

- b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
- if (!b)
- return NULL;
-
- __bio_clone(b, bio);
- if (bio_integrity(bio)) {
- int ret;
-
- ret = bio_integrity_clone(b, bio, gfp_mask, mddev->bio_set);
-
- if (ret < 0) {
- bio_put(b);
- return NULL;
- }
- }
-
- return b;
+ return bio_clone_bioset(bio, gfp_mask, mddev->bio_set);
}
EXPORT_SYMBOL_GPL(bio_clone_mddev);

diff --git a/fs/bio.c b/fs/bio.c
index 5e947d3..31e637a 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -430,16 +430,19 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
EXPORT_SYMBOL(__bio_clone);

/**
- * bio_clone - clone a bio
+ * bio_clone_bioset - clone a bio
* @bio: bio to clone
* @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
*
* Like __bio_clone, only also allocates the returned bio
*/
-struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
+ struct bio_set *bs)
{
- struct bio *b = bio_alloc(gfp_mask, bio->bi_max_vecs);
+ struct bio *b;

+ b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
if (!b)
return NULL;

@@ -448,7 +451,7 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
if (bio_integrity(bio)) {
int ret;

- ret = bio_integrity_clone(b, bio, gfp_mask, fs_bio_set);
+ ret = bio_integrity_clone(b, bio, gfp_mask, bs);

if (ret < 0) {
bio_put(b);
@@ -458,7 +461,7 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)

return b;
}
-EXPORT_SYMBOL(bio_clone);
+EXPORT_SYMBOL(bio_clone_bioset);

/**
* bio_get_nr_vecs - return approx number of vecs
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 1585db1..f936cb5 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -814,8 +814,8 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
struct bio *bio;

if (per_dev != master_dev) {
- bio = bio_kmalloc(GFP_KERNEL,
- master_dev->bio->bi_max_vecs);
+ bio = bio_clone_kmalloc(master_dev->bio,
+ GFP_KERNEL);
if (unlikely(!bio)) {
ORE_DBGMSG(
"Failed to allocate BIO size=%u\n",
@@ -824,7 +824,6 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
goto out;
}

- __bio_clone(bio, master_dev->bio);
bio->bi_bdev = NULL;
bio->bi_next = NULL;
per_dev->offset = master_dev->offset;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8bfd572..f9b61b4 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -216,6 +216,9 @@ extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
extern void bio_put(struct bio *);
extern void bio_free(struct bio *);

+extern void __bio_clone(struct bio *, struct bio *);
+extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
+
extern struct bio_set *fs_bio_set;

static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
@@ -223,18 +226,26 @@ static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
}

+static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+{
+ return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
+}
+
static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
{
return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);
}

+static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
+{
+ return bio_clone_bioset(bio, gfp_mask, NULL);
+
+}
+
extern void bio_endio(struct bio *, int);
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);

-extern void __bio_clone(struct bio *, struct bio *);
-extern struct bio *bio_clone(struct bio *, gfp_t);
-
extern void bio_init(struct bio *);
extern void bio_reset(struct bio *);

--
1.7.12

2012-08-28 17:38:34

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request() (i.e. a stacking block
driver), we risk deadlock.

This is because of the code in generic_make_request() that converts
recursion to iteration; any bios we submit won't actually be submitted
(so they can complete and eventually be freed) until after we return -
this means if we allocate a second bio, we're blocking the first one
from ever being freed.

Thus if enough threads call into a stacking block driver at the same
time with bios that need multiple splits, and the bio_set's reserve gets
used up, we deadlock.

This can be worked around in the driver code - we could check if we're
running under generic_make_request(), then mask out __GFP_WAIT when we
go to allocate a bio, and if the allocation fails punt to workqueue and
retry the allocation.

But this is tricky and not a generic solution. This patch solves it for
all users by inverting the previously described technique. We allocate a
rescuer workqueue for each bio_set, and then in the allocation code if
there are bios on current->bio_list we would be blocking, we punt them
to the rescuer workqueue to be submitted.

Tested it by forcing the rescue codepath to be taken (by disabling the
first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
of arbitrary bio splitting) and verified that the rescuer was being
invoked.

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
---
fs/bio.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++---
include/linux/bio.h | 9 +++++++
2 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 31e637a..5d46318 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -285,6 +285,23 @@ void bio_reset(struct bio *bio)
}
EXPORT_SYMBOL(bio_reset);

+static void bio_alloc_rescue(struct work_struct *work)
+{
+ struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+ struct bio *bio;
+
+ while (1) {
+ spin_lock(&bs->rescue_lock);
+ bio = bio_list_pop(&bs->rescue_list);
+ spin_unlock(&bs->rescue_lock);
+
+ if (!bio)
+ break;
+
+ generic_make_request(bio);
+ }
+}
+
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
@@ -307,6 +324,7 @@ EXPORT_SYMBOL(bio_reset);
*/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
front_pad = 0;
inline_vecs = nr_iovecs;
} else {
+ /*
+ * generic_make_request() converts recursion to iteration; this
+ * means if we're running beneath it, any bios we allocate and
+ * submit will not be submitted (and thus freed) until after we
+ * return.
+ *
+ * This exposes us to a potential deadlock if we allocate
+ * multiple bios from the same bio_set() while running
+ * underneath generic_make_request(). If we were to allocate
+ * multiple bios (say a stacking block driver that was splitting
+ * bios), we would deadlock if we exhausted the mempool's
+ * reserve.
+ *
+ * We solve this, and guarantee forward progress, with a rescuer
+ * workqueue per bio_set. If we go to allocate and there are
+ * bios on current->bio_list, we first try the allocation
+ * without __GFP_WAIT; if that fails, we punt those bios we
+ * would be blocking to the rescuer workqueue before we retry
+ * with the original gfp_flags.
+ */
+
+ if (current->bio_list && !bio_list_empty(current->bio_list))
+ gfp_mask &= ~__GFP_WAIT;
+retry:
p = mempool_alloc(bs->bio_pool, gfp_mask);
front_pad = bs->front_pad;
inline_vecs = BIO_INLINE_VECS;
}

if (unlikely(!p))
- return NULL;
+ goto err;

bio = p + front_pad;
bio_init(bio);
@@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)

err_free:
mempool_free(p, bs->bio_pool);
+err:
+ if (gfp_mask != saved_gfp) {
+ gfp_mask = saved_gfp;
+
+ spin_lock(&bs->rescue_lock);
+ bio_list_merge(&bs->rescue_list, current->bio_list);
+ bio_list_init(current->bio_list);
+ spin_unlock(&bs->rescue_lock);
+
+ queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ goto retry;
+ }
+
return NULL;
}
EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1562,6 +1617,9 @@ static void biovec_free_pools(struct bio_set *bs)

void bioset_free(struct bio_set *bs)
{
+ if (bs->rescue_workqueue)
+ destroy_workqueue(bs->rescue_workqueue);
+
if (bs->bio_pool)
mempool_destroy(bs->bio_pool);

@@ -1597,6 +1655,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)

bs->front_pad = front_pad;

+ spin_lock_init(&bs->rescue_lock);
+ bio_list_init(&bs->rescue_list);
+ INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
if (!bs->bio_slab) {
kfree(bs);
@@ -1607,9 +1669,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
if (!bs->bio_pool)
goto bad;

- if (!biovec_create_pools(bs, pool_size))
- return bs;
+ if (biovec_create_pools(bs, pool_size))
+ goto bad;
+
+ bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+ if (!bs->rescue_workqueue)
+ goto bad;

+ return bs;
bad:
bioset_free(bs);
return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3a8345e..84fdaac 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -492,6 +492,15 @@ struct bio_set {
mempool_t *bio_integrity_pool;
#endif
mempool_t *bvec_pool;
+
+ /*
+ * Deadlock avoidance for stacking block drivers: see comments in
+ * bio_alloc_bioset() for details
+ */
+ spinlock_t rescue_lock;
+ struct bio_list rescue_list;
+ struct work_struct rescue_work;
+ struct workqueue_struct *rescue_workqueue;
};

struct biovec_slab {
--
1.7.12

2012-08-28 17:38:55

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 6/9] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

Previously, bio_kmalloc() and bio_alloc_bioset() behaved slightly
different because there was some almost-duplicated code - this fixes
some of that.

The important change is that previously bio_kmalloc() always set
bi_io_vec = bi_inline_vecs, even if nr_iovecs == 0 - unlike
bio_alloc_bioset(). This would cause bio_has_data() to return true; I
don't know if this resulted in any actual bugs but it was certainly
wrong.

bio_kmalloc() and bio_alloc_bioset() also have different arbitrary
limits on nr_iovecs - 1024 (UIO_MAXIOV) for bio_kmalloc(), 256
(BIO_MAX_PAGES) for bio_alloc_bioset(). This patch doesn't fix that, but
at least they're enforced closer together and hopefully they will be
fixed in a later patch.

This'll also help with some future cleanups - there are a fair number of
functions that allocate bios (e.g. bio_clone()), and now they don't have
to be duplicated for bio_alloc(), bio_alloc_bioset(), and bio_kmalloc().

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
v7: Re-add dropped comments, improv patch description
---
fs/bio.c | 111 ++++++++++++++++++----------------------------------
include/linux/bio.h | 16 ++++++--
2 files changed, 49 insertions(+), 78 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index ce829c8..5e947d3 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -55,6 +55,7 @@ static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
* IO code that does not need private memory pools.
*/
struct bio_set *fs_bio_set;
+EXPORT_SYMBOL(fs_bio_set);

/*
* Our slab pool management
@@ -291,39 +292,58 @@ EXPORT_SYMBOL(bio_reset);
* @bs: the bio_set to allocate from.
*
* Description:
- * bio_alloc_bioset will try its own mempool to satisfy the allocation.
- * If %__GFP_WAIT is set then we will block on the internal pool waiting
- * for a &struct bio to become free.
- **/
+ * If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
+ * backed by the @bs's mempool.
+ *
+ * When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
+ * able to allocate a bio. This is due to the mempool guarantees. To make this
+ * work, callers must never allocate more than 1 bio at a time from this pool.
+ * Callers that need to allocate more than 1 bio must always submit the
+ * previously allocated bio for IO before attempting to allocate a new one.
+ * Failure to do so can cause deadlocks under memory pressure.
+ *
+ * RETURNS:
+ * Pointer to new bio on success, NULL on failure.
+ */
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ unsigned front_pad;
+ unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
struct bio_vec *bvl = NULL;
struct bio *bio;
void *p;

- p = mempool_alloc(bs->bio_pool, gfp_mask);
+ if (!bs) {
+ if (nr_iovecs > UIO_MAXIOV)
+ return NULL;
+
+ p = kmalloc(sizeof(struct bio) +
+ nr_iovecs * sizeof(struct bio_vec),
+ gfp_mask);
+ front_pad = 0;
+ inline_vecs = nr_iovecs;
+ } else {
+ p = mempool_alloc(bs->bio_pool, gfp_mask);
+ front_pad = bs->front_pad;
+ inline_vecs = BIO_INLINE_VECS;
+ }
+
if (unlikely(!p))
return NULL;
- bio = p + bs->front_pad;

+ bio = p + front_pad;
bio_init(bio);
- bio->bi_pool = bs;
-
- if (unlikely(!nr_iovecs))
- goto out_set;

- if (nr_iovecs <= BIO_INLINE_VECS) {
- bvl = bio->bi_inline_vecs;
- nr_iovecs = BIO_INLINE_VECS;
- } else {
+ if (nr_iovecs > inline_vecs) {
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
if (unlikely(!bvl))
goto err_free;
-
- nr_iovecs = bvec_nr_vecs(idx);
+ } else if (nr_iovecs) {
+ bvl = bio->bi_inline_vecs;
}
-out_set:
+
+ bio->bi_pool = bs;
bio->bi_flags |= idx << BIO_POOL_OFFSET;
bio->bi_max_vecs = nr_iovecs;
bio->bi_io_vec = bvl;
@@ -335,63 +355,6 @@ err_free:
}
EXPORT_SYMBOL(bio_alloc_bioset);

-/**
- * bio_alloc - allocate a new bio, memory pool backed
- * @gfp_mask: allocation mask to use
- * @nr_iovecs: number of iovecs
- *
- * bio_alloc will allocate a bio and associated bio_vec array that can hold
- * at least @nr_iovecs entries. Allocations will be done from the
- * fs_bio_set. Also see @bio_alloc_bioset and @bio_kmalloc.
- *
- * If %__GFP_WAIT is set, then bio_alloc will always be able to allocate
- * a bio. This is due to the mempool guarantees. To make this work, callers
- * must never allocate more than 1 bio at a time from this pool. Callers
- * that need to allocate more than 1 bio must always submit the previously
- * allocated bio for IO before attempting to allocate a new one. Failure to
- * do so can cause livelocks under memory pressure.
- *
- * RETURNS:
- * Pointer to new bio on success, NULL on failure.
- */
-struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
-{
- return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-}
-EXPORT_SYMBOL(bio_alloc);
-
-/**
- * bio_kmalloc - allocate a bio for I/O using kmalloc()
- * @gfp_mask: the GFP_ mask given to the slab allocator
- * @nr_iovecs: number of iovecs to pre-allocate
- *
- * Description:
- * Allocate a new bio with @nr_iovecs bvecs. If @gfp_mask contains
- * %__GFP_WAIT, the allocation is guaranteed to succeed.
- *
- **/
-struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
-{
- struct bio *bio;
-
- if (nr_iovecs > UIO_MAXIOV)
- return NULL;
-
- bio = kmalloc(sizeof(struct bio) + nr_iovecs * sizeof(struct bio_vec),
- gfp_mask);
- if (unlikely(!bio))
- return NULL;
-
- bio_init(bio);
- bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
- bio->bi_max_vecs = nr_iovecs;
- bio->bi_io_vec = bio->bi_inline_vecs;
- bio->bi_pool = NULL;
-
- return bio;
-}
-EXPORT_SYMBOL(bio_kmalloc);
-
void zero_fill_bio(struct bio *bio)
{
unsigned long flags;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 393c87e..8bfd572 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -212,12 +212,22 @@ extern void bio_pair_release(struct bio_pair *dbio);
extern struct bio_set *bioset_create(unsigned int, unsigned int);
extern void bioset_free(struct bio_set *);

-extern struct bio *bio_alloc(gfp_t, unsigned int);
-extern struct bio *bio_kmalloc(gfp_t, unsigned int);
extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
extern void bio_put(struct bio *);
extern void bio_free(struct bio *);

+extern struct bio_set *fs_bio_set;
+
+static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
+{
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
+}
+
+static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
+{
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);
+}
+
extern void bio_endio(struct bio *, int);
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);
@@ -305,8 +315,6 @@ struct biovec_slab {
struct kmem_cache *slab;
};

-extern struct bio_set *fs_bio_set;
-
/*
* a small number of entries is fine, not going to be performance critical.
* basically we just need to survive
--
1.7.12

2012-08-28 17:39:17

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 5/9] block: Kill bi_destructor

Now that we've got generic code for freeing bios allocated from bio
pools, this isn't needed anymore.

This also changes the semantics of bio_free() a bit - it now also frees
bios allocated by bio_kmalloc(). It's also no longer exported, as
without bi_destructor there should be no need for it to be called
anywhere else.

v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
v7: No #define BIO_KMALLOC_POOL anymore

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
---
Documentation/block/biodoc.txt | 5 -----
block/blk-core.c | 2 +-
fs/bio.c | 33 ++++++++++++---------------------
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 3 ---
5 files changed, 14 insertions(+), 31 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index e418dc0..8df5e8e 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -465,7 +465,6 @@ struct bio {
bio_end_io_t *bi_end_io; /* bi_end_io (bio) */
atomic_t bi_cnt; /* pin count: free when it hits zero */
void *bi_private;
- bio_destructor_t *bi_destructor; /* bi_destructor (bio) */
};

With this multipage bio design:
@@ -647,10 +646,6 @@ for a non-clone bio. There are the 6 pools setup for different size biovecs,
so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
given size from these slabs.

-The bi_destructor() routine takes into account the possibility of the bio
-having originated from a different source (see later discussions on
-n/w to block transfers and kvec_cb)
-
The bio_get() routine may be used to hold an extra reference on a bio prior
to i/o submission, if the bio fields are likely to be accessed after the
i/o is issued (since the bio may otherwise get freed in case i/o completion
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b4dbdf..f3780f5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2807,7 +2807,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,

free_and_out:
if (bio)
- bio_free(bio, bs);
+ bio_free(bio);
blk_rq_unprep_clone(rq);

return -ENOMEM;
diff --git a/fs/bio.c b/fs/bio.c
index 52da084..ce829c8 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -233,10 +233,19 @@ fallback:
return bvl;
}

-void bio_free(struct bio *bio, struct bio_set *bs)
+void bio_free(struct bio *bio)
{
+ struct bio_set *bs = bio->bi_pool;
void *p;

+ if (!bs) {
+ /* Bio was allocated by bio_kmalloc() */
+ if (bio_integrity(bio))
+ bio_integrity_free(bio, fs_bio_set);
+ kfree(bio);
+ return;
+ }
+
if (bio_has_allocated_vec(bio))
bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));

@@ -252,7 +261,6 @@ void bio_free(struct bio *bio, struct bio_set *bs)

mempool_free(p, bs->bio_pool);
}
-EXPORT_SYMBOL(bio_free);

void bio_init(struct bio *bio)
{
@@ -352,13 +360,6 @@ struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
}
EXPORT_SYMBOL(bio_alloc);

-static void bio_kmalloc_destructor(struct bio *bio)
-{
- if (bio_integrity(bio))
- bio_integrity_free(bio, fs_bio_set);
- kfree(bio);
-}
-
/**
* bio_kmalloc - allocate a bio for I/O using kmalloc()
* @gfp_mask: the GFP_ mask given to the slab allocator
@@ -385,7 +386,7 @@ struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
bio->bi_max_vecs = nr_iovecs;
bio->bi_io_vec = bio->bi_inline_vecs;
- bio->bi_destructor = bio_kmalloc_destructor;
+ bio->bi_pool = NULL;

return bio;
}
@@ -423,17 +424,7 @@ void bio_put(struct bio *bio)
*/
if (atomic_dec_and_test(&bio->bi_cnt)) {
bio_disassociate_task(bio);
- bio->bi_next = NULL;
-
- /*
- * This if statement is temporary - bi_pool is replacing
- * bi_destructor, but bi_destructor will be taken out in another
- * patch.
- */
- if (bio->bi_pool)
- bio_free(bio, bio->bi_pool);
- else
- bio->bi_destructor(bio);
+ bio_free(bio);
}
}
EXPORT_SYMBOL(bio_put);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ba60319..393c87e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -216,7 +216,7 @@ extern struct bio *bio_alloc(gfp_t, unsigned int);
extern struct bio *bio_kmalloc(gfp_t, unsigned int);
extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
extern void bio_put(struct bio *);
-extern void bio_free(struct bio *, struct bio_set *);
+extern void bio_free(struct bio *);

extern void bio_endio(struct bio *, int);
struct request_queue;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 691edd1..7a9c047 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -84,11 +84,8 @@ struct bio {
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif

- /* If bi_pool is non NULL, bi_destructor is not called */
struct bio_set *bi_pool;

- bio_destructor_t *bi_destructor; /* destructor */
-
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
--
1.7.12

2012-08-28 17:39:42

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 3/9] block: Add bio_reset()

Reusing bios is something that's been highly frowned upon in the past,
but driver code keeps doing it anyways. If it's going to happen anyways,
we should provide a generic method.

This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
was open coding it, by doing a bio_init() and resetting bi_destructor.

v5: Add a define BIO_RESET_BITS, to be very explicit about what parts of
bio->bi_flags are saved.
v6: Further commenting verbosity, per Tejun

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
fs/bio.c | 14 ++++++++++++++
include/linux/bio.h | 1 +
include/linux/blk_types.h | 21 +++++++++++++++++----
3 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index e017f7a..52da084 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -262,6 +262,20 @@ void bio_init(struct bio *bio)
}
EXPORT_SYMBOL(bio_init);

+void bio_reset(struct bio *bio)
+{
+ unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
+
+ if (bio_integrity(bio))
+ bio_integrity_free(bio, bio->bi_pool);
+
+ bio_disassociate_task(bio);
+
+ memset(bio, 0, BIO_RESET_BYTES);
+ bio->bi_flags = flags|(1 << BIO_UPTODATE);
+}
+EXPORT_SYMBOL(bio_reset);
+
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 2643589..ba60319 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -226,6 +226,7 @@ extern void __bio_clone(struct bio *, struct bio *);
extern struct bio *bio_clone(struct bio *, gfp_t);

extern void bio_init(struct bio *);
+extern void bio_reset(struct bio *);

extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index af9dd9d..691edd1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -59,15 +59,19 @@ struct bio {
unsigned int bi_seg_front_size;
unsigned int bi_seg_back_size;

+ bio_end_io_t *bi_end_io;
+
+ void *bi_private;
+
+ /*
+ * Everything starting with bi_max_vecs will be preserved by bio_reset()
+ */
+
unsigned int bi_max_vecs; /* max bvl_vecs we can hold */

atomic_t bi_cnt; /* pin count */

struct bio_vec *bi_io_vec; /* the actual vec list */
-
- bio_end_io_t *bi_end_io;
-
- void *bi_private;
#ifdef CONFIG_BLK_CGROUP
/*
* Optional ioc and css associated with this bio. Put on bio
@@ -93,6 +97,8 @@ struct bio {
struct bio_vec bi_inline_vecs[0];
};

+#define BIO_RESET_BYTES offsetof(struct bio, bi_max_vecs)
+
/*
* bio flags
*/
@@ -108,6 +114,13 @@ struct bio {
#define BIO_FS_INTEGRITY 9 /* fs owns integrity data, not block layer */
#define BIO_QUIET 10 /* Make BIO Quiet */
#define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+
+/*
+ * Flags starting here get preserved by bio_reset() - this includes
+ * BIO_POOL_IDX()
+ */
+#define BIO_RESET_BITS 12
+
#define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

/*
--
1.7.12

2012-08-28 17:40:04

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH v7 1/9] block: Generalized bio pool freeing

With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.

This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.

v6: Explain the temporary if statement in bio_put

Signed-off-by: Kent Overstreet <[email protected]>
CC: Jens Axboe <[email protected]>
CC: NeilBrown <[email protected]>
CC: Alasdair Kergon <[email protected]>
CC: Nicholas Bellinger <[email protected]>
CC: Lars Ellenberg <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Nicholas Bellinger <[email protected]>
---
drivers/block/drbd/drbd_main.c | 13 +------------
drivers/md/dm-crypt.c | 9 ---------
drivers/md/dm-io.c | 11 -----------
drivers/md/dm.c | 20 --------------------
drivers/md/md.c | 28 ++++------------------------
drivers/target/target_core_iblock.c | 9 ---------
fs/bio.c | 31 +++++++++++++------------------
include/linux/blk_types.h | 3 +++
8 files changed, 21 insertions(+), 103 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index f93a032..f55683a 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -162,23 +162,12 @@ static const struct block_device_operations drbd_ops = {
.release = drbd_release,
};

-static void bio_destructor_drbd(struct bio *bio)
-{
- bio_free(bio, drbd_md_io_bio_set);
-}
-
struct bio *bio_alloc_drbd(gfp_t gfp_mask)
{
- struct bio *bio;
-
if (!drbd_md_io_bio_set)
return bio_alloc(gfp_mask, 1);

- bio = bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
- if (!bio)
- return NULL;
- bio->bi_destructor = bio_destructor_drbd;
- return bio;
+ return bio_alloc_bioset(gfp_mask, 1, drbd_md_io_bio_set);
}

#ifdef __CHECKER__
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 664743d..3c0acba 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -798,14 +798,6 @@ static int crypt_convert(struct crypt_config *cc,
return 0;
}

-static void dm_crypt_bio_destructor(struct bio *bio)
-{
- struct dm_crypt_io *io = bio->bi_private;
- struct crypt_config *cc = io->cc;
-
- bio_free(bio, cc->bs);
-}
-
/*
* Generate a new unfragmented bio with the given size
* This should never violate the device limitations
@@ -974,7 +966,6 @@ static void clone_init(struct dm_crypt_io *io, struct bio *clone)
clone->bi_end_io = crypt_endio;
clone->bi_bdev = cc->dev->bdev;
clone->bi_rw = io->base_bio->bi_rw;
- clone->bi_destructor = dm_crypt_bio_destructor;
}

static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index ea5dd28..1c46f97 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -249,16 +249,6 @@ static void vm_dp_init(struct dpages *dp, void *data)
dp->context_ptr = data;
}

-static void dm_bio_destructor(struct bio *bio)
-{
- unsigned region;
- struct io *io;
-
- retrieve_io_and_region_from_bio(bio, &io, &region);
-
- bio_free(bio, io->client->bios);
-}
-
/*
* Functions for getting the pages from kernel memory.
*/
@@ -317,7 +307,6 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
bio->bi_sector = where->sector + (where->count - remaining);
bio->bi_bdev = where->bdev;
bio->bi_end_io = endio;
- bio->bi_destructor = dm_bio_destructor;
store_io_and_region_in_bio(bio, io, region);

if (rw & REQ_DISCARD) {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4e09b6f..0c3d6dd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -681,11 +681,6 @@ static void clone_endio(struct bio *bio, int error)
}
}

- /*
- * Store md for cleanup instead of tio which is about to get freed.
- */
- bio->bi_private = md->bs;
-
free_tio(md, tio);
bio_put(bio);
dec_pending(io, error);
@@ -1032,11 +1027,6 @@ static void __map_bio(struct dm_target *ti, struct bio *clone,
/* error the io and bail out, or requeue it if needed */
md = tio->io->md;
dec_pending(tio->io, r);
- /*
- * Store bio_set for cleanup.
- */
- clone->bi_end_io = NULL;
- clone->bi_private = md->bs;
bio_put(clone);
free_tio(md, tio);
} else if (r) {
@@ -1055,13 +1045,6 @@ struct clone_info {
unsigned short idx;
};

-static void dm_bio_destructor(struct bio *bio)
-{
- struct bio_set *bs = bio->bi_private;
-
- bio_free(bio, bs);
-}
-
/*
* Creates a little bio that just does part of a bvec.
*/
@@ -1073,7 +1056,6 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
struct bio_vec *bv = bio->bi_io_vec + idx;

clone = bio_alloc_bioset(GFP_NOIO, 1, bs);
- clone->bi_destructor = dm_bio_destructor;
*clone->bi_io_vec = *bv;

clone->bi_sector = sector;
@@ -1105,7 +1087,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
clone->bi_vcnt = idx + bv_count;
@@ -1150,7 +1131,6 @@ static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
*/
clone = bio_alloc_bioset(GFP_NOIO, ci->bio->bi_max_vecs, ci->md->bs);
__bio_clone(clone, ci->bio);
- clone->bi_destructor = dm_bio_destructor;
if (len) {
clone->bi_sector = ci->sector;
clone->bi_size = to_bytes(len);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3f6203a..b8eebe3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -155,32 +155,17 @@ static int start_readonly;
* like bio_clone, but with a local bio set
*/

-static void mddev_bio_destructor(struct bio *bio)
-{
- struct mddev *mddev, **mddevp;
-
- mddevp = (void*)bio;
- mddev = mddevp[-1];
-
- bio_free(bio, mddev->bio_set);
-}
-
struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
struct mddev *mddev)
{
struct bio *b;
- struct mddev **mddevp;

if (!mddev || !mddev->bio_set)
return bio_alloc(gfp_mask, nr_iovecs);

- b = bio_alloc_bioset(gfp_mask, nr_iovecs,
- mddev->bio_set);
+ b = bio_alloc_bioset(gfp_mask, nr_iovecs, mddev->bio_set);
if (!b)
return NULL;
- mddevp = (void*)b;
- mddevp[-1] = mddev;
- b->bi_destructor = mddev_bio_destructor;
return b;
}
EXPORT_SYMBOL_GPL(bio_alloc_mddev);
@@ -189,18 +174,14 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
struct mddev *mddev)
{
struct bio *b;
- struct mddev **mddevp;

if (!mddev || !mddev->bio_set)
return bio_clone(bio, gfp_mask);

- b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs,
- mddev->bio_set);
+ b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
if (!b)
return NULL;
- mddevp = (void*)b;
- mddevp[-1] = mddev;
- b->bi_destructor = mddev_bio_destructor;
+
__bio_clone(b, bio);
if (bio_integrity(bio)) {
int ret;
@@ -5006,8 +4987,7 @@ int md_run(struct mddev *mddev)
}

if (mddev->bio_set == NULL)
- mddev->bio_set = bioset_create(BIO_POOL_SIZE,
- sizeof(struct mddev *));
+ mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);

spin_lock(&pers_lock);
pers = find_pers(mddev->level, mddev->clevel);
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 76db75e..e58cd7d 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -543,14 +543,6 @@ static void iblock_complete_cmd(struct se_cmd *cmd)
kfree(ibr);
}

-static void iblock_bio_destructor(struct bio *bio)
-{
- struct se_cmd *cmd = bio->bi_private;
- struct iblock_dev *ib_dev = cmd->se_dev->dev_ptr;
-
- bio_free(bio, ib_dev->ibd_bio_set);
-}
-
static struct bio *
iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)
{
@@ -572,7 +564,6 @@ iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)

bio->bi_bdev = ib_dev->ibd_bd;
bio->bi_private = cmd;
- bio->bi_destructor = iblock_bio_destructor;
bio->bi_end_io = &iblock_bio_done;
bio->bi_sector = lba;
return bio;
diff --git a/fs/bio.c b/fs/bio.c
index 71072ab..e017f7a 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -272,10 +272,6 @@ EXPORT_SYMBOL(bio_init);
* bio_alloc_bioset will try its own mempool to satisfy the allocation.
* If %__GFP_WAIT is set then we will block on the internal pool waiting
* for a &struct bio to become free.
- *
- * Note that the caller must set ->bi_destructor on successful return
- * of a bio, to do the appropriate freeing of the bio once the reference
- * count drops to zero.
**/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
@@ -290,6 +286,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
bio = p + bs->front_pad;

bio_init(bio);
+ bio->bi_pool = bs;

if (unlikely(!nr_iovecs))
goto out_set;
@@ -316,11 +313,6 @@ err_free:
}
EXPORT_SYMBOL(bio_alloc_bioset);

-static void bio_fs_destructor(struct bio *bio)
-{
- bio_free(bio, fs_bio_set);
-}
-
/**
* bio_alloc - allocate a new bio, memory pool backed
* @gfp_mask: allocation mask to use
@@ -342,12 +334,7 @@ static void bio_fs_destructor(struct bio *bio)
*/
struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
{
- struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-
- if (bio)
- bio->bi_destructor = bio_fs_destructor;
-
- return bio;
+ return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
}
EXPORT_SYMBOL(bio_alloc);

@@ -423,7 +410,16 @@ void bio_put(struct bio *bio)
if (atomic_dec_and_test(&bio->bi_cnt)) {
bio_disassociate_task(bio);
bio->bi_next = NULL;
- bio->bi_destructor(bio);
+
+ /*
+ * This if statement is temporary - bi_pool is replacing
+ * bi_destructor, but bi_destructor will be taken out in another
+ * patch.
+ */
+ if (bio->bi_pool)
+ bio_free(bio, bio->bi_pool);
+ else
+ bio->bi_destructor(bio);
}
}
EXPORT_SYMBOL(bio_put);
@@ -474,12 +470,11 @@ EXPORT_SYMBOL(__bio_clone);
*/
struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
{
- struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
+ struct bio *b = bio_alloc(gfp_mask, bio->bi_max_vecs);

if (!b)
return NULL;

- b->bi_destructor = bio_fs_destructor;
__bio_clone(b, bio);

if (bio_integrity(bio)) {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7b7ac9c..af9dd9d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -80,6 +80,9 @@ struct bio {
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif

+ /* If bi_pool is non NULL, bi_destructor is not called */
+ struct bio_set *bi_pool;
+
bio_destructor_t *bi_destructor; /* destructor */

/*
--
1.7.12

2012-08-28 20:31:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] block: Add bio_reset()

Hello, Kent.

On Tue, Aug 28, 2012 at 10:37:30AM -0700, Kent Overstreet wrote:
> Reusing bios is something that's been highly frowned upon in the past,
> but driver code keeps doing it anyways. If it's going to happen anyways,
> we should provide a generic method.
>
> This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
> was open coding it, by doing a bio_init() and resetting bi_destructor.

Better to explain why some bio fields are re-ordered and why that
shouldn't make things worse cacheline-wise?

> +void bio_reset(struct bio *bio)
> +{

Function comment explaining what it does and why it does what it does
with integrity / bi_css / whatnot?

> + unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
> +
> + if (bio_integrity(bio))
> + bio_integrity_free(bio, bio->bi_pool);
> +
> + bio_disassociate_task(bio);

Is this desirable? Why?

Thanks.

--
tejun

2012-08-28 20:32:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] pktcdvd: Switch to bio_kmalloc()

On Tue, Aug 28, 2012 at 10:37:31AM -0700, Kent Overstreet wrote:
> This is prep work for killing bi_destructor - previously, pktcdvd had
> its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
> necessitating its own bi_destructor implementation.

How was this tested?

--
tejun

2012-08-28 20:36:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 5/9] block: Kill bi_destructor

On Tue, Aug 28, 2012 at 10:37:32AM -0700, Kent Overstreet wrote:
> @@ -385,7 +386,7 @@ struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
> bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
> bio->bi_max_vecs = nr_iovecs;
> bio->bi_io_vec = bio->bi_inline_vecs;
> - bio->bi_destructor = bio_kmalloc_destructor;
> + bio->bi_pool = NULL;

Doesn't bio_init() already memset the whole thing?

Thanks.

--
tejun

2012-08-28 20:41:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 6/9] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

Hello, Kent.

On Tue, Aug 28, 2012 at 10:37:33AM -0700, Kent Overstreet wrote:
> v7: Re-add dropped comments, improv patch description

I don't think you did the former part. Other than that looks good to
me.

Thanks.

--
tejun

2012-08-28 20:44:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] block: Add bio_clone_bioset(), bio_clone_kmalloc()

On Tue, Aug 28, 2012 at 10:37:34AM -0700, Kent Overstreet wrote:
> +static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> +{
> + return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
> +}
> +
...
> +static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
> +{
> + return bio_clone_bioset(bio, gfp_mask, NULL);
> +
> +}

Do we really need these wrappers? I'd prefer requiring users to
explicit choose @bioset when cloning.

Thanks.

--
tejun

2012-08-28 20:49:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

Hello,

On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> + /*
> + * generic_make_request() converts recursion to iteration; this
> + * means if we're running beneath it, any bios we allocate and
> + * submit will not be submitted (and thus freed) until after we
> + * return.
> + *
> + * This exposes us to a potential deadlock if we allocate
> + * multiple bios from the same bio_set() while running
> + * underneath generic_make_request(). If we were to allocate
> + * multiple bios (say a stacking block driver that was splitting
> + * bios), we would deadlock if we exhausted the mempool's
> + * reserve.
> + *
> + * We solve this, and guarantee forward progress, with a rescuer
> + * workqueue per bio_set. If we go to allocate and there are
> + * bios on current->bio_list, we first try the allocation
> + * without __GFP_WAIT; if that fails, we punt those bios we
> + * would be blocking to the rescuer workqueue before we retry
> + * with the original gfp_flags.
> + */
> +
> + if (current->bio_list && !bio_list_empty(current->bio_list))
> + gfp_mask &= ~__GFP_WAIT;
> +retry:
> p = mempool_alloc(bs->bio_pool, gfp_mask);
> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }

I wonder whether it would be easier to follow if this part is in-line
where retry: is. All that needs to be duplicated is single
mempool_alloc() call, right?

Overall, I *think* this is correct but need to think more about it to
be sure.

Thanks!

--
tejun

2012-08-28 22:04:09

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 6/9] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

On Tue, Aug 28, 2012 at 01:41:48PM -0700, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Aug 28, 2012 at 10:37:33AM -0700, Kent Overstreet wrote:
> > v7: Re-add dropped comments, improv patch description
>
> I don't think you did the former part. Other than that looks good to
> me.

I folded them into the bio_alloc_bioset() comments - so all the
information that was there previously should still be there, just
centralized. That good enough for your acked-by?

2012-08-28 22:06:08

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] block: Add bio_clone_bioset(), bio_clone_kmalloc()

On Tue, Aug 28, 2012 at 01:44:01PM -0700, Tejun Heo wrote:
> On Tue, Aug 28, 2012 at 10:37:34AM -0700, Kent Overstreet wrote:
> > +static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> > +{
> > + return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
> > +}
> > +
> ...
> > +static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
> > +{
> > + return bio_clone_bioset(bio, gfp_mask, NULL);
> > +
> > +}
>
> Do we really need these wrappers? I'd prefer requiring users to
> explicit choose @bioset when cloning.

bio_clone() is an existing api, I agree but I'd prefer to convert
existing users in a separate patch and when I do that I want to spend
some time actually looking at the existing code instead of doing the
conversion blindly (at least some of the existing users are incorrect
and I'll have to add bio_sets for them).

2012-08-28 22:06:24

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request() (i.e. a stacking block
> driver), we risk deadlock.
>
> This is because of the code in generic_make_request() that converts
> recursion to iteration; any bios we submit won't actually be submitted
> (so they can complete and eventually be freed) until after we return -
> this means if we allocate a second bio, we're blocking the first one
> from ever being freed.
>
> Thus if enough threads call into a stacking block driver at the same
> time with bios that need multiple splits, and the bio_set's reserve gets
> used up, we deadlock.

Hi Kent,

So above deadlock possibility arises only if multiple threads are doing
multiple splits from same pool and all the threads get blocked on mempool
and don't return from ->make_request function.

Is there any existing driver in the tree which can cause this deadlock or
it becomes a possibility only when splitting and bcache code shows up?

Thanks
Vivek

2012-08-28 22:07:52

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 5/9] block: Kill bi_destructor

On Tue, Aug 28, 2012 at 01:36:13PM -0700, Tejun Heo wrote:
> On Tue, Aug 28, 2012 at 10:37:32AM -0700, Kent Overstreet wrote:
> > @@ -385,7 +386,7 @@ struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
> > bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
> > bio->bi_max_vecs = nr_iovecs;
> > bio->bi_io_vec = bio->bi_inline_vecs;
> > - bio->bi_destructor = bio_kmalloc_destructor;
> > + bio->bi_pool = NULL;
>
> Doesn't bio_init() already memset the whole thing?

Yes it does, holdover from BIO_KMALLOC_POOL. Dropping it.

2012-08-28 22:17:51

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] block: Add bio_reset()

On Tue, Aug 28, 2012 at 01:31:48PM -0700, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Aug 28, 2012 at 10:37:30AM -0700, Kent Overstreet wrote:
> > Reusing bios is something that's been highly frowned upon in the past,
> > but driver code keeps doing it anyways. If it's going to happen anyways,
> > we should provide a generic method.
> >
> > This'll help with getting rid of bi_destructor - drivers/block/pktcdvd.c
> > was open coding it, by doing a bio_init() and resetting bi_destructor.
>
> Better to explain why some bio fields are re-ordered and why that
> shouldn't make things worse cacheline-wise?

Well it may (struct bio is what, 3 or 4 cachelines now?) but even on
ridiculous million iop devices struct bio just isn't hot enough for this
kind of stuff to show up in the profiles... and I've done enough
profiling to be pretty confident of that.

So um, if there was anything to say performance wise I would, but to me
it seems more that there isn't.

(pruning struct bio would have more of an effect performance wise, which
you know is something I'd like to do.)

>
> > +void bio_reset(struct bio *bio)
> > +{
>
> Function comment explaining what it does and why it does what it does
> with integrity / bi_css / whatnot?

Yeah, good idea.

>
> > + unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
> > +
> > + if (bio_integrity(bio))
> > + bio_integrity_free(bio, bio->bi_pool);
> > +
> > + bio_disassociate_task(bio);
>
> Is this desirable? Why?

*twitch* I should've thought more when I made that change.

It occurs to me now that we specifically talked about that and decided
to do it the other way - when I changed that I was just looking at
bio_free() and decided they needed to be symmetrical.

I still think they should be symmetrical, but if that's true bi_ioc and
bi_css need to be moved, and also bio_disassociate_task() should be
getting called from bio_free(), not bio_put().

Were you the one that added that call? I know you've been working on
that area of the code recently. Sticking it in bio_put() instead of
bio_free() seems odd to be, and they're completely equivalent now that
bio_free() is only called from bio_put() (save one instance I should
probably fix).

2012-08-28 22:23:40

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 06:06:10PM -0400, Vivek Goyal wrote:
> On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> > Previously, if we ever try to allocate more than once from the same bio
> > set while running under generic_make_request() (i.e. a stacking block
> > driver), we risk deadlock.
> >
> > This is because of the code in generic_make_request() that converts
> > recursion to iteration; any bios we submit won't actually be submitted
> > (so they can complete and eventually be freed) until after we return -
> > this means if we allocate a second bio, we're blocking the first one
> > from ever being freed.
> >
> > Thus if enough threads call into a stacking block driver at the same
> > time with bios that need multiple splits, and the bio_set's reserve gets
> > used up, we deadlock.
>
> Hi Kent,
>
> So above deadlock possibility arises only if multiple threads are doing
> multiple splits from same pool and all the threads get blocked on mempool
> and don't return from ->make_request function.
>
> Is there any existing driver in the tree which can cause this deadlock or
> it becomes a possibility only when splitting and bcache code shows up?

It is emphatically possible with dm, though you would probably need a
pathalogical configuration of your targets for it to be possible in
practice - at least with the targets currently commonly in use.

With most of the newer dm targets coming down the pipe I expect it
should be possible under real world configurations, with the caveat that
under that kind of memory pressure in practice many other things will
fall over first.

2012-08-28 22:24:51

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] pktcdvd: Switch to bio_kmalloc()

On Tue, Aug 28, 2012 at 01:32:47PM -0700, Tejun Heo wrote:
> On Tue, Aug 28, 2012 at 10:37:31AM -0700, Kent Overstreet wrote:
> > This is prep work for killing bi_destructor - previously, pktcdvd had
> > its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
> > necessitating its own bi_destructor implementation.
>
> How was this tested?

Hasn't been yet, but I'm getting a new test machine in the next day or
so (and also the patch is a lot smaller since the bio_reset() changes
you suggested).

2012-08-28 22:28:36

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> > @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > front_pad = 0;
> > inline_vecs = nr_iovecs;
> > } else {
> > + /*
> > + * generic_make_request() converts recursion to iteration; this
> > + * means if we're running beneath it, any bios we allocate and
> > + * submit will not be submitted (and thus freed) until after we
> > + * return.
> > + *
> > + * This exposes us to a potential deadlock if we allocate
> > + * multiple bios from the same bio_set() while running
> > + * underneath generic_make_request(). If we were to allocate
> > + * multiple bios (say a stacking block driver that was splitting
> > + * bios), we would deadlock if we exhausted the mempool's
> > + * reserve.
> > + *
> > + * We solve this, and guarantee forward progress, with a rescuer
> > + * workqueue per bio_set. If we go to allocate and there are
> > + * bios on current->bio_list, we first try the allocation
> > + * without __GFP_WAIT; if that fails, we punt those bios we
> > + * would be blocking to the rescuer workqueue before we retry
> > + * with the original gfp_flags.
> > + */
> > +
> > + if (current->bio_list && !bio_list_empty(current->bio_list))
> > + gfp_mask &= ~__GFP_WAIT;
> > +retry:
> > p = mempool_alloc(bs->bio_pool, gfp_mask);
> > front_pad = bs->front_pad;
> > inline_vecs = BIO_INLINE_VECS;
> > }
> >
> > if (unlikely(!p))
> > - return NULL;
> > + goto err;
> >
> > bio = p + front_pad;
> > bio_init(bio);
> > @@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >
> > err_free:
> > mempool_free(p, bs->bio_pool);
> > +err:
> > + if (gfp_mask != saved_gfp) {
> > + gfp_mask = saved_gfp;
> > +
> > + spin_lock(&bs->rescue_lock);
> > + bio_list_merge(&bs->rescue_list, current->bio_list);
> > + bio_list_init(current->bio_list);
> > + spin_unlock(&bs->rescue_lock);
> > +
> > + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > + goto retry;
> > + }
>
> I wonder whether it would be easier to follow if this part is in-line
> where retry: is. All that needs to be duplicated is single
> mempool_alloc() call, right?

Actually, what might be better than both of those approaches is shoving
that code into another function. Then it's just:

if (gfp_mask != saved_gfp) {
gfp_mask = saved_gfp;
shovel_bios_to_rescuer();
goto retry;
}

> Overall, I *think* this is correct but need to think more about it to
> be sure.

Please do. As much time as I've spent staring at this kind of stuff,
I'm pretty sure I've got it correct but it still makes my head hurt to
work out all the various possible deadlocks.

2012-08-28 22:55:10

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] block: Add bio_reset()

On Tue, Aug 28, 2012 at 03:17:15PM -0700, Kent Overstreet wrote:
> On Tue, Aug 28, 2012 at 01:31:48PM -0700, Tejun Heo wrote:
> > > + unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
> > > +
> > > + if (bio_integrity(bio))
> > > + bio_integrity_free(bio, bio->bi_pool);
> > > +
> > > + bio_disassociate_task(bio);
> >
> > Is this desirable? Why?
>
> *twitch* I should've thought more when I made that change.
>
> It occurs to me now that we specifically talked about that and decided
> to do it the other way - when I changed that I was just looking at
> bio_free() and decided they needed to be symmetrical.
>
> I still think they should be symmetrical, but if that's true bi_ioc and
> bi_css need to be moved, and also bio_disassociate_task() should be
> getting called from bio_free(), not bio_put().

Though - looking at it again I didn't actually screw anything up when I
made that change, it's just bad style.

Just screwing around a bit with the patch below, but - couple thoughts:

1) I hate duplicated code, and for the stuff in
bio_init()/bio_free()/bio_reset() it's perhaps not worth it, it is the
kind of stuff that gets out of sync.

2) It'd be better to have bio_reset() reset as much as possible, i.e. be
as close to bio_init() as posible. Fewer differences to confuse people.


diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 7c8fe1d..a38bc7d 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -148,6 +148,9 @@ void bio_integrity_free(struct bio *bio, struct bio_set *bs)

BUG_ON(bip == NULL);

+ if (!bs)
+ bs = fs_bio_set;
+
/* A cloned bio doesn't own the integrity metadata */
if (!bio_flagged(bio, BIO_CLONED) && !bio_flagged(bio, BIO_FS_INTEGRITY)
&& bip->bip_buf != NULL)
diff --git a/fs/bio.c b/fs/bio.c
index c938f42..56d8fa2 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -234,39 +234,47 @@ fallback:
return bvl;
}

+static void bio_free_stuff(struct bio *bio)
+{
+ bio_disassociate_task(bio);
+
+ if (bio_integrity(bio))
+ bio_integrity_free(bio, bio->bi_pool);
+}
+
void bio_free(struct bio *bio)
{
struct bio_set *bs = bio->bi_pool;
void *p;

+ bio_free_stuff(bio);
+
if (!bs) {
- /* Bio was allocated by bio_kmalloc() */
- if (bio_integrity(bio))
- bio_integrity_free(bio, fs_bio_set);
kfree(bio);
- return;
- }
-
- if (bio_has_allocated_vec(bio))
- bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));
+ } else {
+ if (bio_has_allocated_vec(bio))
+ bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));

- if (bio_integrity(bio))
- bio_integrity_free(bio, bs);
+ /*
+ * If we have front padding, adjust the bio pointer before freeing
+ */
+ p = bio;
+ if (bs->front_pad)
+ p -= bs->front_pad;

- /*
- * If we have front padding, adjust the bio pointer before freeing
- */
- p = bio;
- if (bs->front_pad)
- p -= bs->front_pad;
+ mempool_free(p, bs->bio_pool);
+ }
+}

- mempool_free(p, bs->bio_pool);
+static void __bio_init(struct bio *bio)
+{
+ memset(bio, 0, BIO_RESET_BYTES);
+ bio->bi_flags = 1 << BIO_UPTODATE;
}

void bio_init(struct bio *bio)
{
- memset(bio, 0, sizeof(*bio));
- bio->bi_flags = 1 << BIO_UPTODATE;
+ __bio_init(bio);
atomic_set(&bio->bi_cnt, 1);
}
EXPORT_SYMBOL(bio_init);
@@ -275,13 +283,10 @@ void bio_reset(struct bio *bio)
{
unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);

- if (bio_integrity(bio))
- bio_integrity_free(bio, bio->bi_pool);
+ bio_free_stuff(bio);
+ __bio_init(bio);

- bio_disassociate_task(bio);
-
- memset(bio, 0, BIO_RESET_BYTES);
- bio->bi_flags = flags|(1 << BIO_UPTODATE);
+ bio->bi_flags |= flags;
}
EXPORT_SYMBOL(bio_reset);

@@ -440,10 +445,8 @@ void bio_put(struct bio *bio)
/*
* last put frees it
*/
- if (atomic_dec_and_test(&bio->bi_cnt)) {
- bio_disassociate_task(bio);
+ if (atomic_dec_and_test(&bio->bi_cnt))
bio_free(bio);
- }
}
EXPORT_SYMBOL(bio_put);

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ca63847..28bddc0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -68,15 +68,6 @@ struct bio {

struct bvec_iter bi_iter;

- /*
- * Everything starting with bi_max_vecs will be preserved by bio_reset()
- */
-
- unsigned int bi_max_vecs; /* max bvl_vecs we can hold */
-
- atomic_t bi_cnt; /* pin count */
-
- struct bio_vec *bi_io_vec; /* the actual vec list */
#ifdef CONFIG_BLK_CGROUP
/*
* Optional ioc and css associated with this bio. Put on bio
@@ -89,6 +80,16 @@ struct bio {
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif

+ /*
+ * Everything starting with bi_max_vecs will be preserved by bio_reset()
+ */
+
+ unsigned int bi_max_vecs; /* max bvl_vecs we can hold */
+
+ atomic_t bi_cnt; /* pin count */
+
+ struct bio_vec *bi_io_vec; /* the actual vec list */
+
struct bio_set *bi_pool;

/*

2012-08-28 23:01:45

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 03:28:00PM -0700, Kent Overstreet wrote:
> On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> > Overall, I *think* this is correct but need to think more about it to
> > be sure.
>
> Please do. As much time as I've spent staring at this kind of stuff,
> I'm pretty sure I've got it correct but it still makes my head hurt to
> work out all the various possible deadlocks.

Hilarious thought: We're punting bios to a rescuer thread that's
specific to a certain bio_set, right? What if we happen to punt bios
from a different bio_set? And then the rescuer goes to resubmit those
bios, and in the process they happen to have dependencies on the
original bio_set...

I think it's actually necessary to filter out only bios from the current
bio_set to punt to the rescuer.

God I love the block layer sometimes.

2012-08-29 01:32:09

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 04:01:08PM -0700, Kent Overstreet wrote:
> On Tue, Aug 28, 2012 at 03:28:00PM -0700, Kent Overstreet wrote:
> > On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> > > Overall, I *think* this is correct but need to think more about it to
> > > be sure.
> >
> > Please do. As much time as I've spent staring at this kind of stuff,
> > I'm pretty sure I've got it correct but it still makes my head hurt to
> > work out all the various possible deadlocks.
>
> Hilarious thought: We're punting bios to a rescuer thread that's
> specific to a certain bio_set, right? What if we happen to punt bios
> from a different bio_set? And then the rescuer goes to resubmit those
> bios, and in the process they happen to have dependencies on the
> original bio_set...

Are they not fully allocated bios and when you submit these to underlying
device, ideally we should not be sharing memory pool at different layers
of stack otherwise we will deadlock any way as stack depth increases. So
there should not be a dependency on original bio_set?

Or, am I missing something. May be an example will help.

Thanks
Vivek

2012-08-29 03:26:35

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 09:31:50PM -0400, Vivek Goyal wrote:
> On Tue, Aug 28, 2012 at 04:01:08PM -0700, Kent Overstreet wrote:
> > On Tue, Aug 28, 2012 at 03:28:00PM -0700, Kent Overstreet wrote:
> > > On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> > > > Overall, I *think* this is correct but need to think more about it to
> > > > be sure.
> > >
> > > Please do. As much time as I've spent staring at this kind of stuff,
> > > I'm pretty sure I've got it correct but it still makes my head hurt to
> > > work out all the various possible deadlocks.
> >
> > Hilarious thought: We're punting bios to a rescuer thread that's
> > specific to a certain bio_set, right? What if we happen to punt bios
> > from a different bio_set? And then the rescuer goes to resubmit those
> > bios, and in the process they happen to have dependencies on the
> > original bio_set...
>
> Are they not fully allocated bios and when you submit these to underlying
> device, ideally we should not be sharing memory pool at different layers
> of stack otherwise we will deadlock any way as stack depth increases. So
> there should not be a dependency on original bio_set?
>
> Or, am I missing something. May be an example will help.

Uh, it's more complicated than that. My brain is too fried right now to
walk through it in detail, but the problem (if it is a problem; I can't
convince myself one way or the other) is roughly:

one virt block device stacked on top of another - they both do arbitrary
splitting:

So once they've submitted a bio, that bio needs to make forward progress
even if the thread goes to allocate another bio and blocks before it
returns from its make_request fn.

That much my patch solves, with the rescuer thread; if the thread goes
to block, it punts those blocked bios off to a rescuer thread - and we
create one rescuer per bio set.

So going back to the stacked block devices, if you've got say dm on top
of md (or something else since md doesn't really do much splitting) -
each block device will have its own rescuer and everything should be
hunky dory.

Except that when thread a goes to punt those blocked bios to its
rescuer, it punts _all_ the bios on current->bio_list. Even those
generated by/belonging to other bio_sets.

So thread 1 in device b punts bios to its rescuer, thread 2

But thread 2 ends up with bios for both device a and b - because they're
stacked.

Thread 2 starts on bios for device a before it gets to those for device
b. But a is stacked on top of b, so in the process it generates more
bios for b.

So now it's uhm...

yeah, I'm gonna sleep on this. I'm pretty sure to be rigorously correct
filtering the right bios when we punt them to the rescuer is needed,
though.

2012-08-29 12:58:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 08:25:58PM -0700, Kent Overstreet wrote:

[..]
> Except that when thread a goes to punt those blocked bios to its
> rescuer, it punts _all_ the bios on current->bio_list. Even those
> generated by/belonging to other bio_sets.
>
> So thread 1 in device b punts bios to its rescuer, thread 2
>
> But thread 2 ends up with bios for both device a and b - because they're
> stacked.

Ok, just to add more details to above example. Say we have device A stacked
on top of B and B stacked on top of C. Say a bio a is submitted to device A
and is splitted in two bios b1 and b2. Now b1 is sumbitted to device B and is
splitted in c1 and c2. Now current bio list has three bios. b2, c1 and c2.
If submitter is now about to block on any bio set, then all tree bios
b2, c1, c2 will punted to rescue thread and submssion of b2 will again block
resulting in blocking rescue thread itself.

I would say keep all the bio splitting patches and any fixes w.r.t
deadlocks in a seprate series. As this is little complicated and a lot
of is just theoritical corner cases. If you limit this series to just
bio_set related cleanups, it becomes more acceptable for inclusion.

Thanks
Vivek

2012-08-29 14:39:34

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 08:57:59AM -0400, Vivek Goyal wrote:
> I would say keep all the bio splitting patches and any fixes w.r.t
> deadlocks in a seprate series. As this is little complicated and a lot
> of is just theoritical corner cases. If you limit this series to just
> bio_set related cleanups, it becomes more acceptable for inclusion.

Yes, please keep the splitting patches separate. The current code gets
away with what it does through statistics making the deadlocks very
unlikely.

It's also instructive to remember why the code is the way it is: it used
to process bios for underlying devices immediately, but this sometimes
meant too much recursive stack growth. If a per-device rescuer thread
is to be made available (as well as the mempool), the option of
reinstating recursion is there too - only punting to workqueue when the
stack actually becomes "too big". (Also bear in mind that some dm
targets may have dependencies on their own mempools - submission can
block there too.) I find it helpful only to consider splitting into two
pieces - it must always be possible to process the first piece (i.e.
process it at the next layer down in the stack) and complete it
independently of what happens to the second piece (which might require
further splitting and block until the first piece has completed).

Alasdair

2012-08-29 16:25:00

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

Hi

This fixes the bio allocation problems, but doesn't fix a similar deadlock
in device mapper when allocating from md->io_pool or other mempools in
the target driver.

The problem is that majority of device mapper code assumes that if we
submit a bio, that bio will be finished in a finite time. The commit
d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.

I suggest - instead of writing workarounds for this current->bio_list
misbehavior, why not remove current->bio_list at all? We could revert
d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
test stack usage in generic_make_request, and if it is too high (more than
half of the stack used, or so), put the bio to the target device's
blockqueue.

That could be simpler than allocating per-bioset workqueue and it also
solves more problems (possible deadlocks in dm).

Mikulas



On Tue, 28 Aug 2012, Kent Overstreet wrote:

> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request() (i.e. a stacking block
> driver), we risk deadlock.
>
> This is because of the code in generic_make_request() that converts
> recursion to iteration; any bios we submit won't actually be submitted
> (so they can complete and eventually be freed) until after we return -
> this means if we allocate a second bio, we're blocking the first one
> from ever being freed.
>
> Thus if enough threads call into a stacking block driver at the same
> time with bios that need multiple splits, and the bio_set's reserve gets
> used up, we deadlock.
>
> This can be worked around in the driver code - we could check if we're
> running under generic_make_request(), then mask out __GFP_WAIT when we
> go to allocate a bio, and if the allocation fails punt to workqueue and
> retry the allocation.
>
> But this is tricky and not a generic solution. This patch solves it for
> all users by inverting the previously described technique. We allocate a
> rescuer workqueue for each bio_set, and then in the allocation code if
> there are bios on current->bio_list we would be blocking, we punt them
> to the rescuer workqueue to be submitted.
>
> Tested it by forcing the rescue codepath to be taken (by disabling the
> first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
> of arbitrary bio splitting) and verified that the rescuer was being
> invoked.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> CC: Jens Axboe <[email protected]>
> ---
> fs/bio.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> include/linux/bio.h | 9 +++++++
> 2 files changed, 79 insertions(+), 3 deletions(-)
>
> diff --git a/fs/bio.c b/fs/bio.c
> index 31e637a..5d46318 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -285,6 +285,23 @@ void bio_reset(struct bio *bio)
> }
> EXPORT_SYMBOL(bio_reset);
>
> +static void bio_alloc_rescue(struct work_struct *work)
> +{
> + struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> + struct bio *bio;
> +
> + while (1) {
> + spin_lock(&bs->rescue_lock);
> + bio = bio_list_pop(&bs->rescue_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + if (!bio)
> + break;
> +
> + generic_make_request(bio);
> + }
> +}
> +
> /**
> * bio_alloc_bioset - allocate a bio for I/O
> * @gfp_mask: the GFP_ mask given to the slab allocator
> @@ -307,6 +324,7 @@ EXPORT_SYMBOL(bio_reset);
> */
> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> {
> + gfp_t saved_gfp = gfp_mask;
> unsigned front_pad;
> unsigned inline_vecs;
> unsigned long idx = BIO_POOL_NONE;
> @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> + /*
> + * generic_make_request() converts recursion to iteration; this
> + * means if we're running beneath it, any bios we allocate and
> + * submit will not be submitted (and thus freed) until after we
> + * return.
> + *
> + * This exposes us to a potential deadlock if we allocate
> + * multiple bios from the same bio_set() while running
> + * underneath generic_make_request(). If we were to allocate
> + * multiple bios (say a stacking block driver that was splitting
> + * bios), we would deadlock if we exhausted the mempool's
> + * reserve.
> + *
> + * We solve this, and guarantee forward progress, with a rescuer
> + * workqueue per bio_set. If we go to allocate and there are
> + * bios on current->bio_list, we first try the allocation
> + * without __GFP_WAIT; if that fails, we punt those bios we
> + * would be blocking to the rescuer workqueue before we retry
> + * with the original gfp_flags.
> + */
> +
> + if (current->bio_list && !bio_list_empty(current->bio_list))
> + gfp_mask &= ~__GFP_WAIT;
> +retry:
> p = mempool_alloc(bs->bio_pool, gfp_mask);
> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }
> +
> return NULL;
> }
> EXPORT_SYMBOL(bio_alloc_bioset);
> @@ -1562,6 +1617,9 @@ static void biovec_free_pools(struct bio_set *bs)
>
> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)
> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);
>
> @@ -1597,6 +1655,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
>
> bs->front_pad = front_pad;
>
> + spin_lock_init(&bs->rescue_lock);
> + bio_list_init(&bs->rescue_list);
> + INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> +
> bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
> if (!bs->bio_slab) {
> kfree(bs);
> @@ -1607,9 +1669,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
> if (!bs->bio_pool)
> goto bad;
>
> - if (!biovec_create_pools(bs, pool_size))
> - return bs;
> + if (biovec_create_pools(bs, pool_size))
> + goto bad;
> +
> + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> + if (!bs->rescue_workqueue)
> + goto bad;
>
> + return bs;
> bad:
> bioset_free(bs);
> return NULL;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 3a8345e..84fdaac 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -492,6 +492,15 @@ struct bio_set {
> mempool_t *bio_integrity_pool;
> #endif
> mempool_t *bvec_pool;
> +
> + /*
> + * Deadlock avoidance for stacking block drivers: see comments in
> + * bio_alloc_bioset() for details
> + */
> + spinlock_t rescue_lock;
> + struct bio_list rescue_list;
> + struct work_struct rescue_work;
> + struct workqueue_struct *rescue_workqueue;
> };
>
> struct biovec_slab {
> --
> 1.7.12
>

2012-08-29 16:26:18

by Kent Overstreet

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
> It's also instructive to remember why the code is the way it is: it used
> to process bios for underlying devices immediately, but this sometimes
> meant too much recursive stack growth. If a per-device rescuer thread
> is to be made available (as well as the mempool), the option of
> reinstating recursion is there too - only punting to workqueue when the
> stack actually becomes "too big". (Also bear in mind that some dm
> targets may have dependencies on their own mempools - submission can
> block there too.) I find it helpful only to consider splitting into two
> pieces - it must always be possible to process the first piece (i.e.
> process it at the next layer down in the stack) and complete it
> independently of what happens to the second piece (which might require
> further splitting and block until the first piece has completed).

I'm sure it could be made to work (and it may well simpler), but it
seems problematic from a performance pov.

With stacked devices you'd then have to switch stacks on _every_ bio.
That could be made fast enough I'm sure, but it wouldn't be free and I
don't know of any existing code in the kernel that implements what we'd
need (though if you know how you'd go about doing that, I'd love to
know! Would be useful for other things).

The real problem is that because we'd need these extra stacks for
handling all bios we couldn't get by with just one per bio_set. We'd
only need one to make forward progress so the rest could be allocated
on demand (i.e. what the workqueue code does) but that sounds like it's
starting to get expensive.

2012-08-29 16:50:48

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 12:24:43PM -0400, Mikulas Patocka wrote:
> Hi
>
> This fixes the bio allocation problems, but doesn't fix a similar deadlock
> in device mapper when allocating from md->io_pool or other mempools in
> the target driver.

Ick. That is a problem.

There are a bunch of mempools in drivers/md too :(

Though honestly bio_sets have the front_pad thing for a reason, for per
bio state it makes sense to use that anyways - simpler code because
you're doing fewer explicit allocations and more efficient too.

So I wonder if we fixed that if it'd still be a problem.

The other thing we could do is make the rescuer thread per block device
(which arguably makes more sense than per bio_set anyways), then
associate bio_sets with specific block devices/rescuers.

Then the rescuer code can be abstracted out from the bio allocation
code, and we just create a thin wrapper around mempool_alloc() that does
the same dance bio_alloc_bioset() does now.

I think that'd be pretty easy and work out really nicely.

> The problem is that majority of device mapper code assumes that if we
> submit a bio, that bio will be finished in a finite time. The commit
> d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
>
> I suggest - instead of writing workarounds for this current->bio_list
> misbehavior, why not remove current->bio_list at all? We could revert
> d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> test stack usage in generic_make_request, and if it is too high (more than
> half of the stack used, or so), put the bio to the target device's
> blockqueue.
>
> That could be simpler than allocating per-bioset workqueue and it also
> solves more problems (possible deadlocks in dm).

It certainly would be simpler, but honestly the potential for
performance regressions scares me (and bcache at least is used on fast
enough devices where it's going to matter). Also it's not so much the
performance overhead - we can just measure that - it's that if we're
just using the workqueue code the scheduler's getting involved and we
can't just measure what the effects of that are going to be in
production.

2012-08-29 16:58:13

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 09:50:06AM -0700, Kent Overstreet wrote:
> The other thing we could do is make the rescuer thread per block device
> (which arguably makes more sense than per bio_set anyways), then
> associate bio_sets with specific block devices/rescuers.

For dm, it's equivalent - already one bioset required per live device:
dm_alloc_md_mempools().

Alasdair

2012-08-29 17:07:26

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 09:50:06AM -0700, Kent Overstreet wrote:

[..]
> > The problem is that majority of device mapper code assumes that if we
> > submit a bio, that bio will be finished in a finite time. The commit
> > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
> >
> > I suggest - instead of writing workarounds for this current->bio_list
> > misbehavior, why not remove current->bio_list at all? We could revert
> > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> > test stack usage in generic_make_request, and if it is too high (more than
> > half of the stack used, or so), put the bio to the target device's
> > blockqueue.
> >
> > That could be simpler than allocating per-bioset workqueue and it also
> > solves more problems (possible deadlocks in dm).
>
> It certainly would be simpler, but honestly the potential for
> performance regressions scares me (and bcache at least is used on fast
> enough devices where it's going to matter). Also it's not so much the
> performance overhead - we can just measure that - it's that if we're
> just using the workqueue code the scheduler's getting involved and we
> can't just measure what the effects of that are going to be in
> production.

Are workqueues not getting involved already in your solution of punting
to rescuer thread. In the proposal above also, workers get involved
only if stack depth is too deep. So for normal stack usage performance
should not be impacted.

Performance aside, punting submission to per device worker in case of deep
stack usage sounds cleaner solution to me.

Thanks
Vivek

2012-08-29 17:13:59

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 01:07:11PM -0400, Vivek Goyal wrote:
> On Wed, Aug 29, 2012 at 09:50:06AM -0700, Kent Overstreet wrote:
>
> [..]
> > > The problem is that majority of device mapper code assumes that if we
> > > submit a bio, that bio will be finished in a finite time. The commit
> > > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
> > >
> > > I suggest - instead of writing workarounds for this current->bio_list
> > > misbehavior, why not remove current->bio_list at all? We could revert
> > > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> > > test stack usage in generic_make_request, and if it is too high (more than
> > > half of the stack used, or so), put the bio to the target device's
> > > blockqueue.
> > >
> > > That could be simpler than allocating per-bioset workqueue and it also
> > > solves more problems (possible deadlocks in dm).
> >
> > It certainly would be simpler, but honestly the potential for
> > performance regressions scares me (and bcache at least is used on fast
> > enough devices where it's going to matter). Also it's not so much the
> > performance overhead - we can just measure that - it's that if we're
> > just using the workqueue code the scheduler's getting involved and we
> > can't just measure what the effects of that are going to be in
> > production.
>
> Are workqueues not getting involved already in your solution of punting
> to rescuer thread.

Only on allocation failure.

> In the proposal above also, workers get involved
> only if stack depth is too deep. So for normal stack usage performance
> should not be impacted.
>
> Performance aside, punting submission to per device worker in case of deep
> stack usage sounds cleaner solution to me.

Agreed, but performance tends to matter in the real world. And either
way the tricky bits are going to be confined to a few functions, so I
don't think it matters that much.

If someone wants to code up the workqueue version and test it, they're
more than welcome...

2012-08-29 17:23:53

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> Only on allocation failure.

Which as you already said assumes wrapping together the other mempools in the
submission path first...

Alasdair

2012-08-29 17:32:54

by Kent Overstreet

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 06:23:36PM +0100, Alasdair G Kergon wrote:
> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> > Only on allocation failure.
>
> Which as you already said assumes wrapping together the other mempools in the
> submission path first...

Yes? I'm not sure what you're getting at.

2012-08-29 21:09:14

by Kent Overstreet

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 05:01:09PM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <[email protected]> writes:
>
> Kent> On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
> >> It's also instructive to remember why the code is the way it is: it used
> >> to process bios for underlying devices immediately, but this sometimes
> >> meant too much recursive stack growth. If a per-device rescuer thread
> >> is to be made available (as well as the mempool), the option of
> >> reinstating recursion is there too - only punting to workqueue when the
> >> stack actually becomes "too big". (Also bear in mind that some dm
> >> targets may have dependencies on their own mempools - submission can
> >> block there too.) I find it helpful only to consider splitting into two
> >> pieces - it must always be possible to process the first piece (i.e.
> >> process it at the next layer down in the stack) and complete it
> >> independently of what happens to the second piece (which might require
> >> further splitting and block until the first piece has completed).
>
> Kent> I'm sure it could be made to work (and it may well simpler), but it
> Kent> seems problematic from a performance pov.
>
> Kent> With stacked devices you'd then have to switch stacks on _every_ bio.
> Kent> That could be made fast enough I'm sure, but it wouldn't be free and I
> Kent> don't know of any existing code in the kernel that implements what we'd
> Kent> need (though if you know how you'd go about doing that, I'd love to
> Kent> know! Would be useful for other things).
>
> Kent> The real problem is that because we'd need these extra stacks for
> Kent> handling all bios we couldn't get by with just one per bio_set. We'd
> Kent> only need one to make forward progress so the rest could be allocated
> Kent> on demand (i.e. what the workqueue code does) but that sounds like it's
> Kent> starting to get expensive.
>
> Maybe we need to limit the size of BIOs to that of the bottom-most
> underlying device instead? Or maybe limit BIOs to some small
> multiple? As you stack up DM targets one on top of each other, they
> should respect the limits of the underlying devices and pass those
> limits up the chain.

That's the approach the block layer currently tries to take. It's
brittle, tricky and inefficient, and it completely breaks down when the
limits are dynamic - so things like dm and bcache are always going to
have to split anyways.

2012-08-29 21:16:54

by John Stoffel

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

>>>>> "Kent" == Kent Overstreet <[email protected]> writes:

Kent> On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
>> It's also instructive to remember why the code is the way it is: it used
>> to process bios for underlying devices immediately, but this sometimes
>> meant too much recursive stack growth. If a per-device rescuer thread
>> is to be made available (as well as the mempool), the option of
>> reinstating recursion is there too - only punting to workqueue when the
>> stack actually becomes "too big". (Also bear in mind that some dm
>> targets may have dependencies on their own mempools - submission can
>> block there too.) I find it helpful only to consider splitting into two
>> pieces - it must always be possible to process the first piece (i.e.
>> process it at the next layer down in the stack) and complete it
>> independently of what happens to the second piece (which might require
>> further splitting and block until the first piece has completed).

Kent> I'm sure it could be made to work (and it may well simpler), but it
Kent> seems problematic from a performance pov.

Kent> With stacked devices you'd then have to switch stacks on _every_ bio.
Kent> That could be made fast enough I'm sure, but it wouldn't be free and I
Kent> don't know of any existing code in the kernel that implements what we'd
Kent> need (though if you know how you'd go about doing that, I'd love to
Kent> know! Would be useful for other things).

Kent> The real problem is that because we'd need these extra stacks for
Kent> handling all bios we couldn't get by with just one per bio_set. We'd
Kent> only need one to make forward progress so the rest could be allocated
Kent> on demand (i.e. what the workqueue code does) but that sounds like it's
Kent> starting to get expensive.

Maybe we need to limit the size of BIOs to that of the bottom-most
underlying device instead? Or maybe limit BIOs to some small
multiple? As you stack up DM targets one on top of each other, they
should respect the limits of the underlying devices and pass those
limits up the chain.

Or maybe I'm speaking giberish...

John

2012-08-30 22:08:03

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:

[..]
> > Performance aside, punting submission to per device worker in case of deep
> > stack usage sounds cleaner solution to me.
>
> Agreed, but performance tends to matter in the real world. And either
> way the tricky bits are going to be confined to a few functions, so I
> don't think it matters that much.
>
> If someone wants to code up the workqueue version and test it, they're
> more than welcome...

Here is one quick and dirty proof of concept patch. It checks for stack
depth and if remaining space is less than 20% of stack size, then it
defers the bio submission to per queue worker.

Thanks
Vivek


---
block/blk-core.c | 171 ++++++++++++++++++++++++++++++++++------------
block/blk-sysfs.c | 1
include/linux/blk_types.h | 1
include/linux/blkdev.h | 8 ++
4 files changed, 138 insertions(+), 43 deletions(-)

Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h 2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/include/linux/blkdev.h 2012-09-01 18:09:58.805577658 -0400
@@ -430,6 +430,14 @@ struct request_queue {
/* Throttle data */
struct throtl_data *td;
#endif
+
+ /*
+ * Bio submission to queue can be deferred to a workqueue if stack
+ * usage of submitter is high.
+ */
+ struct bio_list deferred_bios;
+ struct work_struct deferred_bio_work;
+ struct workqueue_struct *deferred_bio_workqueue;
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c 2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/block/blk-core.c 2012-09-02 00:34:55.204091269 -0400
@@ -211,6 +211,23 @@ static void blk_delay_work(struct work_s
spin_unlock_irq(q->queue_lock);
}

+static void blk_deferred_bio_work(struct work_struct *work)
+{
+ struct request_queue *q;
+ struct bio *bio = NULL;
+
+ q = container_of(work, struct request_queue, deferred_bio_work);
+
+ do {
+ spin_lock_irq(q->queue_lock);
+ bio = bio_list_pop(&q->deferred_bios);
+ spin_unlock_irq(q->queue_lock);
+ if (!bio)
+ break;
+ generic_make_request(bio);
+ } while (1);
+}
+
/**
* blk_delay_queue - restart queueing after defined interval
* @q: The &struct request_queue in question
@@ -289,6 +306,7 @@ void blk_sync_queue(struct request_queue
{
del_timer_sync(&q->timeout);
cancel_delayed_work_sync(&q->delay_work);
+ cancel_work_sync(&q->deferred_bio_work);
}
EXPORT_SYMBOL(blk_sync_queue);

@@ -351,6 +369,29 @@ void blk_put_queue(struct request_queue
EXPORT_SYMBOL(blk_put_queue);

/**
+ * blk_drain_deferred_bios - drain deferred bios
+ * @q: request_queue to drain deferred bios for
+ *
+ * Dispatch all currently deferred bios on @q through ->make_request_fn().
+ */
+static void blk_drain_deferred_bios(struct request_queue *q)
+{
+ struct bio_list bl;
+ struct bio *bio;
+ unsigned long flags;
+
+ bio_list_init(&bl);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ bio_list_merge(&bl, &q->deferred_bios);
+ bio_list_init(&q->deferred_bios);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ while ((bio = bio_list_pop(&bl)))
+ generic_make_request(bio);
+}
+
+/**
* blk_drain_queue - drain requests from request_queue
* @q: queue to drain
* @drain_all: whether to drain all requests or only the ones w/ ELVPRIV
@@ -358,6 +399,10 @@ EXPORT_SYMBOL(blk_put_queue);
* Drain requests from @q. If @drain_all is set, all requests are drained.
* If not, only ELVPRIV requests are drained. The caller is responsible
* for ensuring that no new requests which need to be drained are queued.
+ *
+ * Note: It does not drain bios on q->deferred_bios list.
+ * Call blk_drain_deferred_bios() if need be.
+ *
*/
void blk_drain_queue(struct request_queue *q, bool drain_all)
{
@@ -505,6 +550,9 @@ void blk_cleanup_queue(struct request_qu
spin_unlock_irq(lock);
mutex_unlock(&q->sysfs_lock);

+ /* First drain all deferred bios. */
+ blk_drain_deferred_bios(q);
+
/* drain all requests queued before DEAD marking */
blk_drain_queue(q, true);

@@ -614,11 +662,19 @@ struct request_queue *blk_alloc_queue_no
q->bypass_depth = 1;
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);

- if (blkcg_init_queue(q))
+ bio_list_init(&q->deferred_bios);
+ INIT_WORK(&q->deferred_bio_work, blk_deferred_bio_work);
+ q->deferred_bio_workqueue = alloc_workqueue("kdeferbiod", WQ_MEM_RECLAIM, 0);
+ if (!q->deferred_bio_workqueue)
goto fail_id;

+ if (blkcg_init_queue(q))
+ goto fail_deferred_bio_wq;
+
return q;

+fail_deferred_bio_wq:
+ destroy_workqueue(q->deferred_bio_workqueue);
fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_q:
@@ -1635,8 +1691,10 @@ static inline int bio_check_eod(struct b
return 0;
}

+
+
static noinline_for_stack bool
-generic_make_request_checks(struct bio *bio)
+generic_make_request_checks_early(struct bio *bio)
{
struct request_queue *q;
int nr_sectors = bio_sectors(bio);
@@ -1715,9 +1773,6 @@ generic_make_request_checks(struct bio *
*/
create_io_context(GFP_ATOMIC, q->node);

- if (blk_throtl_bio(q, bio))
- return false; /* throttled, will be resubmitted later */
-
trace_block_bio_queue(q, bio);
return true;

@@ -1726,6 +1781,56 @@ end_io:
return false;
}

+static noinline_for_stack bool
+generic_make_request_checks_late(struct bio *bio)
+{
+ struct request_queue *q;
+
+ q = bdev_get_queue(bio->bi_bdev);
+
+ /*
+ * Various block parts want %current->io_context and lazy ioc
+ * allocation ends up trading a lot of pain for a small amount of
+ * memory. Just allocate it upfront. This may fail and block
+ * layer knows how to live with it.
+ */
+ create_io_context(GFP_ATOMIC, q->node);
+
+ if (blk_throtl_bio(q, bio))
+ return false; /* throttled, will be resubmitted later */
+
+ return true;
+}
+
+static void __generic_make_request(struct bio *bio)
+{
+ struct request_queue *q;
+
+ if (!generic_make_request_checks_late(bio))
+ return;
+ q = bdev_get_queue(bio->bi_bdev);
+ q->make_request_fn(q, bio);
+}
+
+static void generic_make_request_defer_bio(struct bio *bio)
+{
+ struct request_queue *q;
+ unsigned long flags;
+
+ q = bdev_get_queue(bio->bi_bdev);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ if (unlikely(blk_queue_dead(q))) {
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ bio_endio(bio, -ENODEV);
+ return;
+ }
+ set_bit(BIO_DEFERRED, &bio->bi_flags);
+ bio_list_add(&q->deferred_bios, bio);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ queue_work(q->deferred_bio_workqueue, &q->deferred_bio_work);
+}
+
/**
* generic_make_request - hand a buffer to its device driver for I/O
* @bio: The bio describing the location in memory and on the device.
@@ -1752,51 +1857,31 @@ end_io:
*/
void generic_make_request(struct bio *bio)
{
- struct bio_list bio_list_on_stack;
+ unsigned long sp = 0;
+ unsigned int threshold = (THREAD_SIZE * 2)/10;

- if (!generic_make_request_checks(bio))
- return;
+ BUG_ON(bio->bi_next);

- /*
- * We only want one ->make_request_fn to be active at a time, else
- * stack usage with stacked devices could be a problem. So use
- * current->bio_list to keep a list of requests submited by a
- * make_request_fn function. current->bio_list is also used as a
- * flag to say if generic_make_request is currently active in this
- * task or not. If it is NULL, then no make_request is active. If
- * it is non-NULL, then a make_request is active, and new requests
- * should be added at the tail
- */
- if (current->bio_list) {
- bio_list_add(current->bio_list, bio);
+ /* Submitteing deferred bio from worker context. */
+ if (bio_flagged(bio, BIO_DEFERRED)) {
+ clear_bit(BIO_DEFERRED, &bio->bi_flags);
+ __generic_make_request(bio);
return;
}

- /* following loop may be a bit non-obvious, and so deserves some
- * explanation.
- * Before entering the loop, bio->bi_next is NULL (as all callers
- * ensure that) so we have a list with a single bio.
- * We pretend that we have just taken it off a longer list, so
- * we assign bio_list to a pointer to the bio_list_on_stack,
- * thus initialising the bio_list of new bios to be
- * added. ->make_request() may indeed add some more bios
- * through a recursive call to generic_make_request. If it
- * did, we find a non-NULL value in bio_list and re-enter the loop
- * from the top. In this case we really did just take the bio
- * of the top of the list (no pretending) and so remove it from
- * bio_list, and call into ->make_request() again.
- */
- BUG_ON(bio->bi_next);
- bio_list_init(&bio_list_on_stack);
- current->bio_list = &bio_list_on_stack;
- do {
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ if (!generic_make_request_checks_early(bio))
+ return;

- q->make_request_fn(q, bio);
+ /*
+ * FIXME. Provide an arch dependent function to return left stack
+ * space for current task. This is hack for x86_64.
+ */
+ asm volatile("movq %%rsp,%0" : "=m"(sp));

- bio = bio_list_pop(current->bio_list);
- } while (bio);
- current->bio_list = NULL; /* deactivate */
+ if ((sp - (unsigned long)end_of_stack(current)) < threshold)
+ generic_make_request_defer_bio(bio);
+ else
+ __generic_make_request(bio);
}
EXPORT_SYMBOL(generic_make_request);

Index: linux-2.6/block/blk-sysfs.c
===================================================================
--- linux-2.6.orig/block/blk-sysfs.c 2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/block/blk-sysfs.c 2012-09-01 18:09:58.808577661 -0400
@@ -505,6 +505,7 @@ static void blk_release_queue(struct kob

ida_simple_remove(&blk_queue_ida, q->id);
kmem_cache_free(blk_requestq_cachep, q);
+ destroy_workqueue(q->deferred_bio_workqueue);
}

static const struct sysfs_ops queue_sysfs_ops = {
Index: linux-2.6/include/linux/blk_types.h
===================================================================
--- linux-2.6.orig/include/linux/blk_types.h 2012-09-02 00:34:17.607086696 -0400
+++ linux-2.6/include/linux/blk_types.h 2012-09-02 00:34:21.997087104 -0400
@@ -105,6 +105,7 @@ struct bio {
#define BIO_FS_INTEGRITY 9 /* fs owns integrity data, not block layer */
#define BIO_QUIET 10 /* Make BIO Quiet */
#define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_DEFERRED 12 /* Bio was deferred for submission by worker */
#define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

/*

2012-08-31 01:44:35

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
>
> [..]
> > > Performance aside, punting submission to per device worker in case of deep
> > > stack usage sounds cleaner solution to me.
> >
> > Agreed, but performance tends to matter in the real world. And either
> > way the tricky bits are going to be confined to a few functions, so I
> > don't think it matters that much.
> >
> > If someone wants to code up the workqueue version and test it, they're
> > more than welcome...
>
> Here is one quick and dirty proof of concept patch. It checks for stack
> depth and if remaining space is less than 20% of stack size, then it
> defers the bio submission to per queue worker.

I can't think of any correctness issues. I see some stuff that could be
simplified (blk_drain_deferred_bios() is redundant, just make it a
wrapper around blk_deffered_bio_work()).

Still skeptical about the performance impact, though - frankly, on some
of the hardware I've been running bcache on this would be a visible
performance regression - probably double digit percentages but I'd have
to benchmark it. That kind of of hardware/usage is not normal today,
but I've put a lot of work into performance and I don't want to make
things worse without good reason.

Have you tested/benchmarked it?

There's scheduling behaviour, too. We really want the workqueue thread's
cpu time to be charged to the process that submitted the bio. (We could
use a mechanism like that in other places, too... not like this is a new
issue).

This is going to be a real issue for users that need strong isolation -
for any driver that uses non negligable cpu (i.e. dm crypt), we're
breaking that (not that it wasn't broken already, but this makes it
worse).

I could be convinced, but right now I prefer my solution.

2012-08-31 01:55:51

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Thu, Aug 30, 2012 at 6:43 PM, Kent Overstreet <[email protected]> wrote:
> On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
>> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
>>
>> [..]
>> > > Performance aside, punting submission to per device worker in case of deep
>> > > stack usage sounds cleaner solution to me.
>> >
>> > Agreed, but performance tends to matter in the real world. And either
>> > way the tricky bits are going to be confined to a few functions, so I
>> > don't think it matters that much.
>> >
>> > If someone wants to code up the workqueue version and test it, they're
>> > more than welcome...
>>
>> Here is one quick and dirty proof of concept patch. It checks for stack
>> depth and if remaining space is less than 20% of stack size, then it
>> defers the bio submission to per queue worker.
>
> I can't think of any correctness issues. I see some stuff that could be
> simplified (blk_drain_deferred_bios() is redundant, just make it a
> wrapper around blk_deffered_bio_work()).
>
> Still skeptical about the performance impact, though - frankly, on some
> of the hardware I've been running bcache on this would be a visible
> performance regression - probably double digit percentages but I'd have
> to benchmark it. That kind of of hardware/usage is not normal today,
> but I've put a lot of work into performance and I don't want to make
> things worse without good reason.

Here's another crazy idea - we don't really need another thread, just
more stack space.

We could check if we're running out of stack space, then if we are
just allocate another two pages and memcpy the struct thread_info
over.

I think the main obstacle is that we'd need some per arch code for
mucking with the stack pointer. And it'd break backtraces, but that's
fixable.

2012-08-31 15:02:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

On Thu, Aug 30, 2012 at 06:43:59PM -0700, Kent Overstreet wrote:
> On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> >
> > [..]
> > > > Performance aside, punting submission to per device worker in case of deep
> > > > stack usage sounds cleaner solution to me.
> > >
> > > Agreed, but performance tends to matter in the real world. And either
> > > way the tricky bits are going to be confined to a few functions, so I
> > > don't think it matters that much.
> > >
> > > If someone wants to code up the workqueue version and test it, they're
> > > more than welcome...
> >
> > Here is one quick and dirty proof of concept patch. It checks for stack
> > depth and if remaining space is less than 20% of stack size, then it
> > defers the bio submission to per queue worker.
>
> I can't think of any correctness issues. I see some stuff that could be
> simplified (blk_drain_deferred_bios() is redundant, just make it a
> wrapper around blk_deffered_bio_work()).
>
> Still skeptical about the performance impact, though - frankly, on some
> of the hardware I've been running bcache on this would be a visible
> performance regression - probably double digit percentages but I'd have
> to benchmark it. That kind of of hardware/usage is not normal today,
> but I've put a lot of work into performance and I don't want to make
> things worse without good reason.

Would you like to give this patch a quick try and see with bcache on your
hardware how much performance impact do you see.

Given the fact that submission through worker happens only in case of
when stack usage is high, that should reduce the impact of the patch
and common use cases should reamin unaffected.

>
> Have you tested/benchmarked it?

No, I have not. I will run some simple workloads on SSD.

>
> There's scheduling behaviour, too. We really want the workqueue thread's
> cpu time to be charged to the process that submitted the bio. (We could
> use a mechanism like that in other places, too... not like this is a new
> issue).
>
> This is going to be a real issue for users that need strong isolation -
> for any driver that uses non negligable cpu (i.e. dm crypt), we're
> breaking that (not that it wasn't broken already, but this makes it
> worse).

There are so many places in kernel where worker threads do work on behalf
of each process. I think this is really a minor concern and I would not
be too worried about it.

What is concerning though really is the greater stack usage due to
recursive nature of make_request() and performance impact of deferral
to a worker thread.

Thanks
Vivek

2012-09-01 02:13:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] block: Avoid deadlocks with bio allocation by stacking drivers

Hello, Vivek.

On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote:
> Here is one quick and dirty proof of concept patch. It checks for stack
> depth and if remaining space is less than 20% of stack size, then it
> defers the bio submission to per queue worker.

So, it removes breadth-first walking of bio construction by ensuring
stack overflow never happens by bouncing to workqueue if stack usage
seems too high.

I do like removal of breadth-first walking. It makes failure
scenarios a lot less mind-bending. That said, Kent is right that this
can incur significant overhead for certain configurations, and looking
at stack usage in block layer is rather nasty both in design and
implementation.

If we're gonna need rescuer anyway and can get it right and the
mechanism can be contained in block proper relatively well, I think it
would be better to make bread-first walking safe. Both are nasty in
their own ways after all.

Thanks.

--
tejun

2012-09-01 02:17:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 6/9] block: Consolidate bio_alloc_bioset(), bio_kmalloc()

On Tue, Aug 28, 2012 at 03:03:32PM -0700, Kent Overstreet wrote:
> On Tue, Aug 28, 2012 at 01:41:48PM -0700, Tejun Heo wrote:
> > Hello, Kent.
> >
> > On Tue, Aug 28, 2012 at 10:37:33AM -0700, Kent Overstreet wrote:
> > > v7: Re-add dropped comments, improv patch description
> >
> > I don't think you did the former part. Other than that looks good to
> > me.
>
> I folded them into the bio_alloc_bioset() comments - so all the
> information that was there previously should still be there, just
> centralized. That good enough for your acked-by?

I think it would still be better to at least refer to
bio_alloc_bioset() from bio_alloc*()'s comments.

Thanks.

--
tejun

2012-09-01 02:19:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 7/9] block: Add bio_clone_bioset(), bio_clone_kmalloc()

Hello,

On Tue, Aug 28, 2012 at 03:05:32PM -0700, Kent Overstreet wrote:
> On Tue, Aug 28, 2012 at 01:44:01PM -0700, Tejun Heo wrote:
> > On Tue, Aug 28, 2012 at 10:37:34AM -0700, Kent Overstreet wrote:
> > > +static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
> > > +{
> > > + return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
> > > +}
> > > +
> > ...
> > > +static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
> > > +{
> > > + return bio_clone_bioset(bio, gfp_mask, NULL);
> > > +
> > > +}
> >
> > Do we really need these wrappers? I'd prefer requiring users to
> > explicit choose @bioset when cloning.
>
> bio_clone() is an existing api, I agree but I'd prefer to convert
> existing users in a separate patch and when I do that I want to spend
> some time actually looking at the existing code instead of doing the
> conversion blindly (at least some of the existing users are incorrect
> and I'll have to add bio_sets for them).

Aren't there like three users in kernel? If you wanna clean it up
later, that's fine too but I don't think it would make much difference
either way.

Thanks.

--
tejun

2012-09-01 02:23:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] block: Add bio_reset()

Hello,

On Tue, Aug 28, 2012 at 03:17:15PM -0700, Kent Overstreet wrote:
> > Better to explain why some bio fields are re-ordered and why that
> > shouldn't make things worse cacheline-wise?
>
> Well it may (struct bio is what, 3 or 4 cachelines now?) but even on
> ridiculous million iop devices struct bio just isn't hot enough for this
> kind of stuff to show up in the profiles... and I've done enough
> profiling to be pretty confident of that.
>
> So um, if there was anything to say performance wise I would, but to me
> it seems more that there isn't.
>
> (pruning struct bio would have more of an effect performance wise, which
> you know is something I'd like to do.)

Yeah, please put something like the above in the patch description.

> > > + unsigned long flags = bio->bi_flags & (~0UL << BIO_RESET_BITS);
> > > +
> > > + if (bio_integrity(bio))
> > > + bio_integrity_free(bio, bio->bi_pool);
> > > +
> > > + bio_disassociate_task(bio);
> >
> > Is this desirable? Why?
>
> *twitch* I should've thought more when I made that change.
>
> It occurs to me now that we specifically talked about that and decided
> to do it the other way - when I changed that I was just looking at
> bio_free() and decided they needed to be symmetrical.
>
> I still think they should be symmetrical, but if that's true bi_ioc and
> bi_css need to be moved, and also bio_disassociate_task() should be
> getting called from bio_free(), not bio_put().
>
> Were you the one that added that call? I know you've been working on
> that area of the code recently. Sticking it in bio_put() instead of
> bio_free() seems odd to be, and they're completely equivalent now that
> bio_free() is only called from bio_put() (save one instance I should
> probably fix).

Maybe I botched symmetry but anyways I *suspect* it probably would be
better to keep css association across bio_reset() give the current
usages of both mechanisms. css association indicates the ownership of
the bio which isn't likely to change while recycling the bio.

Thanks.

--
tejun