Here is the start of the DM changes I've queued for 4.2 (via
linux-dm.git's for-next branch).
I'm posting them to the lists mainly to get feedback on the block
changes (first 4 patches) and to publish Joe's dm-thinp discard
improvements which depend on the new blkdev_issue_discard_async
interface.
Jens, if all looks OK please pick up the first 4 block patches. The
"block: remove management of bi_remaining when restoring original
bi_end_io" has been tested in terms of DM. I haven't tested btrfs,
bcache, etc.
All review/feedback is welcome. Thanks!
Mike
Christoph Hellwig (1):
block, dm: don't copy bios for request clones
Joe Thornber (5):
block: factor out blkdev_issue_discard_async
dm btree: add dm_btree_remove_leaves()
dm thin metadata: add dm_thin_find_mapped_range()
dm thin metadata: add dm_thin_remove_range()
dm thin: range discard support
Mike Snitzer (8):
block: remove management of bi_remaining when restoring original bi_end_io
block: remove export for blk_queue_bio
dm: do not allocate any mempools for blk-mq request-based DM
dm: rename methods that requeue requests
dm: factor out a common cleanup_mapped_device()
dm thin: cleanup overwrite's endio restore to be centralized
dm thin: cleanup schedule_zero() to read more logically
dm thin metadata: remove in-core 'read_only' flag
block/bio-integrity.c | 4 +-
block/bio.c | 20 --
block/blk-core.c | 99 +-----
block/blk-lib.c | 57 ++-
drivers/md/bcache/io.c | 2 +-
drivers/md/dm-cache-target.c | 6 -
drivers/md/dm-raid1.c | 2 -
drivers/md/dm-snap.c | 1 -
drivers/md/dm-table.c | 29 +-
drivers/md/dm-thin-metadata.c | 117 +++++-
drivers/md/dm-thin-metadata.h | 11 +
drivers/md/dm-thin.c | 493 +++++++++++++++++---------
drivers/md/dm-verity.c | 2 +-
drivers/md/dm.c | 314 +++++++---------
drivers/md/dm.h | 5 +-
drivers/md/persistent-data/dm-block-manager.c | 6 +
drivers/md/persistent-data/dm-block-manager.h | 1 +
drivers/md/persistent-data/dm-btree-remove.c | 127 +++++++
drivers/md/persistent-data/dm-btree.h | 9 +
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/volumes.c | 16 +-
fs/btrfs/volumes.h | 2 -
include/linux/bio.h | 1 -
include/linux/blk_types.h | 3 +
include/linux/blkdev.h | 11 +-
25 files changed, 806 insertions(+), 534 deletions(-)
--
2.3.2 (Apple Git-55)
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io
The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).
Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.
The bio_inc_remaining() interface has been left exported because it is
still useful for more elaborate uses of bio_chain() -- it will be used
in an upcoming DM commit "dm thin: range discard support".
Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Signed-off-by: Mike Snitzer <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Chris Mason <[email protected]>
---
block/bio-integrity.c | 4 ++--
block/bio.c | 20 --------------------
drivers/md/bcache/io.c | 2 +-
drivers/md/dm-cache-target.c | 6 ------
drivers/md/dm-raid1.c | 2 --
drivers/md/dm-snap.c | 1 -
drivers/md/dm-thin.c | 9 +++------
drivers/md/dm-verity.c | 2 +-
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/volumes.c | 16 +++++-----------
fs/btrfs/volumes.h | 2 --
include/linux/bio.h | 1 -
12 files changed, 13 insertions(+), 54 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5cbd5d9..0436c21 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
/* Restore original bio completion handler */
bio->bi_end_io = bip->bip_end_io;
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
}
/**
@@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
*/
if (error) {
bio->bi_end_io = bip->bip_end_io;
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
return;
}
diff --git a/block/bio.c b/block/bio.c
index c2ff8a8..3690f1c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1809,26 +1809,6 @@ void bio_endio(struct bio *bio, int error)
EXPORT_SYMBOL(bio_endio);
/**
- * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
- * @bio: bio
- * @error: error, if any
- *
- * For code that has saved and restored bi_end_io; thing hard before using this
- * function, probably you should've cloned the entire bio.
- **/
-void bio_endio_nodec(struct bio *bio, int error)
-{
- /*
- * If it's not flagged as a chain, we are not going to dec the count
- */
- if (bio_flagged(bio, BIO_CHAIN))
- bio_inc_remaining(bio);
-
- bio_endio(bio, error);
-}
-EXPORT_SYMBOL(bio_endio_nodec);
-
-/**
* bio_split - split a bio
* @bio: bio to split
* @sectors: number of sectors to split from the front of @bio
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa..cb64e64a 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
s->bio->bi_end_io = s->bi_end_io;
s->bio->bi_private = s->bi_private;
- bio_endio_nodec(s->bio, 0);
+ bio_endio(s->bio, 0);
closure_debug_destroy(&s->cl);
mempool_free(s, s->p->bio_split_hook);
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 705eb7b..41b2594 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
{
bio->bi_end_io = h->bi_end_io;
bio->bi_private = h->bi_private;
-
- /*
- * Must bump bi_remaining to allow bio to complete with
- * restored bi_end_io.
- */
- bio_inc_remaining(bio);
}
/*----------------------------------------------------------------*/
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index d6a1c09..743fa9b 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
dm_bio_restore(bd, bio);
bio_record->details.bi_bdev = NULL;
- bio_inc_remaining(bio);
-
queue_bio(ms, bio, rw);
return DM_ENDIO_INCOMPLETE;
}
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 8bfeae2..7c82d3c 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1478,7 +1478,6 @@ out:
if (full_bio) {
full_bio->bi_end_io = pe->full_bio_end_io;
full_bio->bi_private = pe->full_bio_private;
- bio_inc_remaining(full_bio);
}
increment_pending_exceptions_done_count();
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 342dbda..e852602c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
{
- if (m->bio) {
+ if (m->bio)
m->bio->bi_end_io = m->saved_bi_end_io;
- bio_inc_remaining(m->bio);
- }
+
cell_error(m->tc->pool, m->cell);
list_del(&m->list);
mempool_free(m, m->tc->pool->mapping_pool);
@@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
int r;
bio = m->bio;
- if (bio) {
+ if (bio)
bio->bi_end_io = m->saved_bi_end_io;
- bio_inc_remaining(bio);
- }
if (m->err) {
cell_error(pool, m->cell);
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 66616db..bb9c6a0 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
bio->bi_end_io = io->orig_bi_end_io;
bio->bi_private = io->orig_bi_private;
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
}
static void verity_work(struct work_struct *w)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..fbdbee4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
bio->bi_private = end_io_wq->private;
bio->bi_end_io = end_io_wq->end_io;
kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
}
static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e8d1d1..dac77d4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
{
- if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
- bio_endio_nodec(bio, err);
- else
- bio_endio(bio, err);
+ bio->bi_private = bbio->private;
+ bio->bi_end_io = bbio->end_io;
+ bio_endio(bio, err);
+
btrfs_put_bbio(bbio);
}
@@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
bio = bbio->orig_bio;
}
- bio->bi_private = bbio->private;
- bio->bi_end_io = bbio->end_io;
btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
/* only send an error to the higher layers if it is
* beyond the tolerance of the btrfs bio
@@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
/* Shoud be the original bio. */
WARN_ON(bio != bbio->orig_bio);
- bio->bi_private = bbio->private;
- bio->bi_end_io = bbio->end_io;
btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
bio->bi_iter.bi_sector = logical >> 9;
@@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
if (dev_nr < total_devs - 1) {
bio = btrfs_bio_clone(first_bio, GFP_NOFS);
BUG_ON(!bio); /* -ENOMEM */
- } else {
+ } else
bio = first_bio;
- bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
- }
submit_stripe_bio(root, bbio, bio,
bbio->stripes[dev_nr].physical, dev_nr, rw,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ebc3133..cedae03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
struct btrfs_bio;
typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
-#define BTRFS_BIO_ORIG_BIO_SUBMITTED (1 << 0)
-
struct btrfs_bio {
atomic_t refs;
atomic_t stripes_pending;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7486ea1..fe0a88d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
}
extern void bio_endio(struct bio *, int);
-extern void bio_endio_nodec(struct bio *, int);
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);
--
2.3.2 (Apple Git-55)
With commit ff36ab345 ("dm: remove request-based logic from
make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
so remove its export. Doing so required a forward declaration in
blk-core.c.
Signed-off-by: Mike Snitzer <[email protected]>
---
block/blk-core.c | 5 +++--
include/linux/blkdev.h | 2 --
2 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 96b24b0..69035cb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -735,6 +735,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
}
EXPORT_SYMBOL(blk_init_queue_node);
+static void blk_queue_bio(struct request_queue *q, struct bio *bio);
+
struct request_queue *
blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
spinlock_t *lock)
@@ -1588,7 +1590,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
blk_rq_bio_prep(req->q, req, bio);
}
-void blk_queue_bio(struct request_queue *q, struct bio *bio)
+static void blk_queue_bio(struct request_queue *q, struct bio *bio)
{
const bool sync = !!(bio->bi_rw & REQ_SYNC);
struct blk_plug *plug;
@@ -1696,7 +1698,6 @@ out_unlock:
spin_unlock_irq(q->queue_lock);
}
}
-EXPORT_SYMBOL_GPL(blk_queue_bio); /* for device mapper only */
/*
* If bio->bi_dev is a partition, remap the location
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2da818a..c8cb30e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -792,8 +792,6 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
-extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
-
/*
* A queue has just exitted congestion. Note this in the global counter of
* congested queues, and wake up anyone who was waiting for requests to be
--
2.3.2 (Apple Git-55)
From: Christoph Hellwig <[email protected]>
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.
This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.
I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
---
block/blk-core.c | 94 +++----------------------
drivers/md/dm-table.c | 25 ++++---
drivers/md/dm.c | 171 +++++++++++-----------------------------------
drivers/md/dm.h | 5 +-
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 6 +-
6 files changed, 73 insertions(+), 230 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 69035cb..8605dce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(blk_rq_init);
static void req_bio_endio(struct request *rq, struct bio *bio,
unsigned int nbytes, int error)
{
- if (error)
+ if (error && !(rq->cmd_flags & REQ_CLONE))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
error = -EIO;
@@ -128,7 +128,8 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
bio_advance(bio, nbytes);
/* don't actually finish bio if it's part of flush sequence */
- if (bio->bi_iter.bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+ if (bio->bi_iter.bi_size == 0 &&
+ !(rq->cmd_flags & (REQ_FLUSH_SEQ|REQ_CLONE)))
bio_endio(bio, error);
}
@@ -2914,95 +2915,22 @@ int blk_lld_busy(struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_lld_busy);
-/**
- * blk_rq_unprep_clone - Helper function to free all bios in a cloned request
- * @rq: the clone request to be cleaned up
- *
- * Description:
- * Free all bios in @rq for a cloned request.
- */
-void blk_rq_unprep_clone(struct request *rq)
-{
- struct bio *bio;
-
- while ((bio = rq->bio) != NULL) {
- rq->bio = bio->bi_next;
-
- bio_put(bio);
- }
-}
-EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
-
-/*
- * Copy attributes of the original request to the clone request.
- * The actual data parts (e.g. ->cmd, ->sense) are not copied.
- */
-static void __blk_rq_prep_clone(struct request *dst, struct request *src)
+void blk_rq_prep_clone(struct request *dst, struct request *src)
{
dst->cpu = src->cpu;
- dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+ dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK);
+ dst->cmd_flags |= REQ_NOMERGE | REQ_CLONE;
dst->cmd_type = src->cmd_type;
dst->__sector = blk_rq_pos(src);
dst->__data_len = blk_rq_bytes(src);
dst->nr_phys_segments = src->nr_phys_segments;
dst->ioprio = src->ioprio;
dst->extra_len = src->extra_len;
-}
-
-/**
- * blk_rq_prep_clone - Helper function to setup clone request
- * @rq: the request to be setup
- * @rq_src: original request to be cloned
- * @bs: bio_set that bios for clone are allocated from
- * @gfp_mask: memory allocation mask for bio
- * @bio_ctr: setup function to be called for each clone bio.
- * Returns %0 for success, non %0 for failure.
- * @data: private data to be passed to @bio_ctr
- *
- * Description:
- * Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
- * The actual data parts of @rq_src (e.g. ->cmd, ->sense)
- * are not copied, and copying such parts is the caller's responsibility.
- * Also, pages which the original bios are pointing to are not copied
- * and the cloned bios just point same pages.
- * So cloned bios must be completed before original bios, which means
- * the caller must complete @rq before @rq_src.
- */
-int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
- struct bio_set *bs, gfp_t gfp_mask,
- int (*bio_ctr)(struct bio *, struct bio *, void *),
- void *data)
-{
- struct bio *bio, *bio_src;
-
- if (!bs)
- bs = fs_bio_set;
-
- __rq_for_each_bio(bio_src, rq_src) {
- bio = bio_clone_fast(bio_src, gfp_mask, bs);
- if (!bio)
- goto free_and_out;
-
- if (bio_ctr && bio_ctr(bio, bio_src, data))
- goto free_and_out;
-
- if (rq->bio) {
- rq->biotail->bi_next = bio;
- rq->biotail = bio;
- } else
- rq->bio = rq->biotail = bio;
- }
-
- __blk_rq_prep_clone(rq, rq_src);
-
- return 0;
-
-free_and_out:
- if (bio)
- bio_put(bio);
- blk_rq_unprep_clone(rq);
-
- return -ENOMEM;
+ dst->bio = src->bio;
+ dst->biotail = src->biotail;
+ dst->cmd = src->cmd;
+ dst->cmd_len = src->cmd_len;
+ dst->sense = src->sense;
}
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d9b00b8..3662b2e 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -940,21 +940,28 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
{
unsigned type = dm_table_get_type(t);
unsigned per_bio_data_size = 0;
- struct dm_target *tgt;
unsigned i;
- if (unlikely(type == DM_TYPE_NONE)) {
+ switch (type) {
+ case DM_TYPE_BIO_BASED:
+ for (i = 0; i < t->num_targets; i++) {
+ struct dm_target *tgt = t->targets + i;
+
+ per_bio_data_size = max(per_bio_data_size,
+ tgt->per_bio_data_size);
+ }
+ t->mempools = dm_alloc_bio_mempools(t->integrity_supported,
+ per_bio_data_size);
+ break;
+ case DM_TYPE_REQUEST_BASED:
+ case DM_TYPE_MQ_REQUEST_BASED:
+ t->mempools = dm_alloc_rq_mempools(md, type);
+ break;
+ default:
DMWARN("no table type is set, can't allocate mempools");
return -EINVAL;
}
- if (type == DM_TYPE_BIO_BASED)
- for (i = 0; i < t->num_targets; i++) {
- tgt = t->targets + i;
- per_bio_data_size = max(per_bio_data_size, tgt->per_bio_data_size);
- }
-
- t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_bio_data_size);
if (!t->mempools)
return -ENOMEM;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a930b72..38837f8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -990,57 +990,6 @@ static void clone_endio(struct bio *bio, int error)
dec_pending(io, error);
}
-/*
- * Partial completion handling for request-based dm
- */
-static void end_clone_bio(struct bio *clone, int error)
-{
- struct dm_rq_clone_bio_info *info =
- container_of(clone, struct dm_rq_clone_bio_info, clone);
- struct dm_rq_target_io *tio = info->tio;
- struct bio *bio = info->orig;
- unsigned int nr_bytes = info->orig->bi_iter.bi_size;
-
- bio_put(clone);
-
- if (tio->error)
- /*
- * An error has already been detected on the request.
- * Once error occurred, just let clone->end_io() handle
- * the remainder.
- */
- return;
- else if (error) {
- /*
- * Don't notice the error to the upper layer yet.
- * The error handling decision is made by the target driver,
- * when the request is completed.
- */
- tio->error = error;
- return;
- }
-
- /*
- * I/O for the bio successfully completed.
- * Notice the data completion to the upper layer.
- */
-
- /*
- * bios are processed from the head of the list.
- * So the completing bio should always be rq->bio.
- * If it's not, something wrong is happening.
- */
- if (tio->orig->bio != bio)
- DMERR("bio completion is going in the middle of the request");
-
- /*
- * Update the original request.
- * Do not use blk_end_request() here, because it may complete
- * the original request before the clone, and break the ordering.
- */
- blk_update_request(tio->orig, 0, nr_bytes);
-}
-
static struct dm_rq_target_io *tio_from_request(struct request *rq)
{
return (rq->q->mq_ops ? blk_mq_rq_to_pdu(rq) : rq->special);
@@ -1089,8 +1038,6 @@ static void free_rq_clone(struct request *clone, bool must_be_mapped)
WARN_ON_ONCE(must_be_mapped && !clone->q);
- blk_rq_unprep_clone(clone);
-
if (md->type == DM_TYPE_MQ_REQUEST_BASED)
/* stacked on blk-mq queue(s) */
tio->ti->type->release_clone_rq(clone);
@@ -1821,39 +1768,13 @@ static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
dm_complete_request(rq, r);
}
-static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
- void *data)
+static void setup_clone(struct request *clone, struct request *rq,
+ struct dm_rq_target_io *tio)
{
- struct dm_rq_target_io *tio = data;
- struct dm_rq_clone_bio_info *info =
- container_of(bio, struct dm_rq_clone_bio_info, clone);
-
- info->orig = bio_orig;
- info->tio = tio;
- bio->bi_end_io = end_clone_bio;
-
- return 0;
-}
-
-static int setup_clone(struct request *clone, struct request *rq,
- struct dm_rq_target_io *tio, gfp_t gfp_mask)
-{
- int r;
-
- r = blk_rq_prep_clone(clone, rq, tio->md->bs, gfp_mask,
- dm_rq_bio_constructor, tio);
- if (r)
- return r;
-
- clone->cmd = rq->cmd;
- clone->cmd_len = rq->cmd_len;
- clone->sense = rq->sense;
+ blk_rq_prep_clone(clone, rq);
clone->end_io = end_clone_request;
clone->end_io_data = tio;
-
tio->clone = clone;
-
- return 0;
}
static struct request *clone_rq(struct request *rq, struct mapped_device *md,
@@ -1874,12 +1795,7 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
clone = tio->clone;
blk_rq_init(NULL, clone);
- if (setup_clone(clone, rq, tio, gfp_mask)) {
- /* -ENOMEM */
- if (alloc_clone)
- free_clone_request(md, clone);
- return NULL;
- }
+ setup_clone(clone, rq, tio);
return clone;
}
@@ -1973,11 +1889,7 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq,
}
if (IS_ERR(clone))
return DM_MAPIO_REQUEUE;
- if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
- /* -ENOMEM */
- ti->type->release_clone_rq(clone);
- return DM_MAPIO_REQUEUE;
- }
+ setup_clone(clone, rq, tio);
}
switch (r) {
@@ -2431,8 +2343,6 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
goto out;
}
- BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
-
md->io_pool = p->io_pool;
p->io_pool = NULL;
md->rq_pool = p->rq_pool;
@@ -3536,48 +3446,23 @@ int dm_noflush_suspending(struct dm_target *ti)
}
EXPORT_SYMBOL_GPL(dm_noflush_suspending);
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
- unsigned integrity, unsigned per_bio_data_size)
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+ unsigned per_bio_data_size)
{
- struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL);
- struct kmem_cache *cachep = NULL;
- unsigned int pool_size = 0;
+ struct dm_md_mempools *pools;
+ unsigned int pool_size = dm_get_reserved_bio_based_ios();
unsigned int front_pad;
+ pools = kzalloc(sizeof(*pools), GFP_KERNEL);
if (!pools)
return NULL;
- type = filter_md_type(type, md);
+ front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) +
+ offsetof(struct dm_target_io, clone);
- switch (type) {
- case DM_TYPE_BIO_BASED:
- cachep = _io_cache;
- pool_size = dm_get_reserved_bio_based_ios();
- front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
- break;
- case DM_TYPE_REQUEST_BASED:
- cachep = _rq_tio_cache;
- pool_size = dm_get_reserved_rq_based_ios();
- pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
- if (!pools->rq_pool)
- goto out;
- /* fall through to setup remaining rq-based pools */
- case DM_TYPE_MQ_REQUEST_BASED:
- if (!pool_size)
- pool_size = dm_get_reserved_rq_based_ios();
- front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
- /* per_bio_data_size is not used. See __bind_mempools(). */
- WARN_ON(per_bio_data_size != 0);
- break;
- default:
- BUG();
- }
-
- if (cachep) {
- pools->io_pool = mempool_create_slab_pool(pool_size, cachep);
- if (!pools->io_pool)
- goto out;
- }
+ pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache);
+ if (!pools->io_pool)
+ goto out;
pools->bs = bioset_create_nobvec(pool_size, front_pad);
if (!pools->bs)
@@ -3587,10 +3472,34 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
goto out;
return pools;
-
out:
dm_free_md_mempools(pools);
+ return NULL;
+}
+
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
+ unsigned type)
+{
+ unsigned int pool_size = dm_get_reserved_rq_based_ios();
+ struct dm_md_mempools *pools;
+ pools = kzalloc(sizeof(*pools), GFP_KERNEL);
+ if (!pools)
+ return NULL;
+
+ if (filter_md_type(type, md) == DM_TYPE_REQUEST_BASED) {
+ pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
+ if (!pools->rq_pool)
+ goto out;
+ }
+
+ pools->io_pool = mempool_create_slab_pool(pool_size, _rq_tio_cache);
+ if (!pools->io_pool)
+ goto out;
+
+ return pools;
+out:
+ dm_free_md_mempools(pools);
return NULL;
}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 6123c2b..e6e66d0 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -222,8 +222,9 @@ void dm_kcopyd_exit(void);
/*
* Mempool operations
*/
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
- unsigned integrity, unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+ unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md, unsigned type);
void dm_free_md_mempools(struct dm_md_mempools *pools);
/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cc2e85a..745588b 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -195,6 +195,7 @@ enum rq_flag_bits {
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
__REQ_NO_TIMEOUT, /* requests may never expire */
+ __REQ_CLONE, /* cloned bios */
__REQ_NR_BITS, /* stops here */
};
@@ -249,5 +250,6 @@ enum rq_flag_bits {
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
#define REQ_NO_TIMEOUT (1ULL << __REQ_NO_TIMEOUT)
+#define REQ_CLONE (1ULL << __REQ_CLONE)
#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c8cb30e..cd14e2d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -775,11 +775,7 @@ extern void blk_add_request_payload(struct request *rq, struct page *page,
unsigned int len);
extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
extern int blk_lld_busy(struct request_queue *q);
-extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
- struct bio_set *bs, gfp_t gfp_mask,
- int (*bio_ctr)(struct bio *, struct bio *, void *),
- void *data);
-extern void blk_rq_unprep_clone(struct request *rq);
+extern void blk_rq_prep_clone(struct request *rq, struct request *rq_src);
extern int blk_insert_cloned_request(struct request_queue *q,
struct request *rq);
extern void blk_delay_queue(struct request_queue *, unsigned long);
--
2.3.2 (Apple Git-55)
From: Joe Thornber <[email protected]>
Useful for callers who wish to manage the async completion of discard(s)
without explicitly blocking waiting for completion.
blkdev_issue_discard() is updated to call blkdev_issue_discard_async()
and DM thinp will make use of blkdev_issue_discard_async() in the
upcoming "dm thin: range discard support" commit.
Signed-off-by: Joe Thornber <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
---
block/blk-lib.c | 57 +++++++++++++++++++++++++++++++++++++----------
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 3 +++
3 files changed, 49 insertions(+), 12 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..2815973 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,26 +26,36 @@ static void bio_batch_end_io(struct bio *bio, int err)
bio_put(bio);
}
+static void bio_batch_set_completion(struct bio *bio, void *data)
+{
+ struct bio_batch *bb = data;
+
+ bio->bi_end_io = bio_batch_end_io;
+ bio->bi_private = data;
+ atomic_inc(&bb->done);
+}
+
/**
- * blkdev_issue_discard - queue a discard
+ * blkdev_issue_discard_async - queue a discard with async completion
* @bdev: blockdev to issue discard for
* @sector: start sector
* @nr_sects: number of sectors to discard
* @gfp_mask: memory allocation flags (for bio_alloc)
* @flags: BLKDEV_IFL_* flags to control behaviour
+ * @set_completion: callback to set completion mechanism for discard bios
+ * @data: callback data passed to @set_completion
*
* Description:
* Issue a discard request for the sectors in question.
*/
-int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
- sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
+int blkdev_issue_discard_async(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask, unsigned long flags,
+ bio_discard_completion_t set_completion, void *data)
{
- DECLARE_COMPLETION_ONSTACK(wait);
struct request_queue *q = bdev_get_queue(bdev);
int type = REQ_WRITE | REQ_DISCARD;
unsigned int max_discard_sectors, granularity;
int alignment;
- struct bio_batch bb;
struct bio *bio;
int ret = 0;
struct blk_plug plug;
@@ -77,10 +87,6 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
type |= REQ_SECURE;
}
- atomic_set(&bb.done, 1);
- bb.flags = 1 << BIO_UPTODATE;
- bb.wait = &wait;
-
blk_start_plug(&plug);
while (nr_sects) {
unsigned int req_sects;
@@ -108,16 +114,15 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
req_sects = end_sect - sector;
}
+ set_completion(bio, data);
+
bio->bi_iter.bi_sector = sector;
- bio->bi_end_io = bio_batch_end_io;
bio->bi_bdev = bdev;
- bio->bi_private = &bb;
bio->bi_iter.bi_size = req_sects << 9;
nr_sects -= req_sects;
sector = end_sect;
- atomic_inc(&bb.done);
submit_bio(type, bio);
/*
@@ -129,6 +134,34 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
cond_resched();
}
blk_finish_plug(&plug);
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_discard_async);
+
+/**
+ * blkdev_issue_discard - queue a discard
+ * @bdev: blockdev to issue discard for
+ * @sector: start sector
+ * @nr_sects: number of sectors to discard
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ * @flags: BLKDEV_IFL_* flags to control behaviour
+ *
+ * Description:
+ * Issue a discard request for the sectors in question.
+ */
+int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
+{
+ int ret;
+ DECLARE_COMPLETION_ONSTACK(wait);
+ struct bio_batch bb;
+
+ atomic_set(&bb.done, 1);
+ bb.flags = 1 << BIO_UPTODATE;
+ bb.wait = &wait;
+
+ ret = blkdev_issue_discard_async(bdev, sector, nr_sects, gfp_mask, flags,
+ bio_batch_set_completion, &bb);
/* Wait for bios in-flight */
if (!atomic_dec_and_test(&bb.done))
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 745588b..a0c00e8 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -16,6 +16,7 @@ struct io_context;
struct cgroup_subsys_state;
typedef void (bio_end_io_t) (struct bio *, int);
typedef void (bio_destructor_t) (struct bio *);
+typedef void (bio_discard_completion_t) (struct bio *, void *);
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cd14e2d..16beac3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1128,6 +1128,9 @@ static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
+extern int blkdev_issue_discard_async(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask, unsigned long flags,
+ bio_discard_completion_t set_completion, void *data);
extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct page *page);
extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
--
2.3.2 (Apple Git-55)
On Thu 14-05-15 17:04:59, Mike Snitzer wrote:
> Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> non-chains") regressed all existing callers that followed this pattern:
> 1) saving a bio's original bi_end_io
> 2) wiring up an intermediate bi_end_io
> 3) restoring the original bi_end_io from intermediate bi_end_io
> 4) calling bio_endio() to execute the restored original bi_end_io
>
> The regression was due to BIO_CHAIN only ever getting set if
> bio_inc_remaining() is called. For the above pattern it isn't set until
> step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
> the first bio_endio(), in step 2 above, never decremented __bi_remaining
> before calling the intermediate bi_end_io -- leaving __bi_remaining with
> the value 1 instead of 0. When bio_inc_remaining() occurred during step
> 3 it brought it to a value of 2. When the second bio_endio() was
> called, in step 4 above, it should've called the original bi_end_io but
> it didn't because there was an extra reference that wasn't dropped (due
> to atomic operations being optimized away since BIO_CHAIN wasn't set
> upfront).
>
> Fix this issue by removing the __bi_remaining management complexity for
> all callers that use the above pattern -- bio_chain() is the only
> interface that _needs_ to be concerned with __bi_remaining. For the
> above pattern callers just expect the bi_end_io they set to get called!
> Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> that aren't associated with the bio_chain() interface.
>
> The bio_inc_remaining() interface has been left exported because it is
> still useful for more elaborate uses of bio_chain() -- it will be used
> in an upcoming DM commit "dm thin: range discard support".
>
> Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> Signed-off-by: Mike Snitzer <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Chris Mason <[email protected]>
One question: What happens if you stack dm-thin on top of e.g. dm-linear?
dm-thin will do it's thing to a bio and passes it to dm-linear. That will
split & chain the bio so BIO_CHAIN will be set. And on IO completion you
will have troubles in dm-thinp as now bi_remaining gets decremented in
bio_endio(). That's the reason why I suggested that we should clear
BIO_CHAIN once bi_remaining hits zero...
Honza
> ---
> block/bio-integrity.c | 4 ++--
> block/bio.c | 20 --------------------
> drivers/md/bcache/io.c | 2 +-
> drivers/md/dm-cache-target.c | 6 ------
> drivers/md/dm-raid1.c | 2 --
> drivers/md/dm-snap.c | 1 -
> drivers/md/dm-thin.c | 9 +++------
> drivers/md/dm-verity.c | 2 +-
> fs/btrfs/disk-io.c | 2 +-
> fs/btrfs/volumes.c | 16 +++++-----------
> fs/btrfs/volumes.h | 2 --
> include/linux/bio.h | 1 -
> 12 files changed, 13 insertions(+), 54 deletions(-)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 5cbd5d9..0436c21 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
>
> /* Restore original bio completion handler */
> bio->bi_end_io = bip->bip_end_io;
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
> }
>
> /**
> @@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
> */
> if (error) {
> bio->bi_end_io = bip->bip_end_io;
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
>
> return;
> }
> diff --git a/block/bio.c b/block/bio.c
> index c2ff8a8..3690f1c 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1809,26 +1809,6 @@ void bio_endio(struct bio *bio, int error)
> EXPORT_SYMBOL(bio_endio);
>
> /**
> - * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
> - * @bio: bio
> - * @error: error, if any
> - *
> - * For code that has saved and restored bi_end_io; thing hard before using this
> - * function, probably you should've cloned the entire bio.
> - **/
> -void bio_endio_nodec(struct bio *bio, int error)
> -{
> - /*
> - * If it's not flagged as a chain, we are not going to dec the count
> - */
> - if (bio_flagged(bio, BIO_CHAIN))
> - bio_inc_remaining(bio);
> -
> - bio_endio(bio, error);
> -}
> -EXPORT_SYMBOL(bio_endio_nodec);
> -
> -/**
> * bio_split - split a bio
> * @bio: bio to split
> * @sectors: number of sectors to split from the front of @bio
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index fa028fa..cb64e64a 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
>
> s->bio->bi_end_io = s->bi_end_io;
> s->bio->bi_private = s->bi_private;
> - bio_endio_nodec(s->bio, 0);
> + bio_endio(s->bio, 0);
>
> closure_debug_destroy(&s->cl);
> mempool_free(s, s->p->bio_split_hook);
> diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
> index 705eb7b..41b2594 100644
> --- a/drivers/md/dm-cache-target.c
> +++ b/drivers/md/dm-cache-target.c
> @@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
> {
> bio->bi_end_io = h->bi_end_io;
> bio->bi_private = h->bi_private;
> -
> - /*
> - * Must bump bi_remaining to allow bio to complete with
> - * restored bi_end_io.
> - */
> - bio_inc_remaining(bio);
> }
>
> /*----------------------------------------------------------------*/
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index d6a1c09..743fa9b 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
> dm_bio_restore(bd, bio);
> bio_record->details.bi_bdev = NULL;
>
> - bio_inc_remaining(bio);
> -
> queue_bio(ms, bio, rw);
> return DM_ENDIO_INCOMPLETE;
> }
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 8bfeae2..7c82d3c 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1478,7 +1478,6 @@ out:
> if (full_bio) {
> full_bio->bi_end_io = pe->full_bio_end_io;
> full_bio->bi_private = pe->full_bio_private;
> - bio_inc_remaining(full_bio);
> }
> increment_pending_exceptions_done_count();
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 342dbda..e852602c 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
>
> static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
> {
> - if (m->bio) {
> + if (m->bio)
> m->bio->bi_end_io = m->saved_bi_end_io;
> - bio_inc_remaining(m->bio);
> - }
> +
> cell_error(m->tc->pool, m->cell);
> list_del(&m->list);
> mempool_free(m, m->tc->pool->mapping_pool);
> @@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
> int r;
>
> bio = m->bio;
> - if (bio) {
> + if (bio)
> bio->bi_end_io = m->saved_bi_end_io;
> - bio_inc_remaining(bio);
> - }
>
> if (m->err) {
> cell_error(pool, m->cell);
> diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
> index 66616db..bb9c6a0 100644
> --- a/drivers/md/dm-verity.c
> +++ b/drivers/md/dm-verity.c
> @@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
> bio->bi_end_io = io->orig_bi_end_io;
> bio->bi_private = io->orig_bi_private;
>
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
> }
>
> static void verity_work(struct work_struct *w)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2ef9a4b..fbdbee4 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
> bio->bi_private = end_io_wq->private;
> bio->bi_end_io = end_io_wq->end_io;
> kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
> }
>
> static int cleaner_kthread(void *arg)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8e8d1d1..dac77d4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
>
> static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
> {
> - if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
> - bio_endio_nodec(bio, err);
> - else
> - bio_endio(bio, err);
> + bio->bi_private = bbio->private;
> + bio->bi_end_io = bbio->end_io;
> + bio_endio(bio, err);
> +
> btrfs_put_bbio(bbio);
> }
>
> @@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
> bio = bbio->orig_bio;
> }
>
> - bio->bi_private = bbio->private;
> - bio->bi_end_io = bbio->end_io;
> btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
> /* only send an error to the higher layers if it is
> * beyond the tolerance of the btrfs bio
> @@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
> /* Shoud be the original bio. */
> WARN_ON(bio != bbio->orig_bio);
>
> - bio->bi_private = bbio->private;
> - bio->bi_end_io = bbio->end_io;
> btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
> bio->bi_iter.bi_sector = logical >> 9;
>
> @@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
> if (dev_nr < total_devs - 1) {
> bio = btrfs_bio_clone(first_bio, GFP_NOFS);
> BUG_ON(!bio); /* -ENOMEM */
> - } else {
> + } else
> bio = first_bio;
> - bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
> - }
>
> submit_stripe_bio(root, bbio, bio,
> bbio->stripes[dev_nr].physical, dev_nr, rw,
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ebc3133..cedae03 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
> struct btrfs_bio;
> typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
>
> -#define BTRFS_BIO_ORIG_BIO_SUBMITTED (1 << 0)
> -
> struct btrfs_bio {
> atomic_t refs;
> atomic_t stripes_pending;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 7486ea1..fe0a88d 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
> }
>
> extern void bio_endio(struct bio *, int);
> -extern void bio_endio_nodec(struct bio *, int);
> struct request_queue;
> extern int bio_phys_segments(struct request_queue *, struct bio *);
>
> --
> 2.3.2 (Apple Git-55)
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
In general this looks good. But as Jan mentioned you need to
clear BIO_CHAIN when bi_remaining reaches zero, and I'd really
prefer if bio_inc_remaining wuld not be left exported and folded
into bio_chain so that we prevent new abuses from showing up
and keep the code centralized.
On Thu, May 14, 2015 at 05:05:02PM -0400, Mike Snitzer wrote:
> From: Joe Thornber <[email protected]>
>
> Useful for callers who wish to manage the async completion of discard(s)
> without explicitly blocking waiting for completion.
>
> blkdev_issue_discard() is updated to call blkdev_issue_discard_async()
> and DM thinp will make use of blkdev_issue_discard_async() in the
> upcoming "dm thin: range discard support" commit.
I think this is the wrong level of interface. I think dm should just
submit the bios directly, which will also allow it to use bio_chain
properly instead of needing the inc_remaining hack. Instead export
helpers that properly split up the discard chunk sectors without
touching the bio itself. And with bio split on demand work even
that will hopefully go away soon.
On Mon, May 18 2015 at 3:22am -0400,
Jan Kara <[email protected]> wrote:
> On Thu 14-05-15 17:04:59, Mike Snitzer wrote:
> > Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> > non-chains") regressed all existing callers that followed this pattern:
> > 1) saving a bio's original bi_end_io
> > 2) wiring up an intermediate bi_end_io
> > 3) restoring the original bi_end_io from intermediate bi_end_io
> > 4) calling bio_endio() to execute the restored original bi_end_io
> >
> > The regression was due to BIO_CHAIN only ever getting set if
> > bio_inc_remaining() is called. For the above pattern it isn't set until
> > step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
> > the first bio_endio(), in step 2 above, never decremented __bi_remaining
> > before calling the intermediate bi_end_io -- leaving __bi_remaining with
> > the value 1 instead of 0. When bio_inc_remaining() occurred during step
> > 3 it brought it to a value of 2. When the second bio_endio() was
> > called, in step 4 above, it should've called the original bi_end_io but
> > it didn't because there was an extra reference that wasn't dropped (due
> > to atomic operations being optimized away since BIO_CHAIN wasn't set
> > upfront).
> >
> > Fix this issue by removing the __bi_remaining management complexity for
> > all callers that use the above pattern -- bio_chain() is the only
> > interface that _needs_ to be concerned with __bi_remaining. For the
> > above pattern callers just expect the bi_end_io they set to get called!
> > Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> > that aren't associated with the bio_chain() interface.
> >
> > The bio_inc_remaining() interface has been left exported because it is
> > still useful for more elaborate uses of bio_chain() -- it will be used
> > in an upcoming DM commit "dm thin: range discard support".
> >
> > Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> > Signed-off-by: Mike Snitzer <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: Chris Mason <[email protected]>
> One question: What happens if you stack dm-thin on top of e.g. dm-linear?
> dm-thin will do it's thing to a bio and passes it to dm-linear. That will
> split & chain the bio so BIO_CHAIN will be set. And on IO completion you
> will have troubles in dm-thinp as now bi_remaining gets decremented in
> bio_endio(). That's the reason why I suggested that we should clear
> BIO_CHAIN once bi_remaining hits zero...
I think you need to be more precise in explaining the scenario you're
concerned about. Could be there is an issue but I'm not seeing it yet.
Are you referring to the patch that makes DM thinp use the proposed
blkdev_issue_discard_async() interface? The bios issued to DM linear
are generated by blkdev_issue_discard_async(). By using bio_chain()
they establish ancestory with the parent DM thinp bio (which has
had BIO_CHAIN set even before calling blkdev_issue_discard_async because
there is potential for DM thinp to complete the parent bio before all N
blkdev_issue_discard_async() generated bios complete -- so that is why
DM thinp itself takes an extra reference on the parent bio using
bio_inc_remaining() before calling blkdev_issue_discard_async)
In practice I'm not having any problems with the
device-mapper-test-suite's thin-provisioning tests, e.g.:
dmtest run --suite thin-provisioning -n /discard/
(dmts happily uses dm-linear devices and stacks DM thinp ontop.)
Mike
On Mon, May 18 2015 at 4:24am -0400,
Christoph Hellwig <[email protected]> wrote:
> In general this looks good. But as Jan mentioned you need to
> clear BIO_CHAIN when bi_remaining reaches zero
OK I just replied to Jan -- I'm not yet understanding why. That said, I
also don't have a problem with doing what you guys are asking. I'd just
like to understand the problem you're forseeing because in practice I'm
not hitting it in testing.
> and I'd really prefer if bio_inc_remaining wuld not be left exported
> and folded into bio_chain so that we prevent new abuses from showing
> up and keep the code centralized.
Your desire to make bio_inc_remaining() private is noted but I think the
proposed blkdev_issue_discard_async() is useful. In the context of DM
thinp's use of blkdev_issue_discard_async(): yes it is (ab)using
bio_chain() and bio_inc_remaining() to setup the async IO completion
scheme but it makes for a pretty clean solution to the problem of
wanting to have an async interface for discard.
On Mon, May 18 2015 at 4:27am -0400,
Christoph Hellwig <[email protected]> wrote:
> On Thu, May 14, 2015 at 05:05:02PM -0400, Mike Snitzer wrote:
> > From: Joe Thornber <[email protected]>
> >
> > Useful for callers who wish to manage the async completion of discard(s)
> > without explicitly blocking waiting for completion.
> >
> > blkdev_issue_discard() is updated to call blkdev_issue_discard_async()
> > and DM thinp will make use of blkdev_issue_discard_async() in the
> > upcoming "dm thin: range discard support" commit.
>
> I think this is the wrong level of interface. I think dm should just
> submit the bios directly, which will also allow it to use bio_chain
> properly instead of needing the inc_remaining hack. Instead export
> helpers that properly split up the discard chunk sectors without
> touching the bio itself. And with bio split on demand work even
> that will hopefully go away soon.
The proposed blkdev_issue_discard_async interface allows DM (or any
caller) to not have to concern itself with how discard(s) gets issued.
It leaves all the details of how large a discard can be, etc to block
core. The entire point of doing things this way is to _not_ pollute DM
with code that breaks up a discard into N bios based on the discard
limits of the underlying device.
What you're suggesting sounds a lot like having DM open code
blkdev_issue_discard() -- blkdev_issue_discard_async() was engineered to
avoid that completely.
I hope we can reach consensus on this because as it stands I know Jens
will be less inclined to take this blkdev_issue_discard_async() change
given your early disapproval. Which basically pretty much screws me up
for the coming merge window.. I'm OK with that (and exploring
alternatives) but I _really_ hope you've explored this carefully (not
getting that vibe yet given your suggestion appears to be "open code all
of blkdev_issue_discard in DM").
Mike
On Mon 18-05-15 09:13:59, Mike Snitzer wrote:
> On Mon, May 18 2015 at 3:22am -0400,
> Jan Kara <[email protected]> wrote:
>
> > On Thu 14-05-15 17:04:59, Mike Snitzer wrote:
> > > Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> > > non-chains") regressed all existing callers that followed this pattern:
> > > 1) saving a bio's original bi_end_io
> > > 2) wiring up an intermediate bi_end_io
> > > 3) restoring the original bi_end_io from intermediate bi_end_io
> > > 4) calling bio_endio() to execute the restored original bi_end_io
> > >
> > > The regression was due to BIO_CHAIN only ever getting set if
> > > bio_inc_remaining() is called. For the above pattern it isn't set until
> > > step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
> > > the first bio_endio(), in step 2 above, never decremented __bi_remaining
> > > before calling the intermediate bi_end_io -- leaving __bi_remaining with
> > > the value 1 instead of 0. When bio_inc_remaining() occurred during step
> > > 3 it brought it to a value of 2. When the second bio_endio() was
> > > called, in step 4 above, it should've called the original bi_end_io but
> > > it didn't because there was an extra reference that wasn't dropped (due
> > > to atomic operations being optimized away since BIO_CHAIN wasn't set
> > > upfront).
> > >
> > > Fix this issue by removing the __bi_remaining management complexity for
> > > all callers that use the above pattern -- bio_chain() is the only
> > > interface that _needs_ to be concerned with __bi_remaining. For the
> > > above pattern callers just expect the bi_end_io they set to get called!
> > > Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> > > that aren't associated with the bio_chain() interface.
> > >
> > > The bio_inc_remaining() interface has been left exported because it is
> > > still useful for more elaborate uses of bio_chain() -- it will be used
> > > in an upcoming DM commit "dm thin: range discard support".
> > >
> > > Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> > > Signed-off-by: Mike Snitzer <[email protected]>
> > > Cc: Jan Kara <[email protected]>
> > > Cc: Chris Mason <[email protected]>
> > One question: What happens if you stack dm-thin on top of e.g. dm-linear?
> > dm-thin will do it's thing to a bio and passes it to dm-linear. That will
> > split & chain the bio so BIO_CHAIN will be set. And on IO completion you
> > will have troubles in dm-thinp as now bi_remaining gets decremented in
> > bio_endio(). That's the reason why I suggested that we should clear
> > BIO_CHAIN once bi_remaining hits zero...
>
> I think you need to be more precise in explaining the scenario you're
> concerned about. Could be there is an issue but I'm not seeing it yet.
>
> Are you referring to the patch that makes DM thinp use the proposed
> blkdev_issue_discard_async() interface? The bios issued to DM linear
> are generated by blkdev_issue_discard_async(). By using bio_chain()
> they establish ancestory with the parent DM thinp bio (which has
> had BIO_CHAIN set even before calling blkdev_issue_discard_async because
> there is potential for DM thinp to complete the parent bio before all N
> blkdev_issue_discard_async() generated bios complete -- so that is why
> DM thinp itself takes an extra reference on the parent bio using
> bio_inc_remaining() before calling blkdev_issue_discard_async)
No, I'm not referring to your proposed interface. I'm referring to
current kernel + your patch to remove bio_inc_remaining() from all the dm
targets. Ah, after checking again I see where misunderstanding may have
come from - the device below has to be handled by drivers/md/linear.c which
is MD linear driver, not DM one. I confused those two. Anyway here is the
failure I envision (and frankly, I don't understand dm details much so I may
be just completely wrong but I'd like to understand what prevents the following
from happening):
* We have dm-thin stacked on top of drivers/dm/linear.c
* FS issues bio to dm-thin. remap_and_issue_overwrite() sets bi_end_io to
overwrite_endio. dm-thin eventually calls generic_make_request(bio).
* Now linear_make_request() gets called and it ends up calling
bio_chain(split, bio). This sets BIO_CHAIN on bio.
* IO for all chained bios is completed. So bio->bi_remaining is now zero,
bio still has BIO_CHAIN set and overwrite_endio gets called.
* process_prepared_mapping() will eventually try to call original bi_end_io
callback but that never happens because bi_remaining is 0 and BIO_CHAIN
remained set.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, May 18 2015 at 11:36am -0400,
Jan Kara <[email protected]> wrote:
> No, I'm not referring to your proposed interface. I'm referring to
> current kernel + your patch to remove bio_inc_remaining() from all the dm
> targets. Ah, after checking again I see where misunderstanding may have
> come from - the device below has to be handled by drivers/md/linear.c which
> is MD linear driver, not DM one. I confused those two. Anyway here is the
> failure I envision (and frankly, I don't understand dm details much so I may
> be just completely wrong but I'd like to understand what prevents the following
> from happening):
> * We have dm-thin stacked on top of drivers/dm/linear.c
> * FS issues bio to dm-thin. remap_and_issue_overwrite() sets bi_end_io to
> overwrite_endio. dm-thin eventually calls generic_make_request(bio).
> * Now linear_make_request() gets called and it ends up calling
> bio_chain(split, bio). This sets BIO_CHAIN on bio.
> * IO for all chained bios is completed. So bio->bi_remaining is now zero,
> bio still has BIO_CHAIN set and overwrite_endio gets called.
> * process_prepared_mapping() will eventually try to call original bi_end_io
> callback but that never happens because bi_remaining is 0 and BIO_CHAIN
> remained set.
Makes sense, you have a valid concern (I knew you and hch must, I just
didn't understand).
I'll clear the BIO_CHAIN like you suggest and post v2.
Thanks!
On Mon, May 18, 2015 at 09:32:23AM -0400, Mike Snitzer wrote:
> The proposed blkdev_issue_discard_async interface allows DM (or any
> caller) to not have to concern itself with how discard(s) gets issued.
>
> It leaves all the details of how large a discard can be, etc to block
> core. The entire point of doing things this way is to _not_ pollute DM
> with code that breaks up a discard into N bios based on the discard
> limits of the underlying device.
>
> What you're suggesting sounds a lot like having DM open code
> blkdev_issue_discard() -- blkdev_issue_discard_async() was engineered to
> avoid that completely.
Parts of it anyway. The splitting logic can still be factored into
helpers to keep the nasty details out of DM. But except for that I
think async discards should be handled exactly like async reads, writes
or flushes.
And besides that generic high level sentiment I think the interface
for blkdev_issue_discard_async is simply wrong. Either you want to keep
the internals private and just expose a completion callback that gets
your private data and an error, or you want to build your own bios as
suggested above. But not one that is mostly opaque except for allowing
the caller to hook into the submission process and thus taking over I/O
completion.
On Mon, May 18 2015 at 12:17pm -0400,
Christoph Hellwig <[email protected]> wrote:
> On Mon, May 18, 2015 at 09:32:23AM -0400, Mike Snitzer wrote:
> > The proposed blkdev_issue_discard_async interface allows DM (or any
> > caller) to not have to concern itself with how discard(s) gets issued.
> >
> > It leaves all the details of how large a discard can be, etc to block
> > core. The entire point of doing things this way is to _not_ pollute DM
> > with code that breaks up a discard into N bios based on the discard
> > limits of the underlying device.
> >
> > What you're suggesting sounds a lot like having DM open code
> > blkdev_issue_discard() -- blkdev_issue_discard_async() was engineered to
> > avoid that completely.
>
> Parts of it anyway. The splitting logic can still be factored into
> helpers to keep the nasty details out of DM. But except for that I
> think async discards should be handled exactly like async reads, writes
> or flushes.
OK.
> And besides that generic high level sentiment I think the interface
> for blkdev_issue_discard_async is simply wrong. Either you want to keep
> the internals private and just expose a completion callback that gets
> your private data and an error, or you want to build your own bios as
> suggested above. But not one that is mostly opaque except for allowing
> the caller to hook into the submission process and thus taking over I/O
> completion.
I'll see what I can come up with.
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io
The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).
Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.
changes in v2:
Care is now taken to clear the BIO_CHAIN flag once ->__bi_remaining
drops to 0. Also, the bio_inc_remaining() interface has moved local to
bio.c.
Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Signed-off-by: Mike Snitzer <[email protected]>
---
block/bio-integrity.c | 4 ++--
block/bio.c | 35 ++++++++++++++---------------------
drivers/md/bcache/io.c | 2 +-
drivers/md/dm-cache-target.c | 6 ------
drivers/md/dm-raid1.c | 2 --
drivers/md/dm-snap.c | 1 -
drivers/md/dm-thin.c | 9 +++------
drivers/md/dm-verity.c | 2 +-
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/volumes.c | 16 +++++-----------
fs/btrfs/volumes.h | 2 --
include/linux/bio.h | 12 ------------
12 files changed, 27 insertions(+), 66 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5cbd5d9..0436c21 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
/* Restore original bio completion handler */
bio->bi_end_io = bip->bip_end_io;
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
}
/**
@@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
*/
if (error) {
bio->bi_end_io = bip->bip_end_io;
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
return;
}
diff --git a/block/bio.c b/block/bio.c
index c2ff8a8..259197d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -303,6 +303,17 @@ static void bio_chain_endio(struct bio *bio, int error)
bio_put(bio);
}
+/*
+ * Increment chain count for the bio. Make sure the CHAIN flag update
+ * is visible before the raised count.
+ */
+static inline void bio_inc_remaining(struct bio *bio)
+{
+ bio->bi_flags |= (1 << BIO_CHAIN);
+ smp_mb__before_atomic();
+ atomic_inc(&bio->__bi_remaining);
+}
+
/**
* bio_chain - chain bio completions
* @bio: the target bio
@@ -1756,8 +1767,10 @@ static inline bool bio_remaining_done(struct bio *bio)
BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
- if (atomic_dec_and_test(&bio->__bi_remaining))
+ if (atomic_dec_and_test(&bio->__bi_remaining)) {
+ clear_bit(BIO_CHAIN, &bio->bi_flags);
return true;
+ }
return false;
}
@@ -1809,26 +1822,6 @@ void bio_endio(struct bio *bio, int error)
EXPORT_SYMBOL(bio_endio);
/**
- * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
- * @bio: bio
- * @error: error, if any
- *
- * For code that has saved and restored bi_end_io; thing hard before using this
- * function, probably you should've cloned the entire bio.
- **/
-void bio_endio_nodec(struct bio *bio, int error)
-{
- /*
- * If it's not flagged as a chain, we are not going to dec the count
- */
- if (bio_flagged(bio, BIO_CHAIN))
- bio_inc_remaining(bio);
-
- bio_endio(bio, error);
-}
-EXPORT_SYMBOL(bio_endio_nodec);
-
-/**
* bio_split - split a bio
* @bio: bio to split
* @sectors: number of sectors to split from the front of @bio
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa..cb64e64a 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
s->bio->bi_end_io = s->bi_end_io;
s->bio->bi_private = s->bi_private;
- bio_endio_nodec(s->bio, 0);
+ bio_endio(s->bio, 0);
closure_debug_destroy(&s->cl);
mempool_free(s, s->p->bio_split_hook);
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 705eb7b..41b2594 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
{
bio->bi_end_io = h->bi_end_io;
bio->bi_private = h->bi_private;
-
- /*
- * Must bump bi_remaining to allow bio to complete with
- * restored bi_end_io.
- */
- bio_inc_remaining(bio);
}
/*----------------------------------------------------------------*/
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index d6a1c09..743fa9b 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
dm_bio_restore(bd, bio);
bio_record->details.bi_bdev = NULL;
- bio_inc_remaining(bio);
-
queue_bio(ms, bio, rw);
return DM_ENDIO_INCOMPLETE;
}
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 8bfeae2..7c82d3c 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1478,7 +1478,6 @@ out:
if (full_bio) {
full_bio->bi_end_io = pe->full_bio_end_io;
full_bio->bi_private = pe->full_bio_private;
- bio_inc_remaining(full_bio);
}
increment_pending_exceptions_done_count();
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 342dbda..e852602c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
{
- if (m->bio) {
+ if (m->bio)
m->bio->bi_end_io = m->saved_bi_end_io;
- bio_inc_remaining(m->bio);
- }
+
cell_error(m->tc->pool, m->cell);
list_del(&m->list);
mempool_free(m, m->tc->pool->mapping_pool);
@@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
int r;
bio = m->bio;
- if (bio) {
+ if (bio)
bio->bi_end_io = m->saved_bi_end_io;
- bio_inc_remaining(bio);
- }
if (m->err) {
cell_error(pool, m->cell);
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 66616db..bb9c6a0 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
bio->bi_end_io = io->orig_bi_end_io;
bio->bi_private = io->orig_bi_private;
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
}
static void verity_work(struct work_struct *w)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..fbdbee4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
bio->bi_private = end_io_wq->private;
bio->bi_end_io = end_io_wq->end_io;
kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
- bio_endio_nodec(bio, error);
+ bio_endio(bio, error);
}
static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e8d1d1..dac77d4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
{
- if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
- bio_endio_nodec(bio, err);
- else
- bio_endio(bio, err);
+ bio->bi_private = bbio->private;
+ bio->bi_end_io = bbio->end_io;
+ bio_endio(bio, err);
+
btrfs_put_bbio(bbio);
}
@@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
bio = bbio->orig_bio;
}
- bio->bi_private = bbio->private;
- bio->bi_end_io = bbio->end_io;
btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
/* only send an error to the higher layers if it is
* beyond the tolerance of the btrfs bio
@@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
/* Shoud be the original bio. */
WARN_ON(bio != bbio->orig_bio);
- bio->bi_private = bbio->private;
- bio->bi_end_io = bbio->end_io;
btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
bio->bi_iter.bi_sector = logical >> 9;
@@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
if (dev_nr < total_devs - 1) {
bio = btrfs_bio_clone(first_bio, GFP_NOFS);
BUG_ON(!bio); /* -ENOMEM */
- } else {
+ } else
bio = first_bio;
- bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
- }
submit_stripe_bio(root, bbio, bio,
bbio->stripes[dev_nr].physical, dev_nr, rw,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ebc3133..cedae03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
struct btrfs_bio;
typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
-#define BTRFS_BIO_ORIG_BIO_SUBMITTED (1 << 0)
-
struct btrfs_bio {
atomic_t refs;
atomic_t stripes_pending;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7486ea1..f0291cf 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
}
extern void bio_endio(struct bio *, int);
-extern void bio_endio_nodec(struct bio *, int);
struct request_queue;
extern int bio_phys_segments(struct request_queue *, struct bio *);
@@ -659,17 +658,6 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
}
/*
- * Increment chain count for the bio. Make sure the CHAIN flag update
- * is visible before the raised count.
- */
-static inline void bio_inc_remaining(struct bio *bio)
-{
- bio->bi_flags |= (1 << BIO_CHAIN);
- smp_mb__before_atomic();
- atomic_inc(&bio->__bi_remaining);
-}
-
-/*
* bio_set is used to allow other portions of the IO system to
* allocate their own private memory pools for bio and iovec structures.
* These memory pools in turn all allocate from the bio_slab
--
2.3.2 (Apple Git-55)
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
On Mon 18-05-15 16:40:17, Mike Snitzer wrote:
> Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> non-chains") regressed all existing callers that followed this pattern:
> 1) saving a bio's original bi_end_io
> 2) wiring up an intermediate bi_end_io
> 3) restoring the original bi_end_io from intermediate bi_end_io
> 4) calling bio_endio() to execute the restored original bi_end_io
>
> The regression was due to BIO_CHAIN only ever getting set if
> bio_inc_remaining() is called. For the above pattern it isn't set until
> step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
> the first bio_endio(), in step 2 above, never decremented __bi_remaining
> before calling the intermediate bi_end_io -- leaving __bi_remaining with
> the value 1 instead of 0. When bio_inc_remaining() occurred during step
> 3 it brought it to a value of 2. When the second bio_endio() was
> called, in step 4 above, it should've called the original bi_end_io but
> it didn't because there was an extra reference that wasn't dropped (due
> to atomic operations being optimized away since BIO_CHAIN wasn't set
> upfront).
>
> Fix this issue by removing the __bi_remaining management complexity for
> all callers that use the above pattern -- bio_chain() is the only
> interface that _needs_ to be concerned with __bi_remaining. For the
> above pattern callers just expect the bi_end_io they set to get called!
> Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> that aren't associated with the bio_chain() interface.
>
> changes in v2:
> Care is now taken to clear the BIO_CHAIN flag once ->__bi_remaining
> drops to 0. Also, the bio_inc_remaining() interface has moved local to
> bio.c.
>
> Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> Signed-off-by: Mike Snitzer <[email protected]>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> block/bio-integrity.c | 4 ++--
> block/bio.c | 35 ++++++++++++++---------------------
> drivers/md/bcache/io.c | 2 +-
> drivers/md/dm-cache-target.c | 6 ------
> drivers/md/dm-raid1.c | 2 --
> drivers/md/dm-snap.c | 1 -
> drivers/md/dm-thin.c | 9 +++------
> drivers/md/dm-verity.c | 2 +-
> fs/btrfs/disk-io.c | 2 +-
> fs/btrfs/volumes.c | 16 +++++-----------
> fs/btrfs/volumes.h | 2 --
> include/linux/bio.h | 12 ------------
> 12 files changed, 27 insertions(+), 66 deletions(-)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 5cbd5d9..0436c21 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
>
> /* Restore original bio completion handler */
> bio->bi_end_io = bip->bip_end_io;
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
> }
>
> /**
> @@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
> */
> if (error) {
> bio->bi_end_io = bip->bip_end_io;
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
>
> return;
> }
> diff --git a/block/bio.c b/block/bio.c
> index c2ff8a8..259197d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -303,6 +303,17 @@ static void bio_chain_endio(struct bio *bio, int error)
> bio_put(bio);
> }
>
> +/*
> + * Increment chain count for the bio. Make sure the CHAIN flag update
> + * is visible before the raised count.
> + */
> +static inline void bio_inc_remaining(struct bio *bio)
> +{
> + bio->bi_flags |= (1 << BIO_CHAIN);
> + smp_mb__before_atomic();
> + atomic_inc(&bio->__bi_remaining);
> +}
> +
> /**
> * bio_chain - chain bio completions
> * @bio: the target bio
> @@ -1756,8 +1767,10 @@ static inline bool bio_remaining_done(struct bio *bio)
>
> BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
>
> - if (atomic_dec_and_test(&bio->__bi_remaining))
> + if (atomic_dec_and_test(&bio->__bi_remaining)) {
> + clear_bit(BIO_CHAIN, &bio->bi_flags);
> return true;
> + }
>
> return false;
> }
> @@ -1809,26 +1822,6 @@ void bio_endio(struct bio *bio, int error)
> EXPORT_SYMBOL(bio_endio);
>
> /**
> - * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
> - * @bio: bio
> - * @error: error, if any
> - *
> - * For code that has saved and restored bi_end_io; thing hard before using this
> - * function, probably you should've cloned the entire bio.
> - **/
> -void bio_endio_nodec(struct bio *bio, int error)
> -{
> - /*
> - * If it's not flagged as a chain, we are not going to dec the count
> - */
> - if (bio_flagged(bio, BIO_CHAIN))
> - bio_inc_remaining(bio);
> -
> - bio_endio(bio, error);
> -}
> -EXPORT_SYMBOL(bio_endio_nodec);
> -
> -/**
> * bio_split - split a bio
> * @bio: bio to split
> * @sectors: number of sectors to split from the front of @bio
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index fa028fa..cb64e64a 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
>
> s->bio->bi_end_io = s->bi_end_io;
> s->bio->bi_private = s->bi_private;
> - bio_endio_nodec(s->bio, 0);
> + bio_endio(s->bio, 0);
>
> closure_debug_destroy(&s->cl);
> mempool_free(s, s->p->bio_split_hook);
> diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
> index 705eb7b..41b2594 100644
> --- a/drivers/md/dm-cache-target.c
> +++ b/drivers/md/dm-cache-target.c
> @@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
> {
> bio->bi_end_io = h->bi_end_io;
> bio->bi_private = h->bi_private;
> -
> - /*
> - * Must bump bi_remaining to allow bio to complete with
> - * restored bi_end_io.
> - */
> - bio_inc_remaining(bio);
> }
>
> /*----------------------------------------------------------------*/
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index d6a1c09..743fa9b 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
> dm_bio_restore(bd, bio);
> bio_record->details.bi_bdev = NULL;
>
> - bio_inc_remaining(bio);
> -
> queue_bio(ms, bio, rw);
> return DM_ENDIO_INCOMPLETE;
> }
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 8bfeae2..7c82d3c 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1478,7 +1478,6 @@ out:
> if (full_bio) {
> full_bio->bi_end_io = pe->full_bio_end_io;
> full_bio->bi_private = pe->full_bio_private;
> - bio_inc_remaining(full_bio);
> }
> increment_pending_exceptions_done_count();
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 342dbda..e852602c 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
>
> static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
> {
> - if (m->bio) {
> + if (m->bio)
> m->bio->bi_end_io = m->saved_bi_end_io;
> - bio_inc_remaining(m->bio);
> - }
> +
> cell_error(m->tc->pool, m->cell);
> list_del(&m->list);
> mempool_free(m, m->tc->pool->mapping_pool);
> @@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
> int r;
>
> bio = m->bio;
> - if (bio) {
> + if (bio)
> bio->bi_end_io = m->saved_bi_end_io;
> - bio_inc_remaining(bio);
> - }
>
> if (m->err) {
> cell_error(pool, m->cell);
> diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
> index 66616db..bb9c6a0 100644
> --- a/drivers/md/dm-verity.c
> +++ b/drivers/md/dm-verity.c
> @@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
> bio->bi_end_io = io->orig_bi_end_io;
> bio->bi_private = io->orig_bi_private;
>
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
> }
>
> static void verity_work(struct work_struct *w)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2ef9a4b..fbdbee4 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
> bio->bi_private = end_io_wq->private;
> bio->bi_end_io = end_io_wq->end_io;
> kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
> - bio_endio_nodec(bio, error);
> + bio_endio(bio, error);
> }
>
> static int cleaner_kthread(void *arg)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8e8d1d1..dac77d4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
>
> static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
> {
> - if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
> - bio_endio_nodec(bio, err);
> - else
> - bio_endio(bio, err);
> + bio->bi_private = bbio->private;
> + bio->bi_end_io = bbio->end_io;
> + bio_endio(bio, err);
> +
> btrfs_put_bbio(bbio);
> }
>
> @@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
> bio = bbio->orig_bio;
> }
>
> - bio->bi_private = bbio->private;
> - bio->bi_end_io = bbio->end_io;
> btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
> /* only send an error to the higher layers if it is
> * beyond the tolerance of the btrfs bio
> @@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
> /* Shoud be the original bio. */
> WARN_ON(bio != bbio->orig_bio);
>
> - bio->bi_private = bbio->private;
> - bio->bi_end_io = bbio->end_io;
> btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
> bio->bi_iter.bi_sector = logical >> 9;
>
> @@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
> if (dev_nr < total_devs - 1) {
> bio = btrfs_bio_clone(first_bio, GFP_NOFS);
> BUG_ON(!bio); /* -ENOMEM */
> - } else {
> + } else
> bio = first_bio;
> - bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
> - }
>
> submit_stripe_bio(root, bbio, bio,
> bbio->stripes[dev_nr].physical, dev_nr, rw,
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ebc3133..cedae03 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
> struct btrfs_bio;
> typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
>
> -#define BTRFS_BIO_ORIG_BIO_SUBMITTED (1 << 0)
> -
> struct btrfs_bio {
> atomic_t refs;
> atomic_t stripes_pending;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 7486ea1..f0291cf 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
> }
>
> extern void bio_endio(struct bio *, int);
> -extern void bio_endio_nodec(struct bio *, int);
> struct request_queue;
> extern int bio_phys_segments(struct request_queue *, struct bio *);
>
> @@ -659,17 +658,6 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
> }
>
> /*
> - * Increment chain count for the bio. Make sure the CHAIN flag update
> - * is visible before the raised count.
> - */
> -static inline void bio_inc_remaining(struct bio *bio)
> -{
> - bio->bi_flags |= (1 << BIO_CHAIN);
> - smp_mb__before_atomic();
> - atomic_inc(&bio->__bi_remaining);
> -}
> -
> -/*
> * bio_set is used to allow other portions of the IO system to
> * allocate their own private memory pools for bio and iovec structures.
> * These memory pools in turn all allocate from the bio_slab
> --
> 2.3.2 (Apple Git-55)
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, May 18, 2015 at 03:18:30PM -0400, Mike Snitzer wrote:
> > And besides that generic high level sentiment I think the interface
> > for blkdev_issue_discard_async is simply wrong. Either you want to keep
> > the internals private and just expose a completion callback that gets
> > your private data and an error, or you want to build your own bios as
> > suggested above. But not one that is mostly opaque except for allowing
> > the caller to hook into the submission process and thus taking over I/O
> > completion.
>
> I'll see what I can come up with.
Another option would be to help with reviewing and testing this
series: http://thread.gmane.org/gmane.linux.kernel/1947453
If we can get thast one in the submitter doesn't have to care
about aligning or splitting discard (or any other) bios.