Hi Guys,
It is always not a good practice to access bio->bi_vcnt and
bio->bi_io_vec from drivers directly. Also this kind of direct
access will cause trouble when converting to multipage bvecs.
The 1st patch introduces the following 4 bio helpers which can be
used inside drivers for avoiding direct access to .bi_vcnt and .bi_io_vec.
bio_pages()
bio_is_full()
bio_get_base_vec()
bio_set_vec_table()
Both bio_pages() and bio_is_full() can be easy to convert to
multipage bvecs.
For bio_get_base_vec() and bio_set_vec_table(), they are often used
during initializing a new bio or in case of single bvec bio. With the
two new helpers, it becomes quite easy to audit access to .bi_io_vec
and .bi_vcnt.
Most of the other patches use the 4 helpers to clean up most of direct
access to .bi_vcnt and .bi_io_vec from drivers, except for MD and btrfs,
which two subsystems will be done in the future.
Also bio_add_page() is used in floppy, dm-crypt and fs/logfs to
avoiding direct access to .bi_vcnt & .bi_io_vec.
Thanks,
Ming
Ming Lei (27):
block: bio: introduce 4 helpers for cleanup
block: drbd: use bio_get_base_vec() to retrieve the 1st bvec
block: drbd: remove impossible failure handling
block: loop: use bio_get_base_vec() to retrive bvec table
block: pktcdvd: use bio_get_base_vec() to retrive bvec table
block: floppy: use bio_set_vec_table()
block: floppy: use bio_add_page()
staging: lustre: avoid to use bio->bi_vcnt directly
target: use bio_is_full()
bcache: debug: avoid to access .bi_io_vec directly
bcache: io.c: use bio_set_vec_table
bcache: journal.c: use bio_set_vec_table()
bcache: movinggc: use bio_set_vec_table()
bcache: writeback: use bio_set_vec_table()
bcache: super: use bio_set_vec_table()
bcache: super: use bio_get_base_vec
dm: crypt: use bio_add_page()
dm: dm-io.c: use bio_get_base_vec()
dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
dm: dm-bufio.c: use bio_set_vec_table()
fs: logfs: use bio_set_vec_table()
fs: logfs: convert to bio_add_page() in sync_request()
fs: logfs: use bio_add_page() in __bdev_writeseg()
fs: logfs: use bio_add_page() in do_erase()
fs: logfs: remove unnecesary check
kernel/power/swap.c: use bio_get_base_vec()
mm: page_io.c: use bio_get_base_vec()
drivers/block/drbd/drbd_bitmap.c | 4 +-
drivers/block/drbd/drbd_receiver.c | 14 +---
drivers/block/floppy.c | 9 +--
drivers/block/loop.c | 5 +-
drivers/block/pktcdvd.c | 3 +-
drivers/md/bcache/debug.c | 11 ++-
drivers/md/bcache/io.c | 3 +-
drivers/md/bcache/journal.c | 3 +-
drivers/md/bcache/movinggc.c | 6 +-
drivers/md/bcache/super.c | 28 +++++---
drivers/md/bcache/writeback.c | 4 +-
drivers/md/dm-bufio.c | 3 +-
drivers/md/dm-crypt.c | 8 +--
drivers/md/dm-io.c | 7 +-
drivers/md/dm.c | 3 +-
drivers/staging/lustre/lustre/llite/lloop.c | 9 +--
drivers/target/target_core_pscsi.c | 2 +-
fs/logfs/dev_bdev.c | 107 +++++++++++-----------------
include/linux/bio.h | 28 ++++++++
kernel/power/swap.c | 10 ++-
mm/page_io.c | 18 ++++-
21 files changed, 156 insertions(+), 129 deletions(-)
--
1.9.1
Some drivers access bio->bi_vcnt and bio->bi_io_vec directly,
firstly it isn't a good practice, secondly it may cause trouble
for converting to multipage bvecs.
So this patches introduces 4 helpers for cleaning up this kind
of usage.
Both bio_pages() and bio_is_full() can be convertd to support
multipage bvecs easily.
For bio_get_base_vec() and bio_set_vec_table(), they are often
used during initializing a new bio or in case of single bvec
bio. With the two new helpers, it becomes easy to audit access
of .bi_io_vec and .bi_vcnt.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 88bc64f..2179bc4 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -310,6 +310,34 @@ static inline void bio_clear_flag(struct bio *bio, unsigned int bit)
bio->bi_flags &= ~(1U << bit);
}
+static inline bool bio_is_full(struct bio *bio)
+{
+ WARN_ONCE(bio_flagged(bio, BIO_CLONED), "cloned bio");
+
+ return bio->bi_vcnt >= bio->bi_max_vecs;
+}
+
+static inline struct bio_vec *bio_get_base_vec(struct bio *bio)
+{
+ return __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+}
+
+/* This helper should be used for setting bvec table on a new bio */
+static inline void bio_set_vec_table(struct bio *bio, struct bio_vec *table,
+ unsigned max_vecs)
+{
+ bio->bi_io_vec = table;
+ bio->bi_max_vecs = max_vecs;
+}
+
+/* For singlepage bvecs, one segment includes one page */
+static inline unsigned bio_pages(struct bio *bio)
+{
+ if (!bio_flagged(bio, BIO_CLONED))
+ return bio->bi_vcnt;
+ return bio_segments(bio);
+}
+
static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
{
*bv = bio_iovec(bio);
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/drbd/drbd_bitmap.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 92d6fc0..ccbd1e0 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -938,7 +938,9 @@ static void drbd_bm_endio(struct bio *bio)
struct drbd_bm_aio_ctx *ctx = bio->bi_private;
struct drbd_device *device = ctx->device;
struct drbd_bitmap *b = device->bitmap;
- unsigned int idx = bm_page_to_idx(bio->bi_io_vec[0].bv_page);
+ /* single bvec bio */
+ const struct bio_vec *bvec = bio_get_base_vec(bio);
+ unsigned int idx = bm_page_to_idx(bvec->bv_page);
if ((ctx->flags & BM_AIO_COPY_PAGES) == 0 &&
!bm_test_page_unchanged(b->bm_pages[idx]))
--
1.9.1
For a non-cloned bio, bio_add_page() only returns failure when
the io vec table is full, but in that case, bio->bi_vcnt can't
be zero at all.
So remove the impossible failure handling.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/drbd/drbd_receiver.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 050aaa1..1b0ed15 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1465,20 +1465,8 @@ next_bio:
page_chain_for_each(page) {
unsigned len = min_t(unsigned, data_size, PAGE_SIZE);
- if (!bio_add_page(bio, page, len, 0)) {
- /* A single page must always be possible!
- * But in case it fails anyways,
- * we deal with it, and complain (below). */
- if (bio->bi_vcnt == 0) {
- drbd_err(device,
- "bio_add_page failed for len=%u, "
- "bi_vcnt=0 (bi_sector=%llu)\n",
- len, (uint64_t)bio->bi_iter.bi_sector);
- err = -ENOSPC;
- goto fail;
- }
+ if (!bio_add_page(bio, page, len, 0))
goto next_bio;
- }
data_size -= len;
sector += len >> 9;
--nr_pages;
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/loop.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 423f4ca..2a94d3bb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -477,7 +477,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
loff_t pos, bool rw)
{
struct iov_iter iter;
- struct bio_vec *bvec;
+ const struct bio_vec *bvec;
struct bio *bio = cmd->rq->bio;
struct file *file = lo->lo_backing_file;
int ret;
@@ -485,7 +485,8 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
/* nomerge for loop request queue */
WARN_ON(cmd->rq->bio != cmd->rq->biotail);
- bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+ /* passed to iterate_bvec() */
+ bvec = bio_get_base_vec(bio);
iov_iter_bvec(&iter, ITER_BVEC | rw, bvec,
bio_segments(bio), blk_rq_bytes(cmd->rq));
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/pktcdvd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index d06c62e..8f37435 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1298,7 +1298,8 @@ try_next_bio:
static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
{
int f;
- struct bio_vec *bvec = pkt->w_bio->bi_io_vec;
+ /* need to fix this usage after multipage bvecs */
+ struct bio_vec *bvec = bio_get_base_vec(pkt->w_bio);
bio_reset(pkt->w_bio);
pkt->w_bio->bi_iter.bi_sector = pkt->sector;
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/floppy.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index b5b0e68..a093de0 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3812,17 +3812,14 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
bio_init(&bio);
bio_set_vec_table(&bio, &bio_vec, 1);
- bio_vec.bv_page = page;
- bio_vec.bv_len = size;
- bio_vec.bv_offset = 0;
- bio.bi_vcnt = 1;
- bio.bi_iter.bi_size = size;
bio.bi_bdev = bdev;
bio.bi_iter.bi_sector = 0;
bio.bi_flags |= (1 << BIO_QUIET);
bio.bi_private = &cbdata;
bio.bi_end_io = floppy_rb0_cb;
+ bio_add_page(&bio, page, size, 0);
+
submit_bio(READ, &bio);
process_fd_request();
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/bcache/movinggc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index b929fc9..dbe5af2 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -85,10 +85,10 @@ static void moving_init(struct moving_io *io)
bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
bio->bi_iter.bi_size = KEY_SIZE(&io->w->key) << 9;
- bio->bi_max_vecs = DIV_ROUND_UP(KEY_SIZE(&io->w->key),
- PAGE_SECTORS);
bio->bi_private = &io->cl;
- bio->bi_io_vec = bio->bi_inline_vecs;
+ bio_set_vec_table(bio, bio->bi_inline_vecs,
+ DIV_ROUND_UP(KEY_SIZE(&io->w->key),
+ PAGE_SECTORS));
bch_bio_map(bio, NULL);
}
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/bcache/writeback.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index b9346cd..49a8f8a 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -112,9 +112,9 @@ static void dirty_init(struct keybuf_key *w)
bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
bio->bi_iter.bi_size = KEY_SIZE(&w->key) << 9;
- bio->bi_max_vecs = DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS);
bio->bi_private = w;
- bio->bi_io_vec = bio->bi_inline_vecs;
+ bio_set_vec_table(bio, bio->bi_inline_vecs,
+ DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS));
bch_bio_map(bio, NULL);
}
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/bcache/journal.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 29eba72..bf8924f 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -453,8 +453,7 @@ static void do_journal_discard(struct cache *ca)
ca->sb.d[ja->discard_idx]);
bio->bi_bdev = ca->bdev;
bio->bi_rw = REQ_WRITE|REQ_DISCARD;
- bio->bi_max_vecs = 1;
- bio->bi_io_vec = bio->bi_inline_vecs;
+ bio_set_vec_table(bio, bio->bi_inline_vecs, 1);
bio->bi_iter.bi_size = bucket_bytes(ca);
bio->bi_end_io = journal_discard_endio;
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/bcache/io.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index 86a0bb8..1c48462 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -26,8 +26,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
bio_init(bio);
bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
- bio->bi_max_vecs = bucket_pages(c);
- bio->bi_io_vec = bio->bi_inline_vecs;
+ bio_set_vec_table(bio, bio->bi_inline_vecs, bucket_pages(c));
return bio;
}
--
1.9.1
Instead we use standard iterator way to do that.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/bcache/debug.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 8b1f1d5..d1ad49d 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -106,8 +106,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
{
char name[BDEVNAME_SIZE];
struct bio *check;
- struct bio_vec bv, *bv2;
- struct bvec_iter iter;
+ struct bio_vec bv, cbv, *bv2;
+ struct bvec_iter iter, citer = { 0 };
int i;
check = bio_clone(bio, GFP_NOIO);
@@ -119,9 +119,13 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
submit_bio_wait(READ_SYNC, check);
+ citer.bi_size = UINT_MAX;
bio_for_each_segment(bv, bio, iter) {
void *p1 = kmap_atomic(bv.bv_page);
- void *p2 = page_address(check->bi_io_vec[iter.bi_idx].bv_page);
+ void *p2;
+
+ cbv = bio_iter_iovec(check, citer);
+ p2 = page_address(cbv.bv_page);
cache_set_err_on(memcmp(p1 + bv.bv_offset,
p2 + bv.bv_offset,
@@ -132,6 +136,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
(uint64_t) bio->bi_iter.bi_sector);
kunmap_atomic(p1);
+ bio_advance_iter(check, &citer, bv.bv_len);
}
bio_for_each_segment_all(bv2, check, i)
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/target/target_core_pscsi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/target/target_core_pscsi.c b/drivers/target/target_core_pscsi.c
index de18790..24906c3 100644
--- a/drivers/target/target_core_pscsi.c
+++ b/drivers/target/target_core_pscsi.c
@@ -951,7 +951,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
bio->bi_vcnt, nr_vecs);
- if (bio->bi_vcnt > nr_vecs) {
+ if (bio_is_full(bio)) {
pr_debug("PSCSI: Reached bio->bi_vcnt max:"
" %d i: %d bio: %p, allocating another"
" bio\n", bio->bi_vcnt, i, bio);
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/staging/lustre/lustre/llite/lloop.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index b725fc1..67323db 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -302,19 +302,20 @@ static unsigned int loop_get_bio(struct lloop_device *lo, struct bio **req)
}
/* TODO: need to split the bio, too bad. */
- LASSERT(first->bi_vcnt <= LLOOP_MAX_SEGMENTS);
+ LASSERT(bio_pages(first) <= LLOOP_MAX_SEGMENTS);
rw = first->bi_rw;
bio = &lo->lo_bio;
while (*bio && (*bio)->bi_rw == rw) {
+ unsigned curr_cnt = bio_pages(*bio);
CDEBUG(D_INFO, "bio sector %llu size %u count %u vcnt%u\n",
(unsigned long long)(*bio)->bi_iter.bi_sector,
(*bio)->bi_iter.bi_size,
- page_count, (*bio)->bi_vcnt);
- if (page_count + (*bio)->bi_vcnt > LLOOP_MAX_SEGMENTS)
+ page_count, curr_cnt);
+ if (page_count + curr_cnt > LLOOP_MAX_SEGMENTS)
break;
- page_count += (*bio)->bi_vcnt;
+ page_count += curr_cnt;
count++;
bio = &(*bio)->bi_next;
}
--
1.9.1
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/floppy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 84708a5..b5b0e68 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3811,7 +3811,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
cbdata.drive = drive;
bio_init(&bio);
- bio.bi_io_vec = &bio_vec;
+ bio_set_vec_table(&bio, &bio_vec, 1);
bio_vec.bv_page = page;
bio_vec.bv_len = size;
bio_vec.bv_offset = 0;
--
1.9.1
On Tue, Apr 05, 2016 at 07:56:48PM +0800, Ming Lei wrote:
> For a non-cloned bio, bio_add_page() only returns failure when
> the io vec table is full, but in that case, bio->bi_vcnt can't
> be zero at all.
>
> So remove the impossible failure handling.
Before the immutable bvecs,
we did in fact see this trigger in the wild.
On "strange" deployments.
But for the current implementation of bio_add_page(),
you are correct, this is impossible now.
Ack.
Thanks,
Lars
On Tue, Apr 05, 2016 at 07:56:56PM +0800, Ming Lei wrote:
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index 86a0bb8..1c48462 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -26,8 +26,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
>
> bio_init(bio);
> bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
> - bio->bi_max_vecs = bucket_pages(c);
> - bio->bi_io_vec = bio->bi_inline_vecs;
> + bio_set_vec_table(bio, bio->bi_inline_vecs, bucket_pages(c));
All this bcache code needs to move away from bio_init on a bio
embedded in a driver private structure toward properly using
bio_alloc / bio_alloc_bioset. That will also fix the crash
with bcache over md that Shaohua reported, so I'd suggest to fast
track this part of the series.
On Tue, Apr 05, 2016 at 07:56:53PM +0800, Ming Lei wrote:
> Signed-off-by: Ming Lei <[email protected]>
A bit more of a commit message is always nice :)
Acked-by: Greg Kroah-Hartman <[email protected]>
The lloop driver should be removed entirely - use the loop driver
instead.
On Tue, Apr 05, 2016 at 07:56:54PM +0800, Ming Lei wrote:
> +++ b/drivers/target/target_core_pscsi.c
> @@ -951,7 +951,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
> pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
> bio->bi_vcnt, nr_vecs);
>
> - if (bio->bi_vcnt > nr_vecs) {
> + if (bio_is_full(bio)) {
> pr_debug("PSCSI: Reached bio->bi_vcnt max:"
> " %d i: %d bio: %p, allocating another"
> " bio\n", bio->bi_vcnt, i, bio);
This check should be removed entirely - bio_add_pc_page takes care of
it.
On Tue, Apr 5, 2016 at 8:49 PM, Christoph Hellwig <[email protected]> wrote:
> On Tue, Apr 05, 2016 at 07:56:56PM +0800, Ming Lei wrote:
>> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
>> index 86a0bb8..1c48462 100644
>> --- a/drivers/md/bcache/io.c
>> +++ b/drivers/md/bcache/io.c
>> @@ -26,8 +26,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
>>
>> bio_init(bio);
>> bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
>> - bio->bi_max_vecs = bucket_pages(c);
>> - bio->bi_io_vec = bio->bi_inline_vecs;
>> + bio_set_vec_table(bio, bio->bi_inline_vecs, bucket_pages(c));
>
> All this bcache code needs to move away from bio_init on a bio
> embedded in a driver private structure toward properly using
> bio_alloc / bio_alloc_bioset. That will also fix the crash
> with bcache over md that Shaohua reported, so I'd suggest to fast
> track this part of the series.
I suggest to keep this usage for the following reasons:
- bio can be embedded into one biger instance, which is often allocated
dynamically, so one extra allocation for bio can be avoided.
- we should support arbitrary bio size by this way, at least bio_add_page()
supports this usage. Also code gets lots of simplication with arbitrary bio
size support, such as prio_io(): bcache
BTW, the root cause for bcache crash still isn't clear now because
blk_bio_segment_split() should split big bio into proper size with
all queue's limits. Maybe the max segment limit isn't figured out correctly.
Thanks,
Ming Lei
On Tue, Apr 05, 2016 at 11:24:30PM +0800, Ming Lei wrote:
> - bio can be embedded into one biger instance, which is often allocated
> dynamically, so one extra allocation for bio can be avoided.
We can also do this the other way around with the bios front_pad,
which avoid the caller poking into bio details.
> - we should support arbitrary bio size by this way, at least bio_add_page()
> supports this usage. Also code gets lots of simplication with arbitrary bio
> size support, such as prio_io(): bcache
There is no reason for not supporting huge bios in the core bio code,
in fact using bio_kmalloc you can already allocate huges bios
dynamically right now. Except that you can't really use it, because the
layers below don't expect that. Bios based drivers expect to be able to
call bio_clone and friends called on bios passed to them, and might
also make assumptions about the max number of bios segments for now.
> BTW, the root cause for bcache crash still isn't clear now because
> blk_bio_segment_split() should split big bio into proper size with
> all queue's limits. Maybe the max segment limit isn't figured out correctly.
The root cause is pretty simple: The queue limits matter for request
based drivers, which are the only ones getting bios > BIO_MAX_PAGES
except for the buggy bcache use case. You'll need to either adjust the
limit for all bio based drivers to or get rid of that one magic caller
not playing by the rules.
On Tue, Apr 05, 2016 at 07:56:46PM +0800, Ming Lei wrote:
> Some drivers access bio->bi_vcnt and bio->bi_io_vec directly,
> firstly it isn't a good practice, secondly it may cause trouble
> for converting to multipage bvecs.
"not good practice" is OO bullshit snake oil without more justification. We
don't plaster accessors everywhere without an actual reason.
How would it cause trouble with multipage bvecs?
On Tue, Apr 05, 2016 at 05:49:02AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 05, 2016 at 07:56:56PM +0800, Ming Lei wrote:
> > diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> > index 86a0bb8..1c48462 100644
> > --- a/drivers/md/bcache/io.c
> > +++ b/drivers/md/bcache/io.c
> > @@ -26,8 +26,7 @@ struct bio *bch_bbio_alloc(struct cache_set *c)
> >
> > bio_init(bio);
> > bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
> > - bio->bi_max_vecs = bucket_pages(c);
> > - bio->bi_io_vec = bio->bi_inline_vecs;
> > + bio_set_vec_table(bio, bio->bi_inline_vecs, bucket_pages(c));
>
> All this bcache code needs to move away from bio_init on a bio
> embedded in a driver private structure toward properly using
> bio_alloc / bio_alloc_bioset. That will also fix the crash
> with bcache over md that Shaohua reported, so I'd suggest to fast
> track this part of the series.
Why?
bio_init() is a publicly exported function, it's always been one and bcache is
ot the only driver to use it directly.
bios with > BIO_MAX_PAGES bvecs is a separate issue; I would argue that the bug
is in md's queue_limits; it uses blk_set_stacking_limits() which sets
max_segments = USHRT_MAX, which is wrong if it's going to clone the biovec.
On Wed, Apr 6, 2016 at 8:18 AM, Kent Overstreet
<[email protected]> wrote:
> On Tue, Apr 05, 2016 at 07:56:46PM +0800, Ming Lei wrote:
>> Some drivers access bio->bi_vcnt and bio->bi_io_vec directly,
>> firstly it isn't a good practice, secondly it may cause trouble
>> for converting to multipage bvecs.
>
> "not good practice" is OO bullshit snake oil without more justification. We
> don't plaster accessors everywhere without an actual reason.
>
> How would it cause trouble with multipage bvecs?
Simply speaking, the current drivers may depend on .bi_vcnt for
computing how many page there are in one bio. After multipage bvecs,
it is not true any more. Isn't it a actual reason?
Thanks,
Ming Lei
On Wed, Apr 06, 2016 at 09:34:34AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 8:18 AM, Kent Overstreet
> <[email protected]> wrote:
> > On Tue, Apr 05, 2016 at 07:56:46PM +0800, Ming Lei wrote:
> >> Some drivers access bio->bi_vcnt and bio->bi_io_vec directly,
> >> firstly it isn't a good practice, secondly it may cause trouble
> >> for converting to multipage bvecs.
> >
> > "not good practice" is OO bullshit snake oil without more justification. We
> > don't plaster accessors everywhere without an actual reason.
> >
> > How would it cause trouble with multipage bvecs?
>
> Simply speaking, the current drivers may depend on .bi_vcnt for
> computing how many page there are in one bio. After multipage bvecs,
> it is not true any more. Isn't it a actual reason?
But it's completely valid to use bi_vcnt for segments, which is what it's always
_really_ meant anyways.
Sometimes you have cases where the meaning of a member changes significantly
enough that you really don't want code using it accidentally anymore - like with
Jens' patches that changed how bi_remaining and bi_cnt work, but after those
patches it really wasn't correct to use those members directly anymore so he
renamed them to prevent that.
I don't buy that that's the case for multipage bvecs - the meaning of bi_vcnt
itself isn't changing (it's just the number of entries in the array!) and it'll
still be possible for code to correctly use it directly.
Same with bio->bi_io_vec, it's still an array of biovecs, that's not changing.
Your helpers are at the wrong level of abstraction.
Also, there isn't a huge number of bi_vcnt references in the kernel anyways -
the immutable biovec work required removing most of them.
Instead of adding these low level accessors, it'd be better to convert code to
higher level helpers (especially bio_add_page()) where applicable.
On Wed, Apr 6, 2016 at 9:46 AM, Kent Overstreet
<[email protected]> wrote:
> On Wed, Apr 06, 2016 at 09:34:34AM +0800, Ming Lei wrote:
>> On Wed, Apr 6, 2016 at 8:18 AM, Kent Overstreet
>> <[email protected]> wrote:
>> > On Tue, Apr 05, 2016 at 07:56:46PM +0800, Ming Lei wrote:
>> >> Some drivers access bio->bi_vcnt and bio->bi_io_vec directly,
>> >> firstly it isn't a good practice, secondly it may cause trouble
>> >> for converting to multipage bvecs.
>> >
>> > "not good practice" is OO bullshit snake oil without more justification. We
>> > don't plaster accessors everywhere without an actual reason.
>> >
>> > How would it cause trouble with multipage bvecs?
>>
>> Simply speaking, the current drivers may depend on .bi_vcnt for
>> computing how many page there are in one bio. After multipage bvecs,
>> it is not true any more. Isn't it a actual reason?
>
> But it's completely valid to use bi_vcnt for segments, which is what it's always
> _really_ meant anyways.
Previously drivers may be confused with segment and page, so they just thought
segment is same with page. The situation will change after multipage bvecs
is introduced.
Drivers may loop over .bi_io_vec and .bi_vcnt for accessing each pages.
(pktcdvd, staging: lustre, raid,...)
It isn't practical to fix all these drivers before introducing multipage bvecs.
Meantime we can't cause regressions with multipage bvecs. But we can
disable multipage bvecs for some insane drivers if they insist on their
misusing.
With these helpers, it is easy to audit drivers about their access to
.bi_vcnt & .bi_io_vec.
>
> Sometimes you have cases where the meaning of a member changes significantly
> enough that you really don't want code using it accidentally anymore - like with
> Jens' patches that changed how bi_remaining and bi_cnt work, but after those
> patches it really wasn't correct to use those members directly anymore so he
> renamed them to prevent that.
>
> I don't buy that that's the case for multipage bvecs - the meaning of bi_vcnt
> itself isn't changing (it's just the number of entries in the array!) and it'll
It depends on view, from driver's view, they have changed significantly enough.
> still be possible for code to correctly use it directly.
>
> Same with bio->bi_io_vec, it's still an array of biovecs, that's not changing.
> Your helpers are at the wrong level of abstraction.
>
> Also, there isn't a huge number of bi_vcnt references in the kernel anyways -
> the immutable biovec work required removing most of them.
After this ptach is applied, only btrfs and md are left with these references.
For btrfs, we still need to audit each usage and try to clean them up.
For md, we can't enable multipage bvecs for them until all these usage
are cleaned up or audited.
>
> Instead of adding these low level accessors, it'd be better to convert code to
> higher level helpers (especially bio_add_page()) where applicable.
That is always the better way to use bio_add_page(). but sometimes
both .bi_vcnt and .bi_io_vec is used not for adding page to bio.
Thanks,
Ming Lei
On Wed, Apr 06, 2016 at 10:11:27AM +0800, Ming Lei wrote:
> On Wed, Apr 6, 2016 at 9:46 AM, Kent Overstreet
> <[email protected]> wrote:
> > On Wed, Apr 06, 2016 at 09:34:34AM +0800, Ming Lei wrote:
> >> On Wed, Apr 6, 2016 at 8:18 AM, Kent Overstreet
> >> <[email protected]> wrote:
> >> > On Tue, Apr 05, 2016 at 07:56:46PM +0800, Ming Lei wrote:
> >> >> Some drivers access bio->bi_vcnt and bio->bi_io_vec directly,
> >> >> firstly it isn't a good practice, secondly it may cause trouble
> >> >> for converting to multipage bvecs.
> >> >
> >> > "not good practice" is OO bullshit snake oil without more justification. We
> >> > don't plaster accessors everywhere without an actual reason.
> >> >
> >> > How would it cause trouble with multipage bvecs?
> >>
> >> Simply speaking, the current drivers may depend on .bi_vcnt for
> >> computing how many page there are in one bio. After multipage bvecs,
> >> it is not true any more. Isn't it a actual reason?
> >
> > But it's completely valid to use bi_vcnt for segments, which is what it's always
> > _really_ meant anyways.
>
> Previously drivers may be confused with segment and page, so they just thought
> segment is same with page. The situation will change after multipage bvecs
> is introduced.
>
> Drivers may loop over .bi_io_vec and .bi_vcnt for accessing each pages.
> (pktcdvd, staging: lustre, raid,...)
>
> It isn't practical to fix all these drivers before introducing multipage bvecs.
> Meantime we can't cause regressions with multipage bvecs. But we can
> disable multipage bvecs for some insane drivers if they insist on their
> misusing.
No - it is both practical and IMO _required_ to convert those drivers to
bio_for_each_segment() or bio_for_each_page() as appropriate, before multipage
bvecs.
Especially code that needs pages and segments _has_ to be converted before
multipage bvecs.
If you'll recall looking at my various patch series from way back, especially
around immutable biovecs - most of the work was in converting drivers, not the
actual implementation (and I got rid of a more bi_io_vec/bi_vcnt uses than you
have left, so honestly there's no excuse for not doing it right).
> With these helpers, it is easy to audit drivers about their access to
> .bi_vcnt & .bi_io_vec.
It's easy to grep for those uses now!
> After this ptach is applied, only btrfs and md are left with these references.
>
> For btrfs, we still need to audit each usage and try to clean them up.
> For md, we can't enable multipage bvecs for them until all these usage
> are cleaned up or audited.
Cleaning up those should be your focus now, not adding these helpers. You don't
need these patches to go in to tell you what needs to be cleaned up, we already
know wha thas to be done.
On Wed, Apr 6, 2016 at 10:21 AM, Kent Overstreet
<[email protected]> wrote:
> On Wed, Apr 06, 2016 at 10:11:27AM +0800, Ming Lei wrote:
>> On Wed, Apr 6, 2016 at 9:46 AM, Kent Overstreet
>> <[email protected]> wrote:
>> > On Wed, Apr 06, 2016 at 09:34:34AM +0800, Ming Lei wrote:
>> >> On Wed, Apr 6, 2016 at 8:18 AM, Kent Overstreet
>> >> <[email protected]> wrote:
>> >> > On Tue, Apr 05, 2016 at 07:56:46PM +0800, Ming Lei wrote:
>> >> >> Some drivers access bio->bi_vcnt and bio->bi_io_vec directly,
>> >> >> firstly it isn't a good practice, secondly it may cause trouble
>> >> >> for converting to multipage bvecs.
>> >> >
>> >> > "not good practice" is OO bullshit snake oil without more justification. We
>> >> > don't plaster accessors everywhere without an actual reason.
>> >> >
>> >> > How would it cause trouble with multipage bvecs?
>> >>
>> >> Simply speaking, the current drivers may depend on .bi_vcnt for
>> >> computing how many page there are in one bio. After multipage bvecs,
>> >> it is not true any more. Isn't it a actual reason?
>> >
>> > But it's completely valid to use bi_vcnt for segments, which is what it's always
>> > _really_ meant anyways.
>>
>> Previously drivers may be confused with segment and page, so they just thought
>> segment is same with page. The situation will change after multipage bvecs
>> is introduced.
>>
>> Drivers may loop over .bi_io_vec and .bi_vcnt for accessing each pages.
>> (pktcdvd, staging: lustre, raid,...)
>>
>> It isn't practical to fix all these drivers before introducing multipage bvecs.
>> Meantime we can't cause regressions with multipage bvecs. But we can
>> disable multipage bvecs for some insane drivers if they insist on their
>> misusing.
>
> No - it is both practical and IMO _required_ to convert those drivers to
> bio_for_each_segment() or bio_for_each_page() as appropriate, before multipage
> bvecs.
>
> Especially code that needs pages and segments _has_ to be converted before
> multipage bvecs.
>
> If you'll recall looking at my various patch series from way back, especially
> around immutable biovecs - most of the work was in converting drivers, not the
> actual implementation (and I got rid of a more bi_io_vec/bi_vcnt uses than you
> have left, so honestly there's no excuse for not doing it right).
Looks your style for new featue is the following way:
- convert all drivers to new interface
- convert core code to new feature and enable it
My style is:
- if driver is easy to convert, then take new interface; othewise just leave it
alone without using new feature
- convert core code to new feature and enable it
I don't want to discuss which way is better.
But my way just introduces change to driver as few as possible, and
I try to avoid regression becasue I don't want to change code hugely
without detailed test.
That is why you can see the change to driver in this patchset is just
a little.
Thanks,
>
>> With these helpers, it is easy to audit drivers about their access to
>> .bi_vcnt & .bi_io_vec.
>
> It's easy to grep for those uses now!
>
>> After this ptach is applied, only btrfs and md are left with these references.
>>
>> For btrfs, we still need to audit each usage and try to clean them up.
>> For md, we can't enable multipage bvecs for them until all these usage
>> are cleaned up or audited.
>
> Cleaning up those should be your focus now, not adding these helpers. You don't
> need these patches to go in to tell you what needs to be cleaned up, we already
> know wha thas to be done.
--
Ming Lei
On Tue, Apr 5, 2016 at 9:02 PM, Christoph Hellwig <[email protected]> wrote:
> On Tue, Apr 05, 2016 at 07:56:54PM +0800, Ming Lei wrote:
>> +++ b/drivers/target/target_core_pscsi.c
>> @@ -951,7 +951,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
>> pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
>> bio->bi_vcnt, nr_vecs);
>>
>> - if (bio->bi_vcnt > nr_vecs) {
>> + if (bio_is_full(bio)) {
>> pr_debug("PSCSI: Reached bio->bi_vcnt max:"
>> " %d i: %d bio: %p, allocating another"
>> " bio\n", bio->bi_vcnt, i, bio);
>
> This check should be removed entirely - bio_add_pc_page takes care of
> it.
OK.
--
Ming Lei
> The lloop driver should be removed entirely - use the loop driver
> instead.
I talked with Andreas last week at our annual Lustre users group meeting
about this. The reason I was told for existance is that some users were
using files on a Lustre file system with the loop back device. The
performance was really bad at the time so a lloop was developed to
overcome those limitations. Its been a long time so perhaps its time
to look at the default loop driver again to see if can perform now. If
it doesn't we will go the route of reworking the lloop driver in the
spirit of the cryptoloop device.
On Sun, Apr 10, 2016 at 03:37:42PM +0100, James Simmons wrote:
>
> > The lloop driver should be removed entirely - use the loop driver
> > instead.
>
> I talked with Andreas last week at our annual Lustre users group meeting
> about this. The reason I was told for existance is that some users were
> using files on a Lustre file system with the loop back device. The
> performance was really bad at the time so a lloop was developed to
> overcome those limitations. Its been a long time so perhaps its time
> to look at the default loop driver again to see if can perform now. If
> it doesn't we will go the route of reworking the lloop driver in the
> spirit of the cryptoloop device.
The loop driver now supports using AIO/DIO on any file systems that
implements ->read_iter and ->write_iter. If lustre doesn't support
those or doesn't have proper performance using them it should be
addressed in the file system.
Note that the dio mode in the loop device is not the default and you
need to manually enabled it, keep that in mind when testing.
> On Sun, Apr 10, 2016 at 03:37:42PM +0100, James Simmons wrote:
> >
> > > The lloop driver should be removed entirely - use the loop driver
> > > instead.
> >
> > I talked with Andreas last week at our annual Lustre users group meeting
> > about this. The reason I was told for existance is that some users were
> > using files on a Lustre file system with the loop back device. The
> > performance was really bad at the time so a lloop was developed to
> > overcome those limitations. Its been a long time so perhaps its time
> > to look at the default loop driver again to see if can perform now. If
> > it doesn't we will go the route of reworking the lloop driver in the
> > spirit of the cryptoloop device.
>
> The loop driver now supports using AIO/DIO on any file systems that
> implements ->read_iter and ->write_iter. If lustre doesn't support
> those or doesn't have proper performance using them it should be
> addressed in the file system.
>
> Note that the dio mode in the loop device is not the default and you
> need to manually enabled it, keep that in mind when testing.
This is excellent news. The only sad thing is that most lustre users
are running distros that use kernels before the AIO/DIO enhancements
were landed :-( We will have to keep a copy around for those guys. But
first I need to test the performance of the loop back driver this
week before this can be dropped.
On Mon, Apr 11, 2016 at 12:02 AM, James Simmons <[email protected]> wrote:
>
>> On Sun, Apr 10, 2016 at 03:37:42PM +0100, James Simmons wrote:
>> >
>> > > The lloop driver should be removed entirely - use the loop driver
>> > > instead.
>> >
>> > I talked with Andreas last week at our annual Lustre users group meeting
>> > about this. The reason I was told for existance is that some users were
>> > using files on a Lustre file system with the loop back device. The
>> > performance was really bad at the time so a lloop was developed to
>> > overcome those limitations. Its been a long time so perhaps its time
>> > to look at the default loop driver again to see if can perform now. If
>> > it doesn't we will go the route of reworking the lloop driver in the
>> > spirit of the cryptoloop device.
>>
>> The loop driver now supports using AIO/DIO on any file systems that
>> implements ->read_iter and ->write_iter. If lustre doesn't support
>> those or doesn't have proper performance using them it should be
>> addressed in the file system.
>>
>> Note that the dio mode in the loop device is not the default and you
>> need to manually enabled it, keep that in mind when testing.
>
> This is excellent news. The only sad thing is that most lustre users
> are running distros that use kernels before the AIO/DIO enhancements
> were landed :-( We will have to keep a copy around for those guys. But
> first I need to test the performance of the loop back driver this
> week before this can be dropped.
Considered that this cleanup patch for lustre loop is quite simple and
straightforward, I suggest to keep this cleanup patch as so and do the
dropping in another patchset. Christoph, are you OK with that?
Thanks,
Ming Lei