Hi,
This patchset brings multi-page bvec into block layer:
1) what is multi-page bvec?
Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
which are physically contiguous instead of one single page used in linux
kernel for long time.
2) why is multi-page bvec introduced?
Kent proposed the idea[1] first.
As system's RAM becomes much bigger than before, and huge page, transparent
huge page and memory compaction are widely used, it is a bit easy now
to see physically contiguous pages from fs in I/O. On the other hand, from
block layer's view, it isn't necessary to store intermediate pages into bvec,
and it is enough to just store the physicallly contiguous 'segment' in each
io vector.
Also huge pages are being brought to filesystem and swap [2][6], we can
do IO on a hugepage each time[3], which requires that one bio can transfer
at least one huge page one time. Turns out it isn't flexiable to change
BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
as much bigger, such as 512, which requires at least two 4K pages for holding
the bvec table.
With multi-page bvec:
- Inside block layer, both bio splitting and sg map can become more
efficient than before by just traversing the physically contiguous
'segment' instead of each page.
- segment handling in block layer can be improved much in future since it
should be quite easy to convert multipage bvec into segment easily. For
example, we might just store segment in each bvec directly in future.
- bio size can be increased and it should improve some high-bandwidth IO
case in theory[4].
- there is opportunity in future to improve memory footprint of bvecs.
3) how is multi-page bvec implemented in this patchset?
Patch 1 fixes one issue of misusing on bio->bi_vcnt.
Patches 2 ~ 15 implement multipage bvec in block layer:
- put all tricks into bvec/bio/rq iterators, and as far as
drivers and fs use these standard iterators, they are happy
with multipage bvec
- introduce bio_for_each_bvec() to iterate over multipage bvec for splitting
bio and mapping sg
- keep current bio_for_each_segment*() to itereate over singlepage bvec and
make sure current users won't be broken; especailly, convert to this
new helper prototype in single patch 21 given it is bascially a mechanism
conversion
- deal with iomap & xfs's sub-pagesize io vec in patch 13
- enalbe multipage bvec in patch 14
Patch 16 redefines BIO_MAX_PAGES as 256.
Patch 17 documents usages of bio iterator helpers.
Patch 18~19 kills NO_SG_MERGE.
These patches can be found in the following git tree:
git: https://github.com/ming1/linux.git for-4.21-block-mp-bvec-V11
Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
and not see regression.
Thanks Christoph for reviewing the early version and providing very good
suggestions, such as: introduce bio_init_with_vec_table(), remove another
unnecessary helpers for cleanup and so on.
Thanks Chritoph and Omar for reviewing V10, and provides lots of helpful
comments.
Any comments are welcome!
V11:
- address most of reviews from Omar and christoph
- rename mp_bvec_* as segment_* helpers
- remove 'mp' parameter from bvec_iter_advance() and related helpers
- cleanup patch on bvec_split_segs() and blk_bio_segment_split(),
remove unnecessary checks
- simplify bvec_last_segment()
- drop bio_pages_all()
- introduce dedicated functions/file for handling non-cluser bio for
avoiding checking queue cluster before adding page to bio
- introduce bio_try_merge_segment() for simplifying iomap/xfs page
accounting code
- Fix Document change
V10:
- no any code change, just add more guys and list into patch's CC list,
as suggested by Christoph and Dave Chinner
V9:
- fix regression on iomap's sub-pagesize io vec, covered by patch 13
V8:
- remove prepare patches which all are merged to linus tree
- rebase on for-4.21/block
- address comments on V7
- add patches of killing NO_SG_MERGE
V7:
- include Christoph and Mike's bio_clone_bioset() patches, which is
actually prepare patches for multipage bvec
- address Christoph's comments
V6:
- avoid to introduce lots of renaming, follow Jen's suggestion of
using the name of chunk for multipage io vector
- include Christoph's three prepare patches
- decrease stack usage for using bio_for_each_chunk_segment_all()
- address Kent's comment
V5:
- remove some of prepare patches, which have been merged already
- add bio_clone_seg_bioset() to fix DM's bio clone, which
is introduced by 18a25da84354c6b (dm: ensure bio submission follows
a depth-first tree walk)
- rebase on the latest block for-v4.18
V4:
- rename bio_for_each_segment*() as bio_for_each_page*(), rename
bio_segments() as bio_pages(), rename rq_for_each_segment() as
rq_for_each_pages(), because these helpers never return real
segment, and they always return single page bvec
- introducing segment_for_each_page_all()
- introduce new bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
for returning real multipage segment
- rewrite segment_last_page()
- rename bvec iterator helper as suggested by Christoph
- replace comment with applying bio helpers as suggested by Christoph
- document usage of bio iterator helpers
- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
accommodated in 4K page
- move bio_alloc_pages() into bcache as suggested by Christoph
V3:
- rebase on v4.13-rc3 with for-next of block tree
- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
MD(raid1), and not see regressions triggered
- add Reviewed-by on some btrfs patches
- remove two MD patches because both are merged to linus tree
already
V2:
- bvec table direct access in raid has been cleaned, so NO_MP
flag is dropped
- rebase on recent Neil Brown's change on bio and bounce code
- reorganize the patchset
V1:
- against v4.10-rc1 and some cleanup in V0 are in -linus already
- handle queue_virt_boundary() in mp bvec change and make NVMe happy
- further BTRFS cleanup
- remove QUEUE_FLAG_SPLIT_MP
- rename for two new helpers of bio_for_each_segment_all()
- fix bounce convertion
- address comments in V0
[1], http://marc.info/?l=linux-kernel&m=141680246629547&w=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=147735447100001&r=1&w=2
[4], http://marc.info/?l=linux-mm&m=147745525801433&w=2
[5], http://marc.info/?t=149569484500007&r=1&w=2
[6], http://marc.info/?t=149820215300004&r=1&w=2
Ming Lei (19):
block: don't use bio->bi_vcnt to figure out segment number
block: introduce multi-page bvec helpers
block: introduce bio_for_each_bvec()
block: use bio_for_each_bvec() to compute multi-page bvec count
block: use bio_for_each_bvec() to map sg
block: introduce bvec_last_segment()
fs/buffer.c: use bvec iterator to truncate the bio
btrfs: use bvec_last_segment to get bio's last page
btrfs: move bio_pages_all() to btrfs
block: loop: pass multi-page bvec to iov_iter
bcache: avoid to use bio_for_each_segment_all() in
bch_bio_alloc_pages()
block: allow bio_for_each_segment_all() to iterate over multi-page
bvec
block: move bounce_clone_bio into bio.c
block: handle non-cluster bio out of blk_bio_segment_split
block: enable multipage bvecs
block: always define BIO_MAX_PAGES as 256
block: document usage of bio iterator helpers
block: kill QUEUE_FLAG_NO_SG_MERGE
block: kill BLK_MQ_F_SG_MERGE
Documentation/block/biovecs.txt | 24 +++++
block/Makefile | 3 +-
block/bio.c | 128 +++++++++++++++++++++---
block/blk-merge.c | 198 ++++++++++++++++++++++++--------------
block/blk-mq-debugfs.c | 2 -
block/blk-mq.c | 3 -
block/blk.h | 4 +
block/bounce.c | 76 +--------------
block/non-cluster.c | 70 ++++++++++++++
drivers/block/loop.c | 22 ++---
drivers/block/nbd.c | 2 +-
drivers/block/rbd.c | 2 +-
drivers/block/skd_main.c | 1 -
drivers/block/xen-blkfront.c | 2 +-
drivers/md/bcache/btree.c | 3 +-
drivers/md/bcache/util.c | 6 +-
drivers/md/dm-crypt.c | 3 +-
drivers/md/dm-rq.c | 2 +-
drivers/md/dm-table.c | 13 ---
drivers/md/raid1.c | 3 +-
drivers/mmc/core/queue.c | 3 +-
drivers/scsi/scsi_lib.c | 2 +-
drivers/staging/erofs/data.c | 3 +-
drivers/staging/erofs/unzip_vle.c | 3 +-
fs/block_dev.c | 6 +-
fs/btrfs/compression.c | 3 +-
fs/btrfs/disk-io.c | 3 +-
fs/btrfs/extent_io.c | 29 ++++--
fs/btrfs/inode.c | 6 +-
fs/btrfs/raid56.c | 3 +-
fs/buffer.c | 5 +-
fs/crypto/bio.c | 3 +-
fs/direct-io.c | 4 +-
fs/exofs/ore.c | 3 +-
fs/exofs/ore_raid.c | 3 +-
fs/ext4/page-io.c | 3 +-
fs/ext4/readpage.c | 3 +-
fs/f2fs/data.c | 9 +-
fs/gfs2/lops.c | 6 +-
fs/gfs2/meta_io.c | 3 +-
fs/iomap.c | 8 +-
fs/mpage.c | 3 +-
fs/xfs/xfs_aops.c | 7 +-
include/linux/bio.h | 50 ++++++----
include/linux/blk-mq.h | 1 -
include/linux/blkdev.h | 5 +-
include/linux/bvec.h | 115 ++++++++++++++++++++--
47 files changed, 595 insertions(+), 264 deletions(-)
create mode 100644 block/non-cluster.c
--
2.9.5
This helper is used for iterating over multi-page bvec for bio
split & merge code.
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 25 ++++++++++++++++++++++---
include/linux/bvec.h | 36 +++++++++++++++++++++++++++++-------
2 files changed, 51 insertions(+), 10 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 056fb627edb3..7560209d6a8a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -76,6 +76,9 @@
#define bio_data_dir(bio) \
(op_is_write(bio_op(bio)) ? WRITE : READ)
+#define bio_iter_mp_iovec(bio, iter) \
+ segment_iter_bvec((bio)->bi_io_vec, (iter))
+
/*
* Check whether this bio carries any data or not. A NULL bio is allowed.
*/
@@ -135,18 +138,24 @@ static inline bool bio_full(struct bio *bio)
#define bio_for_each_segment_all(bvl, bio, i) \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes)
+static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes, unsigned max_seg_len)
{
iter->bi_sector += bytes >> 9;
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
else
- bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+ __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
/* TODO: It is reasonable to complete bio with error here. */
}
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes)
+{
+ __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
+}
+
#define __bio_for_each_segment(bvl, bio, iter, start) \
for (iter = (start); \
(iter).bi_size && \
@@ -156,6 +165,16 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
#define bio_for_each_segment(bvl, bio, iter) \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
+#define __bio_for_each_bvec(bvl, bio, iter, start) \
+ for (iter = (start); \
+ (iter).bi_size && \
+ ((bvl = bio_iter_mp_iovec((bio), (iter))), 1); \
+ __bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
+
+/* returns one real segment(multi-page bvec) each time */
+#define bio_for_each_bvec(bvl, bio, iter) \
+ __bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
+
#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ed90bbf4c9c9..b279218c5c4d 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -25,6 +25,8 @@
#include <linux/errno.h>
#include <linux/mm.h>
+#define BVEC_MAX_LEN ((unsigned int)-1)
+
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
@@ -87,8 +89,15 @@ struct bvec_iter {
.bv_offset = bvec_iter_offset((bvec), (iter)), \
})
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
- struct bvec_iter *iter, unsigned bytes)
+#define segment_iter_bvec(bvec, iter) \
+((struct bio_vec) { \
+ .bv_page = segment_iter_page((bvec), (iter)), \
+ .bv_len = segment_iter_len((bvec), (iter)), \
+ .bv_offset = segment_iter_offset((bvec), (iter)), \
+})
+
+static inline bool __bvec_iter_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
{
if (WARN_ONCE(bytes > iter->bi_size,
"Attempted to advance past end of bvec iter\n")) {
@@ -97,12 +106,18 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
}
while (bytes) {
- unsigned iter_len = bvec_iter_len(bv, *iter);
- unsigned len = min(bytes, iter_len);
+ unsigned segment_len = segment_iter_len(bv, *iter);
- bytes -= len;
- iter->bi_size -= len;
- iter->bi_bvec_done += len;
+ if (max_seg_len < BVEC_MAX_LEN)
+ segment_len = min_t(unsigned, segment_len,
+ max_seg_len -
+ bvec_iter_offset(bv, *iter));
+
+ segment_len = min(bytes, segment_len);
+
+ bytes -= segment_len;
+ iter->bi_size -= segment_len;
+ iter->bi_bvec_done += segment_len;
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -136,6 +151,13 @@ static inline bool bvec_iter_rewind(const struct bio_vec *bv,
return true;
}
+static inline bool bvec_iter_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter,
+ unsigned bytes)
+{
+ return __bvec_iter_advance(bv, iter, bytes, PAGE_SIZE);
+}
+
#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
(iter).bi_size && \
--
2.9.5
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use bvec_last_segment() to truncate the bio.
Reviewed-by: Omar Sandoval <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
fs/buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 1286c2b95498..fa37ad52e962 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
- zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+ struct bio_vec bv;
+
+ bvec_last_segment(bvec, &bv);
+ zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
}
--
2.9.5
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bvec.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index b279218c5c4d..b37d13a79a7d 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -173,4 +173,26 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
.bi_bvec_done = 0, \
}
+/*
+ * Get the last single-page segment from the multi-page bvec and store it
+ * in @seg
+ */
+static inline void bvec_last_segment(const struct bio_vec *bvec,
+ struct bio_vec *seg)
+{
+ unsigned total = bvec->bv_offset + bvec->bv_len;
+ unsigned last_page = (total - 1) / PAGE_SIZE;
+
+ seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+ /* the whole segment is inside the last page */
+ if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+ seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+ seg->bv_len = bvec->bv_len;
+ } else {
+ seg->bv_offset = 0;
+ seg->bv_len = total - last_page * PAGE_SIZE;
+ }
+}
+
#endif /* __LINUX_BVEC_ITER_H */
--
2.9.5
Preparing for supporting multi-page bvec.
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
fs/btrfs/extent_io.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d228f706ff3e..5d5965297e7e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2720,11 +2720,12 @@ static int __must_check submit_one_bio(struct bio *bio, int mirror_num,
{
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
- struct page *page = bvec->bv_page;
+ struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
- start = page_offset(page) + bvec->bv_offset;
+ bvec_last_segment(bvec, &bv);
+ start = page_offset(bv.bv_page) + bv.bv_offset;
bio->bi_private = NULL;
--
2.9.5
BTRFS is the only user of this helper, so move this helper into
BTRFS, and implement it via bio_for_each_segment_all(), since
bio->bi_vcnt may not equal to number of pages after multipage bvec
is enabled.
Signed-off-by: Ming Lei <[email protected]>
---
fs/btrfs/extent_io.c | 14 +++++++++++++-
include/linux/bio.h | 6 ------
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5d5965297e7e..874bb9aeebdc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2348,6 +2348,18 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
return bio;
}
+static unsigned btrfs_bio_pages_all(struct bio *bio)
+{
+ unsigned i;
+ struct bio_vec *bv;
+
+ WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+
+ bio_for_each_segment_all(bv, bio, i)
+ ;
+ return i;
+}
+
/*
* this is a generic handler for readpage errors (default
* readpage_io_failed_hook). if other copies exist, read those and write back
@@ -2368,7 +2380,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
- unsigned failed_bio_pages = bio_pages_all(failed_bio);
+ unsigned failed_bio_pages = btrfs_bio_pages_all(failed_bio);
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7560209d6a8a..9d6284f53c07 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -282,12 +282,6 @@ static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
}
-static inline unsigned bio_pages_all(struct bio *bio)
-{
- WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
- return bio->bi_vcnt;
-}
-
static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
{
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
--
2.9.5
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/loop.c | 20 ++++++++++----------
include/linux/blkdev.h | 4 ++++
2 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 176ab1f28eca..e3683211f12d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -510,21 +510,22 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
loff_t pos, bool rw)
{
struct iov_iter iter;
+ struct req_iterator rq_iter;
struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+ struct bio_vec tmp;
unsigned int offset;
- int segments = 0;
+ int nr_bvec = 0;
int ret;
+ rq_for_each_bvec(tmp, rq, rq_iter)
+ nr_bvec++;
+
if (rq->bio != rq->biotail) {
- struct req_iterator iter;
- struct bio_vec tmp;
- __rq_for_each_bio(bio, rq)
- segments += bio_segments(bio);
- bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+ bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
GFP_NOIO);
if (!bvec)
return -EIO;
@@ -533,10 +534,10 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
/*
* The bios of the request may be started from the middle of
* the 'bvec' because of bio splitting, so we can't directly
- * copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+ * copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
* API will take care of all details for us.
*/
- rq_for_each_segment(tmp, rq, iter) {
+ rq_for_each_bvec(tmp, rq, rq_iter) {
*bvec = tmp;
bvec++;
}
@@ -550,11 +551,10 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
*/
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
- segments = bio_segments(bio);
}
atomic_set(&cmd->ref, 2);
- iov_iter_bvec(&iter, rw, bvec, segments, blk_rq_bytes(rq));
+ iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
cmd->iocb.ki_pos = pos;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ad6eafc43f2..a281b6737b61 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -805,6 +805,10 @@ struct req_iterator {
__rq_for_each_bio(_iter.bio, _rq) \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
+#define rq_for_each_bvec(bvl, _rq, _iter) \
+ __rq_for_each_bio(_iter.bio, _rq) \
+ bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
+
#define rq_iter_last(bvec, _iter) \
(_iter.bio->bi_next == NULL && \
bio_iter_last(bvec, _iter.iter))
--
2.9.5
We will reuse bounce_clone_bio() for cloning bio in case of
!blk_queue_cluster(q), so move this helper into bio.c and
rename it as bio_clone_bioset().
No function change.
Signed-off-by: Ming Lei <[email protected]>
---
block/bio.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
block/blk.h | 2 ++
block/bounce.c | 70 +---------------------------------------------------------
3 files changed, 72 insertions(+), 69 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 2680aa42a625..0f1635b9ec50 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -647,6 +647,75 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);
+/* block core only helper */
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs)
+{
+ struct bvec_iter iter;
+ struct bio_vec bv;
+ struct bio *bio;
+
+ /*
+ * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
+ * bio_src->bi_io_vec to bio->bi_io_vec.
+ *
+ * We can't do that anymore, because:
+ *
+ * - The point of cloning the biovec is to produce a bio with a biovec
+ * the caller can modify: bi_idx and bi_bvec_done should be 0.
+ *
+ * - The original bio could've had more than BIO_MAX_PAGES biovecs; if
+ * we tried to clone the whole thing bio_alloc_bioset() would fail.
+ * But the clone should succeed as long as the number of biovecs we
+ * actually need to allocate is fewer than BIO_MAX_PAGES.
+ *
+ * - Lastly, bi_vcnt should not be looked at or relied upon by code
+ * that does not own the bio - reason being drivers don't use it for
+ * iterating over the biovec anymore, so expecting it to be kept up
+ * to date (i.e. for clones that share the parent biovec) is just
+ * asking for trouble and would force extra work on
+ * __bio_clone_fast() anyways.
+ */
+
+ bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
+ if (!bio)
+ return NULL;
+ bio->bi_disk = bio_src->bi_disk;
+ bio->bi_opf = bio_src->bi_opf;
+ bio->bi_ioprio = bio_src->bi_ioprio;
+ bio->bi_write_hint = bio_src->bi_write_hint;
+ bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
+ bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
+
+ switch (bio_op(bio)) {
+ case REQ_OP_DISCARD:
+ case REQ_OP_SECURE_ERASE:
+ case REQ_OP_WRITE_ZEROES:
+ break;
+ case REQ_OP_WRITE_SAME:
+ bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
+ break;
+ default:
+ bio_for_each_segment(bv, bio_src, iter)
+ bio->bi_io_vec[bio->bi_vcnt++] = bv;
+ break;
+ }
+
+ if (bio_integrity(bio_src)) {
+ int ret;
+
+ ret = bio_integrity_clone(bio, bio_src, gfp_mask);
+ if (ret < 0) {
+ bio_put(bio);
+ return NULL;
+ }
+ }
+
+ bio_clone_blkcg_association(bio, bio_src);
+
+ return bio;
+}
+
/**
* bio_add_pc_page - attempt to add page to bio
* @q: the target queue
diff --git a/block/blk.h b/block/blk.h
index 816a9abb87cd..31c0e45aba3a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -336,6 +336,8 @@ static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs);
+
#ifdef CONFIG_BLK_DEV_ZONED
void blk_queue_free_zone_bitmaps(struct request_queue *q);
#else
diff --git a/block/bounce.c b/block/bounce.c
index 7338041e3042..4947c36173b2 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -215,74 +215,6 @@ static void bounce_end_io_read_isa(struct bio *bio)
__bounce_end_io_read(bio, &isa_page_pool);
}
-static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
- struct bio_set *bs)
-{
- struct bvec_iter iter;
- struct bio_vec bv;
- struct bio *bio;
-
- /*
- * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
- * bio_src->bi_io_vec to bio->bi_io_vec.
- *
- * We can't do that anymore, because:
- *
- * - The point of cloning the biovec is to produce a bio with a biovec
- * the caller can modify: bi_idx and bi_bvec_done should be 0.
- *
- * - The original bio could've had more than BIO_MAX_PAGES biovecs; if
- * we tried to clone the whole thing bio_alloc_bioset() would fail.
- * But the clone should succeed as long as the number of biovecs we
- * actually need to allocate is fewer than BIO_MAX_PAGES.
- *
- * - Lastly, bi_vcnt should not be looked at or relied upon by code
- * that does not own the bio - reason being drivers don't use it for
- * iterating over the biovec anymore, so expecting it to be kept up
- * to date (i.e. for clones that share the parent biovec) is just
- * asking for trouble and would force extra work on
- * __bio_clone_fast() anyways.
- */
-
- bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
- if (!bio)
- return NULL;
- bio->bi_disk = bio_src->bi_disk;
- bio->bi_opf = bio_src->bi_opf;
- bio->bi_ioprio = bio_src->bi_ioprio;
- bio->bi_write_hint = bio_src->bi_write_hint;
- bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
- bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
-
- switch (bio_op(bio)) {
- case REQ_OP_DISCARD:
- case REQ_OP_SECURE_ERASE:
- case REQ_OP_WRITE_ZEROES:
- break;
- case REQ_OP_WRITE_SAME:
- bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
- break;
- default:
- bio_for_each_segment(bv, bio_src, iter)
- bio->bi_io_vec[bio->bi_vcnt++] = bv;
- break;
- }
-
- if (bio_integrity(bio_src)) {
- int ret;
-
- ret = bio_integrity_clone(bio, bio_src, gfp_mask);
- if (ret < 0) {
- bio_put(bio);
- return NULL;
- }
- }
-
- bio_clone_blkcg_association(bio, bio_src);
-
- return bio;
-}
-
static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
mempool_t *pool)
{
@@ -311,7 +243,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
generic_make_request(*bio_orig);
*bio_orig = bio;
}
- bio = bounce_clone_bio(*bio_orig, GFP_NOIO, passthrough ? NULL :
+ bio = bio_clone_bioset(*bio_orig, GFP_NOIO, passthrough ? NULL :
&bounce_bio_set);
bio_for_each_segment_all(to, bio, i, iter_all) {
--
2.9.5
We will enable multi-page bvec soon, but non-cluster queue can't
handle the multi-page bvec at all. This patch borrows bounce's
idea to clone new single-page bio for non-cluster queue, and moves
its handling out of blk_bio_segment_split().
Signed-off-by: Ming Lei <[email protected]>
---
block/Makefile | 3 ++-
block/blk-merge.c | 6 ++++-
block/blk.h | 2 ++
block/non-cluster.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 79 insertions(+), 2 deletions(-)
create mode 100644 block/non-cluster.c
diff --git a/block/Makefile b/block/Makefile
index eee1b4ceecf9..e07d59438c4b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o partition-generic.o ioprio.o \
- badblocks.o partitions/ blk-rq-qos.o
+ badblocks.o partitions/ blk-rq-qos.o \
+ non-cluster.o
obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8829c51b4e75..7c44216c1b58 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -247,7 +247,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
goto split;
}
- if (bvprvp && blk_queue_cluster(q)) {
+ if (bvprvp) {
if (seg_size + bv.bv_len > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprvp, &bv))
@@ -307,6 +307,10 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
split = blk_bio_write_same_split(q, *bio, &q->bio_split, &nsegs);
break;
default:
+ if (!blk_queue_cluster(q)) {
+ blk_queue_non_cluster_bio(q, bio);
+ return;
+ }
split = blk_bio_segment_split(q, *bio, &q->bio_split, &nsegs);
break;
}
diff --git a/block/blk.h b/block/blk.h
index 31c0e45aba3a..6fc5821ced55 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -338,6 +338,8 @@ struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs);
+void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig);
+
#ifdef CONFIG_BLK_DEV_ZONED
void blk_queue_free_zone_bitmaps(struct request_queue *q);
#else
diff --git a/block/non-cluster.c b/block/non-cluster.c
new file mode 100644
index 000000000000..9c2910be9404
--- /dev/null
+++ b/block/non-cluster.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/* non-cluster handling for block devices */
+
+#include <linux/kernel.h>
+#include <linux/export.h>
+#include <linux/swap.h>
+#include <linux/gfp.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
+#include <linux/init.h>
+#include <linux/printk.h>
+
+#include "blk.h"
+
+static struct bio_set non_cluster_bio_set, non_cluster_bio_split;
+
+static __init int init_non_cluster_bioset(void)
+{
+ WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
+ BIOSET_NEED_BVECS));
+ WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
+ WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
+
+ return 0;
+}
+__initcall(init_non_cluster_bioset);
+
+static void non_cluster_end_io(struct bio *bio)
+{
+ struct bio *bio_orig = bio->bi_private;
+
+ bio_orig->bi_status = bio->bi_status;
+ bio_endio(bio_orig);
+ bio_put(bio);
+}
+
+void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig)
+{
+ struct bio *bio;
+ struct bvec_iter iter;
+ struct bio_vec from;
+ unsigned i = 0;
+ unsigned sectors = 0;
+ unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES,
+ queue_max_segments(q));
+
+ bio_for_each_segment(from, *bio_orig, iter) {
+ if (i++ < max_segs)
+ sectors += from.bv_len >> 9;
+ else
+ break;
+ }
+
+ if (sectors < bio_sectors(*bio_orig)) {
+ bio = bio_split(*bio_orig, sectors, GFP_NOIO,
+ &non_cluster_bio_split);
+ bio_chain(bio, *bio_orig);
+ generic_make_request(*bio_orig);
+ *bio_orig = bio;
+ }
+ bio = bio_clone_bioset(*bio_orig, GFP_NOIO, &non_cluster_bio_set);
+
+ bio->bi_phys_segments = bio_segments(bio);
+ bio_set_flag(bio, BIO_SEG_VALID);
+ bio->bi_end_io = non_cluster_end_io;
+
+ bio->bi_private = *bio_orig;
+ *bio_orig = bio;
+}
--
2.9.5
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.
CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 8 --------
1 file changed, 8 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7edad188568a..e5b975fa0558 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
#define BIO_BUG_ON
#endif
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES HPAGE_PMD_NR
-#else
#define BIO_MAX_PAGES 256
-#endif
-#else
-#define BIO_MAX_PAGES 256
-#endif
#define bio_prio(bio) (bio)->bi_ioprio
#define bio_set_prio(bio, prio) ((bio)->bi_ioprio = prio)
--
2.9.5
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Signed-off-by: Ming Lei <[email protected]>
---
block/bio.c | 27 ++++++++++++++++++---------
block/bounce.c | 6 ++++--
drivers/md/bcache/btree.c | 3 ++-
drivers/md/dm-crypt.c | 3 ++-
drivers/md/raid1.c | 3 ++-
drivers/staging/erofs/data.c | 3 ++-
drivers/staging/erofs/unzip_vle.c | 3 ++-
fs/block_dev.c | 6 ++++--
fs/btrfs/compression.c | 3 ++-
fs/btrfs/disk-io.c | 3 ++-
fs/btrfs/extent_io.c | 12 ++++++++----
fs/btrfs/inode.c | 6 ++++--
fs/btrfs/raid56.c | 3 ++-
fs/crypto/bio.c | 3 ++-
fs/direct-io.c | 4 +++-
fs/exofs/ore.c | 3 ++-
fs/exofs/ore_raid.c | 3 ++-
fs/ext4/page-io.c | 3 ++-
fs/ext4/readpage.c | 3 ++-
fs/f2fs/data.c | 9 ++++++---
fs/gfs2/lops.c | 6 ++++--
fs/gfs2/meta_io.c | 3 ++-
fs/iomap.c | 6 ++++--
fs/mpage.c | 3 ++-
fs/xfs/xfs_aops.c | 5 +++--
include/linux/bio.h | 11 +++++++++--
include/linux/bvec.h | 31 +++++++++++++++++++++++++++++++
27 files changed, 128 insertions(+), 46 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 4f4d9884443b..2680aa42a625 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1073,8 +1073,9 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
{
int i;
struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
ret = copy_page_from_iter(bvec->bv_page,
@@ -1104,8 +1105,9 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
{
int i;
struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
ret = copy_page_to_iter(bvec->bv_page,
@@ -1127,8 +1129,9 @@ void bio_free_pages(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
__free_page(bvec->bv_page);
}
EXPORT_SYMBOL(bio_free_pages);
@@ -1295,6 +1298,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1368,7 +1372,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
out_unmap:
- bio_for_each_segment_all(bvec, bio, j) {
+ bio_for_each_segment_all(bvec, bio, j, iter_all) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1379,11 +1383,12 @@ static void __bio_unmap_user(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
/*
* make sure we dirty pages we wrote to
*/
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
@@ -1475,8 +1480,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1585,8 +1591,9 @@ void bio_set_pages_dirty(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (!PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
}
@@ -1597,8 +1604,9 @@ static void bio_release_pages(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
put_page(bvec->bv_page);
}
@@ -1645,8 +1653,9 @@ void bio_check_pages_dirty(struct bio *bio)
struct bio_vec *bvec;
unsigned long flags;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
goto defer;
}
diff --git a/block/bounce.c b/block/bounce.c
index 559c55bda040..7338041e3042 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -165,11 +165,12 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool)
struct bio_vec *bvec, orig_vec;
int i;
struct bvec_iter orig_iter = bio_orig->bi_iter;
+ struct bvec_iter_all iter_all;
/*
* free up bounce indirect pages used
*/
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
orig_vec = bio_iter_iovec(bio_orig, orig_iter);
if (bvec->bv_page != orig_vec.bv_page) {
dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
@@ -293,6 +294,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
bool bounce = false;
int sectors = 0;
bool passthrough = bio_is_passthrough(*bio_orig);
+ struct bvec_iter_all iter_all;
bio_for_each_segment(from, *bio_orig, iter) {
if (i++ < BIO_MAX_PAGES)
@@ -312,7 +314,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
bio = bounce_clone_bio(*bio_orig, GFP_NOIO, passthrough ? NULL :
&bounce_bio_set);
- bio_for_each_segment_all(to, bio, i) {
+ bio_for_each_segment_all(to, bio, i, iter_all) {
struct page *page = to->bv_page;
if (page_to_pfn(page) <= q->limits.bounce_pfn)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 3f4211b5cd33..6242ae4e2127 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -427,8 +427,9 @@ static void do_btree_node_write(struct btree *b)
int j;
struct bio_vec *bv;
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bv, b->bio, j)
+ bio_for_each_segment_all(bv, b->bio, j, iter_all)
memcpy(page_address(bv->bv_page),
base + j * PAGE_SIZE, PAGE_SIZE);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index b8eec515a003..a0dcf28c01b5 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1447,8 +1447,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
{
unsigned int i;
struct bio_vec *bv;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bv, clone, i) {
+ bio_for_each_segment_all(bv, clone, i, iter_all) {
BUG_ON(!bv->bv_page);
mempool_free(bv->bv_page, &cc->page_pool);
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1d54109071cc..6f74a3b06c7e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2114,13 +2114,14 @@ static void process_checks(struct r1bio *r1_bio)
struct page **spages = get_resync_pages(sbio)->pages;
struct bio_vec *bi;
int page_len[RESYNC_PAGES] = { 0 };
+ struct bvec_iter_all iter_all;
if (sbio->bi_end_io != end_sync_read)
continue;
/* Now we can 'fixup' the error value */
sbio->bi_status = 0;
- bio_for_each_segment_all(bi, sbio, j)
+ bio_for_each_segment_all(bi, sbio, j, iter_all)
page_len[j] = bi->bv_len;
if (!status) {
diff --git a/drivers/staging/erofs/data.c b/drivers/staging/erofs/data.c
index 6384f73e5418..96240ceca02a 100644
--- a/drivers/staging/erofs/data.c
+++ b/drivers/staging/erofs/data.c
@@ -20,8 +20,9 @@ static inline void read_endio(struct bio *bio)
int i;
struct bio_vec *bvec;
const blk_status_t err = bio->bi_status;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
/* page is already locked */
diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 79d3ba62b298..41a8a9399863 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -731,11 +731,12 @@ static inline void z_erofs_vle_read_endio(struct bio *bio)
const blk_status_t err = bio->bi_status;
unsigned int i;
struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
#ifdef EROFS_FS_HAS_MANAGED_CACHE
struct address_space *mngda = NULL;
#endif
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
bool cachemngd = false;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4d79bc80fb41..6f505982e6b1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -197,6 +197,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
ssize_t ret;
blk_qc_t qc;
int i;
+ struct bvec_iter_all iter_all;
if ((pos | iov_iter_alignment(iter)) &
(bdev_logical_block_size(bdev) - 1))
@@ -246,7 +247,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
}
__set_current_state(TASK_RUNNING);
- bio_for_each_segment_all(bvec, &bio, i) {
+ bio_for_each_segment_all(bvec, &bio, i, iter_all) {
if (should_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
put_page(bvec->bv_page);
@@ -314,8 +315,9 @@ static void blkdev_bio_end_io(struct bio *bio)
} else {
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
put_page(bvec->bv_page);
bio_put(bio);
}
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2955a4ea2fa8..602a74c645c3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -162,13 +162,14 @@ static void end_compressed_bio_read(struct bio *bio)
} else {
int i;
struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
/*
* we have verified the checksum already, set page
* checked so the end_io handlers know about it
*/
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, cb->orig_bio, i)
+ bio_for_each_segment_all(bvec, cb->orig_bio, i, iter_all)
SetPageChecked(bvec->bv_page);
bio_endio(cb->orig_bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3f0b6d1936e8..9f2a31bea08d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -819,9 +819,10 @@ static blk_status_t btree_csum_one_bio(struct bio *bio)
struct bio_vec *bvec;
struct btrfs_root *root;
int i, ret = 0;
+ struct bvec_iter_all iter_all;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
root = BTRFS_I(bvec->bv_page->mapping->host)->root;
ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
if (ret)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 874bb9aeebdc..9373eb8ade06 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2352,10 +2352,11 @@ static unsigned btrfs_bio_pages_all(struct bio *bio)
{
unsigned i;
struct bio_vec *bv;
+ struct bvec_iter_all iter_all;
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bv, bio, i)
+ bio_for_each_segment_all(bv, bio, i, iter_all)
;
return i;
}
@@ -2457,9 +2458,10 @@ static void end_bio_extent_writepage(struct bio *bio)
u64 start;
u64 end;
int i;
+ struct bvec_iter_all iter_all;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2528,9 +2530,10 @@ static void end_bio_extent_readpage(struct bio *bio)
int mirror;
int ret;
int i;
+ struct bvec_iter_all iter_all;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -3682,9 +3685,10 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
struct bio_vec *bvec;
struct extent_buffer *eb;
int i, done;
+ struct bvec_iter_all iter_all;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
eb = (struct extent_buffer *)page->private;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9ea4c6f0352f..2850fca6cc44 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7822,6 +7822,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
struct bio_vec *bvec;
struct extent_io_tree *io_tree, *failure_tree;
int i;
+ struct bvec_iter_all iter_all;
if (bio->bi_status)
goto end;
@@ -7833,7 +7834,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
done->uptodate = 1;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree,
io_tree, done->start, bvec->bv_page,
btrfs_ino(BTRFS_I(inode)), 0);
@@ -7912,6 +7913,7 @@ static void btrfs_retry_endio(struct bio *bio)
int uptodate;
int ret;
int i;
+ struct bvec_iter_all iter_all;
if (bio->bi_status)
goto end;
@@ -7925,7 +7927,7 @@ static void btrfs_retry_endio(struct bio *bio)
failure_tree = &BTRFS_I(inode)->io_failure_tree;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
bvec->bv_offset, done->start,
bvec->bv_len);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index df41d7049936..e33a99871d60 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1443,10 +1443,11 @@ static void set_bio_pages_uptodate(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
SetPageUptodate(bvec->bv_page);
}
diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index 0959044c5cee..5759bcd018cd 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -30,8 +30,9 @@ static void __fscrypt_decrypt_bio(struct bio *bio, bool done)
{
struct bio_vec *bv;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_segment_all(bv, bio, i, iter_all) {
struct page *page = bv->bv_page;
int ret = fscrypt_decrypt_page(page->mapping->host, page,
PAGE_SIZE, 0, page->index);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index ea07d5a34317..5904fc2e180c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -551,7 +551,9 @@ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
- bio_for_each_segment_all(bvec, bio, i) {
+ struct bvec_iter_all iter_all;
+
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
if (dio->op == REQ_OP_READ && !PageCompound(page) &&
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 5331a15a61f1..24a8e34882e9 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -420,8 +420,9 @@ static void _clear_bio(struct bio *bio)
{
struct bio_vec *bv;
unsigned i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_segment_all(bv, bio, i, iter_all) {
unsigned this_count = bv->bv_len;
if (likely(PAGE_SIZE == this_count))
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 199590f36203..e83bab54b03e 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -468,11 +468,12 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
/* loop on all devices all pages */
for (d = 0; d < ios->numdevs; d++) {
struct bio *bio = ios->per_dev[d].bio;
+ struct bvec_iter_all iter_all;
if (!bio)
continue;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_segment_all(bv, bio, i, iter_all) {
struct page *page = bv->bv_page;
SetPageUptodate(page);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index db7590178dfc..0644b4e7d6d4 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -63,8 +63,9 @@ static void ext4_finish_bio(struct bio *bio)
{
int i;
struct bio_vec *bvec;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
#ifdef CONFIG_EXT4_FS_ENCRYPTION
struct page *data_page = NULL;
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index f461d75ac049..b0d9537bc797 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -72,6 +72,7 @@ static void mpage_end_io(struct bio *bio)
{
struct bio_vec *bv;
int i;
+ struct bvec_iter_all iter_all;
if (ext4_bio_encrypted(bio)) {
if (bio->bi_status) {
@@ -81,7 +82,7 @@ static void mpage_end_io(struct bio *bio)
return;
}
}
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_segment_all(bv, bio, i, iter_all) {
struct page *page = bv->bv_page;
if (!bio->bi_status) {
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b293cb3e27a2..d28f482a0d52 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -87,8 +87,9 @@ static void __read_end_io(struct bio *bio)
struct page *page;
struct bio_vec *bv;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_segment_all(bv, bio, i, iter_all) {
page = bv->bv_page;
/* PG_error was set if any post_read step failed */
@@ -164,13 +165,14 @@ static void f2fs_write_end_io(struct bio *bio)
struct f2fs_sb_info *sbi = bio->bi_private;
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
if (time_to_inject(sbi, FAULT_WRITE_IO)) {
f2fs_show_injection_info(FAULT_WRITE_IO);
bio->bi_status = BLK_STS_IOERR;
}
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
enum count_type type = WB_DATA_TYPE(page);
@@ -347,6 +349,7 @@ static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode,
struct bio_vec *bvec;
struct page *target;
int i;
+ struct bvec_iter_all iter_all;
if (!io->bio)
return false;
@@ -354,7 +357,7 @@ static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode,
if (!inode && !page && !ino)
return true;
- bio_for_each_segment_all(bvec, io->bio, i) {
+ bio_for_each_segment_all(bvec, io->bio, i, iter_all) {
if (bvec->bv_page->mapping)
target = bvec->bv_page;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 4c7069b8f3c1..f2f165620161 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -168,7 +168,8 @@ u64 gfs2_log_bmap(struct gfs2_sbd *sdp)
* that is pinned in the pagecache.
*/
-static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
+static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp,
+ struct bio_vec *bvec,
blk_status_t error)
{
struct buffer_head *bh, *next;
@@ -207,6 +208,7 @@ static void gfs2_end_log_write(struct bio *bio)
struct bio_vec *bvec;
struct page *page;
int i;
+ struct bvec_iter_all iter_all;
if (bio->bi_status) {
fs_err(sdp, "Error %d writing to journal, jid=%u\n",
@@ -214,7 +216,7 @@ static void gfs2_end_log_write(struct bio *bio)
wake_up(&sdp->sd_logd_waitq);
}
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
page = bvec->bv_page;
if (page_has_buffers(page))
gfs2_end_log_write_bh(sdp, bvec, bio->bi_status);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index be9c0bf697fe..3201342404a7 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -190,8 +190,9 @@ static void gfs2_meta_read_endio(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_segment_all(bvec, bio, i, iter_all) {
struct page *page = bvec->bv_page;
struct buffer_head *bh = page_buffers(page);
unsigned int len = bvec->bv_len;
diff --git a/fs/iomap.c b/fs/iomap.c
index b0462b363bad..f5fb8bf75cc8 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -262,8 +262,9 @@ iomap_read_end_io(struct bio *bio)
int error = blk_status_to_errno(bio->bi_status);
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
iomap_read_page_end_io(bvec, error);
bio_put(bio);
}
@@ -1541,8 +1542,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
} else {
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
put_page(bvec->bv_page);
bio_put(bio);
}
diff --git a/fs/mpage.c b/fs/mpage.c
index c820dc9bebab..3f19da75178b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,8 +48,9 @@ static void mpage_end_io(struct bio *bio)
{
struct bio_vec *bv;
int i;
+ struct bvec_iter_all iter_all;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_segment_all(bv, bio, i, iter_all) {
struct page *page = bv->bv_page;
page_endio(page, bio_op(bio),
blk_status_to_errno(bio->bi_status));
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..1f1829e506e8 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -62,7 +62,7 @@ xfs_find_daxdev_for_inode(
static void
xfs_finish_page_writeback(
struct inode *inode,
- struct bio_vec *bvec,
+ struct bio_vec *bvec,
int error)
{
struct iomap_page *iop = to_iomap_page(bvec->bv_page);
@@ -98,6 +98,7 @@ xfs_destroy_ioend(
for (bio = &ioend->io_inline_bio; bio; bio = next) {
struct bio_vec *bvec;
int i;
+ struct bvec_iter_all iter_all;
/*
* For the last bio, bi_private points to the ioend, so we
@@ -109,7 +110,7 @@ xfs_destroy_ioend(
next = bio->bi_private;
/* walk each page on bio, ending page IO on them */
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_segment_all(bvec, bio, i, iter_all)
xfs_finish_page_writeback(inode, bvec, error);
bio_put(bio);
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9d6284f53c07..7edad188568a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -131,12 +131,19 @@ static inline bool bio_full(struct bio *bio)
return bio->bi_vcnt >= bio->bi_max_vecs;
}
+#define bvec_for_each_segment(bv, bvl, i, iter_all) \
+ for (bv = bvec_init_iter_all(&iter_all); \
+ (iter_all.done < (bvl)->bv_len) && \
+ (bvec_next_segment((bvl), &iter_all), 1); \
+ iter_all.done += bv->bv_len, i += 1)
+
/*
* drivers should _never_ use the all version - the bio may have been split
* before it got to the driver and the driver won't own all of it
*/
-#define bio_for_each_segment_all(bvl, bio, i) \
- for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
+#define bio_for_each_segment_all(bvl, bio, i, iter_all) \
+ for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; iter_all.idx++) \
+ bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), i, iter_all)
static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
unsigned bytes, unsigned max_seg_len)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index b37d13a79a7d..cab36d838ed0 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -47,6 +47,12 @@ struct bvec_iter {
current bvec */
};
+struct bvec_iter_all {
+ struct bio_vec bv;
+ int idx;
+ unsigned done;
+};
+
/*
* various member access, note that bio_data should of course not be used
* on highmem page vectors
@@ -173,6 +179,31 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
.bi_bvec_done = 0, \
}
+static inline struct bio_vec *bvec_init_iter_all(struct bvec_iter_all *iter_all)
+{
+ iter_all->bv.bv_page = NULL;
+ iter_all->done = 0;
+
+ return &iter_all->bv;
+}
+
+/* used for chunk_for_each_segment */
+static inline void bvec_next_segment(const struct bio_vec *bvec,
+ struct bvec_iter_all *iter_all)
+{
+ struct bio_vec *bv = &iter_all->bv;
+
+ if (bv->bv_page) {
+ bv->bv_page += 1;
+ bv->bv_offset = 0;
+ } else {
+ bv->bv_page = bvec->bv_page;
+ bv->bv_offset = bvec->bv_offset;
+ }
+ bv->bv_len = min_t(unsigned int, PAGE_SIZE - bv->bv_offset,
+ bvec->bv_len - iter_all->done);
+}
+
/*
* Get the last single-page segment from the multi-page bvec and store it
* in @seg
--
2.9.5
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.
Signed-off-by: Ming Lei <[email protected]>
---
Documentation/block/biovecs.txt | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..bb008f7afb05 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,27 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=================
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
+shouldn't use them because the bio may have been split before it reached the
+driver.
+
+ bio_for_each_segment_all()
+ bio_first_bvec_all()
+ bio_first_page_all()
+ bio_last_bvec_all()
+
+* The following helpers iterate over single-page bvecs. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+ bio_for_each_segment()
+ bio_for_each_segment_all()
+
+* The following helpers iterate over single-page bvecs. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+ bio_for_each_bvec()
--
2.9.5
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-mq-debugfs.c | 1 -
drivers/block/loop.c | 2 +-
drivers/block/nbd.c | 2 +-
drivers/block/rbd.c | 2 +-
drivers/block/skd_main.c | 1 -
drivers/block/xen-blkfront.c | 2 +-
drivers/md/dm-rq.c | 2 +-
drivers/mmc/core/queue.c | 3 +--
drivers/scsi/scsi_lib.c | 2 +-
include/linux/blk-mq.h | 1 -
10 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d752fe4461af..a6ec055b54fa 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -249,7 +249,6 @@ static const char *const alloc_policy_name[] = {
static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
- HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
};
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index e3683211f12d..4cf5486689de 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1906,7 +1906,7 @@ static int loop_add(struct loop_device **l, int i)
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
- lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
lo->tag_set.driver_data = lo;
err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 08696f5f00bb..999c94de78e5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1570,7 +1570,7 @@ static int nbd_dev_add(int index)
nbd->tag_set.numa_node = NUMA_NO_NODE;
nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
- BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+ BLK_MQ_F_BLOCKING;
nbd->tag_set.driver_data = nbd;
err = blk_mq_alloc_tag_set(&nbd->tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8e5140bbf241..3dfd300b5283 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
rbd_dev->tag_set.ops = &rbd_mq_ops;
rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
- rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
rbd_dev->tag_set.nr_hw_queues = 1;
rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index a10d5736d8f7..a7040f9a1b1b 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
skdev->sgs_per_request * sizeof(struct scatterlist);
skdev->tag_set.numa_node = NUMA_NO_NODE;
skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
- BLK_MQ_F_SG_MERGE |
BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
skdev->tag_set.driver_data = skdev;
rc = blk_mq_alloc_tag_set(&skdev->tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b200fa58..d43a5677ccbc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
} else
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
- info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1f1fe9a618ea..afbac62a02a2 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -536,7 +536,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
md->tag_set->ops = &dm_mq_ops;
md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
md->tag_set->numa_node = md->numa_node_id;
- md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
md->tag_set->driver_data = md;
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 35cc138b096d..cc19e71c71d4 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -410,8 +410,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card)
else
mq->tag_set.queue_depth = MMC_QUEUE_DEPTH;
mq->tag_set.numa_node = NUMA_NO_NODE;
- mq->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE |
- BLK_MQ_F_BLOCKING;
+ mq->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
mq->tag_set.nr_hw_queues = 1;
mq->tag_set.cmd_size = sizeof(struct mmc_queue_req);
mq->tag_set.driver_data = mq;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0df15cb738d2..4091a67d23e5 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1890,7 +1890,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
shost->tag_set.queue_depth = shost->can_queue;
shost->tag_set.cmd_size = cmd_size;
shost->tag_set.numa_node = NUMA_NO_NODE;
- shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
shost->tag_set.flags |=
BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
shost->tag_set.driver_data = shost;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 929e8abc5535..ca7389d7e04f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -211,7 +211,6 @@ struct blk_mq_ops {
enum {
BLK_MQ_F_SHOULD_MERGE = 1 << 0,
BLK_MQ_F_TAG_SHARED = 1 << 1,
- BLK_MQ_F_SG_MERGE = 1 << 2,
BLK_MQ_F_BLOCKING = 1 << 5,
BLK_MQ_F_NO_SCHED = 1 << 6,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
--
2.9.5
This patch introduces helpers of 'segment_iter_*' for multipage
bvec support.
The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.
The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.
Follows some multi-page bvec background:
- bvecs stored in bio->bi_io_vec is always multi-page style
- bvec(struct bio_vec) represents one physically contiguous I/O
buffer, now the buffer may include more than one page after
multi-page bvec is supported, and all these pages represented
by one bvec is physically contiguous. Before multi-page bvec
support, at most one page is included in one bvec, we call it
single-page bvec.
- .bv_page of the bvec points to the 1st page in the multi-page bvec
- .bv_offset of the bvec is the offset of the buffer in the bvec
The effect on the current drivers/filesystem/dm/bcache/...:
- almost everyone supposes that one bvec only includes one single
page, so we keep the sp interface not changed, for example,
bio_for_each_segment() still returns single-page bvec
- bio_for_each_segment_all() will return single-page bvec too
- during iterating, iterator variable(struct bvec_iter) is always
updated in multi-page bvec style, and bvec_iter_advance() is kept
not changed
- returned(copied) single-page bvec is built in flight by bvec
helpers from the stored multi-page bvec
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bvec.h | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..ed90bbf4c9c9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,7 @@
#include <linux/kernel.h>
#include <linux/bug.h>
#include <linux/errno.h>
+#include <linux/mm.h>
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +51,35 @@ struct bvec_iter {
*/
#define __bvec_iter_bvec(bvec, iter) (&(bvec)[(iter).bi_idx])
-#define bvec_iter_page(bvec, iter) \
+#define segment_iter_page(bvec, iter) \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
-#define bvec_iter_len(bvec, iter) \
+#define segment_iter_len(bvec, iter) \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
-#define bvec_iter_offset(bvec, iter) \
+#define segment_iter_offset(bvec, iter) \
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
+#define segment_iter_page_idx(bvec, iter) \
+ (segment_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+/*
+ * <page, offset,length> of single-page segment.
+ *
+ * This helpers are for building single-page bvec in flight.
+ */
+#define bvec_iter_offset(bvec, iter) \
+ (segment_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter) \
+ min_t(unsigned, segment_iter_len((bvec), (iter)), \
+ PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
+
+#define bvec_iter_page(bvec, iter) \
+ nth_page(segment_iter_page((bvec), (iter)), \
+ segment_iter_page_idx((bvec), (iter)))
+
#define bvec_iter_bvec(bvec, iter) \
((struct bio_vec) { \
.bv_page = bvec_iter_page((bvec), (iter)), \
--
2.9.5
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.
Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.
Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 87 +++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 68 insertions(+), 19 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index f52400ce2187..ec0b93fa1ff8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -161,6 +161,54 @@ static inline unsigned get_max_io_size(struct request_queue *q,
return sectors;
}
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+ unsigned *nsegs, unsigned *last_seg_size,
+ unsigned *front_seg_size, unsigned *sectors)
+{
+ unsigned len = bv->bv_len;
+ unsigned total_len = 0;
+ unsigned new_nsegs = 0, seg_size = 0;
+
+ /*
+ * Multipage bvec may be too big to hold in one segment,
+ * so the current bvec has to be splitted as multiple
+ * segments.
+ */
+ while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
+ seg_size = min(queue_max_segment_size(q), len);
+
+ new_nsegs++;
+ total_len += seg_size;
+ len -= seg_size;
+
+ if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+ break;
+ }
+
+ /* update front segment size */
+ if (!*nsegs) {
+ unsigned first_seg_size = seg_size;
+
+ if (new_nsegs > 1)
+ first_seg_size = queue_max_segment_size(q);
+ if (*front_seg_size < first_seg_size)
+ *front_seg_size = first_seg_size;
+ }
+
+ /* update other varibles */
+ *last_seg_size = seg_size;
+ *nsegs += new_nsegs;
+ if (sectors)
+ *sectors += total_len >> 9;
+
+ /* split in the middle of the bvec if len != 0 */
+ return !!len;
+}
+
static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
@@ -174,7 +222,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
- bio_for_each_segment(bv, bio, iter) {
+ bio_for_each_bvec(bv, bio, iter) {
/*
* If the queue doesn't support SG gaps and adding this
* offset would create a gap, disallow it.
@@ -189,8 +237,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
*/
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
- nsegs++;
- sectors = max_sectors;
+ /* split in the middle of bvec */
+ bv.bv_len = (max_sectors - sectors) << 9;
+ bvec_split_segs(q, &bv, &nsegs,
+ &seg_size,
+ &front_seg_size,
+ §ors);
}
goto split;
}
@@ -212,14 +264,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
if (nsegs == queue_max_segments(q))
goto split;
- if (nsegs == 1 && seg_size > front_seg_size)
- front_seg_size = seg_size;
-
- nsegs++;
bvprv = bv;
bvprvp = &bvprv;
- seg_size = bv.bv_len;
- sectors += bv.bv_len >> 9;
+
+ if (bvec_split_segs(q, &bv, &nsegs, &seg_size,
+ &front_seg_size, §ors))
+ goto split;
}
@@ -233,8 +283,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
bio = new;
}
- if (nsegs == 1 && seg_size > front_seg_size)
- front_seg_size = seg_size;
bio->bi_seg_front_size = front_seg_size;
if (seg_size > bio->bi_seg_back_size)
bio->bi_seg_back_size = seg_size;
@@ -297,6 +345,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
struct bio_vec bv, bvprv = { NULL };
int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
+ unsigned front_seg_size = bio->bi_seg_front_size;
struct bio *fbio, *bbio;
struct bvec_iter iter;
@@ -317,7 +366,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
seg_size = 0;
nr_phys_segs = 0;
for_each_bio(bio) {
- bio_for_each_segment(bv, bio, iter) {
+ bio_for_each_bvec(bv, bio, iter) {
/*
* If SG merging is disabled, each bio vector is
* a segment
@@ -337,20 +386,20 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
continue;
}
new_segment:
- if (nr_phys_segs == 1 && seg_size >
- fbio->bi_seg_front_size)
- fbio->bi_seg_front_size = seg_size;
+ if (nr_phys_segs == 1 && seg_size > front_seg_size)
+ front_seg_size = seg_size;
- nr_phys_segs++;
bvprv = bv;
prev = 1;
- seg_size = bv.bv_len;
+ bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
+ &front_seg_size, NULL);
}
bbio = bio;
}
- if (nr_phys_segs == 1 && seg_size > fbio->bi_seg_front_size)
- fbio->bi_seg_front_size = seg_size;
+ if (nr_phys_segs == 1 && seg_size > front_seg_size)
+ front_seg_size = seg_size;
+ fbio->bi_seg_front_size = front_seg_size;
if (seg_size > bbio->bi_seg_back_size)
bbio->bi_seg_back_size = seg_size;
--
2.9.5
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.
So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").
Reviewed-by: Christoph Hellwig <[email protected]>
Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max segments")
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index b1df622cbd85..f52400ce2187 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -368,13 +368,7 @@ void blk_recalc_rq_segments(struct request *rq)
void blk_recount_segments(struct request_queue *q, struct bio *bio)
{
- unsigned short seg_cnt;
-
- /* estimate segment number by bi_vcnt for non-cloned bio */
- if (bio_flagged(bio, BIO_CLONED))
- seg_cnt = bio_segments(bio);
- else
- seg_cnt = bio->bi_vcnt;
+ unsigned short seg_cnt = bio_segments(bio);
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags) &&
(seg_cnt < queue_max_segments(q)))
--
2.9.5
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.
Acked-by: Coly Li <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/bcache/util.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..62fb917f7a4f 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,11 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
- bio_for_each_segment_all(bv, bio, i) {
+ /*
+ * This is called on freshly new bio, so it is safe to access the
+ * bvec table directly.
+ */
+ for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
--
2.9.5
It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 68 +++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 48 insertions(+), 20 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index ec0b93fa1ff8..8829c51b4e75 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -455,6 +455,52 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
return biovec_phys_mergeable(q, &end_bv, &nxt_bv);
}
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+ struct scatterlist *sglist)
+{
+ if (!*sg)
+ return sglist;
+
+ /*
+ * If the driver previously mapped a shorter list, we could see a
+ * termination bit prematurely unless it fully inits the sg table
+ * on each mapping. We KNOW that there must be more entries here
+ * or the driver would be buggy, so force clear the termination bit
+ * to avoid doing a full sg_init_table() in drivers for each command.
+ */
+ sg_unmark_end(*sg);
+ return sg_next(*sg);
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+ struct bio_vec *bvec, struct scatterlist *sglist,
+ struct scatterlist **sg)
+{
+ unsigned nbytes = bvec->bv_len;
+ unsigned nsegs = 0, total = 0;
+
+ while (nbytes > 0) {
+ unsigned seg_size;
+ struct page *pg;
+ unsigned offset, idx;
+
+ *sg = blk_next_sg(sg, sglist);
+
+ seg_size = min(nbytes, queue_max_segment_size(q));
+ offset = (total + bvec->bv_offset) % PAGE_SIZE;
+ idx = (total + bvec->bv_offset) / PAGE_SIZE;
+ pg = nth_page(bvec->bv_page, idx);
+
+ sg_set_page(*sg, pg, seg_size, offset);
+
+ total += seg_size;
+ nbytes -= seg_size;
+ nsegs++;
+ }
+
+ return nsegs;
+}
+
static inline void
__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -472,25 +518,7 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
(*sg)->length += nbytes;
} else {
new_segment:
- if (!*sg)
- *sg = sglist;
- else {
- /*
- * If the driver previously mapped a shorter
- * list, we could see a termination bit
- * prematurely unless it fully inits the sg
- * table on each mapping. We KNOW that there
- * must be more entries here or the driver
- * would be buggy, so force clear the
- * termination bit to avoid doing a full
- * sg_init_table() in drivers for each command.
- */
- sg_unmark_end(*sg);
- *sg = sg_next(*sg);
- }
-
- sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
- (*nsegs)++;
+ (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
}
@@ -512,7 +540,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
int cluster = blk_queue_cluster(q), nsegs = 0;
for_each_bio(bio)
- bio_for_each_segment(bvec, bio, iter)
+ bio_for_each_bvec(bvec, bio, iter)
__blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
&nsegs, &cluster);
--
2.9.5
This patch pulls the trigger for multi-page bvecs.
Signed-off-by: Ming Lei <[email protected]>
---
block/bio.c | 32 +++++++++++++++++++++++++++-----
fs/iomap.c | 2 +-
fs/xfs/xfs_aops.c | 2 +-
3 files changed, 29 insertions(+), 7 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 0f1635b9ec50..854676edc438 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -823,7 +823,7 @@ EXPORT_SYMBOL(bio_add_pc_page);
* @len: length of the data to add
* @off: offset of the data in @page
*
- * Try to add the data at @page + @off to the last bvec of @bio. This is a
+ * Try to add the data at @page + @off to the last page of @bio. This is a
* a useful optimisation for file systems with a block size smaller than the
* page size.
*
@@ -836,10 +836,13 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
return false;
if (bio->bi_vcnt > 0) {
- struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+ struct bio_vec bv;
+ struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
- if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
- bv->bv_len += len;
+ bvec_last_segment(seg, &bv);
+
+ if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
+ seg->bv_len += len;
bio->bi_iter.bi_size += len;
return true;
}
@@ -848,6 +851,25 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL_GPL(__bio_try_merge_page);
+static bool bio_try_merge_segment(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int off)
+{
+ if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
+ return false;
+
+ if (bio->bi_vcnt > 0) {
+ struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+ if (page_to_phys(seg->bv_page) + seg->bv_offset + seg->bv_len ==
+ page_to_phys(page) + off) {
+ seg->bv_len += len;
+ bio->bi_iter.bi_size += len;
+ return true;
+ }
+ }
+ return false;
+}
+
/**
* __bio_add_page - add page to a bio in a new segment
* @bio: destination bio
@@ -888,7 +910,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
int bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int offset)
{
- if (!__bio_try_merge_page(bio, page, len, offset)) {
+ if (!bio_try_merge_segment(bio, page, len, offset)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index f5fb8bf75cc8..ccc2ba115f4d 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -344,7 +344,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
ctx->bio->bi_end_io = iomap_read_end_io;
}
- __bio_add_page(ctx->bio, page, plen, poff);
+ bio_add_page(ctx->bio, page, plen, poff);
done:
/*
* Move the caller beyond our range so that it keeps making progress.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..5c2190216614 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -621,7 +621,7 @@ xfs_add_to_ioend(
atomic_inc(&iop->write_count);
if (bio_full(wpc->ioend->io_bio))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
- __bio_add_page(wpc->ioend->io_bio, page, len, poff);
+ bio_add_page(wpc->ioend->io_bio, page, len, poff);
}
wpc->ioend->io_size += len;
--
2.9.5
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.
Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.
Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().
For another user of blk_recalc_rq_segments():
- run in partial completion branch of blk_update_request, which is an unusual case
- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.
Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 31 ++++++-------------------------
block/blk-mq-debugfs.c | 1 -
block/blk-mq.c | 3 ---
drivers/md/dm-table.c | 13 -------------
include/linux/blkdev.h | 1 -
5 files changed, 6 insertions(+), 43 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7c44216c1b58..8fcac7855a45 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -343,8 +343,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
EXPORT_SYMBOL(blk_queue_split);
static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
- struct bio *bio,
- bool no_sg_merge)
+ struct bio *bio)
{
struct bio_vec bv, bvprv = { NULL };
int cluster, prev = 0;
@@ -371,13 +370,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_bvec(bv, bio, iter) {
- /*
- * If SG merging is disabled, each bio vector is
- * a segment
- */
- if (no_sg_merge)
- goto new_segment;
-
if (prev && cluster) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -412,27 +404,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
void blk_recalc_rq_segments(struct request *rq)
{
- bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
- &rq->q->queue_flags);
-
- rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
- no_sg_merge);
+ rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
}
void blk_recount_segments(struct request_queue *q, struct bio *bio)
{
- unsigned short seg_cnt = bio_segments(bio);
-
- if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags) &&
- (seg_cnt < queue_max_segments(q)))
- bio->bi_phys_segments = seg_cnt;
- else {
- struct bio *nxt = bio->bi_next;
+ struct bio *nxt = bio->bi_next;
- bio->bi_next = NULL;
- bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
- bio->bi_next = nxt;
- }
+ bio->bi_next = NULL;
+ bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+ bio->bi_next = nxt;
bio_set_flag(bio, BIO_SEG_VALID);
}
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a32bb79d6c95..d752fe4461af 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -127,7 +127,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
- QUEUE_FLAG_NAME(NO_SG_MERGE),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32b246ed44c0..0375c3bd410e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2755,9 +2755,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
- if (!(set->flags & BLK_MQ_F_SG_MERGE))
- blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
q->sg_reserved_size = INT_MAX;
INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 844f7d0f2ef8..a41832cf0c98 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1698,14 +1698,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
return q && !blk_queue_add_random(q);
}
-static int queue_supports_sg_merge(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
-{
- struct request_queue *q = bdev_get_queue(dev->bdev);
-
- return q && !test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags);
-}
-
static bool dm_table_all_devices_attribute(struct dm_table *t,
iterate_devices_callout_fn func)
{
@@ -1902,11 +1894,6 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (!dm_table_supports_write_zeroes(t))
q->limits.max_write_zeroes_sectors = 0;
- if (dm_table_all_devices_attribute(t, queue_supports_sg_merge))
- blk_queue_flag_clear(QUEUE_FLAG_NO_SG_MERGE, q);
- else
- blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
dm_table_verify_integrity(t);
/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a281b6737b61..8e05966ffe94 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -590,7 +590,6 @@ struct request_queue {
#define QUEUE_FLAG_SAME_FORCE 15 /* force complete on same CPU */
#define QUEUE_FLAG_DEAD 16 /* queue tear-down finished */
#define QUEUE_FLAG_INIT_DONE 17 /* queue is initialized */
-#define QUEUE_FLAG_NO_SG_MERGE 18 /* don't attempt to merge SG segments*/
#define QUEUE_FLAG_POLL 19 /* IO polling enabled if set */
#define QUEUE_FLAG_WC 20 /* Write back caching */
#define QUEUE_FLAG_FUA 21 /* device supports FUA writes */
--
2.9.5
On 21.11.18 г. 5:23 ч., Ming Lei wrote:
> Now multi-page bvec is supported, some helpers may return page by
> page, meantime some may return segment by segment, this patch
> documents the usage.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> Documentation/block/biovecs.txt | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
> index 25689584e6e0..bb008f7afb05 100644
> --- a/Documentation/block/biovecs.txt
> +++ b/Documentation/block/biovecs.txt
> @@ -117,3 +117,27 @@ Other implications:
> size limitations and the limitations of the underlying devices. Thus
> there's no need to define ->merge_bvec_fn() callbacks for individual block
> drivers.
> +
> +Usage of helpers:
> +=================
> +
> +* The following helpers whose names have the suffix of "_all" can only be used
> +on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
> +shouldn't use them because the bio may have been split before it reached the
> +driver.
> +
> + bio_for_each_segment_all()
> + bio_first_bvec_all()
> + bio_first_page_all()
> + bio_last_bvec_all()
> +
> +* The following helpers iterate over single-page bvecs. The passed 'struct
> +bio_vec' will contain a single-page IO vector during the iteration
> +
> + bio_for_each_segment()
> + bio_for_each_segment_all()
> +
> +* The following helpers iterate over single-page bvecs. The passed 'struct
> +bio_vec' will contain a single-page IO vector during the iteration
> +
> + bio_for_each_bvec()
Just put this helper right below the above 2, no need to repeat the
explanation. Also I'd suggest introducing another catch-all sentence
"All other helpers are assumed to iterate multipage bio vecs" and
perhaps give an example with 1-2 helpers.
>
> +#define bio_iter_mp_iovec(bio, iter) \
> + segment_iter_bvec((bio)->bi_io_vec, (iter))
Besides the mp naming we'd like to get rid off there also is just
a single user of this macro, please just expand it there.
> +#define segment_iter_bvec(bvec, iter) \
> +((struct bio_vec) { \
> + .bv_page = segment_iter_page((bvec), (iter)), \
> + .bv_len = segment_iter_len((bvec), (iter)), \
> + .bv_offset = segment_iter_offset((bvec), (iter)), \
> +})
And for this one please keep the segment vs bvec versions of these
macros close together in the file please, right now it follow the
bvec_iter_bvec variant closely.
> +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> + unsigned bytes, unsigned max_seg_len)
> {
> iter->bi_sector += bytes >> 9;
>
> if (bio_no_advance_iter(bio))
> iter->bi_size -= bytes;
> else
> - bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> + __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
> /* TODO: It is reasonable to complete bio with error here. */
> }
>
> +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> + unsigned bytes)
> +{
> + __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> +}
Btw, I think the remaining users of bio_advance_iter() in bio.h
should probably switch to using __bio_advance_iter to make them a little
more clear to read.
> +/* returns one real segment(multi-page bvec) each time */
space before the brace, please.
> +#define BVEC_MAX_LEN ((unsigned int)-1)
> while (bytes) {
> + unsigned segment_len = segment_iter_len(bv, *iter);
>
> - iter->bi_bvec_done += len;
> + if (max_seg_len < BVEC_MAX_LEN)
> + segment_len = min_t(unsigned, segment_len,
> + max_seg_len -
> + bvec_iter_offset(bv, *iter));
> +
> + segment_len = min(bytes, segment_len);
Please stick to passing the magic zero here as can often generate more
efficient code.
Talking about efficent code - I wonder how much code size we'd save
by moving this function out of line..
But while looking over this I wonder why we even need the max_seg_len
here. The only thing __bvec_iter_advance does it to move bi_bvec_done
and bi_idx forward, with corresponding decrements of bi_size. As far
as I can tell the only thing that max_seg_len does is that we need
to more iterations of the while loop to archive the same thing.
And actual bvec used by the caller will be obtained using
bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
or single-page variants.
On Wed, Nov 21, 2018 at 11:23:10AM +0800, Ming Lei wrote:
> This patch introduces helpers of 'segment_iter_*' for multipage
> bvec support.
>
> The introduced helpers treate one bvec as real multi-page segment,
> which may include more than one pages.
Unless I'm missing something these bvec vs segment names are exactly
inverted vs how we use it elsewhere.
In the iterators we use segment for single-page bvec, and bvec for multi
page ones, and here it is inverse. Please switch it around.
On Wed, Nov 21, 2018 at 02:19:28PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:23:10AM +0800, Ming Lei wrote:
> > This patch introduces helpers of 'segment_iter_*' for multipage
> > bvec support.
> >
> > The introduced helpers treate one bvec as real multi-page segment,
> > which may include more than one pages.
>
> Unless I'm missing something these bvec vs segment names are exactly
> inverted vs how we use it elsewhere.
>
> In the iterators we use segment for single-page bvec, and bvec for multi
> page ones, and here it is inverse. Please switch it around.
bvec_iter_* is used for single-page bvec in current linus tree, and there are
lots of users now:
[linux]$ git grep -n "bvec_iter_*" ./ | wc
191 995 13242
If we have to switch it first, it can be a big change, just wondering if Jens
is happy with that?
Thanks,
Ming
On Wed, Nov 21, 2018 at 02:32:44PM +0100, Christoph Hellwig wrote:
> > +#define bio_iter_mp_iovec(bio, iter) \
> > + segment_iter_bvec((bio)->bi_io_vec, (iter))
>
> Besides the mp naming we'd like to get rid off there also is just
> a single user of this macro, please just expand it there.
OK.
>
> > +#define segment_iter_bvec(bvec, iter) \
> > +((struct bio_vec) { \
> > + .bv_page = segment_iter_page((bvec), (iter)), \
> > + .bv_len = segment_iter_len((bvec), (iter)), \
> > + .bv_offset = segment_iter_offset((bvec), (iter)), \
> > +})
>
> And for this one please keep the segment vs bvec versions of these
> macros close together in the file please, right now it follow the
> bvec_iter_bvec variant closely.
OK.
>
> > +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> > + unsigned bytes, unsigned max_seg_len)
> > {
> > iter->bi_sector += bytes >> 9;
> >
> > if (bio_no_advance_iter(bio))
> > iter->bi_size -= bytes;
> > else
> > - bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> > + __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
> > /* TODO: It is reasonable to complete bio with error here. */
> > }
> >
> > +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> > + unsigned bytes)
> > +{
> > + __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> > +}
>
> Btw, I think the remaining users of bio_advance_iter() in bio.h
> should probably switch to using __bio_advance_iter to make them a little
> more clear to read.
Good point.
>
> > +/* returns one real segment(multi-page bvec) each time */
>
> space before the brace, please.
OK.
>
> > +#define BVEC_MAX_LEN ((unsigned int)-1)
>
> > while (bytes) {
> > + unsigned segment_len = segment_iter_len(bv, *iter);
> >
> > - iter->bi_bvec_done += len;
> > + if (max_seg_len < BVEC_MAX_LEN)
> > + segment_len = min_t(unsigned, segment_len,
> > + max_seg_len -
> > + bvec_iter_offset(bv, *iter));
> > +
> > + segment_len = min(bytes, segment_len);
>
> Please stick to passing the magic zero here as can often generate more
> efficient code.
But zero may decrease the code readability. Actually the passed
'max_seg_len' is just a constant, and complier should have generated
same efficient code for any constant, either 0 or other.
>
> Talking about efficent code - I wonder how much code size we'd save
> by moving this function out of line..
That is good point, see the following diff:
[mingl@hp kernel]$ diff -u inline.size non_inline.size
--- inline.size 2018-11-21 23:24:52.305312076 +0800
+++ non_inline.size 2018-11-21 23:24:59.908393010 +0800
@@ -1,2 +1,2 @@
text data bss dec hex filename
-13429213 6893922 4292692 24615827 1779b93 vmlinux.inline
+13429153 6893346 4292692 24615191 1779917 vmlinux.non_inline
vmlinux(non_inline) is built by just moving/exporting __bvec_iter_advance()
into block/bio.c.
The difference is about 276bytes.
>
> But while looking over this I wonder why we even need the max_seg_len
> here. The only thing __bvec_iter_advance does it to move bi_bvec_done
> and bi_idx forward, with corresponding decrements of bi_size. As far
> as I can tell the only thing that max_seg_len does is that we need
> to more iterations of the while loop to archive the same thing.
>
> And actual bvec used by the caller will be obtained using
> bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
> or single-page variants.
Right, we let __bvec_iter_advance() serve for both multi-page and single-page
case, then we have to tell it via one way or another, now we use the constant
of 'max_seg_len'.
Or you suggest to implement two versions of __bvec_iter_advance()?
Thanks,
Ming
On Wed, Nov 21, 2018 at 03:55:02PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:23:23AM +0800, Ming Lei wrote:
> > if (bio->bi_vcnt > 0) {
> > - struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> > + struct bio_vec bv;
> > + struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
> >
> > - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> > - bv->bv_len += len;
> > + bvec_last_segment(seg, &bv);
> > +
> > + if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
>
> I think this we can simplify the try to merge into bio case a bit,
> and also document it better with something like this:
>
> diff --git a/block/bio.c b/block/bio.c
> index 854676edc438..cc913281a723 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -822,54 +822,40 @@ EXPORT_SYMBOL(bio_add_pc_page);
> * @page: page to add
> * @len: length of the data to add
> * @off: offset of the data in @page
> + * @same_page: if %true only merge if the new data is in the same physical
> + * page as the last segment of the bio.
> *
> - * Try to add the data at @page + @off to the last page of @bio. This is a
> + * Try to add the data at @page + @off to the last bvec of @bio. This is a
> * a useful optimisation for file systems with a block size smaller than the
> * page size.
> *
> * Return %true on success or %false on failure.
> */
> bool __bio_try_merge_page(struct bio *bio, struct page *page,
> - unsigned int len, unsigned int off)
> + unsigned int len, unsigned int off, bool same_page)
> {
> if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
> return false;
>
> if (bio->bi_vcnt > 0) {
> - struct bio_vec bv;
> - struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
> -
> - bvec_last_segment(seg, &bv);
> -
> - if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
> - seg->bv_len += len;
> - bio->bi_iter.bi_size += len;
> - return true;
> - }
> + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> + phys_addr_t vec_addr = page_to_phys(bv->bv_page);
> + phys_addr_t page_addr = page_to_phys(page);
> +
> + if (vec_addr + bv->bv_offset + bv->bv_len != page_addr + off)
> + return false;
> + if (same_page &&
> + (vec_addr & PAGE_SIZE) != (page_addr & PAGE_SIZE))
> + return false;
I guess the correct check should be:
end_addr = vec_addr + bv->bv_offset + bv->bv_len;
if (same_page &&
(end_addr & PAGE_MASK) != (page_addr & PAGE_MASK))
return false;
And this approach is good, will take it in V12.
Thanks,
Ming
On Wed, Nov 21, 2018 at 11:37:27PM +0800, Ming Lei wrote:
> > + bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig),
> > + &non_cluster_bio_set);
>
> bio_segments(*bio_orig) may be > 256, so bio_alloc_bioset() may fail.
Nothing a little min with BIO_MAX_PAGES couldn't fix. This patch
was just intended as an idea how I think this code could work.
On Wed, Nov 21, 2018 at 11:48:13PM +0800, Ming Lei wrote:
> I guess the correct check should be:
>
> end_addr = vec_addr + bv->bv_offset + bv->bv_len;
> if (same_page &&
> (end_addr & PAGE_MASK) != (page_addr & PAGE_MASK))
> return false;
Indeed.
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1ad6eafc43f2..a281b6737b61 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -805,6 +805,10 @@ struct req_iterator {
> __rq_for_each_bio(_iter.bio, _rq) \
> bio_for_each_segment(bvl, _iter.bio, _iter.iter)
>
> +#define rq_for_each_bvec(bvl, _rq, _iter) \
> + __rq_for_each_bio(_iter.bio, _rq) \
> + bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
> +
I think this should go into the patch adding bio_for_each_bvec and
friends.
Otherwise looks fine:
Reviewed-by: Christoph Hellwig <[email protected]>
On Wed, Nov 21, 2018 at 11:23:20AM +0800, Ming Lei wrote:
> This patch introduces one extra iterator variable to bio_for_each_segment_all(),
> then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
>
> Given it is just one mechannical & simple change on all bio_for_each_segment_all()
> users, this patch does tree-wide change in one single patch, so that we can
> avoid to use a temporary helper for this conversion.
>
> Signed-off-by: Ming Lei <[email protected]>
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
On Wed, Nov 21, 2018 at 11:23:19AM +0800, Ming Lei wrote:
> bch_bio_alloc_pages() is always called on one new bio, so it is safe
> to access the bvec table directly. Given it is the only kind of this
> case, open code the bvec table access since bio_for_each_segment_all()
> will be changed to support for iterating over multipage bvec.
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
> + non-cluster.o
Do we really need a new source file for these few functions?
> default:
> + if (!blk_queue_cluster(q)) {
> + blk_queue_non_cluster_bio(q, bio);
> + return;
I'd name this blk_bio_segment_split_singlepage or similar.
> +static __init int init_non_cluster_bioset(void)
> +{
> + WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
> + BIOSET_NEED_BVECS));
> + WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
> + WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
Please only allocate the resources once a queue without the cluster
flag is registered, there are only very few modern drivers that do that.
> +static void non_cluster_end_io(struct bio *bio)
> +{
> + struct bio *bio_orig = bio->bi_private;
> +
> + bio_orig->bi_status = bio->bi_status;
> + bio_endio(bio_orig);
> + bio_put(bio);
> +}
Why can't we use bio_chain for the split bios?
> + bio_for_each_segment(from, *bio_orig, iter) {
> + if (i++ < max_segs)
> + sectors += from.bv_len >> 9;
> + else
> + break;
> + }
The easy to read way would be:
bio_for_each_segment(from, *bio_orig, iter) {
if (i++ == max_segs)
break;
sectors += from.bv_len >> 9;
}
> + if (sectors < bio_sectors(*bio_orig)) {
> + bio = bio_split(*bio_orig, sectors, GFP_NOIO,
> + &non_cluster_bio_split);
> + bio_chain(bio, *bio_orig);
> + generic_make_request(*bio_orig);
> + *bio_orig = bio;
I don't think this is very efficient, as this means we now
clone the bio twice, first to split it at the sector boundary,
and then again when converting it to single-page bio_vec.
I think this could be something like this (totally untested):
diff --git a/block/non-cluster.c b/block/non-cluster.c
index 9c2910be9404..60389f275c43 100644
--- a/block/non-cluster.c
+++ b/block/non-cluster.c
@@ -13,58 +13,59 @@
#include "blk.h"
-static struct bio_set non_cluster_bio_set, non_cluster_bio_split;
+static struct bio_set non_cluster_bio_set;
static __init int init_non_cluster_bioset(void)
{
WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
BIOSET_NEED_BVECS));
WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
- WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
return 0;
}
__initcall(init_non_cluster_bioset);
-static void non_cluster_end_io(struct bio *bio)
-{
- struct bio *bio_orig = bio->bi_private;
-
- bio_orig->bi_status = bio->bi_status;
- bio_endio(bio_orig);
- bio_put(bio);
-}
-
void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig)
{
- struct bio *bio;
struct bvec_iter iter;
- struct bio_vec from;
- unsigned i = 0;
- unsigned sectors = 0;
- unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES,
- queue_max_segments(q));
+ struct bio *bio;
+ struct bio_vec bv;
+ unsigned short max_segs, segs = 0;
+
+ bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig),
+ &non_cluster_bio_set);
+ bio->bi_disk = (*bio_orig)->bi_disk;
+ bio->bi_partno = (*bio_orig)->bi_partno;
+ bio_set_flag(bio, BIO_CLONED);
+ if (bio_flagged(*bio_orig, BIO_THROTTLED))
+ bio_set_flag(bio, BIO_THROTTLED);
+ bio->bi_opf = (*bio_orig)->bi_opf;
+ bio->bi_ioprio = (*bio_orig)->bi_ioprio;
+ bio->bi_write_hint = (*bio_orig)->bi_write_hint;
+ bio->bi_iter.bi_sector = (*bio_orig)->bi_iter.bi_sector;
+ bio->bi_iter.bi_size = (*bio_orig)->bi_iter.bi_size;
+
+ if (bio_integrity(*bio_orig))
+ bio_integrity_clone(bio, *bio_orig, GFP_NOIO);
- bio_for_each_segment(from, *bio_orig, iter) {
- if (i++ < max_segs)
- sectors += from.bv_len >> 9;
- else
+ bio_clone_blkcg_association(bio, *bio_orig);
+
+ max_segs = min_t(unsigned short, queue_max_segments(q), BIO_MAX_PAGES);
+ bio_for_each_segment(bv, *bio_orig, iter) {
+ bio->bi_io_vec[segs++] = bv;
+ if (segs++ == max_segs)
break;
}
- if (sectors < bio_sectors(*bio_orig)) {
- bio = bio_split(*bio_orig, sectors, GFP_NOIO,
- &non_cluster_bio_split);
- bio_chain(bio, *bio_orig);
- generic_make_request(*bio_orig);
- *bio_orig = bio;
- }
- bio = bio_clone_bioset(*bio_orig, GFP_NOIO, &non_cluster_bio_set);
+ bio->bi_vcnt = segs;
+ bio->bi_phys_segments = segs;
+ bio_set_flag(bio, BIO_SEG_VALID);
+ bio_chain(bio, *bio_orig);
- bio->bi_phys_segments = bio_segments(bio);
- bio_set_flag(bio, BIO_SEG_VALID);
- bio->bi_end_io = non_cluster_end_io;
+ if (bio_integrity(bio))
+ bio_integrity_trim(bio);
+ bio_advance(bio, (*bio_orig)->bi_iter.bi_size);
- bio->bi_private = *bio_orig;
+ generic_make_request(*bio_orig);
*bio_orig = bio;
}
On Wed, Nov 21, 2018 at 09:45:25AM +0200, Nikolay Borisov wrote:
> > + bio_for_each_segment_all()
> > + bio_first_bvec_all()
> > + bio_first_page_all()
> > + bio_last_bvec_all()
> > +
> > +* The following helpers iterate over single-page bvecs. The passed 'struct
> > +bio_vec' will contain a single-page IO vector during the iteration
> > +
> > + bio_for_each_segment()
> > + bio_for_each_segment_all()
> > +
> > +* The following helpers iterate over single-page bvecs. The passed 'struct
> > +bio_vec' will contain a single-page IO vector during the iteration
> > +
> > + bio_for_each_bvec()
>
> Just put this helper right below the above 2, no need to repeat the
> explanation. Also I'd suggest introducing another catch-all sentence
> "All other helpers are assumed to iterate multipage bio vecs" and
> perhaps give an example with 1-2 helpers.
Well, I think the second explanation is wrong - bio_for_each_bvec
iterates over the whole bvecs, not just single page.
On Wed, Nov 21, 2018 at 11:23:23AM +0800, Ming Lei wrote:
> if (bio->bi_vcnt > 0) {
> - struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> + struct bio_vec bv;
> + struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
>
> - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> - bv->bv_len += len;
> + bvec_last_segment(seg, &bv);
> +
> + if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
I think this we can simplify the try to merge into bio case a bit,
and also document it better with something like this:
diff --git a/block/bio.c b/block/bio.c
index 854676edc438..cc913281a723 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -822,54 +822,40 @@ EXPORT_SYMBOL(bio_add_pc_page);
* @page: page to add
* @len: length of the data to add
* @off: offset of the data in @page
+ * @same_page: if %true only merge if the new data is in the same physical
+ * page as the last segment of the bio.
*
- * Try to add the data at @page + @off to the last page of @bio. This is a
+ * Try to add the data at @page + @off to the last bvec of @bio. This is a
* a useful optimisation for file systems with a block size smaller than the
* page size.
*
* Return %true on success or %false on failure.
*/
bool __bio_try_merge_page(struct bio *bio, struct page *page,
- unsigned int len, unsigned int off)
+ unsigned int len, unsigned int off, bool same_page)
{
if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
return false;
if (bio->bi_vcnt > 0) {
- struct bio_vec bv;
- struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
- bvec_last_segment(seg, &bv);
-
- if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
- seg->bv_len += len;
- bio->bi_iter.bi_size += len;
- return true;
- }
+ struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+ phys_addr_t vec_addr = page_to_phys(bv->bv_page);
+ phys_addr_t page_addr = page_to_phys(page);
+
+ if (vec_addr + bv->bv_offset + bv->bv_len != page_addr + off)
+ return false;
+ if (same_page &&
+ (vec_addr & PAGE_SIZE) != (page_addr & PAGE_SIZE))
+ return false;
+
+ bv->bv_len += len;
+ bio->bi_iter.bi_size += len;
+ return true;
}
return false;
}
EXPORT_SYMBOL_GPL(__bio_try_merge_page);
-static bool bio_try_merge_segment(struct bio *bio, struct page *page,
- unsigned int len, unsigned int off)
-{
- if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
- return false;
-
- if (bio->bi_vcnt > 0) {
- struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
- if (page_to_phys(seg->bv_page) + seg->bv_offset + seg->bv_len ==
- page_to_phys(page) + off) {
- seg->bv_len += len;
- bio->bi_iter.bi_size += len;
- return true;
- }
- }
- return false;
-}
-
/**
* __bio_add_page - add page to a bio in a new segment
* @bio: destination bio
@@ -910,7 +896,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
int bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int offset)
{
- if (!bio_try_merge_segment(bio, page, len, offset)) {
+ if (!__bio_try_merge_page(bio, page, len, offset, false)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index ccc2ba115f4d..d918acb9bfc9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -313,7 +313,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
*/
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
- if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+ if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
goto done;
is_contig = true;
}
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5c2190216614..b9fd44168f61 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -616,7 +616,7 @@ xfs_add_to_ioend(
bdev, sector);
}
- if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+ if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
if (iop)
atomic_inc(&iop->write_count);
if (bio_full(wpc->ioend->io_bio))
diff --git a/include/linux/bio.h b/include/linux/bio.h
index e5b975fa0558..f08e6940c1ab 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -442,7 +442,7 @@ extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
unsigned int, unsigned int);
bool __bio_try_merge_page(struct bio *bio, struct page *page,
- unsigned int len, unsigned int off);
+ unsigned int len, unsigned int off, bool same_page);
void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off);
int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
On Wed, Nov 21, 2018 at 03:33:55PM +0100, Christoph Hellwig wrote:
> > + non-cluster.o
>
> Do we really need a new source file for these few functions?
>
> > default:
> > + if (!blk_queue_cluster(q)) {
> > + blk_queue_non_cluster_bio(q, bio);
> > + return;
>
> I'd name this blk_bio_segment_split_singlepage or similar.
OK.
>
> > +static __init int init_non_cluster_bioset(void)
> > +{
> > + WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
> > + BIOSET_NEED_BVECS));
> > + WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
> > + WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
>
> Please only allocate the resources once a queue without the cluster
> flag is registered, there are only very few modern drivers that do that.
OK.
>
> > +static void non_cluster_end_io(struct bio *bio)
> > +{
> > + struct bio *bio_orig = bio->bi_private;
> > +
> > + bio_orig->bi_status = bio->bi_status;
> > + bio_endio(bio_orig);
> > + bio_put(bio);
> > +}
>
> Why can't we use bio_chain for the split bios?
The parent bio is multi-page bvec, we can't submit it for non-cluster.
>
> > + bio_for_each_segment(from, *bio_orig, iter) {
> > + if (i++ < max_segs)
> > + sectors += from.bv_len >> 9;
> > + else
> > + break;
> > + }
>
> The easy to read way would be:
>
> bio_for_each_segment(from, *bio_orig, iter) {
> if (i++ == max_segs)
> break;
> sectors += from.bv_len >> 9;
> }
OK.
>
> > + if (sectors < bio_sectors(*bio_orig)) {
> > + bio = bio_split(*bio_orig, sectors, GFP_NOIO,
> > + &non_cluster_bio_split);
> > + bio_chain(bio, *bio_orig);
> > + generic_make_request(*bio_orig);
> > + *bio_orig = bio;
>
> I don't think this is very efficient, as this means we now
> clone the bio twice, first to split it at the sector boundary,
> and then again when converting it to single-page bio_vec.
That is exactly what bounce code does. The problem for both bounce
and non-cluster is same actually because the bvec table itself has
to be changed.
>
> I think this could be something like this (totally untested):
>
> diff --git a/block/non-cluster.c b/block/non-cluster.c
> index 9c2910be9404..60389f275c43 100644
> --- a/block/non-cluster.c
> +++ b/block/non-cluster.c
> @@ -13,58 +13,59 @@
>
> #include "blk.h"
>
> -static struct bio_set non_cluster_bio_set, non_cluster_bio_split;
> +static struct bio_set non_cluster_bio_set;
>
> static __init int init_non_cluster_bioset(void)
> {
> WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
> BIOSET_NEED_BVECS));
> WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
> - WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
>
> return 0;
> }
> __initcall(init_non_cluster_bioset);
>
> -static void non_cluster_end_io(struct bio *bio)
> -{
> - struct bio *bio_orig = bio->bi_private;
> -
> - bio_orig->bi_status = bio->bi_status;
> - bio_endio(bio_orig);
> - bio_put(bio);
> -}
> -
> void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig)
> {
> - struct bio *bio;
> struct bvec_iter iter;
> - struct bio_vec from;
> - unsigned i = 0;
> - unsigned sectors = 0;
> - unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES,
> - queue_max_segments(q));
> + struct bio *bio;
> + struct bio_vec bv;
> + unsigned short max_segs, segs = 0;
> +
> + bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig),
> + &non_cluster_bio_set);
bio_segments(*bio_orig) may be > 256, so bio_alloc_bioset() may fail.
Thanks,
Ming
On Wed, Nov 21, 2018 at 11:06:11PM +0800, Ming Lei wrote:
> bvec_iter_* is used for single-page bvec in current linus tree, and there are
> lots of users now:
>
> [linux]$ git grep -n "bvec_iter_*" ./ | wc
> 191 995 13242
>
> If we have to switch it first, it can be a big change, just wondering if Jens
> is happy with that?
Your above grep statement seems to catch every use of struct bvec_iter,
due to the *.
Most uses of bvec_iter_ are either in the block headers, or are
ceph wrappers that match the above and can easily be redefined.
On Wed, Nov 21, 2018 at 11:31:36PM +0800, Ming Lei wrote:
> > But while looking over this I wonder why we even need the max_seg_len
> > here. The only thing __bvec_iter_advance does it to move bi_bvec_done
> > and bi_idx forward, with corresponding decrements of bi_size. As far
> > as I can tell the only thing that max_seg_len does is that we need
> > to more iterations of the while loop to archive the same thing.
> >
> > And actual bvec used by the caller will be obtained using
> > bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
> > or single-page variants.
>
> Right, we let __bvec_iter_advance() serve for both multi-page and single-page
> case, then we have to tell it via one way or another, now we use the constant
> of 'max_seg_len'.
>
> Or you suggest to implement two versions of __bvec_iter_advance()?
No - I think we can always use the code without any segment in
bvec_iter_advance. Because bvec_iter_advance only operates on the
iteractor, the generation of an actual single-page or multi-page
bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
helpers. The only difference is how many bytes you can move the
iterator forward in a single loop iteration - so if you pass in
PAGE_SIZE as the max_seg_len you just will have to loop more often
for a large enough bytes, but not actually do anything different.
On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote:
> No - I think we can always use the code without any segment in
> bvec_iter_advance. Because bvec_iter_advance only operates on the
> iteractor, the generation of an actual single-page or multi-page
> bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
> helpers. The only difference is how many bytes you can move the
> iterator forward in a single loop iteration - so if you pass in
> PAGE_SIZE as the max_seg_len you just will have to loop more often
> for a large enough bytes, but not actually do anything different.
FYI, this patch reverts the max_seg_len related changes back to where
we are in mainline, and as expected everything works fine for me:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index e5b975fa0558..926550ce2d21 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio)
for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; iter_all.idx++) \
bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), i, iter_all)
-static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes, unsigned max_seg_len)
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes)
{
iter->bi_sector += bytes >> 9;
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
else
- __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
+ bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
}
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes)
-{
- __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
-}
-
#define __bio_for_each_segment(bvl, bio, iter, start) \
for (iter = (start); \
(iter).bi_size && \
@@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
for (iter = (start); \
(iter).bi_size && \
((bvl = bio_iter_mp_iovec((bio), (iter))), 1); \
- __bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
+ bio_advance_iter((bio), &(iter), (bvl).bv_len))
/* returns one real segment(multi-page bvec) each time */
#define bio_for_each_bvec(bvl, bio, iter) \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index cab36d838ed0..138b4007b8f2 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -25,8 +25,6 @@
#include <linux/errno.h>
#include <linux/mm.h>
-#define BVEC_MAX_LEN ((unsigned int)-1)
-
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
@@ -102,8 +100,8 @@ struct bvec_iter_all {
.bv_offset = segment_iter_offset((bvec), (iter)), \
})
-static inline bool __bvec_iter_advance(const struct bio_vec *bv,
- struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
+static inline bool bvec_iter_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter, unsigned bytes)
{
if (WARN_ONCE(bytes > iter->bi_size,
"Attempted to advance past end of bvec iter\n")) {
@@ -112,18 +110,12 @@ static inline bool __bvec_iter_advance(const struct bio_vec *bv,
}
while (bytes) {
- unsigned segment_len = segment_iter_len(bv, *iter);
-
- if (max_seg_len < BVEC_MAX_LEN)
- segment_len = min_t(unsigned, segment_len,
- max_seg_len -
- bvec_iter_offset(bv, *iter));
+ unsigned iter_len = bvec_iter_len(bv, *iter);
+ unsigned len = min(bytes, iter_len);
- segment_len = min(bytes, segment_len);
-
- bytes -= segment_len;
- iter->bi_size -= segment_len;
- iter->bi_bvec_done += segment_len;
+ bytes -= len;
+ iter->bi_size -= len;
+ iter->bi_bvec_done += len;
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -157,13 +149,6 @@ static inline bool bvec_iter_rewind(const struct bio_vec *bv,
return true;
}
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
- struct bvec_iter *iter,
- unsigned bytes)
-{
- return __bvec_iter_advance(bv, iter, bytes, PAGE_SIZE);
-}
-
#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
(iter).bi_size && \
Actually..
I think we can kill this code entirely. If we look at what the
clustering setting is really about it is to avoid ever merging a
segement that spans a page boundary. And we should be able to do
that with something like this before your series:
---
From 0d46fa76c376493a74ea0dbe77305bd5fa2cf011 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Wed, 21 Nov 2018 18:39:47 +0100
Subject: block: remove the "cluster" flag
The cluster flag implements some very old SCSI behavior. As far as I
can tell the original intent was to enable or disable any kind of
segment merging. But the actually visible effect to the LLDD is that
it limits each segments to be inside a single page, which we can
also affect by setting the maximum segment size and the virt
boundary.
Signed-off-by: Christoph Hellwig <[email protected]>
---
block/blk-merge.c | 20 ++++++++------------
block/blk-settings.c | 3 ---
block/blk-sysfs.c | 5 +----
drivers/scsi/scsi_lib.c | 16 +++++++++++++---
include/linux/blkdev.h | 6 ------
5 files changed, 22 insertions(+), 28 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6be04ef8da5b..e69d8f8ba819 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -195,7 +195,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
goto split;
}
- if (bvprvp && blk_queue_cluster(q)) {
+ if (bvprvp) {
if (seg_size + bv.bv_len > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprvp, &bv))
@@ -295,10 +295,10 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
bool no_sg_merge)
{
struct bio_vec bv, bvprv = { NULL };
- int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
struct bvec_iter iter;
+ bool prev = false;
if (!bio)
return 0;
@@ -313,7 +313,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
}
fbio = bio;
- cluster = blk_queue_cluster(q);
seg_size = 0;
nr_phys_segs = 0;
for_each_bio(bio) {
@@ -325,7 +324,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
if (no_sg_merge)
goto new_segment;
- if (prev && cluster) {
+ if (prev) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
goto new_segment;
@@ -343,7 +342,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
nr_phys_segs++;
bvprv = bv;
- prev = 1;
+ prev = true;
seg_size = bv.bv_len;
}
bbio = bio;
@@ -396,9 +395,6 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
{
struct bio_vec end_bv = { NULL }, nxt_bv;
- if (!blk_queue_cluster(q))
- return 0;
-
if (bio->bi_seg_back_size + nxt->bi_seg_front_size >
queue_max_segment_size(q))
return 0;
@@ -415,12 +411,12 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
static inline void
__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
struct scatterlist *sglist, struct bio_vec *bvprv,
- struct scatterlist **sg, int *nsegs, int *cluster)
+ struct scatterlist **sg, int *nsegs)
{
int nbytes = bvec->bv_len;
- if (*sg && *cluster) {
+ if (*sg) {
if ((*sg)->length + nbytes > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprv, bvec))
@@ -466,12 +462,12 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
{
struct bio_vec bvec, bvprv = { NULL };
struct bvec_iter iter;
- int cluster = blk_queue_cluster(q), nsegs = 0;
+ int nsegs = 0;
for_each_bio(bio)
bio_for_each_segment(bvec, bio, iter)
__blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
- &nsegs, &cluster);
+ &nsegs);
return nsegs;
}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3abe831e92c8..3e7038e475ee 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -56,7 +56,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->alignment_offset = 0;
lim->io_opt = 0;
lim->misaligned = 0;
- lim->cluster = 1;
lim->zoned = BLK_ZONED_NONE;
}
EXPORT_SYMBOL(blk_set_default_limits);
@@ -547,8 +546,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->io_min = max(t->io_min, b->io_min);
t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
- t->cluster &= b->cluster;
-
/* Physical block size a multiple of the logical block size? */
if (t->physical_block_size & (t->logical_block_size - 1)) {
t->physical_block_size = t->logical_block_size;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 80eef48fddc8..ef7b844a3e00 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -132,10 +132,7 @@ static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *
static ssize_t queue_max_segment_size_show(struct request_queue *q, char *page)
{
- if (blk_queue_cluster(q))
- return queue_var_show(queue_max_segment_size(q), (page));
-
- return queue_var_show(PAGE_SIZE, (page));
+ return queue_var_show(queue_max_segment_size(q), page);
}
static ssize_t queue_logical_block_size_show(struct request_queue *q, char *page)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0df15cb738d2..c1ea50962286 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1810,6 +1810,7 @@ static int scsi_map_queues(struct blk_mq_tag_set *set)
void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
{
struct device *dev = shost->dma_dev;
+ unsigned max_segment_size = dma_get_max_seg_size(dev);
/*
* this limit is imposed by hardware restrictions
@@ -1831,10 +1832,19 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
- blk_queue_max_segment_size(q, dma_get_max_seg_size(dev));
+ /*
+ * Clustering is a really old concept from the stone age of Linux
+ * SCSI support. But the basic idea is that we never give the
+ * driver a segment that spans multiple pages. For that we need
+ * to limit the segment size, and set the virt boundary so that
+ * we never merge a second segment which is no page aligned.
+ */
+ if (!shost->use_clustering) {
+ blk_queue_virt_boundary(q, PAGE_SIZE - 1);
+ max_segment_size = min_t(unsigned, max_segment_size, PAGE_SIZE);
+ }
- if (!shost->use_clustering)
- q->limits.cluster = 0;
+ blk_queue_max_segment_size(q, max_segment_size);
/*
* Set a reasonable default alignment: The larger of 32-byte (dword),
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9b53db06ad08..399a7a415609 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -341,7 +341,6 @@ struct queue_limits {
unsigned char misaligned;
unsigned char discard_misaligned;
- unsigned char cluster;
unsigned char raid_partial_stripes_expensive;
enum blk_zoned_model zoned;
};
@@ -660,11 +659,6 @@ static inline bool queue_is_mq(struct request_queue *q)
return q->mq_ops;
}
-static inline unsigned int blk_queue_cluster(struct request_queue *q)
-{
- return q->limits.cluster;
-}
-
static inline enum blk_zoned_model
blk_queue_zoned_model(struct request_queue *q)
{
--
2.19.1
On Wed, Nov 21, 2018 at 05:08:11PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:06:11PM +0800, Ming Lei wrote:
> > bvec_iter_* is used for single-page bvec in current linus tree, and there are
> > lots of users now:
> >
> > [linux]$ git grep -n "bvec_iter_*" ./ | wc
> > 191 995 13242
> >
> > If we have to switch it first, it can be a big change, just wondering if Jens
> > is happy with that?
>
> Your above grep statement seems to catch every use of struct bvec_iter,
> due to the *.
>
> Most uses of bvec_iter_ are either in the block headers, or are
> ceph wrappers that match the above and can easily be redefined.
OK, looks you are right, seems not so widely used:
$ git grep -n -w -E "bvec_iter_len|bvec_iter_bvec|bvec_iter_advance|bvec_iter_page|bvec_iter_offset" ./ | wc
36 194 2907
I will switch to that given the effected driver are only dm, nvdimm and ceph.
Thanks,
Ming
On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:31:36PM +0800, Ming Lei wrote:
> > > But while looking over this I wonder why we even need the max_seg_len
> > > here. The only thing __bvec_iter_advance does it to move bi_bvec_done
> > > and bi_idx forward, with corresponding decrements of bi_size. As far
> > > as I can tell the only thing that max_seg_len does is that we need
> > > to more iterations of the while loop to archive the same thing.
> > >
> > > And actual bvec used by the caller will be obtained using
> > > bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
> > > or single-page variants.
> >
> > Right, we let __bvec_iter_advance() serve for both multi-page and single-page
> > case, then we have to tell it via one way or another, now we use the constant
> > of 'max_seg_len'.
> >
> > Or you suggest to implement two versions of __bvec_iter_advance()?
>
> No - I think we can always use the code without any segment in
> bvec_iter_advance. Because bvec_iter_advance only operates on the
> iteractor, the generation of an actual single-page or multi-page
> bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
> helpers. The only difference is how many bytes you can move the
> iterator forward in a single loop iteration - so if you pass in
> PAGE_SIZE as the max_seg_len you just will have to loop more often
> for a large enough bytes, but not actually do anything different.
Yeah, I see that.
The difference is made by bio_iter_iovec()/bio_iter_mp_iovec() in
__bio_for_each_segment()/__bio_for_each_bvec().
Thanks,
Ming
On Wed, Nov 21, 2018 at 06:46:21PM +0100, Christoph Hellwig wrote:
> Actually..
>
> I think we can kill this code entirely. If we look at what the
> clustering setting is really about it is to avoid ever merging a
> segement that spans a page boundary. And we should be able to do
> that with something like this before your series:
>
> ---
> From 0d46fa76c376493a74ea0dbe77305bd5fa2cf011 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <[email protected]>
> Date: Wed, 21 Nov 2018 18:39:47 +0100
> Subject: block: remove the "cluster" flag
>
> The cluster flag implements some very old SCSI behavior. As far as I
> can tell the original intent was to enable or disable any kind of
> segment merging. But the actually visible effect to the LLDD is that
> it limits each segments to be inside a single page, which we can
> also affect by setting the maximum segment size and the virt
> boundary.
This approach is pretty good given we can do post-split during mapping
sg.
However, using virt boundary limit on non-cluster seems over-kill,
because the bio will be over-split(each small bvec may be split as one bio)
if it includes lots of small segment.
What we want to do is just to avoid to merge bvecs to segment, which
should have been done by NO_SG_MERGE simply. However, after multi-page
is enabled, two adjacent bvecs won't be merged any more, I just forget
to remove the bvec merge code in V11.
So seems we can simply avoid to use virt boundary limit for non-cluster
after multipage bvec is enabled?
thanks,
Ming
On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> However, using virt boundary limit on non-cluster seems over-kill,
> because the bio will be over-split(each small bvec may be split as one bio)
> if it includes lots of small segment.
The combination of the virt boundary of PAGE_SIZE - 1 and a
max_segment_size of PAGE_SIZE will only split if the to me merged
segment is in a different page than the previous one, which is exactly
what we need here. Multiple small bvec inside the same page (e.g.
512 byte buffer_heads) will still be merged.
> What we want to do is just to avoid to merge bvecs to segment, which
> should have been done by NO_SG_MERGE simply. However, after multi-page
> is enabled, two adjacent bvecs won't be merged any more, I just forget
> to remove the bvec merge code in V11.
>
> So seems we can simply avoid to use virt boundary limit for non-cluster
> after multipage bvec is enabled?
No, we can't just remove it. As explained in the patch there is one very
visible difference of setting the flag amd that is no segment will span a
page boundary, and at least the iSCSI code seems to rely on that.
On Wed, Nov 21, 2018 at 06:12:17PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote:
> > No - I think we can always use the code without any segment in
> > bvec_iter_advance. Because bvec_iter_advance only operates on the
> > iteractor, the generation of an actual single-page or multi-page
> > bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
> > helpers. The only difference is how many bytes you can move the
> > iterator forward in a single loop iteration - so if you pass in
> > PAGE_SIZE as the max_seg_len you just will have to loop more often
> > for a large enough bytes, but not actually do anything different.
>
> FYI, this patch reverts the max_seg_len related changes back to where
> we are in mainline, and as expected everything works fine for me:
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index e5b975fa0558..926550ce2d21 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio)
> for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; iter_all.idx++) \
> bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), i, iter_all)
>
> -static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> - unsigned bytes, unsigned max_seg_len)
> +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> + unsigned bytes)
> {
> iter->bi_sector += bytes >> 9;
>
> if (bio_no_advance_iter(bio))
> iter->bi_size -= bytes;
> else
> - __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
> + bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> /* TODO: It is reasonable to complete bio with error here. */
> }
>
> -static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> - unsigned bytes)
> -{
> - __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> -}
> -
> #define __bio_for_each_segment(bvl, bio, iter, start) \
> for (iter = (start); \
> (iter).bi_size && \
> @@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> for (iter = (start); \
> (iter).bi_size && \
> ((bvl = bio_iter_mp_iovec((bio), (iter))), 1); \
> - __bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
> + bio_advance_iter((bio), &(iter), (bvl).bv_len))
>
> /* returns one real segment(multi-page bvec) each time */
> #define bio_for_each_bvec(bvl, bio, iter) \
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index cab36d838ed0..138b4007b8f2 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -25,8 +25,6 @@
> #include <linux/errno.h>
> #include <linux/mm.h>
>
> -#define BVEC_MAX_LEN ((unsigned int)-1)
> -
> /*
> * was unsigned short, but we might as well be ready for > 64kB I/O pages
> */
> @@ -102,8 +100,8 @@ struct bvec_iter_all {
> .bv_offset = segment_iter_offset((bvec), (iter)), \
> })
>
> -static inline bool __bvec_iter_advance(const struct bio_vec *bv,
> - struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
> +static inline bool bvec_iter_advance(const struct bio_vec *bv,
> + struct bvec_iter *iter, unsigned bytes)
> {
> if (WARN_ONCE(bytes > iter->bi_size,
> "Attempted to advance past end of bvec iter\n")) {
> @@ -112,18 +110,12 @@ static inline bool __bvec_iter_advance(const struct bio_vec *bv,
> }
>
> while (bytes) {
> - unsigned segment_len = segment_iter_len(bv, *iter);
> -
> - if (max_seg_len < BVEC_MAX_LEN)
> - segment_len = min_t(unsigned, segment_len,
> - max_seg_len -
> - bvec_iter_offset(bv, *iter));
> + unsigned iter_len = bvec_iter_len(bv, *iter);
> + unsigned len = min(bytes, iter_len);
It may not work to always use bvec_iter_len() here, and 'segment_len'
should be max length of the passed 'bv', however we don't know if it is
single-page or mutli-page bvec if no one tells us.
Thanks,
Ming
On Thu, Nov 22, 2018 at 11:04:28AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> > However, using virt boundary limit on non-cluster seems over-kill,
> > because the bio will be over-split(each small bvec may be split as one bio)
> > if it includes lots of small segment.
>
> The combination of the virt boundary of PAGE_SIZE - 1 and a
> max_segment_size of PAGE_SIZE will only split if the to me merged
> segment is in a different page than the previous one, which is exactly
> what we need here. Multiple small bvec inside the same page (e.g.
> 512 byte buffer_heads) will still be merged.
>
> > What we want to do is just to avoid to merge bvecs to segment, which
> > should have been done by NO_SG_MERGE simply. However, after multi-page
> > is enabled, two adjacent bvecs won't be merged any more, I just forget
> > to remove the bvec merge code in V11.
> >
> > So seems we can simply avoid to use virt boundary limit for non-cluster
> > after multipage bvec is enabled?
>
> No, we can't just remove it. As explained in the patch there is one very
> visible difference of setting the flag amd that is no segment will span a
> page boundary, and at least the iSCSI code seems to rely on that.
IMO, we should use queue_segment_boundary() to enhance the rule during splitting
segment after multi-page bvec is enabled.
Seems we miss the segment boundary limit in bvec_split_segs().
Thanks,
Ming
On Thu, Nov 22, 2018 at 11:41:50AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 06:32:09PM +0800, Ming Lei wrote:
> > On Thu, Nov 22, 2018 at 11:04:28AM +0100, Christoph Hellwig wrote:
> > > On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> > > > However, using virt boundary limit on non-cluster seems over-kill,
> > > > because the bio will be over-split(each small bvec may be split as one bio)
> > > > if it includes lots of small segment.
> > >
> > > The combination of the virt boundary of PAGE_SIZE - 1 and a
> > > max_segment_size of PAGE_SIZE will only split if the to me merged
> > > segment is in a different page than the previous one, which is exactly
> > > what we need here. Multiple small bvec inside the same page (e.g.
> > > 512 byte buffer_heads) will still be merged.
> > >
> > > > What we want to do is just to avoid to merge bvecs to segment, which
> > > > should have been done by NO_SG_MERGE simply. However, after multi-page
> > > > is enabled, two adjacent bvecs won't be merged any more, I just forget
> > > > to remove the bvec merge code in V11.
> > > >
> > > > So seems we can simply avoid to use virt boundary limit for non-cluster
> > > > after multipage bvec is enabled?
> > >
> > > No, we can't just remove it. As explained in the patch there is one very
> > > visible difference of setting the flag amd that is no segment will span a
> > > page boundary, and at least the iSCSI code seems to rely on that.
> >
> > IMO, we should use queue_segment_boundary() to enhance the rule during splitting
> > segment after multi-page bvec is enabled.
> >
> > Seems we miss the segment boundary limit in bvec_split_segs().
>
> Yes, that looks like the right fix!
Then your patch should work by just replacing virt boundary with segment
boudary limit. I will do that change in V12 if you don't object.
Thanks,
Ming
> +/* used for chunk_for_each_segment */
> +static inline void bvec_next_segment(const struct bio_vec *bvec,
> + struct bvec_iter_all *iter_all)
FYI, chunk_for_each_segment doesn't exist anymore, this is
bvec_for_each_segment now. Not sure the comment helps much, though.
> +{
> + struct bio_vec *bv = &iter_all->bv;
> +
> + if (bv->bv_page) {
> + bv->bv_page += 1;
I think this needs to use nth_page() given that with discontigmem
page structures might not be allocated contigously.
On Thu, Nov 22, 2018 at 06:15:28PM +0800, Ming Lei wrote:
> > while (bytes) {
> > - unsigned segment_len = segment_iter_len(bv, *iter);
> > -
> > - if (max_seg_len < BVEC_MAX_LEN)
> > - segment_len = min_t(unsigned, segment_len,
> > - max_seg_len -
> > - bvec_iter_offset(bv, *iter));
> > + unsigned iter_len = bvec_iter_len(bv, *iter);
> > + unsigned len = min(bytes, iter_len);
>
> It may not work to always use bvec_iter_len() here, and 'segment_len'
> should be max length of the passed 'bv', however we don't know if it is
> single-page or mutli-page bvec if no one tells us.
The plain revert I sent isn't totally correct in your current tree
indeed because of the odd change that segment_iter_len now is the
full bvec len, so it should be changed to that until we switch to the
sane naming scheme used elsewhere.
But except for that, it will work. The bvec we operate on (part of the
array in the bio pointed to by the iter) is always a multi-page capable
one. We only fake up a single page on for users using segment_iter_len.
Think of what you are doing in the (I think mostly hypothetical) case
that we call bvec_iter_advance with a bytes > PAGE_SIZE on a segment
small than page size.
In this case we will limit the current iteration to the while loop
until we git a page boundary. This means we will not hit the condition
moving to the next (real, in the array) bvec at the end of the loop,
and just go on into the next iteration of the loop with the reduced
amount of bytes. Depending on how large byte is this might repeat
a few more times. Then on the final one we hit the limit and move
to the next (real, in the array) bvec.
Now if we don't have the segment limit we do exactly the same thing,
just a whole lot more efficiently in a single loop iteration.
tree where segment_iter_len is the
On Thu, Nov 22, 2018 at 11:04:28AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> > However, using virt boundary limit on non-cluster seems over-kill,
> > because the bio will be over-split(each small bvec may be split as one bio)
> > if it includes lots of small segment.
>
> The combination of the virt boundary of PAGE_SIZE - 1 and a
> max_segment_size of PAGE_SIZE will only split if the to me merged
> segment is in a different page than the previous one, which is exactly
> what we need here. Multiple small bvec inside the same page (e.g.
> 512 byte buffer_heads) will still be merged.
Suppose one bio includes (pg0, 0, 512) and (pg1, 512, 512):
The split is introduced by the following code in blk_bio_segment_split():
if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
goto split;
Without this patch, for non-cluster, the two bvecs are just in different
segment, but still handled by one same bio. Now you convert into two bios.
Thanks,
Ming
Btw, this patch instead of the plain rever might make it a little
more clear what is going on by skipping the confusing helper altogher
and operating on the raw bvec array:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index e5b975fa0558..926550ce2d21 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio)
for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; iter_all.idx++) \
bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), i, iter_all)
-static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes, unsigned max_seg_len)
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes)
{
iter->bi_sector += bytes >> 9;
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
else
- __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
+ bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
}
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes)
-{
- __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
-}
-
#define __bio_for_each_segment(bvl, bio, iter, start) \
for (iter = (start); \
(iter).bi_size && \
@@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
for (iter = (start); \
(iter).bi_size && \
((bvl = bio_iter_mp_iovec((bio), (iter))), 1); \
- __bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
+ bio_advance_iter((bio), &(iter), (bvl).bv_len))
/* returns one real segment(multi-page bvec) each time */
#define bio_for_each_bvec(bvl, bio, iter) \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index cab36d838ed0..7d0f9bdb6f05 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -25,8 +25,6 @@
#include <linux/errno.h>
#include <linux/mm.h>
-#define BVEC_MAX_LEN ((unsigned int)-1)
-
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
@@ -102,8 +100,8 @@ struct bvec_iter_all {
.bv_offset = segment_iter_offset((bvec), (iter)), \
})
-static inline bool __bvec_iter_advance(const struct bio_vec *bv,
- struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
+static inline bool bvec_iter_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter, unsigned bytes)
{
if (WARN_ONCE(bytes > iter->bi_size,
"Attempted to advance past end of bvec iter\n")) {
@@ -112,20 +110,15 @@ static inline bool __bvec_iter_advance(const struct bio_vec *bv,
}
while (bytes) {
- unsigned segment_len = segment_iter_len(bv, *iter);
-
- if (max_seg_len < BVEC_MAX_LEN)
- segment_len = min_t(unsigned, segment_len,
- max_seg_len -
- bvec_iter_offset(bv, *iter));
+ const struct bio_vec *cur = bv + iter->bi_idx;
+ unsigned len = min3(bytes, iter->bi_size,
+ cur->bv_len - iter->bi_bvec_done);
- segment_len = min(bytes, segment_len);
-
- bytes -= segment_len;
- iter->bi_size -= segment_len;
- iter->bi_bvec_done += segment_len;
+ bytes -= len;
+ iter->bi_size -= len;
+ iter->bi_bvec_done += len;
- if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
+ if (iter->bi_bvec_done == cur->bv_len) {
iter->bi_bvec_done = 0;
iter->bi_idx++;
}
@@ -157,13 +150,6 @@ static inline bool bvec_iter_rewind(const struct bio_vec *bv,
return true;
}
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
- struct bvec_iter *iter,
- unsigned bytes)
-{
- return __bvec_iter_advance(bv, iter, bytes, PAGE_SIZE);
-}
-
#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
(iter).bi_size && \
On Thu, Nov 22, 2018 at 06:26:17PM +0800, Ming Lei wrote:
> Suppose one bio includes (pg0, 0, 512) and (pg1, 512, 512):
>
> The split is introduced by the following code in blk_bio_segment_split():
>
> if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
> goto split;
>
> Without this patch, for non-cluster, the two bvecs are just in different
> segment, but still handled by one same bio. Now you convert into two bios.
Oh, true - we create new bios instead of new segments. So I guess we
need to do something special here if we don't want to pay that overhead.
Or just look into killing the cluster setting..
On Thu, Nov 22, 2018 at 06:32:09PM +0800, Ming Lei wrote:
> On Thu, Nov 22, 2018 at 11:04:28AM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> > > However, using virt boundary limit on non-cluster seems over-kill,
> > > because the bio will be over-split(each small bvec may be split as one bio)
> > > if it includes lots of small segment.
> >
> > The combination of the virt boundary of PAGE_SIZE - 1 and a
> > max_segment_size of PAGE_SIZE will only split if the to me merged
> > segment is in a different page than the previous one, which is exactly
> > what we need here. Multiple small bvec inside the same page (e.g.
> > 512 byte buffer_heads) will still be merged.
> >
> > > What we want to do is just to avoid to merge bvecs to segment, which
> > > should have been done by NO_SG_MERGE simply. However, after multi-page
> > > is enabled, two adjacent bvecs won't be merged any more, I just forget
> > > to remove the bvec merge code in V11.
> > >
> > > So seems we can simply avoid to use virt boundary limit for non-cluster
> > > after multipage bvec is enabled?
> >
> > No, we can't just remove it. As explained in the patch there is one very
> > visible difference of setting the flag amd that is no segment will span a
> > page boundary, and at least the iSCSI code seems to rely on that.
>
> IMO, we should use queue_segment_boundary() to enhance the rule during splitting
> segment after multi-page bvec is enabled.
>
> Seems we miss the segment boundary limit in bvec_split_segs().
Yes, that looks like the right fix!
On Thu, Nov 22, 2018 at 06:46:05PM +0800, Ming Lei wrote:
> Then your patch should work by just replacing virt boundary with segment
> boudary limit. I will do that change in V12 if you don't object.
Please do, thanks a lot.
Btw, given that this is the last user of bvec_last_segment after my
other patches I think we should kill bvec_last_segment and do something
like this here:
diff --git a/fs/buffer.c b/fs/buffer.c
index fa37ad52e962..af5e135d2b83 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2981,6 +2981,14 @@ static void end_bio_bh_io_sync(struct bio *bio)
bio_put(bio);
}
+static void zero_trailing_sectors(struct bio_vec *bvec, unsigned bytes)
+{
+ unsigned last_page = (bvec->bv_offset + bvec->bv_len - 1) >> PAGE_SHIFT;
+
+ zero_user(nth_page(bvec->bv_page, last_page),
+ bvec->bv_offset % PAGE_SIZE + bvec->bv_len, bytes);
+}
+
/*
* This allows us to do IO even on the odd last sectors
* of a device, even if the block size is some multiple
@@ -3031,13 +3039,8 @@ void guard_bio_eod(int op, struct bio *bio)
bvec->bv_len -= truncated_bytes;
/* ..and clear the end of the buffer for reads */
- if (op == REQ_OP_READ) {
- struct bio_vec bv;
-
- bvec_last_segment(bvec, &bv);
- zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
- truncated_bytes);
- }
+ if (op == REQ_OP_READ)
+ zero_trailing_sectors(bvec, truncated_bytes);
}
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
On Thu, Nov 22, 2018 at 11:30:33AM +0100, Christoph Hellwig wrote:
> Btw, this patch instead of the plain rever might make it a little
> more clear what is going on by skipping the confusing helper altogher
> and operating on the raw bvec array:
>
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index e5b975fa0558..926550ce2d21 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio)
> for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; iter_all.idx++) \
> bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), i, iter_all)
>
> -static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> - unsigned bytes, unsigned max_seg_len)
> +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> + unsigned bytes)
> {
> iter->bi_sector += bytes >> 9;
>
> if (bio_no_advance_iter(bio))
> iter->bi_size -= bytes;
> else
> - __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
> + bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> /* TODO: It is reasonable to complete bio with error here. */
> }
>
> -static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> - unsigned bytes)
> -{
> - __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> -}
> -
> #define __bio_for_each_segment(bvl, bio, iter, start) \
> for (iter = (start); \
> (iter).bi_size && \
> @@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> for (iter = (start); \
> (iter).bi_size && \
> ((bvl = bio_iter_mp_iovec((bio), (iter))), 1); \
> - __bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
> + bio_advance_iter((bio), &(iter), (bvl).bv_len))
>
> /* returns one real segment(multi-page bvec) each time */
> #define bio_for_each_bvec(bvl, bio, iter) \
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index cab36d838ed0..7d0f9bdb6f05 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -25,8 +25,6 @@
> #include <linux/errno.h>
> #include <linux/mm.h>
>
> -#define BVEC_MAX_LEN ((unsigned int)-1)
> -
> /*
> * was unsigned short, but we might as well be ready for > 64kB I/O pages
> */
> @@ -102,8 +100,8 @@ struct bvec_iter_all {
> .bv_offset = segment_iter_offset((bvec), (iter)), \
> })
>
> -static inline bool __bvec_iter_advance(const struct bio_vec *bv,
> - struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
> +static inline bool bvec_iter_advance(const struct bio_vec *bv,
> + struct bvec_iter *iter, unsigned bytes)
> {
> if (WARN_ONCE(bytes > iter->bi_size,
> "Attempted to advance past end of bvec iter\n")) {
> @@ -112,20 +110,15 @@ static inline bool __bvec_iter_advance(const struct bio_vec *bv,
> }
>
> while (bytes) {
> - unsigned segment_len = segment_iter_len(bv, *iter);
> -
> - if (max_seg_len < BVEC_MAX_LEN)
> - segment_len = min_t(unsigned, segment_len,
> - max_seg_len -
> - bvec_iter_offset(bv, *iter));
> + const struct bio_vec *cur = bv + iter->bi_idx;
> + unsigned len = min3(bytes, iter->bi_size,
> + cur->bv_len - iter->bi_bvec_done);
>
> - segment_len = min(bytes, segment_len);
> -
> - bytes -= segment_len;
> - iter->bi_size -= segment_len;
> - iter->bi_bvec_done += segment_len;
> + bytes -= len;
> + iter->bi_size -= len;
> + iter->bi_bvec_done += len;
>
> - if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
> + if (iter->bi_bvec_done == cur->bv_len) {
> iter->bi_bvec_done = 0;
> iter->bi_idx++;
> }
I'd rather not do the optimization part in this patchset, given it doesn't
belong to this patchset, and it may decrease readability. So I plan to revert
the delta part in V12 first.
Thanks,
Ming
On Thu, Nov 22, 2018 at 11:58:49AM +0100, Christoph Hellwig wrote:
> Btw, given that this is the last user of bvec_last_segment after my
> other patches I think we should kill bvec_last_segment and do something
> like this here:
>
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index fa37ad52e962..af5e135d2b83 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2981,6 +2981,14 @@ static void end_bio_bh_io_sync(struct bio *bio)
> bio_put(bio);
> }
>
> +static void zero_trailing_sectors(struct bio_vec *bvec, unsigned bytes)
> +{
> + unsigned last_page = (bvec->bv_offset + bvec->bv_len - 1) >> PAGE_SHIFT;
> +
> + zero_user(nth_page(bvec->bv_page, last_page),
> + bvec->bv_offset % PAGE_SIZE + bvec->bv_len, bytes);
> +}
The above 'start' parameter is figured out as wrong, and the computation
isn't very obvious, so I'd suggest to keep bvec_last_segment().
Thanks,
Ming
On Thu, Nov 22, 2018 at 12:03:15PM +0100, Christoph Hellwig wrote:
> > +/* used for chunk_for_each_segment */
> > +static inline void bvec_next_segment(const struct bio_vec *bvec,
> > + struct bvec_iter_all *iter_all)
>
> FYI, chunk_for_each_segment doesn't exist anymore, this is
> bvec_for_each_segment now. Not sure the comment helps much, though.
OK, will remove the comment.
>
> > +{
> > + struct bio_vec *bv = &iter_all->bv;
> > +
> > + if (bv->bv_page) {
> > + bv->bv_page += 1;
>
> I think this needs to use nth_page() given that with discontigmem
> page structures might not be allocated contigously.
Good catch!
Thanks,
Ming
On Wed, Nov 21, 2018 at 05:12:06PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:48:13PM +0800, Ming Lei wrote:
> > I guess the correct check should be:
> >
> > end_addr = vec_addr + bv->bv_offset + bv->bv_len;
> > if (same_page &&
> > (end_addr & PAGE_MASK) != (page_addr & PAGE_MASK))
> > return false;
>
> Indeed.
The above is still not totally correct, and it should have been:
end_addr = vec_addr + bv->bv_offset + bv->bv_len - 1;
if (same_page && (end_addr & PAGE_MASK) != page_addr)
return false;
Also bv->bv_len should be guaranteed as being bigger than zero.
It also shows that it is quite easy to figure out the last page as
wrong, :-(
Thanks,
Ming