Hi,
This patchset brings multipage bvec into block layer:
1) what is multipage bvec?
Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
which are physically contiguous instead of one single page used in linux
kernel for long time.
2) why is multipage bvec introduced?
Kent proposed the idea[1] first.
As system's RAM becomes much bigger than before, and huge page, transparent
huge page and memory compaction are widely used, it is a bit easy now
to see physically contiguous pages from fs in I/O. On the other hand, from
block layer's view, it isn't necessary to store intermediate pages into bvec,
and it is enough to just store the physicallly contiguous 'segment' in each
io vector.
Also huge pages are being brought to filesystem and swap [2][6], we can
do IO on a hugepage each time[3], which requires that one bio can transfer
at least one huge page one time. Turns out it isn't flexiable to change
BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
as much bigger, such as 512, which requires at least two 4K pages for holding
the bvec table.
With multipage bvec:
- Inside block layer, both bio splitting and sg map can become more
efficient than before by just traversing the physically contiguous
'segment' instead of each page.
- segment handling in block layer can be improved much in future since it
should be quite easy to convert multipage bvec into segment easily. For
example, we might just store segment in each bvec directly in future.
- bio size can be increased and it should improve some high-bandwidth IO
case in theory[4].
- there is opportunity in future to improve memory footprint of bvecs.
3) how is multipage bvec implemented in this patchset?
The 1st 3 patches are prepare patches for multipage bvec, from Christoph.
The patches of 4~28 implement multipage bvec in block layer:
- put all tricks into bvec/bio/rq iterators, and as far as
drivers and fs use these standard iterators, they are happy
with multipage bvec
- use multipage bvec to split bio and map sg
- introduce bio_for_each_chunk*() for iterating bio segment by
segment
- make current bio_for_each_segment*() to itereate page by page and
make sure current uses won't be broken
The patch 29 redefines BIO_MAX_PAGES as 256.
The patch 30 documents usages of bio iterator helpers.
These patches can be found in the following git tree:
gitweb: https://github.com/ming1/linux/commits/v4.18-pre-rc-mp-bvec-V6
git: https://github.com/ming1/linux.git #v4.18-pre-rc-mp-bvec-V6
Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
and not see regression.
Thanks Christoph for reviewing the early version and providing very good
suggestions, such as: introduce bio_init_with_vec_table(), remove another
unnecessary helpers for cleanup and so on.
Any comments are welcome!
V6:
- avoid to introduce lots of renaming, follow Jen's suggestion of
using the name of chunk for multipage io vector
- include Christoph's three prepare patches
- decrease stack usage for using bio_for_each_chunk_segment_all()
- address Kent and Randy's comment
V5:
- remove some of prepare patches, which have been merged already
- add bio_clone_seg_bioset() to fix DM's bio clone, which
is introduced by 18a25da84354c6b (dm: ensure bio submission follows
a depth-first tree walk)
- rebase on the latest block for-v4.18
V4:
- rename bio_for_each_segment*() as bio_for_each_page*(), rename
bio_segments() as bio_pages(), rename rq_for_each_segment() as
rq_for_each_pages(), because these helpers never return real
segment, and they always return single page bvec
- introducing segment_for_each_page_all()
- introduce new bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
for returning real multipage segment
- rewrite segment_last_page()
- rename bvec iterator helper as suggested by Christoph
- replace comment with applying bio helpers as suggested by Christoph
- document usage of bio iterator helpers
- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
accommodated in 4K page
- move bio_alloc_pages() into bcache as suggested by Christoph
V3:
- rebase on v4.13-rc3 with for-next of block tree
- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
MD(raid1), and not see regressions triggered
- add Reviewed-by on some btrfs patches
- remove two MD patches because both are merged to linus tree
already
V2:
- bvec table direct access in raid has been cleaned, so NO_MP
flag is dropped
- rebase on recent Neil Brown's change on bio and bounce code
- reorganize the patchset
V1:
- against v4.10-rc1 and some cleanup in V0 are in -linus already
- handle queue_virt_boundary() in mp bvec change and make NVMe happy
- further BTRFS cleanup
- remove QUEUE_FLAG_SPLIT_MP
- rename for two new helpers of bio_for_each_segment_all()
- fix bounce convertion
- address comments in V0
[1], http://marc.info/?l=linux-kernel&m=141680246629547&w=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=147735447100001&r=1&w=2
[4], http://marc.info/?l=linux-mm&m=147745525801433&w=2
[5], http://marc.info/?t=149569484500007&r=1&w=2
[6], http://marc.info/?t=149820215300004&r=1&w=2
Christoph Hellwig (3):
block: simplify bio_check_pages_dirty
block: bio_set_pages_dirty can't see NULL bv_page in a valid bio_vec
block: use bio_add_page in bio_iov_iter_get_pages
Ming Lei (27):
block: introduce multipage page bvec helpers
block: introduce bio_for_each_chunk()
block: use bio_for_each_chunk() to compute multipage bvec count
block: use bio_for_each_chunk() to map sg
block: introduce chunk_last_segment()
fs/buffer.c: use bvec iterator to truncate the bio
btrfs: use chunk_last_segment to get bio's last page
block: implement bio_pages_all() via bio_for_each_segment_all()
block: introduce bio_chunks()
block: introduce rq_for_each_chunk()
block: loop: pass multipage chunks to iov_iter
block: introduce bio_clone_chunk_bioset()
dm: clone bio via bio_clone_chunk_bioset
block: introduce bio_for_each_chunk_all and
bio_for_each_chunk_segment_all
block: convert to bio_for_each_chunk_segment_all()
md/dm/bcache: conver to bio_for_each_chunk_segment_all and
bio_for_each_chunk_all
fs: conver to bio_for_each_chunk_segment_all()
btrfs: conver to bio_for_each_chunk_segment_all
ext4: conver to bio_for_each_chunk_segment_all
f2fs: conver to bio_for_each_chunk_segment_all
xfs: conver to bio_for_each_chunk_segment_all
exofs: conver to bio_for_each_chunk_segment_all
gfs2: conver to bio_for_each_chunk_segment_all
block: kill bio_for_each_segment_all()
block: enable multipage bvecs
block: always define BIO_MAX_PAGES as 256
block: document usage of bio iterator helpers
Documentation/block/biovecs.txt | 30 ++++++
block/bio.c | 211 +++++++++++++++++++++++-----------------
block/blk-merge.c | 162 +++++++++++++++++++++++-------
block/blk-zoned.c | 5 +-
block/bounce.c | 6 +-
drivers/block/loop.c | 6 +-
drivers/md/bcache/btree.c | 3 +-
drivers/md/bcache/util.c | 2 +-
drivers/md/dm-crypt.c | 3 +-
drivers/md/dm.c | 4 +-
drivers/md/raid1.c | 3 +-
fs/block_dev.c | 6 +-
fs/btrfs/compression.c | 8 +-
fs/btrfs/disk-io.c | 3 +-
fs/btrfs/extent_io.c | 14 ++-
fs/btrfs/inode.c | 6 +-
fs/btrfs/raid56.c | 3 +-
fs/buffer.c | 5 +-
fs/crypto/bio.c | 3 +-
fs/direct-io.c | 4 +-
fs/exofs/ore.c | 3 +-
fs/exofs/ore_raid.c | 3 +-
fs/ext4/page-io.c | 3 +-
fs/ext4/readpage.c | 3 +-
fs/f2fs/data.c | 9 +-
fs/gfs2/lops.c | 6 +-
fs/gfs2/meta_io.c | 3 +-
fs/iomap.c | 3 +-
fs/mpage.c | 3 +-
fs/xfs/xfs_aops.c | 5 +-
include/linux/bio.h | 94 ++++++++++++++----
include/linux/blkdev.h | 4 +
include/linux/bvec.h | 155 +++++++++++++++++++++++++++--
33 files changed, 588 insertions(+), 193 deletions(-)
--
2.9.5
From: Christoph Hellwig <[email protected]>
bio_check_pages_dirty currently inviolates the invariant that bv_page of
a bio_vec inside bi_vcnt shouldn't be zero, and that is going to become
really annoying with multpath biovecs. Fortunately there isn't any
all that good reason for it - once we decide to defer freeing the bio
to a workqueue holding onto a few additional pages isn't really an
issue anymore. So just check if there is a clean page that needs
dirtying in the first path, and do a second pass to free them if there
was none, while the cache is still hot.
Also use the chance to micro-optimize bio_dirty_fn a bit by not saving
irq state - we know we are called from a workqueue.
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
block/bio.c | 56 +++++++++++++++++++++-----------------------------------
1 file changed, 21 insertions(+), 35 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 5f7563598b1c..3e7d117c3346 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1647,19 +1647,15 @@ static void bio_release_pages(struct bio *bio)
struct bio_vec *bvec;
int i;
- bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
-
- if (page)
- put_page(page);
- }
+ bio_for_each_segment_all(bvec, bio, i)
+ put_page(bvec->bv_page);
}
/*
* bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
* If they are, then fine. If, however, some pages are clean then they must
* have been written out during the direct-IO read. So we take another ref on
- * the BIO and the offending pages and re-dirty the pages in process context.
+ * the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
* here on. It will run one put_page() against each page and will run one
@@ -1677,52 +1673,42 @@ static struct bio *bio_dirty_list;
*/
static void bio_dirty_fn(struct work_struct *work)
{
- unsigned long flags;
- struct bio *bio;
+ struct bio *bio, *next;
- spin_lock_irqsave(&bio_dirty_lock, flags);
- bio = bio_dirty_list;
+ spin_lock_irq(&bio_dirty_lock);
+ next = bio_dirty_list;
bio_dirty_list = NULL;
- spin_unlock_irqrestore(&bio_dirty_lock, flags);
+ spin_unlock_irq(&bio_dirty_lock);
- while (bio) {
- struct bio *next = bio->bi_private;
+ while ((bio = next) != NULL) {
+ next = bio->bi_private;
bio_set_pages_dirty(bio);
bio_release_pages(bio);
bio_put(bio);
- bio = next;
}
}
void bio_check_pages_dirty(struct bio *bio)
{
struct bio_vec *bvec;
- int nr_clean_pages = 0;
+ unsigned long flags;
int i;
bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
-
- if (PageDirty(page) || PageCompound(page)) {
- put_page(page);
- bvec->bv_page = NULL;
- } else {
- nr_clean_pages++;
- }
+ if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
+ goto defer;
}
- if (nr_clean_pages) {
- unsigned long flags;
-
- spin_lock_irqsave(&bio_dirty_lock, flags);
- bio->bi_private = bio_dirty_list;
- bio_dirty_list = bio;
- spin_unlock_irqrestore(&bio_dirty_lock, flags);
- schedule_work(&bio_dirty_work);
- } else {
- bio_put(bio);
- }
+ bio_release_pages(bio);
+ bio_put(bio);
+ return;
+defer:
+ spin_lock_irqsave(&bio_dirty_lock, flags);
+ bio->bi_private = bio_dirty_list;
+ bio_dirty_list = bio;
+ spin_unlock_irqrestore(&bio_dirty_lock, flags);
+ schedule_work(&bio_dirty_work);
}
EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
--
2.9.5
From: Christoph Hellwig <[email protected]>
Replace a nasty hack with a different nasty hack to prepare for multipage
bio_vecs. By moving the temporary page array as far up as possible in
the space allocated for the bio_vec array we can iterate forward over it
and thus use bio_add_page. Using bio_add_page means we'll be able to
merge physically contiguous pages once support for multipath bio_vecs is
merged.
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
block/bio.c | 45 +++++++++++++++++++++------------------------
1 file changed, 21 insertions(+), 24 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index ebd3ca62e037..cb0f46e2752b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -902,6 +902,8 @@ int bio_add_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL(bio_add_page);
+#define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *))
+
/**
* bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
* @bio: bio to add pages to
@@ -913,38 +915,33 @@ EXPORT_SYMBOL(bio_add_page);
int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+ unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
- size_t offset, diff;
- ssize_t size;
+ ssize_t size, left;
+ unsigned len, i;
+ size_t offset;
+
+ /*
+ * Move page array up in the allocated memory for the bio vecs as
+ * far as possible so that we can start filling biovecs from the
+ * beginning without overwriting the temporary page array.
+ */
+ BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
+ pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
- nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
- /*
- * Deep magic below: We need to walk the pinned pages backwards
- * because we are abusing the space allocated for the bio_vecs
- * for the page array. Because the bio_vecs are larger than the
- * page pointers by definition this will always work. But it also
- * means we can't use bio_add_page, so any changes to it's semantics
- * need to be reflected here as well.
- */
- bio->bi_iter.bi_size += size;
- bio->bi_vcnt += nr_pages;
-
- diff = (nr_pages * PAGE_SIZE - offset) - size;
- while (nr_pages--) {
- bv[nr_pages].bv_page = pages[nr_pages];
- bv[nr_pages].bv_len = PAGE_SIZE;
- bv[nr_pages].bv_offset = 0;
- }
+ for (left = size, i = 0; left > 0; left -= len, i++) {
+ struct page *page = pages[i];
- bv[0].bv_offset += offset;
- bv[0].bv_len -= offset;
- if (diff)
- bv[bio->bi_vcnt - 1].bv_len -= diff;
+ len = min_t(size_t, PAGE_SIZE - offset, left);
+ if (WARN_ON_ONCE(bio_add_page(bio, page, len, offset) != len))
+ return -EINVAL;
+ offset = 0;
+ }
iov_iter_advance(iter, size);
return 0;
--
2.9.5
Firstly it is more efficient to use bio_for_each_chunk() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how many
multipage bvecs there are in the bio.
Secondaly once bio_for_each_chunk() is used, the bvec may need to
be splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.
Thirdly during splitting multipage bvec into segments, max segment number
may be reached, then the bio need to be splitted when this happens.
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 76 insertions(+), 14 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index aaec38cc37b8..2493fe027953 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -97,6 +97,62 @@ static inline unsigned get_max_io_size(struct request_queue *q,
return sectors;
}
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+ unsigned *nsegs, unsigned *last_seg_size,
+ unsigned *front_seg_size, unsigned *sectors)
+{
+ bool need_split = false;
+ unsigned len = bv->bv_len;
+ unsigned total_len = 0;
+ unsigned new_nsegs = 0, seg_size = 0;
+
+ if ((*nsegs >= queue_max_segments(q)) || !len)
+ return need_split;
+
+ /*
+ * Multipage bvec may be too big to hold in one segment,
+ * so the current bvec has to be splitted as multiple
+ * segments.
+ */
+ while (new_nsegs + *nsegs < queue_max_segments(q)) {
+ seg_size = min(queue_max_segment_size(q), len);
+
+ new_nsegs++;
+ total_len += seg_size;
+ len -= seg_size;
+
+ if ((queue_virt_boundary(q) && ((bv->bv_offset +
+ total_len) & queue_virt_boundary(q))) || !len)
+ break;
+ }
+
+ /* split in the middle of the bvec */
+ if (len)
+ need_split = true;
+
+ /* update front segment size */
+ if (!*nsegs) {
+ unsigned first_seg_size = seg_size;
+
+ if (new_nsegs > 1)
+ first_seg_size = queue_max_segment_size(q);
+ if (*front_seg_size < first_seg_size)
+ *front_seg_size = first_seg_size;
+ }
+
+ /* update other varibles */
+ *last_seg_size = seg_size;
+ *nsegs += new_nsegs;
+ if (sectors)
+ *sectors += total_len >> 9;
+
+ return need_split;
+}
+
static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
@@ -110,7 +166,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
- bio_for_each_segment(bv, bio, iter) {
+ bio_for_each_chunk(bv, bio, iter) {
/*
* If the queue doesn't support SG gaps and adding this
* offset would create a gap, disallow it.
@@ -125,8 +181,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
*/
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
- nsegs++;
- sectors = max_sectors;
+ /* split in the middle of bvec */
+ bv.bv_len = (max_sectors - sectors) << 9;
+ bvec_split_segs(q, &bv, &nsegs,
+ &seg_size,
+ &front_seg_size,
+ §ors);
}
goto split;
}
@@ -153,11 +213,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
if (nsegs == 1 && seg_size > front_seg_size)
front_seg_size = seg_size;
- nsegs++;
bvprv = bv;
bvprvp = &bvprv;
- seg_size = bv.bv_len;
- sectors += bv.bv_len >> 9;
+
+ if (bvec_split_segs(q, &bv, &nsegs, &seg_size,
+ &front_seg_size, §ors))
+ goto split;
}
@@ -235,6 +296,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
struct bio_vec bv, bvprv = { NULL };
int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
+ unsigned front_seg_size = bio->bi_seg_front_size;
struct bio *fbio, *bbio;
struct bvec_iter iter;
@@ -255,7 +317,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
seg_size = 0;
nr_phys_segs = 0;
for_each_bio(bio) {
- bio_for_each_segment(bv, bio, iter) {
+ bio_for_each_chunk(bv, bio, iter) {
/*
* If SG merging is disabled, each bio vector is
* a segment
@@ -277,20 +339,20 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
continue;
}
new_segment:
- if (nr_phys_segs == 1 && seg_size >
- fbio->bi_seg_front_size)
- fbio->bi_seg_front_size = seg_size;
+ if (nr_phys_segs == 1 && seg_size > front_seg_size)
+ front_seg_size = seg_size;
- nr_phys_segs++;
bvprv = bv;
prev = 1;
- seg_size = bv.bv_len;
+ bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size,
+ &front_seg_size, NULL);
}
bbio = bio;
}
- if (nr_phys_segs == 1 && seg_size > fbio->bi_seg_front_size)
- fbio->bi_seg_front_size = seg_size;
+ if (nr_phys_segs == 1 && seg_size > front_seg_size)
+ front_seg_size = seg_size;
+ fbio->bi_seg_front_size = front_seg_size;
if (seg_size > bbio->bi_seg_back_size)
bbio->bi_seg_back_size = seg_size;
--
2.9.5
This patch introduces helpers of 'bvec_iter_chunk_*' for multipage
bvec(chunk) support.
The introduced interfaces treate one bvec as real multipage chunk,
for example, .bv_len is the total length of the multipage chunk.
The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as singlepage only by drivers, fs, dm and
etc. These helpers will build singlepage bvec in flight, so users of
current bio/bvec iterator still can work well and needn't change even
though we store real multipage chunk into bvec table.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bvec.h | 63 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 60 insertions(+), 3 deletions(-)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index fe7a22dd133b..52c90ea1a96a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,44 @@
#include <linux/kernel.h>
#include <linux/bug.h>
#include <linux/errno.h>
+#include <linux/mm.h>
+
+/*
+ * What is multipage bvecs(chunk)?
+ *
+ * - bvec stored in bio->bi_io_vec is always multipage(mp) style
+ *
+ * - bvec(struct bio_vec) represents one physically contiguous I/O
+ * buffer, now the buffer may include more than one pages since
+ * multipage(mp) bvec is supported, and all these pages represented
+ * by one bvec is physically contiguous. Before mp support, at most
+ * one page can be included in one bvec, we call it singlepage(sp)
+ * bvec.
+ *
+ * - .bv_page of th bvec represents the 1st page in the mp chunk
+ *
+ * - .bv_offset of the bvec represents offset of the buffer in the bvec
+ *
+ * The effect on the current drivers/filesystem/dm/bcache/...:
+ *
+ * - almost everyone supposes that one bvec only includes one single
+ * page, so we keep the sp interface not changed, for example,
+ * bio_for_each_segment() still returns bvec with single page
+ *
+ * - bio_for_each_segment_all() will be changed to return singlepage
+ * bvec too
+ *
+ * - during iterating, iterator variable(struct bvec_iter) is always
+ * updated in multipage bvec style and that means bvec_iter_advance()
+ * is kept not changed
+ *
+ * - returned(copied) singlepage bvec is generated in flight by bvec
+ * helpers from the stored multipage bvec(chunk)
+ *
+ * - In case that some components(such as iov_iter) need to support
+ * multipage chunk, we introduce new helpers(bvec_iter_chunk_*) for
+ * them.
+ */
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -52,16 +90,35 @@ struct bvec_iter {
*/
#define __bvec_iter_bvec(bvec, iter) (&(bvec)[(iter).bi_idx])
-#define bvec_iter_page(bvec, iter) \
+#define bvec_iter_chunk_page(bvec, iter) \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
-#define bvec_iter_len(bvec, iter) \
+#define bvec_iter_chunk_len(bvec, iter) \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
-#define bvec_iter_offset(bvec, iter) \
+#define bvec_iter_chunk_offset(bvec, iter) \
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
+#define bvec_iter_page_idx_in_seg(bvec, iter) \
+ (bvec_iter_chunk_offset((bvec), (iter)) / PAGE_SIZE)
+
+/*
+ * <page, offset,length> of singlepage(sp) segment.
+ *
+ * This helpers will be implemented for building sp bvec in flight.
+ */
+#define bvec_iter_offset(bvec, iter) \
+ (bvec_iter_chunk_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter) \
+ min_t(unsigned, bvec_iter_chunk_len((bvec), (iter)), \
+ (PAGE_SIZE - (bvec_iter_offset((bvec), (iter)))))
+
+#define bvec_iter_page(bvec, iter) \
+ nth_page(bvec_iter_chunk_page((bvec), (iter)), \
+ bvec_iter_page_idx_in_seg((bvec), (iter)))
+
#define bvec_iter_bvec(bvec, iter) \
((struct bio_vec) { \
.bv_page = bvec_iter_page((bvec), (iter)), \
--
2.9.5
From: Christoph Hellwig <[email protected]>
So don't bother handling it.
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
block/bio.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 3e7d117c3346..ebd3ca62e037 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1634,10 +1634,8 @@ void bio_set_pages_dirty(struct bio *bio)
int i;
bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
-
- if (page && !PageCompound(page))
- set_page_dirty_lock(page);
+ if (!PageCompound(bvec->bv_page))
+ set_page_dirty_lock(bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
--
2.9.5
This helper is used to iterate multipage bvec for bio spliting/merge,
and it is required in bio_clone_bioset() too, so introduce it.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 34 +++++++++++++++++++++++++++++++---
include/linux/bvec.h | 36 ++++++++++++++++++++++++++++++++----
2 files changed, 63 insertions(+), 7 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 397a38aca182..e9f74c73bbe6 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -80,6 +80,9 @@
#define bio_data_dir(bio) \
(op_is_write(bio_op(bio)) ? WRITE : READ)
+#define bio_iter_chunk_iovec(bio, iter) \
+ bvec_iter_chunk_bvec((bio)->bi_io_vec, (iter))
+
/*
* Check whether this bio carries any data or not. A NULL bio is allowed.
*/
@@ -165,8 +168,8 @@ static inline bool bio_full(struct bio *bio)
#define bio_for_each_segment_all(bvl, bio, i) \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes)
+static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes, bool chunk)
{
iter->bi_sector += bytes >> 9;
@@ -174,11 +177,26 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
iter->bi_size -= bytes;
iter->bi_done += bytes;
} else {
- bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+ if (!chunk)
+ bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+ else
+ bvec_iter_chunk_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
}
}
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes)
+{
+ __bio_advance_iter(bio, iter, bytes, false);
+}
+
+static inline void bio_advance_chunk_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes)
+{
+ __bio_advance_iter(bio, iter, bytes, true);
+}
+
static inline bool bio_rewind_iter(struct bio *bio, struct bvec_iter *iter,
unsigned int bytes)
{
@@ -202,6 +220,16 @@ static inline bool bio_rewind_iter(struct bio *bio, struct bvec_iter *iter,
#define bio_for_each_segment(bvl, bio, iter) \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
+#define __bio_for_each_chunk(bvl, bio, iter, start) \
+ for (iter = (start); \
+ (iter).bi_size && \
+ ((bvl = bio_iter_chunk_iovec((bio), (iter))), 1); \
+ bio_advance_chunk_iter((bio), &(iter), (bvl).bv_len))
+
+/* returns one real segment(multipage bvec) each time */
+#define bio_for_each_chunk(bvl, bio, iter) \
+ __bio_for_each_chunk(bvl, bio, iter, (bio)->bi_iter)
+
#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 52c90ea1a96a..9e082d023392 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -126,8 +126,16 @@ struct bvec_iter {
.bv_offset = bvec_iter_offset((bvec), (iter)), \
})
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
- struct bvec_iter *iter, unsigned bytes)
+#define bvec_iter_chunk_bvec(bvec, iter) \
+((struct bio_vec) { \
+ .bv_page = bvec_iter_chunk_page((bvec), (iter)), \
+ .bv_len = bvec_iter_chunk_len((bvec), (iter)), \
+ .bv_offset = bvec_iter_chunk_offset((bvec), (iter)), \
+})
+
+static inline bool __bvec_iter_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter,
+ unsigned bytes, bool chunk)
{
if (WARN_ONCE(bytes > iter->bi_size,
"Attempted to advance past end of bvec iter\n")) {
@@ -136,8 +144,14 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
}
while (bytes) {
- unsigned iter_len = bvec_iter_len(bv, *iter);
- unsigned len = min(bytes, iter_len);
+ unsigned len;
+
+ if (chunk)
+ len = bvec_iter_chunk_len(bv, *iter);
+ else
+ len = bvec_iter_len(bv, *iter);
+
+ len = min(bytes, len);
bytes -= len;
iter->bi_size -= len;
@@ -176,6 +190,20 @@ static inline bool bvec_iter_rewind(const struct bio_vec *bv,
return true;
}
+static inline bool bvec_iter_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter,
+ unsigned bytes)
+{
+ return __bvec_iter_advance(bv, iter, bytes, false);
+}
+
+static inline bool bvec_iter_chunk_advance(const struct bio_vec *bv,
+ struct bvec_iter *iter,
+ unsigned bytes)
+{
+ return __bvec_iter_advance(bv, iter, bytes, true);
+}
+
#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
(iter).bi_size && \
--
2.9.5
It is more efficient to use bio_for_each_chunk() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 72 +++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 52 insertions(+), 20 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2493fe027953..044db0fa2f89 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -424,6 +424,56 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
return 0;
}
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+ struct scatterlist *sglist)
+{
+ if (!*sg)
+ return sglist;
+ else {
+ /*
+ * If the driver previously mapped a shorter
+ * list, we could see a termination bit
+ * prematurely unless it fully inits the sg
+ * table on each mapping. We KNOW that there
+ * must be more entries here or the driver
+ * would be buggy, so force clear the
+ * termination bit to avoid doing a full
+ * sg_init_table() in drivers for each command.
+ */
+ sg_unmark_end(*sg);
+ return sg_next(*sg);
+ }
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+ struct bio_vec *bvec, struct scatterlist *sglist,
+ struct scatterlist **sg)
+{
+ unsigned nbytes = bvec->bv_len;
+ unsigned nsegs = 0, total = 0;
+
+ while (nbytes > 0) {
+ unsigned seg_size;
+ struct page *pg;
+ unsigned offset, idx;
+
+ *sg = blk_next_sg(sg, sglist);
+
+ seg_size = min(nbytes, queue_max_segment_size(q));
+ offset = (total + bvec->bv_offset) % PAGE_SIZE;
+ idx = (total + bvec->bv_offset) / PAGE_SIZE;
+ pg = nth_page(bvec->bv_page, idx);
+
+ sg_set_page(*sg, pg, seg_size, offset);
+
+ total += seg_size;
+ nbytes -= seg_size;
+ nsegs++;
+ }
+
+ return nsegs;
+}
+
static inline void
__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -444,25 +494,7 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
(*sg)->length += nbytes;
} else {
new_segment:
- if (!*sg)
- *sg = sglist;
- else {
- /*
- * If the driver previously mapped a shorter
- * list, we could see a termination bit
- * prematurely unless it fully inits the sg
- * table on each mapping. We KNOW that there
- * must be more entries here or the driver
- * would be buggy, so force clear the
- * termination bit to avoid doing a full
- * sg_init_table() in drivers for each command.
- */
- sg_unmark_end(*sg);
- *sg = sg_next(*sg);
- }
-
- sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
- (*nsegs)++;
+ (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
}
@@ -484,7 +516,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
int cluster = blk_queue_cluster(q), nsegs = 0;
for_each_bio(bio)
- bio_for_each_segment(bvec, bio, iter)
+ bio_for_each_chunk(bvec, bio, iter)
__blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
&nsegs, &cluster);
--
2.9.5
Preparing for supporting multipage bvec.
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: David Sterba <[email protected]>
Cc: [email protected]
Signed-off-by: Ming Lei <[email protected]>
---
fs/btrfs/compression.c | 5 ++++-
fs/btrfs/extent_io.c | 5 +++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index d3e447b45bf7..02da66acc57d 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -407,8 +407,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
static u64 bio_end_offset(struct bio *bio)
{
struct bio_vec *last = bio_last_bvec_all(bio);
+ struct bio_vec bv;
- return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
+ chunk_last_segment(last, &bv);
+
+ return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;
}
static noinline int add_ra_bio_pages(struct inode *inode,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 51fc015c7d2c..efb8db0fd73c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2728,11 +2728,12 @@ static int __must_check submit_one_bio(struct bio *bio, int mirror_num,
{
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
- struct page *page = bvec->bv_page;
+ struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
- start = page_offset(page) + bvec->bv_offset;
+ chunk_last_segment(bvec, &bv);
+ start = page_offset(bv.bv_page) + bv.bv_offset;
bio->bi_private = NULL;
--
2.9.5
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage chunk, so introduce this helper to make them happy.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bvec.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 9e082d023392..aac75d87d884 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -219,4 +219,29 @@ static inline bool bvec_iter_chunk_advance(const struct bio_vec *bv,
.bi_bvec_done = 0, \
}
+/*
+ * Get the last singlepage segment from the multipage bvec and store it
+ * in @seg
+ */
+static inline void chunk_last_segment(const struct bio_vec *bvec,
+ struct bio_vec *seg)
+{
+ unsigned total = bvec->bv_offset + bvec->bv_len;
+ unsigned last_page = total / PAGE_SIZE;
+
+ if (last_page * PAGE_SIZE == total)
+ last_page--;
+
+ seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+ /* the whole segment is inside the last page */
+ if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+ seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+ seg->bv_len = bvec->bv_len;
+ } else {
+ seg->bv_offset = 0;
+ seg->bv_len = total - last_page * PAGE_SIZE;
+ }
+}
+
#endif /* __LINUX_BVEC_ITER_H */
--
2.9.5
As multipage bvec will be enabled soon, bio->bi_vcnt isn't same with
page count in the bio any more, so use bio_for_each_segment_all() to
compute the number because we will keep bio_for_each_segment_all()
to iterate each page.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index e9f74c73bbe6..c17b8f80d650 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -339,8 +339,14 @@ static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
static inline unsigned bio_pages_all(struct bio *bio)
{
+ unsigned i;
+ struct bio_vec *bv;
+
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
- return bio->bi_vcnt;
+
+ bio_for_each_segment_all(bv, bio, i)
+ ;
+ return i;
}
static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
--
2.9.5
Once multipage bvec is enabled, the last bvec may include more than one
page, this patch use chunk_last_segment() to truncate the bio.
Signed-off-by: Ming Lei <[email protected]>
---
fs/buffer.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index cabc045f483d..630ae3e0238b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3021,7 +3021,10 @@ void guard_bio_eod(int op, struct bio *bio)
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
- zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+ struct bio_vec bv;
+
+ chunk_last_segment(bvec, &bv);
+ zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
}
--
2.9.5
There are still cases in which we need to use bio_chunks() for get the
number of multipage segment, so introduce it.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c17b8f80d650..13fd7bc30390 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -242,7 +242,6 @@ static inline unsigned bio_segments(struct bio *bio)
* We special case discard/write same/write zeroes, because they
* interpret bi_size differently:
*/
-
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
@@ -251,13 +250,34 @@ static inline unsigned bio_segments(struct bio *bio)
case REQ_OP_WRITE_SAME:
return 1;
default:
- break;
+ bio_for_each_segment(bv, bio, iter)
+ segs++;
+ return segs;
}
+}
- bio_for_each_segment(bv, bio, iter)
- segs++;
+static inline unsigned bio_chunks(struct bio *bio)
+{
+ unsigned chunks = 0;
+ struct bio_vec bv;
+ struct bvec_iter iter;
- return segs;
+ /*
+ * We special case discard/write same/write zeroes, because they
+ * interpret bi_size differently:
+ */
+ switch (bio_op(bio)) {
+ case REQ_OP_DISCARD:
+ case REQ_OP_SECURE_ERASE:
+ case REQ_OP_WRITE_ZEROES:
+ return 0;
+ case REQ_OP_WRITE_SAME:
+ return 1;
+ default:
+ bio_for_each_chunk(bv, bio, iter)
+ chunks++;
+ return chunks;
+ }
}
/*
--
2.9.5
There are still cases in which rq_for_each_chunk() is required, for
example, loop.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/blkdev.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bca3a92eb55f..4eaba73c784a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -941,6 +941,10 @@ struct req_iterator {
__rq_for_each_bio(_iter.bio, _rq) \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
+#define rq_for_each_chunk(bvl, _rq, _iter) \
+ __rq_for_each_bio(_iter.bio, _rq) \
+ bio_for_each_chunk(bvl, _iter.bio, _iter.iter)
+
#define rq_iter_last(bvec, _iter) \
(_iter.bio->bi_next == NULL && \
bio_iter_last(bvec, _iter.iter))
--
2.9.5
iov_iter is implemented with bvec itererator, so it is safe to pass
multipage chunks to it, and this way is much more efficient than
passing one page in each bvec.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/loop.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4838b0dbaad3..c25963a9df00 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -521,7 +521,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
struct bio_vec tmp;
__rq_for_each_bio(bio, rq)
- segments += bio_segments(bio);
+ segments += bio_chunks(bio);
bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_NOIO);
if (!bvec)
return -EIO;
@@ -533,7 +533,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
* API will take care of all details for us.
*/
- rq_for_each_segment(tmp, rq, iter) {
+ rq_for_each_chunk(tmp, rq, iter) {
*bvec = tmp;
bvec++;
}
@@ -547,7 +547,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
*/
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
- segments = bio_segments(bio);
+ segments = bio_chunks(bio);
}
atomic_set(&cmd->ref, 2);
--
2.9.5
The incoming bio will become very big after multipage bvec is enabled,
so we can't clone bio page by page.
This patch uses the introduced bio_clone_chunk_bioset(), so the incoming
bio can be cloned successfully. This way is safe because device mapping
won't modify the bio vector on the cloned multipage bio.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/dm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 98dff36b89a3..13ca3574d972 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1582,8 +1582,8 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
* the usage of io->orig_bio in dm_remap_zone_report()
* won't be affected by this reassignment.
*/
- struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
- &md->queue->bio_split);
+ struct bio *b = bio_clone_chunk_bioset(bio, GFP_NOIO,
+ &md->queue->bio_split);
ci.io->orig_bio = b;
bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
bio_chain(b, bio);
--
2.9.5
bio_for_each_segment_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all().
Signed-off-by: Ming Lei <[email protected]>
---
fs/btrfs/compression.c | 3 ++-
fs/btrfs/disk-io.c | 3 ++-
fs/btrfs/extent_io.c | 9 ++++++---
fs/btrfs/inode.c | 6 ++++--
fs/btrfs/raid56.c | 3 ++-
5 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 02da66acc57d..388804e9dde6 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -166,13 +166,14 @@ static void end_compressed_bio_read(struct bio *bio)
} else {
int i;
struct bio_vec *bvec;
+ struct bvec_chunk_iter citer;
/*
* we have verified the checksum already, set page
* checked so the end_io handlers know about it
*/
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, cb->orig_bio, i)
+ bio_for_each_chunk_segment_all(bvec, cb->orig_bio, i, citer)
SetPageChecked(bvec->bv_page);
bio_endio(cb->orig_bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092dc9390..e9ad3daa3247 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -829,9 +829,10 @@ static blk_status_t btree_csum_one_bio(struct bio *bio)
struct bio_vec *bvec;
struct btrfs_root *root;
int i, ret = 0;
+ struct bvec_chunk_iter citer;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
root = BTRFS_I(bvec->bv_page->mapping->host)->root;
ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
if (ret)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index efb8db0fd73c..805bac595f3a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2453,9 +2453,10 @@ static void end_bio_extent_writepage(struct bio *bio)
u64 start;
u64 end;
int i;
+ struct bvec_chunk_iter citer;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -2524,9 +2525,10 @@ static void end_bio_extent_readpage(struct bio *bio)
int mirror;
int ret;
int i;
+ struct bvec_chunk_iter citer;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -3678,9 +3680,10 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
struct bio_vec *bvec;
struct extent_buffer *eb;
int i, done;
+ struct bvec_chunk_iter citer;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
eb = (struct extent_buffer *)page->private;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 89b208201783..deee00ca8e6d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7894,6 +7894,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
struct bio_vec *bvec;
struct extent_io_tree *io_tree, *failure_tree;
int i;
+ struct bvec_chunk_iter citer;
if (bio->bi_status)
goto end;
@@ -7905,7 +7906,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
done->uptodate = 1;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer)
clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree,
io_tree, done->start, bvec->bv_page,
btrfs_ino(BTRFS_I(inode)), 0);
@@ -7984,6 +7985,7 @@ static void btrfs_retry_endio(struct bio *bio)
int uptodate;
int ret;
int i;
+ struct bvec_chunk_iter citer;
if (bio->bi_status)
goto end;
@@ -7997,7 +7999,7 @@ static void btrfs_retry_endio(struct bio *bio)
failure_tree = &BTRFS_I(inode)->io_failure_tree;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
bvec->bv_offset, done->start,
bvec->bv_len);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 5e4ad134b9ad..c9f21039e1db 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1463,10 +1463,11 @@ static void set_bio_pages_uptodate(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
ASSERT(!bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer)
SetPageUptodate(bvec->bv_page);
}
--
2.9.5
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all().
Signed-off-by: Ming Lei <[email protected]>
---
fs/f2fs/data.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 02237d4d91f5..00ee386b902c 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -54,6 +54,7 @@ static void f2fs_read_end_io(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
#ifdef CONFIG_F2FS_FAULT_INJECTION
if (time_to_inject(F2FS_P_SB(bio_first_page_all(bio)), FAULT_IO)) {
@@ -71,7 +72,7 @@ static void f2fs_read_end_io(struct bio *bio)
}
}
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
if (!bio->bi_status) {
@@ -91,8 +92,9 @@ static void f2fs_write_end_io(struct bio *bio)
struct f2fs_sb_info *sbi = bio->bi_private;
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
enum count_type type = WB_DATA_TYPE(page);
@@ -267,6 +269,7 @@ static bool __has_merged_page(struct f2fs_bio_info *io,
struct bio_vec *bvec;
struct page *target;
int i;
+ struct bvec_chunk_iter citer;
if (!io->bio)
return false;
@@ -274,7 +277,7 @@ static bool __has_merged_page(struct f2fs_bio_info *io,
if (!inode && !ino)
return true;
- bio_for_each_segment_all(bvec, io->bio, i) {
+ bio_for_each_chunk_segment_all(bvec, io->bio, i, citer) {
if (bvec->bv_page->mapping)
target = bvec->bv_page;
--
2.9.5
There is one use case(DM) which requires to clone bio chunk by
chunk, so introduce this API.
Signed-off-by: Ming Lei <[email protected]>
---
block/bio.c | 56 +++++++++++++++++++++++++++++++++++++++--------------
include/linux/bio.h | 1 +
2 files changed, 43 insertions(+), 14 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index cb0f46e2752b..60219f82ddab 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -644,21 +644,13 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);
-/**
- * bio_clone_bioset - clone a bio
- * @bio_src: bio to clone
- * @gfp_mask: allocation priority
- * @bs: bio_set to allocate from
- *
- * Clone bio. Caller will own the returned bio, but not the actual data it
- * points to. Reference count of returned bio will be one.
- */
-struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
- struct bio_set *bs)
+static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, bool seg)
{
struct bvec_iter iter;
struct bio_vec bv;
struct bio *bio;
+ int nr_vecs = seg ? bio_segments(bio_src) : bio_chunks(bio_src);
/*
* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -682,7 +674,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
* __bio_clone_fast() anyways.
*/
- bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
+ bio = bio_alloc_bioset(gfp_mask, nr_vecs, bs);
if (!bio)
return NULL;
bio->bi_disk = bio_src->bi_disk;
@@ -700,8 +692,13 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
break;
default:
- bio_for_each_segment(bv, bio_src, iter)
- bio->bi_io_vec[bio->bi_vcnt++] = bv;
+ if (seg) {
+ bio_for_each_segment(bv, bio_src, iter)
+ bio->bi_io_vec[bio->bi_vcnt++] = bv;
+ } else {
+ bio_for_each_chunk(bv, bio_src, iter)
+ bio->bi_io_vec[bio->bi_vcnt++] = bv;
+ }
break;
}
@@ -719,9 +716,40 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
return bio;
}
+
+/**
+ * bio_clone_bioset - clone a bio
+ * @bio_src: bio to clone
+ * @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
+ *
+ * Clone bio. Caller will own the returned bio, but not the actual data it
+ * points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs)
+{
+ return __bio_clone_bioset(bio_src, gfp_mask, bs, true);
+}
EXPORT_SYMBOL(bio_clone_bioset);
/**
+ * bio_clone_seg_bioset - clone a bio segment by segment
+ * @bio_src: bio to clone
+ * @gfp_mask: allocation priority
+ * @bs: bio_set to allocate from
+ *
+ * Clone bio. Caller will own the returned bio, but not the actual data it
+ * points to. Reference count of returned bio will be one.
+ */
+struct bio *bio_clone_chunk_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs)
+{
+ return __bio_clone_bioset(bio_src, gfp_mask, bs, false);
+}
+EXPORT_SYMBOL(bio_clone_chunk_bioset);
+
+/**
* bio_add_pc_page - attempt to add page to bio
* @q: the target queue
* @bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 13fd7bc30390..0fa1035dde38 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -483,6 +483,7 @@ extern void bio_put(struct bio *);
extern void __bio_clone_fast(struct bio *, struct bio *);
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);
extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
+extern struct bio *bio_clone_chunk_bioset(struct bio *, gfp_t, struct bio_set *bs);
extern struct bio_set fs_bio_set;
--
2.9.5
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all().
Given bvec can't be changed under bio_for_each_chunk_segment_all(), this patch
marks the bvec parameter as 'const' for xfs_finish_page_writeback().
Signed-off-by: Ming Lei <[email protected]>
---
fs/xfs/xfs_aops.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ca6903726689..c134db97911d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -107,7 +107,7 @@ xfs_find_daxdev_for_inode(
static void
xfs_finish_page_writeback(
struct inode *inode,
- struct bio_vec *bvec,
+ const struct bio_vec *bvec,
int error)
{
struct buffer_head *head = page_buffers(bvec->bv_page), *bh = head;
@@ -169,6 +169,7 @@ xfs_destroy_ioend(
for (bio = &ioend->io_inline_bio; bio; bio = next) {
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
/*
* For the last bio, bi_private points to the ioend, so we
@@ -180,7 +181,7 @@ xfs_destroy_ioend(
next = bio->bi_private;
/* walk each page on bio, ending page IO on them */
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer)
xfs_finish_page_writeback(inode, bvec, error);
bio_put(bio);
--
2.9.5
bio_for_each_segment_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all()
Signed-off-by: Ming Lei <[email protected]>
---
fs/exofs/ore.c | 3 ++-
fs/exofs/ore_raid.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index ddbf87246898..fe2eb28de358 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -406,8 +406,9 @@ static void _clear_bio(struct bio *bio)
{
struct bio_vec *bv;
unsigned i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_chunk_segment_all(bv, bio, i, citer) {
unsigned this_count = bv->bv_len;
if (likely(PAGE_SIZE == this_count))
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 27cbdb697649..4da3065ce976 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -433,11 +433,12 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
/* loop on all devices all pages */
for (d = 0; d < ios->numdevs; d++) {
struct bio *bio = ios->per_dev[d].bio;
+ struct bvec_chunk_iter citer;
if (!bio)
continue;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_chunk_segment_all(bv, bio, i, citer) {
struct page *page = bv->bv_page;
SetPageUptodate(page);
--
2.9.5
In bch_bio_alloc_pages(), bio_for_each_chunk_all() is fine because this
helper can only be used on a freshly new bio.
For other cases, we conver to bio_for_each_chunk_segment_all() since they needn't
to update bvec table.
bio_for_each_segment_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all().
Signed-off-by: Ming Lei <[email protected]>
---
drivers/md/bcache/btree.c | 3 ++-
drivers/md/bcache/util.c | 2 +-
drivers/md/dm-crypt.c | 3 ++-
drivers/md/raid1.c | 3 ++-
4 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a0968c04e21..dc0747c37bdf 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -423,8 +423,9 @@ static void do_btree_node_write(struct btree *b)
int j;
struct bio_vec *bv;
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bv, b->bio, j)
+ bio_for_each_chunk_segment_all(bv, b->bio, j, citer)
memcpy(page_address(bv->bv_page),
base + j * PAGE_SIZE, PAGE_SIZE);
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index fc479b026d6d..2f05199f7edb 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -268,7 +268,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_chunk_all(bv, bio, i) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index da02f4d8e4b9..637ef1b1dc43 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1450,8 +1450,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
{
unsigned int i;
struct bio_vec *bv;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bv, clone, i) {
+ bio_for_each_chunk_segment_all(bv, clone, i, citer) {
BUG_ON(!bv->bv_page);
mempool_free(bv->bv_page, &cc->page_pool);
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index bad28520719b..2a4f1037c680 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2116,13 +2116,14 @@ static void process_checks(struct r1bio *r1_bio)
struct page **spages = get_resync_pages(sbio)->pages;
struct bio_vec *bi;
int page_len[RESYNC_PAGES] = { 0 };
+ struct bvec_chunk_iter citer;
if (sbio->bi_end_io != end_sync_read)
continue;
/* Now we can 'fixup' the error value */
sbio->bi_status = 0;
- bio_for_each_segment_all(bi, sbio, j)
+ bio_for_each_chunk_segment_all(bi, sbio, j, citer)
page_len[j] = bi->bv_len;
if (!status) {
--
2.9.5
bio_for_each_page_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all()
Given bvec can't be changed inside bio_for_each_chunk_segtment(), this patch
marks the bvec parameter as 'const' for gfs2_end_log_write_bh().
Signed-off-by: Ming Lei <[email protected]>
---
fs/gfs2/lops.c | 6 ++++--
fs/gfs2/meta_io.c | 3 ++-
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 4d6567990baf..e48f215006dd 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -168,7 +168,8 @@ u64 gfs2_log_bmap(struct gfs2_sbd *sdp)
* that is pinned in the pagecache.
*/
-static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
+static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp,
+ const struct bio_vec *bvec,
blk_status_t error)
{
struct buffer_head *bh, *next;
@@ -207,6 +208,7 @@ static void gfs2_end_log_write(struct bio *bio)
struct bio_vec *bvec;
struct page *page;
int i;
+ struct bvec_chunk_iter citer;
if (bio->bi_status) {
fs_err(sdp, "Error %d writing to journal, jid=%u\n",
@@ -214,7 +216,7 @@ static void gfs2_end_log_write(struct bio *bio)
wake_up(&sdp->sd_logd_waitq);
}
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
page = bvec->bv_page;
if (page_has_buffers(page))
gfs2_end_log_write_bh(sdp, bvec, bio->bi_status);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 52de1036d9f9..1448f42f9c91 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -190,8 +190,9 @@ static void gfs2_meta_read_endio(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
struct buffer_head *bh = page_buffers(page);
unsigned int len = bvec->bv_len;
--
2.9.5
This patch introduces bio_for_each_chunk_all() and bio_for_each_chunk_segment_all(),
which are for replacing the current bio_for_each_segment_all().
bio_for_each_chunk_all() will iterate one chunk by chunk, which is multipage based.
bio_for_each_chunk_segment_all() will iterate one segment by segment, which is
singlepage based.
For using bio_for_each_chunk_segment_all(), one 24-bytes extra local variable has to
be introduced.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 13 +++++++++++++
include/linux/bvec.h | 31 +++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0fa1035dde38..f21384be9b51 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,6 +168,19 @@ static inline bool bio_full(struct bio *bio)
#define bio_for_each_segment_all(bvl, bio, i) \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
+#define bio_for_each_chunk_all(bvl, bio, i) \
+ bio_for_each_segment_all(bvl, bio, i)
+
+#define chunk_for_each_segment(bv, bvl, i, citer) \
+ for (bv = bvec_init_chunk_iter(&citer); \
+ (citer.done < (bvl)->bv_len) && \
+ ((chunk_next_segment((bvl), &citer)), 1); \
+ citer.done += bv->bv_len, i += 1)
+
+#define bio_for_each_chunk_segment_all(bvl, bio, i, citer) \
+ for (i = 0, citer.idx = 0; citer.idx < (bio)->bi_vcnt; citer.idx++) \
+ chunk_for_each_segment(bvl, &((bio)->bi_io_vec[citer.idx]), i, citer)
+
static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
unsigned bytes, bool chunk)
{
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index aac75d87d884..d4eaa0c26bb5 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -84,6 +84,12 @@ struct bvec_iter {
current bvec */
};
+struct bvec_chunk_iter {
+ struct bio_vec bv;
+ int idx;
+ unsigned done;
+};
+
/*
* various member access, note that bio_data should of course not be used
* on highmem page vectors
@@ -219,6 +225,31 @@ static inline bool bvec_iter_chunk_advance(const struct bio_vec *bv,
.bi_bvec_done = 0, \
}
+static inline struct bio_vec *bvec_init_chunk_iter(struct bvec_chunk_iter *citer)
+{
+ citer->bv.bv_page = NULL;
+ citer->done = 0;
+
+ return &citer->bv;
+}
+
+/* used for chunk_for_each_segment */
+static inline void chunk_next_segment(const struct bio_vec *chunk,
+ struct bvec_chunk_iter *iter)
+{
+ struct bio_vec *bv = &iter->bv;
+
+ if (bv->bv_page) {
+ bv->bv_page += 1;
+ bv->bv_offset = 0;
+ } else {
+ bv->bv_page = chunk->bv_page;
+ bv->bv_offset = chunk->bv_offset;
+ }
+ bv->bv_len = min_t(unsigned int, PAGE_SIZE - bv->bv_offset,
+ chunk->bv_len - iter->done);
+}
+
/*
* Get the last singlepage segment from the multipage bvec and store it
* in @seg
--
2.9.5
We have to convert to bio_for_each_chunk_segment_all() for iterating page by
page.
bio_for_each_segment_all() can't be used any more after multipage bvec is
enabled.
Signed-off-by: Ming Lei <[email protected]>
---
block/bio.c | 27 ++++++++++++++++++---------
block/blk-zoned.c | 5 +++--
block/bounce.c | 6 ++++--
include/linux/bio.h | 3 ++-
4 files changed, 27 insertions(+), 14 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 60219f82ddab..276fc35ec559 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1146,8 +1146,9 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
{
int i;
struct bio_vec *bvec;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
ssize_t ret;
ret = copy_page_from_iter(bvec->bv_page,
@@ -1177,8 +1178,9 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
{
int i;
struct bio_vec *bvec;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
ssize_t ret;
ret = copy_page_to_iter(bvec->bv_page,
@@ -1200,8 +1202,9 @@ void bio_free_pages(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer)
__free_page(bvec->bv_page);
}
EXPORT_SYMBOL(bio_free_pages);
@@ -1367,6 +1370,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+ struct bvec_chunk_iter citer;
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1440,7 +1444,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
out_unmap:
- bio_for_each_segment_all(bvec, bio, j) {
+ bio_for_each_chunk_segment_all(bvec, bio, j, citer) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1451,11 +1455,12 @@ static void __bio_unmap_user(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
/*
* make sure we dirty pages we wrote to
*/
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
@@ -1547,8 +1552,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1657,8 +1663,9 @@ void bio_set_pages_dirty(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
if (!PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
}
@@ -1669,8 +1676,9 @@ static void bio_release_pages(struct bio *bio)
{
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer)
put_page(bvec->bv_page);
}
@@ -1717,8 +1725,9 @@ void bio_check_pages_dirty(struct bio *bio)
struct bio_vec *bvec;
unsigned long flags;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
goto defer;
}
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 3d08dc84db16..9223666c845d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -123,6 +123,7 @@ int blkdev_report_zones(struct block_device *bdev,
unsigned int ofst;
void *addr;
int ret;
+ struct bvec_chunk_iter citer;
if (!q)
return -ENXIO;
@@ -190,7 +191,7 @@ int blkdev_report_zones(struct block_device *bdev,
n = 0;
nz = 0;
nr_rep = 0;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_chunk_segment_all(bv, bio, i, citer) {
if (!bv->bv_page)
break;
@@ -223,7 +224,7 @@ int blkdev_report_zones(struct block_device *bdev,
*nr_zones = nz;
out:
- bio_for_each_segment_all(bv, bio, i)
+ bio_for_each_chunk_segment_all(bv, bio, i, citer)
__free_page(bv->bv_page);
bio_put(bio);
diff --git a/block/bounce.c b/block/bounce.c
index fd31347b7836..c6af0bd29ec9 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -146,11 +146,12 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool)
struct bio_vec *bvec, orig_vec;
int i;
struct bvec_iter orig_iter = bio_orig->bi_iter;
+ struct bvec_chunk_iter citer;
/*
* free up bounce indirect pages used
*/
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
orig_vec = bio_iter_iovec(bio_orig, orig_iter);
if (bvec->bv_page != orig_vec.bv_page) {
dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
@@ -206,6 +207,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
bool bounce = false;
int sectors = 0;
bool passthrough = bio_is_passthrough(*bio_orig);
+ struct bvec_chunk_iter citer;
bio_for_each_segment(from, *bio_orig, iter) {
if (i++ < BIO_MAX_PAGES)
@@ -225,7 +227,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
bio = bio_clone_bioset(*bio_orig, GFP_NOIO, passthrough ? NULL :
&bounce_bio_set);
- bio_for_each_segment_all(to, bio, i) {
+ bio_for_each_chunk_segment_all(to, bio, i, citer) {
struct page *page = to->bv_page;
if (page_to_pfn(page) <= q->limits.bounce_pfn)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f21384be9b51..c22b8be961ce 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -374,10 +374,11 @@ static inline unsigned bio_pages_all(struct bio *bio)
{
unsigned i;
struct bio_vec *bv;
+ struct bvec_chunk_iter citer;
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
- bio_for_each_segment_all(bv, bio, i)
+ bio_for_each_chunk_segment_all(bv, bio, i, citer)
;
return i;
}
--
2.9.5
This patch pulls the trigger for multipage bvecs.
Now any request queue which supports queue cluster will see multipage
bvecs.
Signed-off-by: Ming Lei <[email protected]>
---
block/bio.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 276fc35ec559..284085ab97e7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -870,12 +870,23 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
if (bio->bi_vcnt > 0) {
struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
- if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
- bv->bv_len += len;
- bio->bi_iter.bi_size += len;
- return true;
- }
+ struct request_queue *q = NULL;
+
+ if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len)
+ goto merge;
+
+ if (bio->bi_disk)
+ q = bio->bi_disk->queue;
+
+ /* disable multipage bvec too if cluster isn't enabled */
+ if (!q || !blk_queue_cluster(q) ||
+ (bvec_to_phys(bv) + bv->bv_len !=
+ page_to_phys(page) + off))
+ return false;
+ merge:
+ bv->bv_len += len;
+ bio->bi_iter.bi_size += len;
+ return true;
}
return false;
}
--
2.9.5
Now multipage bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 8 --------
1 file changed, 8 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 69ef05dc7019..58838dc12d69 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -38,15 +38,7 @@
#define BIO_BUG_ON
#endif
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES HPAGE_PMD_NR
-#else
#define BIO_MAX_PAGES 256
-#endif
-#else
-#define BIO_MAX_PAGES 256
-#endif
#define bio_prio(bio) (bio)->bi_ioprio
#define bio_set_prio(bio, prio) ((bio)->bi_ioprio = prio)
--
2.9.5
bio_for_each_segment_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all().
Signed-off-by: Ming Lei <[email protected]>
---
fs/block_dev.c | 6 ++++--
fs/crypto/bio.c | 3 ++-
fs/direct-io.c | 4 +++-
fs/iomap.c | 3 ++-
fs/mpage.c | 3 ++-
5 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bef6934b6189..6726f8297a7b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -197,6 +197,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
ssize_t ret;
blk_qc_t qc;
int i;
+ struct bvec_chunk_iter citer;
if ((pos | iov_iter_alignment(iter)) &
(bdev_logical_block_size(bdev) - 1))
@@ -242,7 +243,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
}
__set_current_state(TASK_RUNNING);
- bio_for_each_segment_all(bvec, &bio, i) {
+ bio_for_each_chunk_segment_all(bvec, &bio, i, citer) {
if (should_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
put_page(bvec->bv_page);
@@ -309,8 +310,9 @@ static void blkdev_bio_end_io(struct bio *bio)
} else {
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer)
put_page(bvec->bv_page);
bio_put(bio);
}
diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index 0d5e6a569d58..13bcbdbf3440 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -37,8 +37,9 @@ static void completion_pages(struct work_struct *work)
struct bio *bio = ctx->r.bio;
struct bio_vec *bv;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_chunk_segment_all(bv, bio, i, citer) {
struct page *page = bv->bv_page;
int ret = fscrypt_decrypt_page(page->mapping->host, page,
PAGE_SIZE, 0, page->index);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 093fb54cd316..8f7fd985450a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -551,7 +551,9 @@ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
- bio_for_each_segment_all(bvec, bio, i) {
+ struct bvec_chunk_iter citer;
+
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
if (dio->op == REQ_OP_READ && !PageCompound(page) &&
diff --git a/fs/iomap.c b/fs/iomap.c
index 206539d369a8..dbc35c40a1c4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -934,8 +934,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
} else {
struct bio_vec *bvec;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i)
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer)
put_page(bvec->bv_page);
bio_put(bio);
}
diff --git a/fs/mpage.c b/fs/mpage.c
index b7e7f570733a..78b372607650 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,8 +48,9 @@ static void mpage_end_io(struct bio *bio)
{
struct bio_vec *bv;
int i;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_chunk_segment_all(bv, bio, i, citer) {
struct page *page = bv->bv_page;
page_endio(page, op_is_write(bio_op(bio)),
blk_status_to_errno(bio->bi_status));
--
2.9.5
bio_for_each_segment_all() can't be used any more after multipage bvec is
enabled, so we have to convert to bio_for_each_chunk_segment_all().
Signed-off-by: Ming Lei <[email protected]>
---
fs/ext4/page-io.c | 3 ++-
fs/ext4/readpage.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index db7590178dfc..d4e737e51176 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -63,8 +63,9 @@ static void ext4_finish_bio(struct bio *bio)
{
int i;
struct bio_vec *bvec;
+ struct bvec_chunk_iter citer;
- bio_for_each_segment_all(bvec, bio, i) {
+ bio_for_each_chunk_segment_all(bvec, bio, i, citer) {
struct page *page = bvec->bv_page;
#ifdef CONFIG_EXT4_FS_ENCRYPTION
struct page *data_page = NULL;
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 9ffa6fad18db..234e4486a90c 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -72,6 +72,7 @@ static void mpage_end_io(struct bio *bio)
{
struct bio_vec *bv;
int i;
+ struct bvec_chunk_iter citer;
if (ext4_bio_encrypted(bio)) {
if (bio->bi_status) {
@@ -81,7 +82,7 @@ static void mpage_end_io(struct bio *bio)
return;
}
}
- bio_for_each_segment_all(bv, bio, i) {
+ bio_for_each_chunk_segment_all(bv, bio, i, citer) {
struct page *page = bv->bv_page;
if (!bio->bi_status) {
--
2.9.5
No one uses it any more, so kill it now.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/bio.h | 5 +----
include/linux/bvec.h | 2 +-
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c22b8be961ce..69ef05dc7019 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -165,11 +165,8 @@ static inline bool bio_full(struct bio *bio)
* drivers should _never_ use the all version - the bio may have been split
* before it got to the driver and the driver won't own all of it
*/
-#define bio_for_each_segment_all(bvl, bio, i) \
- for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
-
#define bio_for_each_chunk_all(bvl, bio, i) \
- bio_for_each_segment_all(bvl, bio, i)
+ for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
#define chunk_for_each_segment(bv, bvl, i, citer) \
for (bv = bvec_init_chunk_iter(&citer); \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d4eaa0c26bb5..58267bde111e 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -47,7 +47,7 @@
* page, so we keep the sp interface not changed, for example,
* bio_for_each_segment() still returns bvec with single page
*
- * - bio_for_each_segment_all() will be changed to return singlepage
+ * - bio_for_each_chunk_all() will be changed to return singlepage
* bvec too
*
* - during iterating, iterator variable(struct bvec_iter) is always
--
2.9.5
Now multipage bvec is supported, and some helpers may return page by
page, and some may return segment by segment, this patch documents the
usage for helping us use them correctly.
Signed-off-by: Ming Lei <[email protected]>
---
Documentation/block/biovecs.txt | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..3ab72566141f 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,33 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=================
+
+* The following helpers, whose names have the suffix "_all", can only be
+used on non-BIO_CLONED bio, and usually they are used by filesystem code,
+and driver shouldn't use them because bio may have been split before they
+got to the driver:
+
+ bio_for_each_chunk_segment_all()
+ bio_for_each_chunk_all()
+ bio_pages_all()
+ bio_first_bvec_all()
+ bio_first_page_all()
+ bio_last_bvec_all()
+
+* The following helpers iterate bio page by page, and the local variable of
+'struct bio_vec' or the reference records single page io vector during the
+iteration:
+
+ bio_for_each_segment()
+ bio_for_each_segment_all()
+
+* The following helpers iterate bio chunk by chunk, and each chunk may
+include multiple physically contiguous pages, and the local variable of
+'struct bio_vec' or the reference records multi page io vector during the
+iteration:
+
+ bio_for_each_chunk()
+ bio_for_each_chunk_all()
--
2.9.5
On 2018/6/9 8:30 PM, Ming Lei wrote:
> In bch_bio_alloc_pages(), bio_for_each_chunk_all() is fine because this
> helper can only be used on a freshly new bio.
>
> For other cases, we conver to bio_for_each_chunk_segment_all() since they needn't
> to update bvec table.
>
> bio_for_each_segment_all() can't be used any more after multipage bvec is
> enabled, so we have to convert to bio_for_each_chunk_segment_all().
>
> Signed-off-by: Ming Lei <[email protected]>
I am OK with the bcache part. Acked-by: Coly Li <[email protected]>
Thanks.
Coly Li
> ---
> drivers/md/bcache/btree.c | 3 ++-
> drivers/md/bcache/util.c | 2 +-
> drivers/md/dm-crypt.c | 3 ++-
> drivers/md/raid1.c | 3 ++-
> 4 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 2a0968c04e21..dc0747c37bdf 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -423,8 +423,9 @@ static void do_btree_node_write(struct btree *b)
> int j;
> struct bio_vec *bv;
> void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
> + struct bvec_chunk_iter citer;
>
> - bio_for_each_segment_all(bv, b->bio, j)
> + bio_for_each_chunk_segment_all(bv, b->bio, j, citer)
> memcpy(page_address(bv->bv_page),
> base + j * PAGE_SIZE, PAGE_SIZE);
>
> diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> index fc479b026d6d..2f05199f7edb 100644
> --- a/drivers/md/bcache/util.c
> +++ b/drivers/md/bcache/util.c
> @@ -268,7 +268,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
> int i;
> struct bio_vec *bv;
>
> - bio_for_each_segment_all(bv, bio, i) {
> + bio_for_each_chunk_all(bv, bio, i) {
> bv->bv_page = alloc_page(gfp_mask);
> if (!bv->bv_page) {
> while (--bv >= bio->bi_io_vec)
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index da02f4d8e4b9..637ef1b1dc43 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1450,8 +1450,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
> {
> unsigned int i;
> struct bio_vec *bv;
> + struct bvec_chunk_iter citer;
>
> - bio_for_each_segment_all(bv, clone, i) {
> + bio_for_each_chunk_segment_all(bv, clone, i, citer) {
> BUG_ON(!bv->bv_page);
> mempool_free(bv->bv_page, &cc->page_pool);
> }
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index bad28520719b..2a4f1037c680 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2116,13 +2116,14 @@ static void process_checks(struct r1bio *r1_bio)
> struct page **spages = get_resync_pages(sbio)->pages;
> struct bio_vec *bi;
> int page_len[RESYNC_PAGES] = { 0 };
> + struct bvec_chunk_iter citer;
>
> if (sbio->bi_end_io != end_sync_read)
> continue;
> /* Now we can 'fixup' the error value */
> sbio->bi_status = 0;
>
> - bio_for_each_segment_all(bi, sbio, j)
> + bio_for_each_chunk_segment_all(bi, sbio, j, citer)
> page_len[j] = bi->bv_len;
>
> if (!status) {
>
І think the new naming scheme in this series is a nightmare. It
confuses the heck out of me, and that is despite knowing many bits of
the block layer inside out, and reviewing previous series.
I think we need to take a step back and figure out what names what we
want in the end, and how we get there separately.
For the end result using bio_for_each_page in some form for the per-page
iteration seems like the only sensible idea, as that is what it does.
For the bio-vec iteration I'm fine with either bio_for_each_bvec as that
exactly explains what it does, or bio_for_each_segment to keep the
change at a minimum.
And in terms of how to get there: maybe we need to move all the drivers
and file systems to the new names first before the actual changes to
document all the intent. For that using the bio_for_each_bvec variant
might be benefitial as it allows to seasily see the difference between
old uncovered code and the already converted one.
I think both callers would be just as easy to understand by using
nth_page() instead of these magic helpers. E.g. for guard_bio_eod:
unsigned offset = (bv.bv_offset + bv.bv_len);
struct page *page = nth_page(bv.bv_page, offset);
zero_user(page, offset & PAGE_MASK, truncated_bytes);
On Mon, Jun 11, 2018 at 10:19:38AM -0700, Christoph Hellwig wrote:
> I think both callers would be just as easy to understand by using
> nth_page() instead of these magic helpers. E.g. for guard_bio_eod:
>
> unsigned offset = (bv.bv_offset + bv.bv_len);
> struct page *page = nth_page(bv.bv_page, offset);
The above lines should have been written as:
struct page *page = nth_page(bv.bv_page, offset / PAGE_SIZE)
but this way may cause 'page' points to the next page of bv's last
page if offset == N * PAGE_SIZE.
Thanks,
Ming
On Mon, Jun 11, 2018 at 09:48:06AM -0700, Christoph Hellwig wrote:
> І think the new naming scheme in this series is a nightmare. It
> confuses the heck out of me, and that is despite knowing many bits of
> the block layer inside out, and reviewing previous series.
In V5, there isn't such issue, since bio_for_each_segment* is renamed
into bio_for_each_page* first before doing the change.
>
> I think we need to take a step back and figure out what names what we
> want in the end, and how we get there separately.
Right, I agree, last year I told people that naming may be the biggest
issue for this patchset.
>
> For the end result using bio_for_each_page in some form for the per-page
> iteration seems like the only sensible idea, as that is what it does.
Yeah, I agree, but except for renaming bio_for_each_segment* into
bio_for_each_page* or whatever first, I don't see any way to deal with
it cleanly.
Seems Jens isn't fine with the big renaming, then I follow the suggestion
of taking 'chunk' for representing multipage bvec in V6.
>
> For the bio-vec iteration I'm fine with either bio_for_each_bvec as that
> exactly explains what it does, or bio_for_each_segment to keep the
> change at a minimum.
If bio_for_each_segment() is fine, that is basically what this patch is doing,
then could you share me what the actual naming issue is in V6? And
basically the name of 'chunk' is introduced for multipage bvec.
>
> And in terms of how to get there: maybe we need to move all the drivers
> and file systems to the new names first before the actual changes to
> document all the intent.
That is exactly what I have done in V5, but that way is refused.
Guys, so what can we do to make progress for this naming issue?
Thanks,
Ming
On Tue, Jun 12, 2018 at 11:42:49AM +0800, Ming Lei wrote:
> On Mon, Jun 11, 2018 at 09:48:06AM -0700, Christoph Hellwig wrote:
> > І think the new naming scheme in this series is a nightmare. It
> > confuses the heck out of me, and that is despite knowing many bits of
> > the block layer inside out, and reviewing previous series.
>
> In V5, there isn't such issue, since bio_for_each_segment* is renamed
> into bio_for_each_page* first before doing the change.
But now we are at V6 where that isn't the case..
> Seems Jens isn't fine with the big renaming, then I follow the suggestion
> of taking 'chunk' for representing multipage bvec in V6.
Please don't use chunk. We are iterating over bio_vec structures, while
we have the concept of a chunk size for something else in the block layer,
so this just creates confusion. Nevermind names like
bio_for_each_chunk_segment_all which just double the confusion.
So assuming that bio_for_each_segment is set to stay as-is for now,
here is a proposal for sanity by using the vec name.
OLD: bio_for_each_segment
NEW(page): bio_for_each_segment, to be renamed bio_for_each_page later
NEW(bvec): bio_for_each_bvec
OLD: __bio_for_each_segment
NEW(page): __bio_for_each_segment, to be renamed __bio_for_each_page later
NEW(bvec): (no bvec version needed)
OLD: bio_for_each_segment_all
NEW(page): bio_for_each_page_all (needs updated prototype anyway)
NEW(bvec): (no bvec version needed once bcache is fixed up)
Given that we have a single, dubious user of bio_pages_all I'd rather
see it as an opencoded bio_for_each_ loop in the caller.
> +static inline unsigned bio_chunks(struct bio *bio)
> +{
> + unsigned chunks = 0;
> + struct bio_vec bv;
> + struct bvec_iter iter;
>
> - return segs;
> + /*
> + * We special case discard/write same/write zeroes, because they
> + * interpret bi_size differently:
> + */
> + switch (bio_op(bio)) {
> + case REQ_OP_DISCARD:
> + case REQ_OP_SECURE_ERASE:
> + case REQ_OP_WRITE_ZEROES:
> + return 0;
> + case REQ_OP_WRITE_SAME:
> + return 1;
> + default:
> + bio_for_each_chunk(bv, bio, iter)
> + chunks++;
> + return chunks;
Shouldn't this just return bio->bi_vcnt?
On Sat, Jun 09, 2018 at 08:29:57PM +0800, Ming Lei wrote:
> There are still cases in which rq_for_each_chunk() is required, for
> example, loop.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> include/linux/blkdev.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bca3a92eb55f..4eaba73c784a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -941,6 +941,10 @@ struct req_iterator {
> __rq_for_each_bio(_iter.bio, _rq) \
> bio_for_each_segment(bvl, _iter.bio, _iter.iter)
>
> +#define rq_for_each_chunk(bvl, _rq, _iter) \
> + __rq_for_each_bio(_iter.bio, _rq) \
> + bio_for_each_chunk(bvl, _iter.bio, _iter.iter)
We have a single users of this in the loop driver. I'd rather
see the obvious loop open coded.
On Sat, Jun 09, 2018 at 08:29:59PM +0800, Ming Lei wrote:
> There is one use case(DM) which requires to clone bio chunk by
> chunk, so introduce this API.
I don't think DM is the special case here. The special case is the
bounce code that only wants single page bios. Between that, and the
fact that we only have two callers and one of them is inside the
block layer I would suggest to fold in the following patch to make
bio_clone_bioset clone in multi-page bvecs and make the bounce code
use the low-level interface directly:
diff --git a/block/bio.c b/block/bio.c
index 284085ab97e7..cef45c8d0a19 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -644,13 +644,14 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);
-static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
- struct bio_set *bs, bool seg)
+struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, bool single_page_only)
{
struct bvec_iter iter;
struct bio_vec bv;
struct bio *bio;
- int nr_vecs = seg ? bio_segments(bio_src) : bio_chunks(bio_src);
+ int nr_vecs = single_page_only ?
+ bio_segments(bio_src) : bio_chunks(bio_src);
/*
* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -692,7 +693,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
break;
default:
- if (seg) {
+ if (single_page_only) {
bio_for_each_segment(bv, bio_src, iter)
bio->bi_io_vec[bio->bi_vcnt++] = bv;
} else {
@@ -728,26 +729,10 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
*/
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
struct bio_set *bs)
-{
- return __bio_clone_bioset(bio_src, gfp_mask, bs, true);
-}
-EXPORT_SYMBOL(bio_clone_bioset);
-
-/**
- * bio_clone_seg_bioset - clone a bio segment by segment
- * @bio_src: bio to clone
- * @gfp_mask: allocation priority
- * @bs: bio_set to allocate from
- *
- * Clone bio. Caller will own the returned bio, but not the actual data it
- * points to. Reference count of returned bio will be one.
- */
-struct bio *bio_clone_chunk_bioset(struct bio *bio_src, gfp_t gfp_mask,
- struct bio_set *bs)
{
return __bio_clone_bioset(bio_src, gfp_mask, bs, false);
}
-EXPORT_SYMBOL(bio_clone_chunk_bioset);
+EXPORT_SYMBOL(bio_clone_bioset);
/**
* bio_add_pc_page - attempt to add page to bio
diff --git a/block/bounce.c b/block/bounce.c
index c6af0bd29ec9..62dab528dc1b 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -224,8 +224,8 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
generic_make_request(*bio_orig);
*bio_orig = bio;
}
- bio = bio_clone_bioset(*bio_orig, GFP_NOIO, passthrough ? NULL :
- &bounce_bio_set);
+ bio = __bio_clone_bioset(*bio_orig, GFP_NOIO, passthrough ? NULL :
+ &bounce_bio_set, true);
bio_for_each_chunk_segment_all(to, bio, i, citer) {
struct page *page = to->bv_page;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 13ca3574d972..98dff36b89a3 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1582,8 +1582,8 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
* the usage of io->orig_bio in dm_remap_zone_report()
* won't be affected by this reassignment.
*/
- struct bio *b = bio_clone_chunk_bioset(bio, GFP_NOIO,
- &md->queue->bio_split);
+ struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
+ &md->queue->bio_split);
ci.io->orig_bio = b;
bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
bio_chain(b, bio);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 58838dc12d69..5ccafeadbe95 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -486,7 +486,8 @@ extern void bio_put(struct bio *);
extern void __bio_clone_fast(struct bio *, struct bio *);
extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *);
extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
-extern struct bio *bio_clone_chunk_bioset(struct bio *, gfp_t, struct bio_set *bs);
+extern struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+ struct bio_set *bs, bool single_page_only);
extern struct bio_set fs_bio_set;
On Wed, Jun 13, 2018 at 07:47:41AM -0700, Christoph Hellwig wrote:
> > +static inline unsigned bio_chunks(struct bio *bio)
> > +{
> > + unsigned chunks = 0;
> > + struct bio_vec bv;
> > + struct bvec_iter iter;
> >
> > - return segs;
> > + /*
> > + * We special case discard/write same/write zeroes, because they
> > + * interpret bi_size differently:
> > + */
> > + switch (bio_op(bio)) {
> > + case REQ_OP_DISCARD:
> > + case REQ_OP_SECURE_ERASE:
> > + case REQ_OP_WRITE_ZEROES:
> > + return 0;
> > + case REQ_OP_WRITE_SAME:
> > + return 1;
> > + default:
> > + bio_for_each_chunk(bv, bio, iter)
> > + chunks++;
> > + return chunks;
>
> Shouldn't this just return bio->bi_vcnt?
No.
bio->bi_vcnt is only for the owner of a bio (the code that originally allocated
it and filled it out) to use, and really the only legit use is
bio_for_each_segment_all() (iterating over segments without using bi_iter
because it's already been iterated to the end), and as a convenience thing for
bio_add_page.
Code that has a bio submitted to it can _not_ use bio->bi_vcnt, it's perfectly
legal for it to be 0 (and it is for e.g. bio splits).
s/conver/convert/ in the subject here and a couple other patches.
On Sat, Jun 09, 2018 at 08:29:44PM +0800, Ming Lei wrote:
> Hi,
>
> This patchset brings multipage bvec into block layer:
Ming, what's going on with the chunk naming? I haven't been paying attention
because it feels like it's turned into bike shedding, but I just saw something
about a 3rd way of iterating over bios? (page/segment/chunk...?)
On Wed, Jun 13, 2018 at 07:42:53AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 12, 2018 at 11:42:49AM +0800, Ming Lei wrote:
> > On Mon, Jun 11, 2018 at 09:48:06AM -0700, Christoph Hellwig wrote:
> > > І think the new naming scheme in this series is a nightmare. It
> > > confuses the heck out of me, and that is despite knowing many bits of
> > > the block layer inside out, and reviewing previous series.
> >
> > In V5, there isn't such issue, since bio_for_each_segment* is renamed
> > into bio_for_each_page* first before doing the change.
>
> But now we are at V6 where that isn't the case..
>
> > Seems Jens isn't fine with the big renaming, then I follow the suggestion
> > of taking 'chunk' for representing multipage bvec in V6.
>
> Please don't use chunk. We are iterating over bio_vec structures, while
> we have the concept of a chunk size for something else in the block layer,
> so this just creates confusion. Nevermind names like
> bio_for_each_chunk_segment_all which just double the confusion.
We may keep the name of bio_for_each_segment_all(), and just change
the prototype in one single big patch.
>
> So assuming that bio_for_each_segment is set to stay as-is for now,
> here is a proposal for sanity by using the vec name.
>
> OLD: bio_for_each_segment
> NEW(page): bio_for_each_segment, to be renamed bio_for_each_page later
> NEW(bvec): bio_for_each_bvec
>
> OLD: __bio_for_each_segment
> NEW(page): __bio_for_each_segment, to be renamed __bio_for_each_page later
> NEW(bvec): (no bvec version needed)
For the above two, basically similar with V6, just V6 takes chunk, :-)
>
> OLD: bio_for_each_segment_all
> NEW(page): bio_for_each_page_all (needs updated prototype anyway)
> NEW(bvec): (no bvec version needed once bcache is fixed up)
This one may cause confusing, since we iterate over pages via
bio_for_each_segment(), but the _all version takes another name
of page, still iterate over pages.
So could we change it in the following way?
OLD: bio_for_each_segment_all
NEW(page): bio_for_each_segment_all (update prototype in one tree-wide &
big patch, to be renamed bio_for_each_page_all)
NEW(bvec): (no bvec version needed once bcache is fixed up)
Thanks,
Ming
On Wed, Jun 13, 2018 at 10:59:08AM -0400, Kent Overstreet wrote:
> On Sat, Jun 09, 2018 at 08:29:44PM +0800, Ming Lei wrote:
> > Hi,
> >
> > This patchset brings multipage bvec into block layer:
>
> Ming, what's going on with the chunk naming? I haven't been paying attention
> because it feels like it's turned into bike shedding, but I just saw something
> about a 3rd way of iterating over bios? (page/segment/chunk...?)
This patchset takes the chunk naming.
And 'chunk' represents multipage bvec, and 'segment' represents singlepage bvec
basically.
Thanks,
Ming
On Wed, Jun 13, 2018 at 07:44:12AM -0700, Christoph Hellwig wrote:
> Given that we have a single, dubious user of bio_pages_all I'd rather
> see it as an opencoded bio_for_each_ loop in the caller.
Yeah, that is fine since there is only one user in btrfs.
Thanks,
Ming
On Wed, Jun 13, 2018 at 07:48:18AM -0700, Christoph Hellwig wrote:
> On Sat, Jun 09, 2018 at 08:29:57PM +0800, Ming Lei wrote:
> > There are still cases in which rq_for_each_chunk() is required, for
> > example, loop.
> >
> > Signed-off-by: Ming Lei <[email protected]>
> > ---
> > include/linux/blkdev.h | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index bca3a92eb55f..4eaba73c784a 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -941,6 +941,10 @@ struct req_iterator {
> > __rq_for_each_bio(_iter.bio, _rq) \
> > bio_for_each_segment(bvl, _iter.bio, _iter.iter)
> >
> > +#define rq_for_each_chunk(bvl, _rq, _iter) \
> > + __rq_for_each_bio(_iter.bio, _rq) \
> > + bio_for_each_chunk(bvl, _iter.bio, _iter.iter)
>
> We have a single users of this in the loop driver. I'd rather
> see the obvious loop open coded.
OK.
Thanks,
Ming
On Wed, Jun 13, 2018 at 07:56:54AM -0700, Christoph Hellwig wrote:
> On Sat, Jun 09, 2018 at 08:29:59PM +0800, Ming Lei wrote:
> > There is one use case(DM) which requires to clone bio chunk by
> > chunk, so introduce this API.
>
> I don't think DM is the special case here. The special case is the
> bounce code that only wants single page bios. Between that, and the
> fact that we only have two callers and one of them is inside the
> block layer I would suggest to fold in the following patch to make
> bio_clone_bioset clone in multi-page bvecs and make the bounce code
> use the low-level interface directly:
Bounce limits the max pages as 256 will do bio splitting, so won't need
this change.
Thanks,
Ming
On Thu, Jun 14, 2018 at 09:18:58AM +0800, Ming Lei wrote:
> This one may cause confusing, since we iterate over pages via
> bio_for_each_segment(), but the _all version takes another name
> of page, still iterate over pages.
>
> So could we change it in the following way?
>
> OLD: bio_for_each_segment_all
> NEW(page): bio_for_each_segment_all (update prototype in one tree-wide &
> big patch, to be renamed bio_for_each_page_all)
> NEW(bvec): (no bvec version needed once bcache is fixed up)
Fine with me, but I thought Jens didn't like that sweeping change?
On Thu, Jun 14, 2018 at 09:23:54AM +0800, Ming Lei wrote:
> On Wed, Jun 13, 2018 at 07:44:12AM -0700, Christoph Hellwig wrote:
> > Given that we have a single, dubious user of bio_pages_all I'd rather
> > see it as an opencoded bio_for_each_ loop in the caller.
>
> Yeah, that is fine since there is only one user in btrfs.
And I suspect it really is checking for the wrong thing. I don't
fully understand that code, but as far as I can tell it really
needs to know if there is more than a file system block of data in
the bio, and btrfs conflats pages with blocks. But I'd need btrfs
folks to confirm this.
On Thu, Jun 14, 2018 at 10:01:38AM +0800, Ming Lei wrote:
> Bounce limits the max pages as 256 will do bio splitting, so won't need
> this change.
Behavior for the bounce code does not change with my patch.
The important points are:
- the default interface (bio_clone_bioset in this case) should always
operate on full biosets
- if the bounce code needs bioves limited to single pages it should
be treated as the special case
- given that the bounce code is inside the block layer using the
__-prefixed internal interface is perfectly fine
- last but not least I think the parameter switching the behavior
needs a much more descriptive name as suggested in my patch
On Wed, Jun 13, 2018 at 11:39:20PM -0700, Christoph Hellwig wrote:
> On Thu, Jun 14, 2018 at 10:01:38AM +0800, Ming Lei wrote:
> > Bounce limits the max pages as 256 will do bio splitting, so won't need
> > this change.
>
> Behavior for the bounce code does not change with my patch.
>
> The important points are:
>
> - the default interface (bio_clone_bioset in this case) should always
> operate on full biosets
> - if the bounce code needs bioves limited to single pages it should
> be treated as the special case
> - given that the bounce code is inside the block layer using the
> __-prefixed internal interface is perfectly fine
> - last but not least I think the parameter switching the behavior
> needs a much more descriptive name as suggested in my patch
Fair enough, will switch to this way and avoid DM's change, even though
it is a dying interface.
Thanks,
Ming
>
> - bio size can be increased and it should improve some high-bandwidth IO
> case in theory[4].
>
Hi,
I would like to report your patch set works well on my system based on v4.14.48.
I thought the multipage bvec could improve the performance of my system.
(FYI, my system has v4.14.48 and provides KVM-base virtualization service.)
So I did back-porting your patches to v4.14.48.
It has done without any serious problem.
I only needed to cherry-pick "blk-merge: compute
bio->bi_seg_front_size efficiently" and
"block: move bio_alloc_pages() to bcache" patches before back-porting
to prevent conflicts.
And I ran my own test-suit for checking features of md and RAID1 layer.
There was no problem. All test cases passed.
(If you want, I will send you the back-ported patches.)
Then I did two performance test as following.
To say the conclusion first, I failed to show performance improvement
of the patch set.
Of course, my test cases would not be suitable to test your patch set.
Or maybe I did test wrong.
Please inform me which tools are suitable, then I will try them.
1. fio
First I ran fio with null device to check the performance of the block-layer.
I am not sure those test is suitable to show the performance
improvement or degradation.
Nevertheless there was a little (-6%) performance degradation.
If it is not much trouble to you, please review my options for fio and
inform me if I used wrong or incorrect options.
Then I will run the test again.
1.1 Following is my options for fio.
gkim@ib1:~/pb-ltp/benchmark/fio$ cat go_local.sh
#!/bin/bash
echo "fio start : $(date)"
echo "kernel info : $(uname -a)"
echo "fio version : $(fio --version)"
# set "none" io-scheduler
modprobe -r null_blk
modprobe null_blk
echo "none" > /sys/block/nullb0/queue/scheduler
FIO_OPTION="--direct=1 --rw=randrw:2 --time_based=1 --group_reporting \
--ioengine=libaio --iodepth=64 --name=fiotest --numjobs=8 \
--bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 \
--fadvise_hint=0 --iodepth_batch_submit=64
--iodepth_batch_complete=64"
# fio test null_blk device, so it is not necessary to run long.
fio $FIO_OPTION --filename=/dev/nullb0 --runtime=600
1.2 Following is the result before porting.
fio start : Mon Jun 11 04:30:01 CEST 2018
kernel info : Linux ib1 4.14.48-1-pserver
#4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
x86_64 GNU/Linux
fio version : fio-2.2.10
fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
ioengine=libaio, iodepth=64
...
fio-2.2.10
Starting 8 processes
fiotest: (groupid=0, jobs=8): err= 0: pid=1655: Mon Jun 11 04:40:02 2018
read : io=7133.2GB, bw=12174MB/s, iops=1342.1K, runt=600001msec
slat (usec): min=1, max=15750, avg=123.78, stdev=153.79
clat (usec): min=0, max=15758, avg=24.70, stdev=77.93
lat (usec): min=2, max=15782, avg=148.49, stdev=167.54
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6],
| 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107],
| 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416],
| 99.99th=[ 2960]
bw (MB /s): min= 1096, max= 2147, per=12.51%, avg=1522.69, stdev=253.89
write: io=7131.3GB, bw=12171MB/s, iops=1343.6K, runt=600001msec
slat (usec): min=1, max=15751, avg=124.73, stdev=154.11
clat (usec): min=0, max=15758, avg=24.69, stdev=77.84
lat (usec): min=2, max=15780, avg=149.43, stdev=167.82
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6],
| 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107],
| 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416],
| 99.99th=[ 2960]
bw (MB /s): min= 1080, max= 2121, per=12.51%, avg=1522.33, stdev=253.96
lat (usec) : 2=21.63%, 4=37.80%, 10=2.12%, 20=6.43%, 50=16.70%
lat (usec) : 100=8.86%, 250=6.07%, 500=0.17%, 750=0.08%, 1000=0.05%
lat (msec) : 2=0.06%, 4=0.02%, 10=0.01%, 20=0.01%
cpu : usr=22.39%, sys=64.19%, ctx=15425825, majf=0, minf=97
IO depths : 1=1.8%, 2=1.8%, 4=8.8%, 8=14.4%, 16=12.3%, 32=41.7%, >=64=19.3%
submit : 0=0.0%, 4=5.8%, 8=9.7%, 16=15.0%, 32=18.0%, 64=51.5%, >=64=0.0%
complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
issued : total=r=805764385/w=806127393/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: io=7133.2GB, aggrb=12174MB/s, minb=12174MB/s, maxb=12174MB/s,
mint=600001msec, maxt=600001msec
WRITE: io=7131.3GB, aggrb=12171MB/s, minb=12171MB/s, maxb=12171MB/s,
mint=600001msec, maxt=600001msec
Disk stats (read/write):
nullb0: ios=442461761/442546060, merge=363197836/363473703,
ticks=12280990/12452480, in_queue=2740, util=0.43%
1.3 Following is the result after porting.
fio start : Fri Jun 15 12:42:47 CEST 2018
kernel info : Linux ib1 4.14.48-1-pserver-mpbvec+ #12 SMP Fri Jun 15
12:21:36 CEST 2018 x86_64 GNU/Linux
fio version : fio-2.2.10
fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
ioengine=libaio, iodepth=64
...
fio-2.2.10
Starting 8 processes
Jobs: 4 (f=0): [m(1),_(2),m(1),_(1),m(2),_(1)] [100.0% done]
[8430MB/8444MB/0KB /s] [961K/963K/0 iops] [eta 00m:00s]
fiotest: (groupid=0, jobs=8): err= 0: pid=14096: Fri Jun 15 12:52:48 2018
read : io=6633.8GB, bw=11322MB/s, iops=1246.9K, runt=600005msec
slat (usec): min=1, max=16939, avg=135.34, stdev=156.23
clat (usec): min=0, max=16947, avg=26.10, stdev=78.50
lat (usec): min=2, max=16957, avg=161.45, stdev=168.88
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5],
| 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115],
| 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416],
| 99.99th=[ 2864]
bw (MB /s): min= 1106, max= 2031, per=12.51%, avg=1416.05, stdev=201.81
write: io=6631.1GB, bw=11318MB/s, iops=1247.5K, runt=600005msec
slat (usec): min=1, max=16938, avg=136.48, stdev=156.54
clat (usec): min=0, max=16947, avg=26.08, stdev=78.43
lat (usec): min=2, max=16957, avg=162.58, stdev=169.15
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5],
| 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115],
| 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416],
| 99.99th=[ 2864]
bw (MB /s): min= 1084, max= 2044, per=12.51%, avg=1415.67, stdev=201.93
lat (usec) : 2=20.98%, 4=38.82%, 10=2.15%, 20=5.08%, 50=16.91%
lat (usec) : 100=8.75%, 250=6.91%, 500=0.19%, 750=0.09%, 1000=0.05%
lat (msec) : 2=0.07%, 4=0.02%, 10=0.01%, 20=0.01%
cpu : usr=21.02%, sys=65.53%, ctx=15321661, majf=0, minf=78
IO depths : 1=1.9%, 2=1.9%, 4=9.5%, 8=13.6%, 16=11.2%, 32=42.1%, >=64=19.9%
submit : 0=0.0%, 4=6.3%, 8=10.1%, 16=14.1%, 32=18.2%,
64=51.3%, >=64=0.0%
complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
issued : total=r=748120019/w=748454509/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: io=6633.8GB, aggrb=11322MB/s, minb=11322MB/s, maxb=11322MB/s,
mint=600005msec, maxt=600005msec
WRITE: io=6631.1GB, aggrb=11318MB/s, minb=11318MB/s, maxb=11318MB/s,
mint=600005msec, maxt=600005msec
Disk stats (read/write):
nullb0: ios=410911387/410974086, merge=337127604/337396176,
ticks=12482050/12662790, in_queue=1780, util=0.27%
2. Unixbench
Second I rand Unixbench to check general performance.
I think there is no difference before and after porting the patches.
Unixbench might not be suitable to check the performance improvement
of the block layer.
If you inform me which tools is suitable, I will try it on my system.
2.1 Following is the result before porting.
BYTE UNIX Benchmarks (Version 5.1.3)
System: ib1: GNU/Linux
OS: GNU/Linux -- 4.14.48-1-pserver --
#4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
Machine: x86_64 (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
05:00:01 up 3 days, 16:20, 2 users, load average: 0.00, 0.11,
1.11; runlevel 2018-06-07
------------------------------------------------------------------------
Benchmark Run: Mon Jun 11 2018 05:00:01 - 05:28:54
8 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 47158867.7 lps (10.0 s, 7 samples)
Double-Precision Whetstone 3878.8 MWIPS (15.2 s, 7 samples)
Execl Throughput 9203.9 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 1490834.8 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 388784.2 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 3744780.2 KBps (30.0 s, 2 samples)
Pipe Throughput 2682620.1 lps (10.0 s, 7 samples)
Pipe-based Context Switching 263786.5 lps (10.0 s, 7 samples)
Process Creation 19674.0 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 16121.5 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 5623.5 lpm (60.0 s, 2 samples)
System Call Overhead 4068991.3 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 47158867.7 4041.0
Double-Precision Whetstone 55.0 3878.8 705.2
Execl Throughput 43.0 9203.9 2140.4
File Copy 1024 bufsize 2000 maxblocks 3960.0 1490834.8 3764.7
File Copy 256 bufsize 500 maxblocks 1655.0 388784.2 2349.1
File Copy 4096 bufsize 8000 maxblocks 5800.0 3744780.2 6456.5
Pipe Throughput 12440.0 2682620.1 2156.4
Pipe-based Context Switching 4000.0 263786.5 659.5
Process Creation 126.0 19674.0 1561.4
Shell Scripts (1 concurrent) 42.4 16121.5 3802.2
Shell Scripts (8 concurrent) 6.0 5623.5 9372.5
System Call Overhead 15000.0 4068991.3 2712.7
========
System Benchmarks Index Score 2547.7
------------------------------------------------------------------------
Benchmark Run: Mon Jun 11 2018 05:28:54 - 05:57:07
8 CPUs in system; running 8 parallel copies of tests
Dhrystone 2 using register variables 234727639.9 lps (10.0 s, 7 samples)
Double-Precision Whetstone 35350.9 MWIPS (10.7 s, 7 samples)
Execl Throughput 43811.3 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 1401373.1 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 366033.9 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 4360829.6 KBps (30.0 s, 2 samples)
Pipe Throughput 12875165.6 lps (10.0 s, 7 samples)
Pipe-based Context Switching 2431725.6 lps (10.0 s, 7 samples)
Process Creation 97360.8 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 58879.6 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 9232.5 lpm (60.0 s, 2 samples)
System Call Overhead 9497958.7 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 234727639.9 20113.8
Double-Precision Whetstone 55.0 35350.9 6427.4
Execl Throughput 43.0 43811.3 10188.7
File Copy 1024 bufsize 2000 maxblocks 3960.0 1401373.1 3538.8
File Copy 256 bufsize 500 maxblocks 1655.0 366033.9 2211.7
File Copy 4096 bufsize 8000 maxblocks 5800.0 4360829.6 7518.7
Pipe Throughput 12440.0 12875165.6 10349.8
Pipe-based Context Switching 4000.0 2431725.6 6079.3
Process Creation 126.0 97360.8 7727.0
Shell Scripts (1 concurrent) 42.4 58879.6 13886.7
Shell Scripts (8 concurrent) 6.0 9232.5 15387.5
System Call Overhead 15000.0 9497958.7 6332.0
========
System Benchmarks Index Score 7803.5
2.2 Following is the result after porting.
BYTE UNIX Benchmarks (Version 5.1.3)
System: ib1: GNU/Linux
OS: GNU/Linux -- 4.14.48-1-pserver-mpbvec+ -- #12 SMP Fri Jun 15
12:21:36 CEST 2018
Machine: x86_64 (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
Hyper-Threading, x86-64, MMX, Physical Address Ext,
SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
13:16:11 up 50 min, 1 user, load average: 0.00, 1.40, 3.46;
runlevel 2018-06-15
------------------------------------------------------------------------
Benchmark Run: Fri Jun 15 2018 13:16:11 - 13:45:04
8 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 47103754.6 lps (10.0 s, 7 samples)
Double-Precision Whetstone 3886.3 MWIPS (15.1 s, 7 samples)
Execl Throughput 8965.0 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 1510285.9 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 395196.9 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 3802788.0 KBps (30.0 s, 2 samples)
Pipe Throughput 2670169.1 lps (10.0 s, 7 samples)
Pipe-based Context Switching 275093.8 lps (10.0 s, 7 samples)
Process Creation 19707.1 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 16046.8 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 5600.8 lpm (60.0 s, 2 samples)
System Call Overhead 4104142.0 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 47103754.6 4036.3
Double-Precision Whetstone 55.0 3886.3 706.6
Execl Throughput 43.0 8965.0 2084.9
File Copy 1024 bufsize 2000 maxblocks 3960.0 1510285.9 3813.9
File Copy 256 bufsize 500 maxblocks 1655.0 395196.9 2387.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 3802788.0 6556.5
Pipe Throughput 12440.0 2670169.1 2146.4
Pipe-based Context Switching 4000.0 275093.8 687.7
Process Creation 126.0 19707.1 1564.1
Shell Scripts (1 concurrent) 42.4 16046.8 3784.6
Shell Scripts (8 concurrent) 6.0 5600.8 9334.6
System Call Overhead 15000.0 4104142.0 2736.1
========
System Benchmarks Index Score 2560.0
------------------------------------------------------------------------
Benchmark Run: Fri Jun 15 2018 13:45:04 - 14:13:17
8 CPUs in system; running 8 parallel copies of tests
Dhrystone 2 using register variables 237271982.6 lps (10.0 s, 7 samples)
Double-Precision Whetstone 35186.8 MWIPS (10.7 s, 7 samples)
Execl Throughput 42557.8 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 1403922.0 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 367436.5 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 4380468.3 KBps (30.0 s, 2 samples)
Pipe Throughput 12872664.6 lps (10.0 s, 7 samples)
Pipe-based Context Switching 2451404.5 lps (10.0 s, 7 samples)
Process Creation 97788.2 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 58505.9 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 9195.4 lpm (60.0 s, 2 samples)
System Call Overhead 9467372.2 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 237271982.6 20331.8
Double-Precision Whetstone 55.0 35186.8 6397.6
Execl Throughput 43.0 42557.8 9897.2
File Copy 1024 bufsize 2000 maxblocks 3960.0 1403922.0 3545.3
File Copy 256 bufsize 500 maxblocks 1655.0 367436.5 2220.2
File Copy 4096 bufsize 8000 maxblocks 5800.0 4380468.3 7552.5
Pipe Throughput 12440.0 12872664.6 10347.8
Pipe-based Context Switching 4000.0 2451404.5 6128.5
Process Creation 126.0 97788.2 7761.0
Shell Scripts (1 concurrent) 42.4 58505.9 13798.6
Shell Scripts (8 concurrent) 6.0 9195.4 15325.6
System Call Overhead 15000.0 9467372.2 6311.6
========
System Benchmarks Index Score 7794.3
--
GIOH KIM
Linux Kernel Entwickler
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 176 2697 8962
Fax: +49 30 577 008 299
Email: [email protected]
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens
On Sat, Jun 9, 2018 at 2:36 PM Ming Lei <[email protected]> wrote:
>
> Now multipage bvec is supported, and some helpers may return page by
> page, and some may return segment by segment, this patch documents the
> usage for helping us use them correctly.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> Documentation/block/biovecs.txt | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
> index 25689584e6e0..3ab72566141f 100644
> --- a/Documentation/block/biovecs.txt
> +++ b/Documentation/block/biovecs.txt
> @@ -117,3 +117,33 @@ Other implications:
> size limitations and the limitations of the underlying devices. Thus
> there's no need to define ->merge_bvec_fn() callbacks for individual block
> drivers.
> +
> +Usage of helpers:
> +=================
> +
> +* The following helpers, whose names have the suffix "_all", can only be
> +used on non-BIO_CLONED bio, and usually they are used by filesystem code,
> +and driver shouldn't use them because bio may have been split before they
> +got to the driver:
> +
> + bio_for_each_chunk_segment_all()
> + bio_for_each_chunk_all()
> + bio_pages_all()
> + bio_first_bvec_all()
> + bio_first_page_all()
> + bio_last_bvec_all()
> +
> +* The following helpers iterate bio page by page, and the local variable of
> +'struct bio_vec' or the reference records single page io vector during the
> +iteration:
> +
> + bio_for_each_segment()
> + bio_for_each_segment_all()
bio_for_each_segment_all() is removed, isn't it?
> +
> +* The following helpers iterate bio chunk by chunk, and each chunk may
> +include multiple physically contiguous pages, and the local variable of
> +'struct bio_vec' or the reference records multi page io vector during the
> +iteration:
> +
> + bio_for_each_chunk()
> + bio_for_each_chunk_all()
> --
> 2.9.5
>
--
GIOH KIM
Linux Kernel Entwickler
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 176 2697 8962
Fax: +49 30 577 008 299
Email: [email protected]
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens
On Fri, Jun 15, 2018 at 02:59:19PM +0200, Gi-Oh Kim wrote:
> >
> > - bio size can be increased and it should improve some high-bandwidth IO
> > case in theory[4].
> >
>
> Hi,
>
> I would like to report your patch set works well on my system based on v4.14.48.
> I thought the multipage bvec could improve the performance of my system.
> (FYI, my system has v4.14.48 and provides KVM-base virtualization service.)
Thanks for your test!
>
> So I did back-porting your patches to v4.14.48.
> It has done without any serious problem.
> I only needed to cherry-pick "blk-merge: compute
> bio->bi_seg_front_size efficiently" and
> "block: move bio_alloc_pages() to bcache" patches before back-porting
> to prevent conflicts.
Not sure I understand your point, you have to backport all patches.
> And I ran my own test-suit for checking features of md and RAID1 layer.
> There was no problem. All test cases passed.
> (If you want, I will send you the back-ported patches.)
>
> Then I did two performance test as following.
> To say the conclusion first, I failed to show performance improvement
> of the patch set.
> Of course, my test cases would not be suitable to test your patch set.
> Or maybe I did test wrong.
> Please inform me which tools are suitable, then I will try them.
>
> 1. fio
>
> First I ran fio with null device to check the performance of the block-layer.
> I am not sure those test is suitable to show the performance
> improvement or degradation.
> Nevertheless there was a little (-6%) performance degradation.
>
> If it is not much trouble to you, please review my options for fio and
> inform me if I used wrong or incorrect options.
> Then I will run the test again.
>
> 1.1 Following is my options for fio.
>
> gkim@ib1:~/pb-ltp/benchmark/fio$ cat go_local.sh
> #!/bin/bash
> echo "fio start : $(date)"
> echo "kernel info : $(uname -a)"
> echo "fio version : $(fio --version)"
>
> # set "none" io-scheduler
> modprobe -r null_blk
> modprobe null_blk
> echo "none" > /sys/block/nullb0/queue/scheduler
>
> FIO_OPTION="--direct=1 --rw=randrw:2 --time_based=1 --group_reporting \
> --ioengine=libaio --iodepth=64 --name=fiotest --numjobs=8 \
> --bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 \
> --fadvise_hint=0 --iodepth_batch_submit=64
> --iodepth_batch_complete=64"
> # fio test null_blk device, so it is not necessary to run long.
> fio $FIO_OPTION --filename=/dev/nullb0 --runtime=600
>
> 1.2 Following is the result before porting.
>
> fio start : Mon Jun 11 04:30:01 CEST 2018
> kernel info : Linux ib1 4.14.48-1-pserver
> #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
> x86_64 GNU/Linux
> fio version : fio-2.2.10
> fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
> ioengine=libaio, iodepth=64
> ...
> fio-2.2.10
> Starting 8 processes
>
> fiotest: (groupid=0, jobs=8): err= 0: pid=1655: Mon Jun 11 04:40:02 2018
> read : io=7133.2GB, bw=12174MB/s, iops=1342.1K, runt=600001msec
> slat (usec): min=1, max=15750, avg=123.78, stdev=153.79
> clat (usec): min=0, max=15758, avg=24.70, stdev=77.93
> lat (usec): min=2, max=15782, avg=148.49, stdev=167.54
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6],
> | 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107],
> | 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416],
> | 99.99th=[ 2960]
> bw (MB /s): min= 1096, max= 2147, per=12.51%, avg=1522.69, stdev=253.89
> write: io=7131.3GB, bw=12171MB/s, iops=1343.6K, runt=600001msec
> slat (usec): min=1, max=15751, avg=124.73, stdev=154.11
> clat (usec): min=0, max=15758, avg=24.69, stdev=77.84
> lat (usec): min=2, max=15780, avg=149.43, stdev=167.82
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6],
> | 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107],
> | 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416],
> | 99.99th=[ 2960]
> bw (MB /s): min= 1080, max= 2121, per=12.51%, avg=1522.33, stdev=253.96
> lat (usec) : 2=21.63%, 4=37.80%, 10=2.12%, 20=6.43%, 50=16.70%
> lat (usec) : 100=8.86%, 250=6.07%, 500=0.17%, 750=0.08%, 1000=0.05%
> lat (msec) : 2=0.06%, 4=0.02%, 10=0.01%, 20=0.01%
> cpu : usr=22.39%, sys=64.19%, ctx=15425825, majf=0, minf=97
> IO depths : 1=1.8%, 2=1.8%, 4=8.8%, 8=14.4%, 16=12.3%, 32=41.7%, >=64=19.3%
> submit : 0=0.0%, 4=5.8%, 8=9.7%, 16=15.0%, 32=18.0%, 64=51.5%, >=64=0.0%
> complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
> issued : total=r=805764385/w=806127393/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> READ: io=7133.2GB, aggrb=12174MB/s, minb=12174MB/s, maxb=12174MB/s,
> mint=600001msec, maxt=600001msec
> WRITE: io=7131.3GB, aggrb=12171MB/s, minb=12171MB/s, maxb=12171MB/s,
> mint=600001msec, maxt=600001msec
>
> Disk stats (read/write):
> nullb0: ios=442461761/442546060, merge=363197836/363473703,
> ticks=12280990/12452480, in_queue=2740, util=0.43%
>
> 1.3 Following is the result after porting.
>
> fio start : Fri Jun 15 12:42:47 CEST 2018
> kernel info : Linux ib1 4.14.48-1-pserver-mpbvec+ #12 SMP Fri Jun 15
> 12:21:36 CEST 2018 x86_64 GNU/Linux
> fio version : fio-2.2.10
> fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
> ioengine=libaio, iodepth=64
> ...
> fio-2.2.10
> Starting 8 processes
> Jobs: 4 (f=0): [m(1),_(2),m(1),_(1),m(2),_(1)] [100.0% done]
> [8430MB/8444MB/0KB /s] [961K/963K/0 iops] [eta 00m:00s]
> fiotest: (groupid=0, jobs=8): err= 0: pid=14096: Fri Jun 15 12:52:48 2018
> read : io=6633.8GB, bw=11322MB/s, iops=1246.9K, runt=600005msec
> slat (usec): min=1, max=16939, avg=135.34, stdev=156.23
> clat (usec): min=0, max=16947, avg=26.10, stdev=78.50
> lat (usec): min=2, max=16957, avg=161.45, stdev=168.88
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5],
> | 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115],
> | 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416],
> | 99.99th=[ 2864]
> bw (MB /s): min= 1106, max= 2031, per=12.51%, avg=1416.05, stdev=201.81
> write: io=6631.1GB, bw=11318MB/s, iops=1247.5K, runt=600005msec
> slat (usec): min=1, max=16938, avg=136.48, stdev=156.54
> clat (usec): min=0, max=16947, avg=26.08, stdev=78.43
> lat (usec): min=2, max=16957, avg=162.58, stdev=169.15
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5],
> | 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115],
> | 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416],
> | 99.99th=[ 2864]
> bw (MB /s): min= 1084, max= 2044, per=12.51%, avg=1415.67, stdev=201.93
> lat (usec) : 2=20.98%, 4=38.82%, 10=2.15%, 20=5.08%, 50=16.91%
> lat (usec) : 100=8.75%, 250=6.91%, 500=0.19%, 750=0.09%, 1000=0.05%
> lat (msec) : 2=0.07%, 4=0.02%, 10=0.01%, 20=0.01%
> cpu : usr=21.02%, sys=65.53%, ctx=15321661, majf=0, minf=78
> IO depths : 1=1.9%, 2=1.9%, 4=9.5%, 8=13.6%, 16=11.2%, 32=42.1%, >=64=19.9%
> submit : 0=0.0%, 4=6.3%, 8=10.1%, 16=14.1%, 32=18.2%,
> 64=51.3%, >=64=0.0%
> complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
> issued : total=r=748120019/w=748454509/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> READ: io=6633.8GB, aggrb=11322MB/s, minb=11322MB/s, maxb=11322MB/s,
> mint=600005msec, maxt=600005msec
> WRITE: io=6631.1GB, aggrb=11318MB/s, minb=11318MB/s, maxb=11318MB/s,
> mint=600005msec, maxt=600005msec
>
> Disk stats (read/write):
> nullb0: ios=410911387/410974086, merge=337127604/337396176,
> ticks=12482050/12662790, in_queue=1780, util=0.27%
>
>
> 2. Unixbench
>
> Second I rand Unixbench to check general performance.
> I think there is no difference before and after porting the patches.
> Unixbench might not be suitable to check the performance improvement
> of the block layer.
> If you inform me which tools is suitable, I will try it on my system.
>
> 2.1 Following is the result before porting.
>
> BYTE UNIX Benchmarks (Version 5.1.3)
>
> System: ib1: GNU/Linux
> OS: GNU/Linux -- 4.14.48-1-pserver --
> #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
> Machine: x86_64 (unknown)
> Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
> CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> 05:00:01 up 3 days, 16:20, 2 users, load average: 0.00, 0.11,
> 1.11; runlevel 2018-06-07
>
> ------------------------------------------------------------------------
> Benchmark Run: Mon Jun 11 2018 05:00:01 - 05:28:54
> 8 CPUs in system; running 1 parallel copy of tests
>
> Dhrystone 2 using register variables 47158867.7 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 3878.8 MWIPS (15.2 s, 7 samples)
> Execl Throughput 9203.9 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1490834.8 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 388784.2 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 3744780.2 KBps (30.0 s, 2 samples)
> Pipe Throughput 2682620.1 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 263786.5 lps (10.0 s, 7 samples)
> Process Creation 19674.0 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 16121.5 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 5623.5 lpm (60.0 s, 2 samples)
> System Call Overhead 4068991.3 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 47158867.7 4041.0
> Double-Precision Whetstone 55.0 3878.8 705.2
> Execl Throughput 43.0 9203.9 2140.4
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1490834.8 3764.7
> File Copy 256 bufsize 500 maxblocks 1655.0 388784.2 2349.1
> File Copy 4096 bufsize 8000 maxblocks 5800.0 3744780.2 6456.5
> Pipe Throughput 12440.0 2682620.1 2156.4
> Pipe-based Context Switching 4000.0 263786.5 659.5
> Process Creation 126.0 19674.0 1561.4
> Shell Scripts (1 concurrent) 42.4 16121.5 3802.2
> Shell Scripts (8 concurrent) 6.0 5623.5 9372.5
> System Call Overhead 15000.0 4068991.3 2712.7
> ========
> System Benchmarks Index Score 2547.7
>
> ------------------------------------------------------------------------
> Benchmark Run: Mon Jun 11 2018 05:28:54 - 05:57:07
> 8 CPUs in system; running 8 parallel copies of tests
>
> Dhrystone 2 using register variables 234727639.9 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 35350.9 MWIPS (10.7 s, 7 samples)
> Execl Throughput 43811.3 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1401373.1 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 366033.9 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 4360829.6 KBps (30.0 s, 2 samples)
> Pipe Throughput 12875165.6 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 2431725.6 lps (10.0 s, 7 samples)
> Process Creation 97360.8 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 58879.6 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 9232.5 lpm (60.0 s, 2 samples)
> System Call Overhead 9497958.7 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 234727639.9 20113.8
> Double-Precision Whetstone 55.0 35350.9 6427.4
> Execl Throughput 43.0 43811.3 10188.7
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1401373.1 3538.8
> File Copy 256 bufsize 500 maxblocks 1655.0 366033.9 2211.7
> File Copy 4096 bufsize 8000 maxblocks 5800.0 4360829.6 7518.7
> Pipe Throughput 12440.0 12875165.6 10349.8
> Pipe-based Context Switching 4000.0 2431725.6 6079.3
> Process Creation 126.0 97360.8 7727.0
> Shell Scripts (1 concurrent) 42.4 58879.6 13886.7
> Shell Scripts (8 concurrent) 6.0 9232.5 15387.5
> System Call Overhead 15000.0 9497958.7 6332.0
> ========
> System Benchmarks Index Score 7803.5
>
>
> 2.2 Following is the result after porting.
>
> BYTE UNIX Benchmarks (Version 5.1.3)
>
> System: ib1: GNU/Linux
> OS: GNU/Linux -- 4.14.48-1-pserver-mpbvec+ -- #12 SMP Fri Jun 15
> 12:21:36 CEST 2018
> Machine: x86_64 (unknown)
> Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
> CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> 13:16:11 up 50 min, 1 user, load average: 0.00, 1.40, 3.46;
> runlevel 2018-06-15
>
> ------------------------------------------------------------------------
> Benchmark Run: Fri Jun 15 2018 13:16:11 - 13:45:04
> 8 CPUs in system; running 1 parallel copy of tests
>
> Dhrystone 2 using register variables 47103754.6 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 3886.3 MWIPS (15.1 s, 7 samples)
> Execl Throughput 8965.0 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1510285.9 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 395196.9 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 3802788.0 KBps (30.0 s, 2 samples)
> Pipe Throughput 2670169.1 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 275093.8 lps (10.0 s, 7 samples)
> Process Creation 19707.1 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 16046.8 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 5600.8 lpm (60.0 s, 2 samples)
> System Call Overhead 4104142.0 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 47103754.6 4036.3
> Double-Precision Whetstone 55.0 3886.3 706.6
> Execl Throughput 43.0 8965.0 2084.9
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1510285.9 3813.9
> File Copy 256 bufsize 500 maxblocks 1655.0 395196.9 2387.9
> File Copy 4096 bufsize 8000 maxblocks 5800.0 3802788.0 6556.5
> Pipe Throughput 12440.0 2670169.1 2146.4
> Pipe-based Context Switching 4000.0 275093.8 687.7
> Process Creation 126.0 19707.1 1564.1
> Shell Scripts (1 concurrent) 42.4 16046.8 3784.6
> Shell Scripts (8 concurrent) 6.0 5600.8 9334.6
> System Call Overhead 15000.0 4104142.0 2736.1
> ========
> System Benchmarks Index Score 2560.0
>
> ------------------------------------------------------------------------
> Benchmark Run: Fri Jun 15 2018 13:45:04 - 14:13:17
> 8 CPUs in system; running 8 parallel copies of tests
>
> Dhrystone 2 using register variables 237271982.6 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 35186.8 MWIPS (10.7 s, 7 samples)
> Execl Throughput 42557.8 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1403922.0 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 367436.5 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 4380468.3 KBps (30.0 s, 2 samples)
> Pipe Throughput 12872664.6 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 2451404.5 lps (10.0 s, 7 samples)
> Process Creation 97788.2 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 58505.9 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 9195.4 lpm (60.0 s, 2 samples)
> System Call Overhead 9467372.2 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 237271982.6 20331.8
> Double-Precision Whetstone 55.0 35186.8 6397.6
> Execl Throughput 43.0 42557.8 9897.2
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1403922.0 3545.3
> File Copy 256 bufsize 500 maxblocks 1655.0 367436.5 2220.2
> File Copy 4096 bufsize 8000 maxblocks 5800.0 4380468.3 7552.5
> Pipe Throughput 12440.0 12872664.6 10347.8
> Pipe-based Context Switching 4000.0 2451404.5 6128.5
> Process Creation 126.0 97788.2 7761.0
> Shell Scripts (1 concurrent) 42.4 58505.9 13798.6
> Shell Scripts (8 concurrent) 6.0 9195.4 15325.6
> System Call Overhead 15000.0 9467372.2 6311.6
> ========
> System Benchmarks Index Score 7794.3
At least now, BIO_MAX_PAGES can be fixed as 256 in case of CONFIG_THP_SWAP,
otherwise 2 pages may be allocated for holding the bvec table, so tests
in case of THP_SWAP may be improved.
Also filesystem may support IO to/from THP, and multipage bvec should
improve this case too.
Long term, there is opportunity to improve fs code by only allocating
'nr_segment' of bvec table, instead of 'nr_page' of bvec table because
physically contiguous pages are often allocated from mm for same
process.
So this patchset is just a start, and at the current stage, I am
focusing on making it stable since it is the correct approach to
only store the multipage segment instead of each pages.
Thanks again for your test.
Thanks,
Ming
On Thu, Jun 21, 2018 at 3:17 AM, Ming Lei <[email protected]> wrote:
> On Fri, Jun 15, 2018 at 02:59:19PM +0200, Gi-Oh Kim wrote:
>> >
>> > - bio size can be increased and it should improve some high-bandwidth IO
>> > case in theory[4].
>> >
>>
>> Hi,
>>
>> I would like to report your patch set works well on my system based on v4.14.48.
>> I thought the multipage bvec could improve the performance of my system.
>> (FYI, my system has v4.14.48 and provides KVM-base virtualization service.)
>
> Thanks for your test!
>
>>
>> So I did back-porting your patches to v4.14.48.
>> It has done without any serious problem.
>> I only needed to cherry-pick "blk-merge: compute
>> bio->bi_seg_front_size efficiently" and
>> "block: move bio_alloc_pages() to bcache" patches before back-porting
>> to prevent conflicts.
>
> Not sure I understand your point, you have to backport all patches.
Never mind.
I just meant I did backporting for myself and it is still working well.
>
> At least now, BIO_MAX_PAGES can be fixed as 256 in case of CONFIG_THP_SWAP,
> otherwise 2 pages may be allocated for holding the bvec table, so tests
> in case of THP_SWAP may be improved.
>
> Also filesystem may support IO to/from THP, and multipage bvec should
> improve this case too.
OK, I got it.
I will find something to use THP_SWAP and run the performance test with it.
Thank you ;-)
--
GIOH KIM
Linux Kernel Entwickler
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 176 2697 8962
Fax: +49 30 577 008 299
Email: [email protected]
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens