2020-08-24 15:25:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 00/11] iomap/fs/block patches for 5.11

As promised earlier [1], here are the patches which I would like to
merge into 5.11 to support THPs. They depend on that earlier series.
If there's anything in here that you'd like to see pulled out and added
to that earlier series, let me know.

There are a couple of pieces in here which aren't exactly part of
iomap, but I think make sense to take through the iomap tree.

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/

Matthew Wilcox (Oracle) (11):
fs: Make page_mkwrite_check_truncate thp-aware
mm: Support THPs in zero_user_segments
mm: Zero the head page, not the tail page
block: Add bio_for_each_thp_segment_all
iomap: Support THPs in iomap_adjust_read_range
iomap: Support THPs in invalidatepage
iomap: Support THPs in read paths
iomap: Change iomap_write_begin calling convention
iomap: Support THPs in write paths
iomap: Inline data shouldn't see THPs
iomap: Handle tail pages in iomap_page_mkwrite

fs/iomap/buffered-io.c | 178 ++++++++++++++++++++++++----------------
include/linux/bio.h | 13 +++
include/linux/bvec.h | 27 ++++++
include/linux/highmem.h | 15 +++-
include/linux/pagemap.h | 10 +--
mm/highmem.c | 62 +++++++++++++-
mm/shmem.c | 7 ++
mm/truncate.c | 7 ++
8 files changed, 236 insertions(+), 83 deletions(-)

--
2.28.0


2020-08-24 15:25:31

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 10/11] iomap: Inline data shouldn't see THPs

Assert that we're not seeing THPs in functions that read/write
inline data, rather than zeroing out the tail.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/iomap/buffered-io.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 52d371c59758..ca2aa1995519 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -210,6 +210,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
return;

BUG_ON(page->index);
+ BUG_ON(PageCompound(page));
BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));

addr = kmap_atomic(page);
@@ -727,6 +728,7 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page,

flush_dcache_page(page);
WARN_ON_ONCE(!PageUptodate(page));
+ BUG_ON(PageCompound(page));
BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));

addr = kmap_atomic(page);
--
2.28.0

2020-08-24 15:26:35

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 06/11] iomap: Support THPs in invalidatepage

If we're punching a hole in a THP, we need to remove the per-page
iomap data as the THP is about to be split and each page will need
its own. This means that writepage can now come across a page with
no iop allocated, so remove the assertions that there is already one,
and just create one (with the Uptodate bits set) if there isn't one.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5cc0343b6a8e..9ea162617398 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -54,6 +54,8 @@ iomap_page_create(struct inode *inode, struct page *page)
iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
GFP_NOFS | __GFP_NOFAIL);
spin_lock_init(&iop->uptodate_lock);
+ if (PageUptodate(page))
+ bitmap_fill(iop->uptodate, nr_blocks);
attach_page_private(page, iop);
return iop;
}
@@ -483,10 +485,17 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
* If we are invalidating the entire page, clear the dirty state from it
* and release it to avoid unnecessary buildup of the LRU.
*/
- if (offset == 0 && len == PAGE_SIZE) {
+ if (offset == 0 && len == thp_size(page)) {
WARN_ON_ONCE(PageWriteback(page));
cancel_dirty_page(page);
iomap_page_release(page);
+ return;
+ }
+
+ /* Punching a hole in a THP requires releasing the iop */
+ if (PageTransHuge(page)) {
+ VM_BUG_ON_PAGE(!PageUptodate(page), page);
+ iomap_page_release(page);
}
}
EXPORT_SYMBOL_GPL(iomap_invalidatepage);
@@ -1043,14 +1052,13 @@ static void
iomap_finish_page_writeback(struct inode *inode, struct page *page,
int error, unsigned int len)
{
- struct iomap_page *iop = to_iomap_page(page);
+ struct iomap_page *iop = iomap_page_create(inode, page);

if (error) {
SetPageError(page);
mapping_set_error(inode->i_mapping, -EIO);
}

- WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0);

if (!iop || atomic_sub_and_test(len, &iop->write_count))
@@ -1340,14 +1348,13 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
struct writeback_control *wbc, struct inode *inode,
struct page *page, u64 end_offset)
{
- struct iomap_page *iop = to_iomap_page(page);
+ struct iomap_page *iop = iomap_page_create(inode, page);
struct iomap_ioend *ioend, *next;
unsigned len = i_blocksize(inode);
u64 file_offset; /* file offset of page */
int error = 0, count = 0, i;
LIST_HEAD(submit_list);

- WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);

/*
--
2.28.0

2020-08-24 15:27:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 11/11] iomap: Handle tail pages in iomap_page_mkwrite

iomap_page_mkwrite() can be called with a tail page. If we are,
operate on the head page, since we're treating the entire thing as a
single unit and the whole page is dirtied at the same time.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ca2aa1995519..eb8202424665 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1043,7 +1043,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,

vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
{
- struct page *page = vmf->page;
+ struct page *page = thp_head(vmf->page);
struct inode *inode = file_inode(vmf->vma->vm_file);
unsigned long length;
loff_t offset;
--
2.28.0

2020-08-24 15:28:30

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 04/11] block: Add bio_for_each_thp_segment_all

Iterate once for each THP instead of once for each base page.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/bio.h | 13 +++++++++++++
include/linux/bvec.h | 27 +++++++++++++++++++++++++++
2 files changed, 40 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index c6d765382926..a0e104910097 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -129,12 +129,25 @@ static inline bool bio_next_segment(const struct bio *bio,
return true;
}

+static inline bool bio_next_thp_segment(const struct bio *bio,
+ struct bvec_iter_all *iter)
+{
+ if (iter->idx >= bio->bi_vcnt)
+ return false;
+
+ bvec_thp_advance(&bio->bi_io_vec[iter->idx], iter);
+ return true;
+}
+
/*
* drivers should _never_ use the all version - the bio may have been split
* before it got to the driver and the driver won't own all of it
*/
#define bio_for_each_segment_all(bvl, bio, iter) \
for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+#define bio_for_each_thp_segment_all(bvl, bio, iter) \
+ for (bvl = bvec_init_iter_all(&iter); \
+ bio_next_thp_segment((bio), &iter); )

static inline void bio_advance_iter(const struct bio *bio,
struct bvec_iter *iter, unsigned int bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ac0c7299d5b8..ea8a37a7515b 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -162,4 +162,31 @@ static inline void bvec_advance(const struct bio_vec *bvec,
}
}

+static inline void bvec_thp_advance(const struct bio_vec *bvec,
+ struct bvec_iter_all *iter_all)
+{
+ struct bio_vec *bv = &iter_all->bv;
+ unsigned int page_size;
+
+ if (iter_all->done) {
+ bv->bv_page += thp_nr_pages(bv->bv_page);
+ page_size = thp_size(bv->bv_page);
+ bv->bv_offset = 0;
+ } else {
+ bv->bv_page = thp_head(bvec->bv_page +
+ (bvec->bv_offset >> PAGE_SHIFT));
+ page_size = thp_size(bv->bv_page);
+ bv->bv_offset = bvec->bv_offset -
+ (bv->bv_page - bvec->bv_page) * PAGE_SIZE;
+ BUG_ON(bv->bv_offset >= page_size);
+ }
+ bv->bv_len = min(page_size - bv->bv_offset,
+ bvec->bv_len - iter_all->done);
+ iter_all->done += bv->bv_len;
+
+ if (iter_all->done == bvec->bv_len) {
+ iter_all->idx++;
+ iter_all->done = 0;
+ }
+}
#endif /* __LINUX_BVEC_ITER_H */
--
2.28.0

2020-08-24 15:28:41

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 07/11] iomap: Support THPs in read paths

Use thp_size() instead of PAGE_SIZE, offset_in_thp() instead
of offset_in_page() and bio_for_each_thp_segment_all().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9ea162617398..d14de8886d5c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -187,7 +187,7 @@ iomap_read_end_io(struct bio *bio)
struct bio_vec *bvec;
struct bvec_iter_all iter_all;

- bio_for_each_segment_all(bvec, bio, iter_all)
+ bio_for_each_thp_segment_all(bvec, bio, iter_all)
iomap_read_page_end_io(bvec, error);
bio_put(bio);
}
@@ -227,6 +227,16 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
pos >= i_size_read(inode);
}

+/*
+ * Estimate the number of vectors we need based on the current page size;
+ * if we're wrong we'll end up doing an overly large allocation or needing
+ * to do a second allocation, neither of which is a big deal.
+ */
+static unsigned int iomap_nr_vecs(struct page *page, loff_t length)
+{
+ return (length + thp_size(page) - 1) >> page_shift(page);
+}
+
static loff_t
iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap, struct iomap *srcmap)
@@ -280,7 +290,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
gfp_t orig_gfp = gfp;
- int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ int nr_vecs = iomap_nr_vecs(page, length);

if (ctx->bio)
submit_bio(ctx->bio);
@@ -324,9 +334,9 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)

trace_iomap_readpage(page->mapping->host, 1);

- for (poff = 0; poff < PAGE_SIZE; poff += ret) {
+ for (poff = 0; poff < thp_size(page); poff += ret) {
ret = iomap_apply(inode, page_offset(page) + poff,
- PAGE_SIZE - poff, 0, ops, &ctx,
+ thp_size(page) - poff, 0, ops, &ctx,
iomap_readpage_actor);
if (ret <= 0) {
WARN_ON_ONCE(ret == 0);
@@ -360,7 +370,8 @@ iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
loff_t done, ret;

for (done = 0; done < length; done += ret) {
- if (ctx->cur_page && offset_in_page(pos + done) == 0) {
+ if (ctx->cur_page &&
+ offset_in_thp(ctx->cur_page, pos + done) == 0) {
if (!ctx->cur_page_in_bio)
unlock_page(ctx->cur_page);
put_page(ctx->cur_page);
--
2.28.0

2020-08-24 15:28:59

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 03/11] mm: Zero the head page, not the tail page

Pass the head page to zero_user_segment(), not the tail page, and adjust
the byte offsets appropriately.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/shmem.c | 7 +++++++
mm/truncate.c | 7 +++++++
2 files changed, 14 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 271548ca20f3..77982149b437 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -958,11 +958,18 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
struct page *page = NULL;
shmem_getpage(inode, start - 1, &page, SGP_READ);
if (page) {
+ struct page *head = thp_head(page);
unsigned int top = PAGE_SIZE;
if (start > end) {
top = partial_end;
partial_end = 0;
}
+ if (head != page) {
+ unsigned int diff = start - 1 - head->index;
+ partial_start += diff << PAGE_SHIFT;
+ top += diff << PAGE_SHIFT;
+ page = head;
+ }
zero_user_segment(page, partial_start, top);
set_page_dirty(page);
unlock_page(page);
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..152974888124 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -374,12 +374,19 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (partial_start) {
struct page *page = find_lock_page(mapping, start - 1);
if (page) {
+ struct page *head = thp_head(page);
unsigned int top = PAGE_SIZE;
if (start > end) {
/* Truncation within a single page */
top = partial_end;
partial_end = 0;
}
+ if (head != page) {
+ unsigned int diff = start - 1 - head->index;
+ partial_start += diff << PAGE_SHIFT;
+ top += diff << PAGE_SHIFT;
+ page = head;
+ }
wait_on_page_writeback(page);
zero_user_segment(page, partial_start, top);
cleancache_invalidate_page(mapping, page);
--
2.28.0

2020-08-24 15:29:00

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 05/11] iomap: Support THPs in iomap_adjust_read_range

Pass the struct page instead of the iomap_page so we can determine the
size of the page. Use offset_in_thp() instead of offset_in_page() and
use thp_size() instead of PAGE_SIZE. Convert the arguments to be size_t
instead of unsigned int, in case pages ever get larger than 2^31 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
fs/iomap/buffered-io.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2dba054095e8..5cc0343b6a8e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -76,16 +76,16 @@ iomap_page_release(struct page *page)
/*
* Calculate the range inside the page that we actually need to read.
*/
-static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
- loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
+static void iomap_adjust_read_range(struct inode *inode, struct page *page,
+ loff_t *pos, loff_t length, size_t *offp, size_t *lenp)
{
+ struct iomap_page *iop = to_iomap_page(page);
loff_t orig_pos = *pos;
loff_t isize = i_size_read(inode);
unsigned block_bits = inode->i_blkbits;
unsigned block_size = (1 << block_bits);
- unsigned poff = offset_in_page(*pos);
- unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+ size_t poff = offset_in_thp(page, *pos);
+ size_t plen = min_t(loff_t, thp_size(page) - poff, length);
unsigned first = poff >> block_bits;
unsigned last = (poff + plen - 1) >> block_bits;

@@ -123,7 +123,7 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
* page cache for blocks that are entirely outside of i_size.
*/
if (orig_pos <= isize && orig_pos + length > isize) {
- unsigned end = offset_in_page(isize - 1) >> block_bits;
+ unsigned end = offset_in_thp(page, isize - 1) >> block_bits;

if (first <= end && last > end)
plen -= (last - end) * block_size;
@@ -234,7 +234,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap_page *iop = iomap_page_create(inode, page);
bool same_page = false, is_contig = false;
loff_t orig_pos = pos;
- unsigned poff, plen;
+ size_t poff, plen;
sector_t sector;

if (iomap->type == IOMAP_INLINE) {
@@ -244,7 +244,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
}

/* zero post-eof blocks as the page may be mapped */
- iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+ iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
if (plen == 0)
goto done;

@@ -550,18 +550,19 @@ static int
__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
struct page *page, struct iomap *srcmap)
{
- struct iomap_page *iop = iomap_page_create(inode, page);
loff_t block_size = i_blocksize(inode);
loff_t block_start = pos & ~(block_size - 1);
loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
- unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+ unsigned from = offset_in_page(pos), to = from + len;
+ size_t poff, plen;
int status;

if (PageUptodate(page))
return 0;
+ iomap_page_create(inode, page);

do {
- iomap_adjust_read_range(inode, iop, &block_start,
+ iomap_adjust_read_range(inode, page, &block_start,
block_end - block_start, &poff, &plen);
if (plen == 0)
break;
--
2.28.0

2020-08-25 12:50:21

by William Kucharski

[permalink] [raw]
Subject: Re: [PATCH 00/11] iomap/fs/block patches for 5.11

Really nice improvements here.

Reviewed-by: William Kucharski <[email protected]>

> On Aug 24, 2020, at 9:16 AM, Matthew Wilcox (Oracle) <[email protected]> wrote:
>
> As promised earlier [1], here are the patches which I would like to
> merge into 5.11 to support THPs. They depend on that earlier series.
> If there's anything in here that you'd like to see pulled out and added
> to that earlier series, let me know.
>
> There are a couple of pieces in here which aren't exactly part of
> iomap, but I think make sense to take through the iomap tree.
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Matthew Wilcox (Oracle) (11):
> fs: Make page_mkwrite_check_truncate thp-aware
> mm: Support THPs in zero_user_segments
> mm: Zero the head page, not the tail page
> block: Add bio_for_each_thp_segment_all
> iomap: Support THPs in iomap_adjust_read_range
> iomap: Support THPs in invalidatepage
> iomap: Support THPs in read paths
> iomap: Change iomap_write_begin calling convention
> iomap: Support THPs in write paths
> iomap: Inline data shouldn't see THPs
> iomap: Handle tail pages in iomap_page_mkwrite
>
> fs/iomap/buffered-io.c | 178 ++++++++++++++++++++++++----------------
> include/linux/bio.h | 13 +++
> include/linux/bvec.h | 27 ++++++
> include/linux/highmem.h | 15 +++-
> include/linux/pagemap.h | 10 +--
> mm/highmem.c | 62 +++++++++++++-
> mm/shmem.c | 7 ++
> mm/truncate.c | 7 ++
> 8 files changed, 236 insertions(+), 83 deletions(-)
>
> --
> 2.28.0
>
>

2020-08-27 08:47:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 04/11] block: Add bio_for_each_thp_segment_all

On Mon, Aug 24, 2020 at 04:16:53PM +0100, Matthew Wilcox (Oracle) wrote:
> Iterate once for each THP instead of once for each base page.

FYI, I've always been wondering if bio_for_each_segment_all is the
right interface for the I/O completions, because we generally don't
need the fake bvecs for each page. Only the first page can have an
offset, and only the last page can be end earlier than the end of
the page size.

It would seem way more efficient to just have a helper that extracts
the offset and end, and just use that in a loop that does the way
cheaper iteration over the physical addresses only. This might (or
might) not be a good time to switch to that model for iomap.

2020-08-31 22:47:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 04/11] block: Add bio_for_each_thp_segment_all

On Thu, Aug 27, 2020 at 09:44:31AM +0100, Christoph Hellwig wrote:
> On Mon, Aug 24, 2020 at 04:16:53PM +0100, Matthew Wilcox (Oracle) wrote:
> > Iterate once for each THP instead of once for each base page.
>
> FYI, I've always been wondering if bio_for_each_segment_all is the
> right interface for the I/O completions, because we generally don't
> need the fake bvecs for each page. Only the first page can have an
> offset, and only the last page can be end earlier than the end of
> the page size.
>
> It would seem way more efficient to just have a helper that extracts
> the offset and end, and just use that in a loop that does the way
> cheaper iteration over the physical addresses only. This might (or
> might) not be a good time to switch to that model for iomap.

Something like this?

static void iomap_read_end_io(struct bio *bio)
{
int i, error = blk_status_to_errno(bio->bi_status);

for (i = 0; i < bio->bi_vcnt; i++) {
struct bio_vec *bvec = &bio->bi_io_vec[i];
size_t offset = bvec->bv_offset;
size_t length = bvec->bv_len;
struct page *page = bvec->bv_page;

while (length > 0) {
size_t count = thp_size(page) - offset;

if (count > length)
count = length;
iomap_read_page_end_io(page, offset, count, error);
page += (offset + count) / PAGE_SIZE;
length -= count;
offset = 0;
}
}

bio_put(bio);
}

Maybe I'm missing something important here, but it's significantly
simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes
(256 bytes less!) iomap_read_page_end_io is inlined into it both before
and after.

There is some weirdness going on with regards to bv_offset that I don't
quite understand. In the original bvec_advance:

bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT);
bv->bv_offset = bvec->bv_offset & ~PAGE_MASK;

which I cargo-culted into bvec_thp_advance as:

bv->bv_page = thp_head(bvec->bv_page +
(bvec->bv_offset >> PAGE_SHIFT));
page_size = thp_size(bv->bv_page);
bv->bv_offset = bvec->bv_offset -
(bv->bv_page - bvec->bv_page) * PAGE_SIZE;

Is it possible to have a bvec with an offset that is larger than the
size of bv_page? That doesn't seem like a useful thing to do, but
if that needs to be supported, then the code up top doesn't do that.
We maybe gain a little bit by counting length down to 0 instead of
counting it up to bv_len. I dunno; reading the code over now, it
doesn't seem like that much of a difference.

Maybe you meant something different?

2020-09-01 05:36:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 04/11] block: Add bio_for_each_thp_segment_all

On Mon, Aug 31, 2020 at 08:48:37PM +0100, Matthew Wilcox wrote:
> static void iomap_read_end_io(struct bio *bio)
> {
> int i, error = blk_status_to_errno(bio->bi_status);
>
> for (i = 0; i < bio->bi_vcnt; i++) {
> struct bio_vec *bvec = &bio->bi_io_vec[i];

This should probably use bio_for_each_bvec_all instead of directly
poking into the bio. I'd also be tempted to move the loop body into
a separate helper, but that's just a slight stylistic preference.

> size_t offset = bvec->bv_offset;
> size_t length = bvec->bv_len;
> struct page *page = bvec->bv_page;
>
> while (length > 0) {
> size_t count = thp_size(page) - offset;
>
> if (count > length)
> count = length;
> iomap_read_page_end_io(page, offset, count, error);
> page += (offset + count) / PAGE_SIZE;

Shouldn't the page_size here be thp_size?

> Maybe I'm missing something important here, but it's significantly
> simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes
> (256 bytes less!) iomap_read_page_end_io is inlined into it both before
> and after.

Yes, that's exactly why I think avoiding bio_for_each_segment_all is
a good idea in general.

> There is some weirdness going on with regards to bv_offset that I don't
> quite understand. In the original bvec_advance:
>
> bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT);
> bv->bv_offset = bvec->bv_offset & ~PAGE_MASK;
>
> which I cargo-culted into bvec_thp_advance as:
>
> bv->bv_page = thp_head(bvec->bv_page +
> (bvec->bv_offset >> PAGE_SHIFT));
> page_size = thp_size(bv->bv_page);
> bv->bv_offset = bvec->bv_offset -
> (bv->bv_page - bvec->bv_page) * PAGE_SIZE;
>
> Is it possible to have a bvec with an offset that is larger than the
> size of bv_page? That doesn't seem like a useful thing to do, but
> if that needs to be supported, then the code up top doesn't do that.
> We maybe gain a little bit by counting length down to 0 instead of
> counting it up to bv_len. I dunno; reading the code over now, it
> doesn't seem like that much of a difference.

Drivers can absolutely see a bv_offset that is larger due to bio
splitting. However the submitting file system should never see one
unless it creates one, which would be stupid.

And yes, eventually bv_page and bv_offset should be replaced with a

phys_addr_t bv_phys;

and life would become simpler in many places (and the bvec would
shrink for most common setups as well).

For now I'd end up with something like:

static void iomap_read_end_bvec(struct page *page, size_t offset,
size_t length, int error)
{
while (length > 0) {
size_t page_size = thp_size(page);
size_t count = min(page_size - offset, length);

iomap_read_page_end_io(page, offset, count, error);

page += (offset + count) / page_size;
length -= count;
offset = 0;
}
}

static void iomap_read_end_io(struct bio *bio)
{
int i, error = blk_status_to_errno(bio->bi_status);
struct bio_vec *bvec;

bio_for_each_bvec_all(bvec, bio, i)
iomap_read_end_bvec(bvec->bv_page, bvec->bv_offset,
bvec->bv_len, error;
bio_put(bio);
}

and maybe even merge iomap_read_page_end_io into iomap_read_end_bvec.

2020-09-01 13:16:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 04/11] block: Add bio_for_each_thp_segment_all

On Tue, Sep 01, 2020 at 06:34:26AM +0100, Christoph Hellwig wrote:
> On Mon, Aug 31, 2020 at 08:48:37PM +0100, Matthew Wilcox wrote:
> > static void iomap_read_end_io(struct bio *bio)
> > {
> > int i, error = blk_status_to_errno(bio->bi_status);
> >
> > for (i = 0; i < bio->bi_vcnt; i++) {
> > struct bio_vec *bvec = &bio->bi_io_vec[i];
>
> This should probably use bio_for_each_bvec_all instead of directly
> poking into the bio. I'd also be tempted to move the loop body into
> a separate helper, but that's just a slight stylistic preference.

Ah, got it.

> > size_t offset = bvec->bv_offset;
> > size_t length = bvec->bv_len;
> > struct page *page = bvec->bv_page;
> >
> > while (length > 0) {
> > size_t count = thp_size(page) - offset;
> >
> > if (count > length)
> > count = length;
> > iomap_read_page_end_io(page, offset, count, error);
> > page += (offset + count) / PAGE_SIZE;
>
> Shouldn't the page_size here be thp_size?

No. Let's suppose we have a 20kB I/O which starts on a page boundary and
the first page is order-2. To get from the first head page to the second
page, we need to add 4, which is 16kB / 4kB, not 16kB / 16kB.

> > Maybe I'm missing something important here, but it's significantly
> > simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes
> > (256 bytes less!) iomap_read_page_end_io is inlined into it both before
> > and after.
>
> Yes, that's exactly why I think avoiding bio_for_each_segment_all is
> a good idea in general.

I took out all the attempts to cope with insane bv_offset to compare like
with like and I got the bio_for_each_thp_segment_all() version down to
656 bytes:

@@ -166,21 +166,15 @@ static inline void bvec_thp_advance(const struct bio_vec *
bvec,
struct bvec_iter_all *iter_all)
{
struct bio_vec *bv = &iter_all->bv;
- unsigned int page_size;

if (iter_all->done) {
bv->bv_page += thp_nr_pages(bv->bv_page);
- page_size = thp_size(bv->bv_page);
bv->bv_offset = 0;
} else {
- bv->bv_page = thp_head(bvec->bv_page +
- (bvec->bv_offset >> PAGE_SHIFT));
- page_size = thp_size(bv->bv_page);
- bv->bv_offset = bvec->bv_offset -
- (bv->bv_page - bvec->bv_page) * PAGE_SIZE;
- BUG_ON(bv->bv_offset >= page_size);
+ bv->bv_page = bvec->bv_page;
+ bv->bv_offset = bvec->bv_offset;
}
- bv->bv_len = min(page_size - bv->bv_offset,
+ bv->bv_len = min_t(unsigned int, thp_size(bv->bv_page) - bv->bv_offset,
bvec->bv_len - iter_all->done);
iter_all->done += bv->bv_len;

> And yes, eventually bv_page and bv_offset should be replaced with a
>
> phys_addr_t bv_phys;
>
> and life would become simpler in many places (and the bvec would
> shrink for most common setups as well).

I'd very much like to see that. It causes quite considerable pain for
our virtualisation people that we need a struct page. They'd like the
hypervisor to not have struct pages for the guest's memory, but if they
don't have them, they can't do I/O to them. Perhaps I'll try getting
one of them to work on this.

I'm not entirely sure the bvec would shrink. On 64-bit systems, it's
currently 8 bytes for the struct page, 4 bytes for the len and 4 bytes
for the offset. Sure, we can get rid of the offset, but the compiler
will just pad the struct from 12 bytes back to 16. On 32-bit systems
with 32-bit phys_addr_t, we go from 12 bytes down to 8, but most 32-bit
systems have a 64-bit phys_addr_t these days, don't they?

> For now I'd end up with something like:
>
> static void iomap_read_end_bvec(struct page *page, size_t offset,
> size_t length, int error)
> {
> while (length > 0) {
> size_t page_size = thp_size(page);
> size_t count = min(page_size - offset, length);
>
> iomap_read_page_end_io(page, offset, count, error);
>
> page += (offset + count) / page_size;
> length -= count;
> offset = 0;
> }
> }
>
> static void iomap_read_end_io(struct bio *bio)
> {
> int i, error = blk_status_to_errno(bio->bi_status);
> struct bio_vec *bvec;
>
> bio_for_each_bvec_all(bvec, bio, i)
> iomap_read_end_bvec(bvec->bv_page, bvec->bv_offset,
> bvec->bv_len, error;
> bio_put(bio);
> }
>
> and maybe even merge iomap_read_page_end_io into iomap_read_end_bvec.

The lines start to get a bit long. Here's what I currently have on
the write side:

@@ -1104,6 +1117,20 @@ iomap_finish_page_writeback(struct inode *inode, struct p
age *page,
end_page_writeback(page);
}

+static void iomap_finish_bvec_write(struct inode *inode, struct page *page,
+ size_t offset, size_t length, int error)
+{
+ while (length > 0) {
+ size_t count = min(thp_size(page) - offset, length);
+
+ iomap_finish_page_writeback(inode, page, error, count);
+
+ page += (offset + count) / PAGE_SIZE;
+ offset = 0;
+ length -= count;
+ }
+}
+
/*
* We're now finished for good with this ioend structure. Update the page
* state, release holds on bios, and finally free up memory. Do not use the
@@ -1121,7 +1148,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)

for (bio = &ioend->io_inline_bio; bio; bio = next) {
struct bio_vec *bv;
- struct bvec_iter_all iter_all;
+ int i;

/*
* For the last bio, bi_private points to the ioend, so we
@@ -1133,9 +1160,9 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
next = bio->bi_private;

/* walk each page on bio, ending page IO on them */
- bio_for_each_thp_segment_all(bv, bio, iter_all)
- iomap_finish_page_writeback(inode, bv->bv_page, error,
- bv->bv_len);
+ bio_for_each_bvec_all(bv, bio, i)
+ iomap_finish_bvec_writeback(inode, bv->bv_page,
+ bv->bv_offset, bv->bv_len, error);
bio_put(bio);
}
/* The ioend has been freed by bio_put() */

That's a bit more boilerplate than I'd like, but if bio_vec is going to
lose its bv_page then I don't see a better way. Unless we come up with
a different page/offset/length struct that bio_vecs are decomposed into.

2020-09-01 14:54:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 04/11] block: Add bio_for_each_thp_segment_all

On Tue, Sep 01, 2020 at 02:05:25PM +0100, Matthew Wilcox wrote:
> > > struct page *page = bvec->bv_page;
> > >
> > > while (length > 0) {
> > > size_t count = thp_size(page) - offset;
> > >
> > > if (count > length)
> > > count = length;
> > > iomap_read_page_end_io(page, offset, count, error);
> > > page += (offset + count) / PAGE_SIZE;
> >
> > Shouldn't the page_size here be thp_size?
>
> No. Let's suppose we have a 20kB I/O which starts on a page boundary and
> the first page is order-2. To get from the first head page to the second
> page, we need to add 4, which is 16kB / 4kB, not 16kB / 16kB.

True.

> I'm not entirely sure the bvec would shrink. On 64-bit systems, it's
> currently 8 bytes for the struct page, 4 bytes for the len and 4 bytes
> for the offset. Sure, we can get rid of the offset, but the compiler
> will just pad the struct from 12 bytes back to 16. On 32-bit systems
> with 32-bit phys_addr_t, we go from 12 bytes down to 8, but most 32-bit
> systems have a 64-bit phys_addr_t these days, don't they?

Actually on those system that still are 32-bit because they are so
tiny I'd very much still expect a 32-bit phys_addr_t. E.g. arm
without LPAE or 32-bit RISC-V.

But yeah, point taken on the alignment for the 64-bit ones.

> That's a bit more boilerplate than I'd like, but if bio_vec is going to
> lose its bv_page then I don't see a better way. Unless we come up with
> a different page/offset/length struct that bio_vecs are decomposed into.

I'm not sure it is going to lose bv_page any time soon. I'd sure like
to, but least time something like that came up Linus wasn't entirely
in favor. Things might have changed now, though and I think it is about
time to give it another try.