2019-10-16 12:26:46

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

This patch series adds the support for blocksize < pagesize for
dioread_nolock feature.

Since in case of blocksize < pagesize, we can have multiple
small buffers of page as unwritten extents, we need to
maintain a vector of these unwritten extents which needs
the conversion after the IO is complete. Thus, we maintain
a list of tuple <offset, size> pair (io_end_vec) for this &
traverse this list to do the unwritten to written conversion.

Appreciate any reviews/comments on this patches.

Tests completed
===============

All (which also passes in default config) "quick" group xfstests
are passing. Tested xfstests with below configurations.
dioread_nolock with blocksize < pagesize
dioread_nolock with blocksize == pagesize
without dioread_nolock with blocksize < pagesize
without dioread_nolock with blocksize == pagesize
ltp/fsx test with multiple iterations of 1 million ops
did not show any error.


About patches
=============

Patch 1 - 3: These are some cleanup and refactoring patches.
Patch 4: This patch adds the required support.
Patch 5: This patch removes the checks which was not allowing to mount
with dioread_nolock when blocksize != pagesize was true.

_Patches can be cleanly applied on today's linus tree master branch_

Ritesh Harjani (5):
ext4: keep uniform naming convention for io & io_end variables
ext4: Add API to bring in support for unwritten io_end_vec conversion
ext4: Refactor mpage_map_and_submit_buffers function
ext4: Add support for blocksize < pagesize in dioread_nolock
ext4: Enable blocksize < pagesize for dioread_nolock

fs/ext4/ext4.h | 13 ++++-
fs/ext4/extents.c | 49 ++++++++++++-----
fs/ext4/inode.c | 136 ++++++++++++++++++++++++++++++----------------
fs/ext4/page-io.c | 110 ++++++++++++++++++++++++-------------
fs/ext4/super.c | 10 ----
5 files changed, 208 insertions(+), 110 deletions(-)

--
2.21.0


2019-10-16 12:26:46

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 1/5] ext4: keep uniform naming convention for io & io_end variables

Let's keep uniform naming convention for ext4_submit_io (io)
& ext4_end_io_t (io_end) structures, to avoid any confusion.
No functionality change in this patch.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/page-io.c | 55 ++++++++++++++++++++++++-----------------------
1 file changed, 28 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 12ceadef32c5..d9b96fc976a3 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -136,19 +136,19 @@ static void ext4_release_io_end(ext4_io_end_t *io_end)
* cannot get to ext4_ext_truncate() before all IOs overlapping that range are
* completed (happens from ext4_free_ioend()).
*/
-static int ext4_end_io(ext4_io_end_t *io)
+static int ext4_end_io_end(ext4_io_end_t *io_end)
{
- struct inode *inode = io->inode;
- loff_t offset = io->offset;
- ssize_t size = io->size;
- handle_t *handle = io->handle;
+ struct inode *inode = io_end->inode;
+ loff_t offset = io_end->offset;
+ ssize_t size = io_end->size;
+ handle_t *handle = io_end->handle;
int ret = 0;

- ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
+ ext4_debug("ext4_end_io_nolock: io_end 0x%p from inode %lu,list->next 0x%p,"
"list->prev 0x%p\n",
- io, inode->i_ino, io->list.next, io->list.prev);
+ io_end, inode->i_ino, io_end->list.next, io_end->list.prev);

- io->handle = NULL; /* Following call will use up the handle */
+ io_end->handle = NULL; /* Following call will use up the handle */
ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
if (ret < 0 && !ext4_forced_shutdown(EXT4_SB(inode->i_sb))) {
ext4_msg(inode->i_sb, KERN_EMERG,
@@ -157,8 +157,8 @@ static int ext4_end_io(ext4_io_end_t *io)
"(inode %lu, offset %llu, size %zd, error %d)",
inode->i_ino, offset, size, ret);
}
- ext4_clear_io_unwritten_flag(io);
- ext4_release_io_end(io);
+ ext4_clear_io_unwritten_flag(io_end);
+ ext4_release_io_end(io_end);
return ret;
}

@@ -166,21 +166,21 @@ static void dump_completed_IO(struct inode *inode, struct list_head *head)
{
#ifdef EXT4FS_DEBUG
struct list_head *cur, *before, *after;
- ext4_io_end_t *io, *io0, *io1;
+ ext4_io_end_t *io_end, *io_end0, *io_end1;

if (list_empty(head))
return;

ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
- list_for_each_entry(io, head, list) {
- cur = &io->list;
+ list_for_each_entry(io_end, head, list) {
+ cur = &io_end->list;
before = cur->prev;
- io0 = container_of(before, ext4_io_end_t, list);
+ io_end0 = container_of(before, ext4_io_end_t, list);
after = cur->next;
- io1 = container_of(after, ext4_io_end_t, list);
+ io_end1 = container_of(after, ext4_io_end_t, list);

ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
- io, inode->i_ino, io0, io1);
+ io_end, inode->i_ino, io_end0, io_end1);
}
#endif
}
@@ -207,7 +207,7 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
static int ext4_do_flush_completed_IO(struct inode *inode,
struct list_head *head)
{
- ext4_io_end_t *io;
+ ext4_io_end_t *io_end;
struct list_head unwritten;
unsigned long flags;
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -219,11 +219,11 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);

while (!list_empty(&unwritten)) {
- io = list_entry(unwritten.next, ext4_io_end_t, list);
- BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
- list_del_init(&io->list);
+ io_end = list_entry(unwritten.next, ext4_io_end_t, list);
+ BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
+ list_del_init(&io_end->list);

- err = ext4_end_io(io);
+ err = ext4_end_io_end(io_end);
if (unlikely(!ret && err))
ret = err;
}
@@ -242,13 +242,14 @@ void ext4_end_io_rsv_work(struct work_struct *work)

ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
{
- ext4_io_end_t *io = kmem_cache_zalloc(io_end_cachep, flags);
- if (io) {
- io->inode = inode;
- INIT_LIST_HEAD(&io->list);
- atomic_set(&io->count, 1);
+ ext4_io_end_t *io_end = kmem_cache_zalloc(io_end_cachep, flags);
+
+ if (io_end) {
+ io_end->inode = inode;
+ INIT_LIST_HEAD(&io_end->list);
+ atomic_set(&io_end->count, 1);
}
- return io;
+ return io_end;
}

void ext4_put_io_end_defer(ext4_io_end_t *io_end)
--
2.21.0

2019-10-16 12:26:46

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 2/5] ext4: Add API to bring in support for unwritten io_end_vec conversion

This patch just brings in the API for conversion of unwritten io_end_vec
extents which will be required for blocksize < pagesize support
for dioread_nolock feature.

No functional changes in this patch.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/extents.c | 42 +++++++++++++++++++++++++++---------------
fs/ext4/page-io.c | 7 +++----
3 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03db3e71676c..8d924bd19ca7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3264,6 +3264,8 @@ extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
loff_t len);
extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
loff_t offset, ssize_t len);
+extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
+ ext4_io_end_t *io_end);
extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);
extern int ext4_ext_calc_metadata_amount(struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fb0f99dc8c22..731e67ccab22 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4962,23 +4962,13 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
int ret = 0;
int ret2 = 0;
struct ext4_map_blocks map;
- unsigned int credits, blkbits = inode->i_blkbits;
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned int credits = 0;

map.m_lblk = offset >> blkbits;
max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);

- /*
- * This is somewhat ugly but the idea is clear: When transaction is
- * reserved, everything goes into it. Otherwise we rather start several
- * smaller transactions for conversion of each extent separately.
- */
- if (handle) {
- handle = ext4_journal_start_reserved(handle,
- EXT4_HT_EXT_CONVERT);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
- credits = 0;
- } else {
+ if (!handle) {
/*
* credits to insert 1 extent into extent tree
*/
@@ -5009,11 +4999,33 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
if (ret <= 0 || ret2)
break;
}
- if (!credits)
- ret2 = ext4_journal_stop(handle);
return ret > 0 ? ret2 : ret;
}

+int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end)
+{
+ int ret, err = 0;
+
+ /*
+ * This is somewhat ugly but the idea is clear: When transaction is
+ * reserved, everything goes into it. Otherwise we rather start several
+ * smaller transactions for conversion of each extent separately.
+ */
+ if (handle) {
+ handle = ext4_journal_start_reserved(handle,
+ EXT4_HT_EXT_CONVERT);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ }
+
+ ret = ext4_convert_unwritten_extents(handle, io_end->inode,
+ io_end->offset, io_end->size);
+ if (handle)
+ err = ext4_journal_stop(handle);
+
+ return ret < 0 ? ret : err;
+}
+
/*
* If newes is not existing extent (newes->ec_pblk equals zero) find
* delayed extent at start of newes and update newes accordingly and
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index d9b96fc976a3..3ccf54a380b2 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -149,7 +149,7 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
io_end, inode->i_ino, io_end->list.next, io_end->list.prev);

io_end->handle = NULL; /* Following call will use up the handle */
- ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
+ ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
if (ret < 0 && !ext4_forced_shutdown(EXT4_SB(inode->i_sb))) {
ext4_msg(inode->i_sb, KERN_EMERG,
"failed to convert unwritten extents to written "
@@ -269,9 +269,8 @@ int ext4_put_io_end(ext4_io_end_t *io_end)

if (atomic_dec_and_test(&io_end->count)) {
if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
- err = ext4_convert_unwritten_extents(io_end->handle,
- io_end->inode, io_end->offset,
- io_end->size);
+ err = ext4_convert_unwritten_io_end_vec(io_end->handle,
+ io_end);
io_end->handle = NULL;
ext4_clear_io_unwritten_flag(io_end);
}
--
2.21.0

2019-10-16 12:26:49

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 4/5] ext4: Add support for blocksize < pagesize in dioread_nolock

This patch adds the support for blocksize < pagesize for
dioread_nolock feature.

Since in case of blocksize < pagesize, we can have multiple
small buffers of page as unwritten extents, we need to
maintain a vector of these unwritten extents which needs
the conversion after the IO is complete. Thus, we maintain
a list of tuple <offset, size> pair (io_end_vec) for this &
traverse this list to do the unwritten to written conversion.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 11 ++++++++--
fs/ext4/extents.c | 11 ++++++++--
fs/ext4/inode.c | 37 +++++++++++++++++----------------
fs/ext4/page-io.c | 52 +++++++++++++++++++++++++++++++++++++++--------
4 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8d924bd19ca7..3616f1b0c987 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -198,6 +198,12 @@ struct ext4_system_blocks {
*/
#define EXT4_IO_END_UNWRITTEN 0x0001

+struct ext4_io_end_vec {
+ struct list_head list; /* list of io_end_vec */
+ loff_t offset; /* offset in the file */
+ ssize_t size; /* size of the extent */
+};
+
/*
* For converting unwritten extents on a work queue. 'handle' is used for
* buffered writeback.
@@ -211,8 +217,7 @@ typedef struct ext4_io_end {
* bios covering the extent */
unsigned int flag; /* unwritten or not */
atomic_t count; /* reference counter */
- loff_t offset; /* offset in the file */
- ssize_t size; /* size of the extent */
+ struct list_head list_vec; /* list of ext4_io_end_vec */
} ext4_io_end_t;

struct ext4_io_submit {
@@ -3324,6 +3329,8 @@ extern int ext4_bio_write_page(struct ext4_io_submit *io,
int len,
struct writeback_control *wbc,
bool keep_towrite);
+extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
+extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);

/* mmp.c */
extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 731e67ccab22..cf6c5f64cb58 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5005,6 +5005,7 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end)
{
int ret, err = 0;
+ struct ext4_io_end_vec *io_end_vec;

/*
* This is somewhat ugly but the idea is clear: When transaction is
@@ -5018,8 +5019,14 @@ int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end)
return PTR_ERR(handle);
}

- ret = ext4_convert_unwritten_extents(handle, io_end->inode,
- io_end->offset, io_end->size);
+ list_for_each_entry(io_end_vec, &io_end->list_vec, list) {
+ ret = ext4_convert_unwritten_extents(handle, io_end->inode,
+ io_end_vec->offset,
+ io_end_vec->size);
+ if (ret)
+ break;
+ }
+
if (handle)
err = ext4_journal_stop(handle);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3204a7d35581..026666bdc058 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2364,6 +2364,9 @@ static int mpage_process_page(struct mpage_da_data *mpd, struct page *page,
ext4_lblk_t lblk = *m_lblk;
ext4_fsblk_t pblock = *m_pblk;
int err = 0;
+ int blkbits = mpd->inode->i_blkbits;
+ ssize_t io_end_size = 0;
+ struct ext4_io_end_vec *io_end_vec = ext4_last_io_end_vec(io_end);

bh = head = page_buffers(page);
do {
@@ -2376,17 +2379,16 @@ static int mpage_process_page(struct mpage_da_data *mpd, struct page *page,
*/
mpd->map.m_len = 0;
mpd->map.m_flags = 0;
+ io_end_vec->size += io_end_size;
+ io_end_size = 0;

- /*
- * FIXME: If dioread_nolock supports
- * blocksize < pagesize, we need to make
- * sure we add size mapped so far to
- * io_end->size as the following call
- * can submit the page for IO.
- */
err = mpage_process_page_bufs(mpd, head, bh, lblk);
if (err > 0)
err = 0;
+ if (!err && mpd->map.m_len && mpd->map.m_lblk > lblk) {
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ io_end_vec->offset = mpd->map.m_lblk << blkbits;
+ }
*map_bh = true;
goto out;
}
@@ -2395,13 +2397,11 @@ static int mpage_process_page(struct mpage_da_data *mpd, struct page *page,
bh->b_blocknr = pblock++;
}
clear_buffer_unwritten(bh);
+ io_end_size += (1 << blkbits);
} while (lblk++, (bh = bh->b_this_page) != head);
- /*
- * FIXME: This is going to break if dioread_nolock
- * supports blocksize < pagesize as we will try to
- * convert potentially unmapped parts of inode.
- */
- io_end->size += PAGE_SIZE;
+
+ io_end_vec->size += io_end_size;
+ io_end_size = 0;
*map_bh = false;
out:
*m_lblk = lblk;
@@ -2551,9 +2551,10 @@ static int mpage_map_and_submit_extent(handle_t *handle,
int err;
loff_t disksize;
int progress = 0;
+ ext4_io_end_t *io_end = mpd->io_submit.io_end;
+ struct ext4_io_end_vec *io_end_vec = ext4_alloc_io_end_vec(io_end);

- mpd->io_submit.io_end->offset =
- ((loff_t)map->m_lblk) << inode->i_blkbits;
+ io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
do {
err = mpage_map_one_extent(handle, mpd);
if (err < 0) {
@@ -3654,6 +3655,7 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
{
ext4_io_end_t *io_end = private;
+ struct ext4_io_end_vec *io_end_vec;

/* if not async direct IO just return */
if (!io_end)
@@ -3671,8 +3673,9 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ext4_clear_io_unwritten_flag(io_end);
size = 0;
}
- io_end->offset = offset;
- io_end->size = size;
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ io_end_vec->offset = offset;
+ io_end_vec->size = size;
ext4_put_io_end(io_end);

return 0;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 3ccf54a380b2..893913d1c2fe 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -31,18 +31,56 @@
#include "acl.h"

static struct kmem_cache *io_end_cachep;
+static struct kmem_cache *io_end_vec_cachep;

int __init ext4_init_pageio(void)
{
io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
if (io_end_cachep == NULL)
return -ENOMEM;
+
+ io_end_vec_cachep = KMEM_CACHE(ext4_io_end_vec, 0);
+ if (io_end_vec_cachep == NULL) {
+ kmem_cache_destroy(io_end_cachep);
+ return -ENOMEM;
+ }
return 0;
}

void ext4_exit_pageio(void)
{
kmem_cache_destroy(io_end_cachep);
+ kmem_cache_destroy(io_end_vec_cachep);
+}
+
+struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end)
+{
+ struct ext4_io_end_vec *io_end_vec;
+
+ io_end_vec = kmem_cache_zalloc(io_end_vec_cachep, GFP_NOFS);
+ if (!io_end_vec)
+ return ERR_PTR(-ENOMEM);
+ INIT_LIST_HEAD(&io_end_vec->list);
+ list_add_tail(&io_end_vec->list, &io_end->list_vec);
+ return io_end_vec;
+}
+
+static void ext4_free_io_end_vec(ext4_io_end_t *io_end)
+{
+ struct ext4_io_end_vec *io_end_vec, *tmp;
+
+ if (list_empty(&io_end->list_vec))
+ return;
+ list_for_each_entry_safe(io_end_vec, tmp, &io_end->list_vec, list) {
+ list_del(&io_end_vec->list);
+ kmem_cache_free(io_end_vec_cachep, io_end_vec);
+ }
+}
+
+struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end)
+{
+ BUG_ON(list_empty(&io_end->list_vec));
+ return list_last_entry(&io_end->list_vec, struct ext4_io_end_vec, list);
}

/*
@@ -125,6 +163,7 @@ static void ext4_release_io_end(ext4_io_end_t *io_end)
ext4_finish_bio(bio);
bio_put(bio);
}
+ ext4_free_io_end_vec(io_end);
kmem_cache_free(io_end_cachep, io_end);
}

@@ -139,8 +178,6 @@ static void ext4_release_io_end(ext4_io_end_t *io_end)
static int ext4_end_io_end(ext4_io_end_t *io_end)
{
struct inode *inode = io_end->inode;
- loff_t offset = io_end->offset;
- ssize_t size = io_end->size;
handle_t *handle = io_end->handle;
int ret = 0;

@@ -154,8 +191,7 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
ext4_msg(inode->i_sb, KERN_EMERG,
"failed to convert unwritten extents to written "
"extents -- potential data loss! "
- "(inode %lu, offset %llu, size %zd, error %d)",
- inode->i_ino, offset, size, ret);
+ "(inode %lu, error %d)", inode->i_ino, ret);
}
ext4_clear_io_unwritten_flag(io_end);
ext4_release_io_end(io_end);
@@ -247,6 +283,7 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
if (io_end) {
io_end->inode = inode;
INIT_LIST_HEAD(&io_end->list);
+ INIT_LIST_HEAD(&io_end->list_vec);
atomic_set(&io_end->count, 1);
}
return io_end;
@@ -255,7 +292,8 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
void ext4_put_io_end_defer(ext4_io_end_t *io_end)
{
if (atomic_dec_and_test(&io_end->count)) {
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || !io_end->size) {
+ if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
+ list_empty(&io_end->list_vec)) {
ext4_release_io_end(io_end);
return;
}
@@ -307,10 +345,8 @@ static void ext4_end_bio(struct bio *bio)
struct inode *inode = io_end->inode;

ext4_warning(inode->i_sb, "I/O error %d writing to inode %lu "
- "(offset %llu size %ld starting block %llu)",
+ "starting block %llu)",
bio->bi_status, inode->i_ino,
- (unsigned long long) io_end->offset,
- (long) io_end->size,
(unsigned long long)
bi_sector >> (inode->i_blkbits - 9));
mapping_set_error(inode->i_mapping,
--
2.21.0

2019-10-16 12:27:10

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 3/5] ext4: Refactor mpage_map_and_submit_buffers function

This patch refactors mpage_map_and_submit_buffers to take
out the page buffers processing, as a separate function.
This will be required to add support for blocksize < pagesize
for dioread_nolock feature.

No functionality change in this patch.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 125 ++++++++++++++++++++++++++++++++----------------
1 file changed, 83 insertions(+), 42 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 123e3dee7733..3204a7d35581 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2340,6 +2340,75 @@ static int mpage_process_page_bufs(struct mpage_da_data *mpd,
return lblk < blocks;
}

+/*
+ * mpage_process_page - update page buffers corresponding to changed extent and
+ * may submit fully mapped page for IO
+ *
+ * @mpd - description of extent to map, on return next extent to map
+ * @m_lblk - logical block mapping.
+ * @m_pblk - corresponding physical mapping.
+ * @map_bh - determines on return whether this page requires any further
+ * mapping or not.
+ * Scan given page buffers corresponding to changed extent and update buffer
+ * state according to new extent state.
+ * We map delalloc buffers to their physical location, clear unwritten bits.
+ * If the given page is not fully mapped, we update @map to the next extent in
+ * the given page that needs mapping & return @map_bh as true.
+ */
+static int mpage_process_page(struct mpage_da_data *mpd, struct page *page,
+ ext4_lblk_t *m_lblk, ext4_fsblk_t *m_pblk,
+ bool *map_bh)
+{
+ struct buffer_head *head, *bh;
+ ext4_io_end_t *io_end = mpd->io_submit.io_end;
+ ext4_lblk_t lblk = *m_lblk;
+ ext4_fsblk_t pblock = *m_pblk;
+ int err = 0;
+
+ bh = head = page_buffers(page);
+ do {
+ if (lblk < mpd->map.m_lblk)
+ continue;
+ if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
+ /*
+ * Buffer after end of mapped extent.
+ * Find next buffer in the page to map.
+ */
+ mpd->map.m_len = 0;
+ mpd->map.m_flags = 0;
+
+ /*
+ * FIXME: If dioread_nolock supports
+ * blocksize < pagesize, we need to make
+ * sure we add size mapped so far to
+ * io_end->size as the following call
+ * can submit the page for IO.
+ */
+ err = mpage_process_page_bufs(mpd, head, bh, lblk);
+ if (err > 0)
+ err = 0;
+ *map_bh = true;
+ goto out;
+ }
+ if (buffer_delay(bh)) {
+ clear_buffer_delay(bh);
+ bh->b_blocknr = pblock++;
+ }
+ clear_buffer_unwritten(bh);
+ } while (lblk++, (bh = bh->b_this_page) != head);
+ /*
+ * FIXME: This is going to break if dioread_nolock
+ * supports blocksize < pagesize as we will try to
+ * convert potentially unmapped parts of inode.
+ */
+ io_end->size += PAGE_SIZE;
+ *map_bh = false;
+out:
+ *m_lblk = lblk;
+ *m_pblk = pblock;
+ return err;
+}
+
/*
* mpage_map_buffers - update buffers corresponding to changed extent and
* submit fully mapped pages for IO
@@ -2359,12 +2428,12 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
struct pagevec pvec;
int nr_pages, i;
struct inode *inode = mpd->inode;
- struct buffer_head *head, *bh;
int bpp_bits = PAGE_SHIFT - inode->i_blkbits;
pgoff_t start, end;
ext4_lblk_t lblk;
- sector_t pblock;
+ ext4_fsblk_t pblock;
int err;
+ bool map_bh = false;

start = mpd->map.m_lblk >> bpp_bits;
end = (mpd->map.m_lblk + mpd->map.m_len - 1) >> bpp_bits;
@@ -2380,50 +2449,19 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];

- bh = head = page_buffers(page);
- do {
- if (lblk < mpd->map.m_lblk)
- continue;
- if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
- /*
- * Buffer after end of mapped extent.
- * Find next buffer in the page to map.
- */
- mpd->map.m_len = 0;
- mpd->map.m_flags = 0;
- /*
- * FIXME: If dioread_nolock supports
- * blocksize < pagesize, we need to make
- * sure we add size mapped so far to
- * io_end->size as the following call
- * can submit the page for IO.
- */
- err = mpage_process_page_bufs(mpd, head,
- bh, lblk);
- pagevec_release(&pvec);
- if (err > 0)
- err = 0;
- return err;
- }
- if (buffer_delay(bh)) {
- clear_buffer_delay(bh);
- bh->b_blocknr = pblock++;
- }
- clear_buffer_unwritten(bh);
- } while (lblk++, (bh = bh->b_this_page) != head);
-
+ err = mpage_process_page(mpd, page, &lblk, &pblock,
+ &map_bh);
/*
- * FIXME: This is going to break if dioread_nolock
- * supports blocksize < pagesize as we will try to
- * convert potentially unmapped parts of inode.
+ * If map_bh is true, means page may require further bh
+ * mapping, or maybe the page was submitted for IO.
+ * So we return to call further extent mapping.
*/
- mpd->io_submit.io_end->size += PAGE_SIZE;
+ if (err < 0 || map_bh == true)
+ goto out;
/* Page fully mapped - let IO run! */
err = mpage_submit_page(mpd, page);
- if (err < 0) {
- pagevec_release(&pvec);
- return err;
- }
+ if (err < 0)
+ goto out;
}
pagevec_release(&pvec);
}
@@ -2431,6 +2469,9 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
mpd->map.m_len = 0;
mpd->map.m_flags = 0;
return 0;
+out:
+ pagevec_release(&pvec);
+ return err;
}

static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
--
2.21.0

2019-10-16 12:27:14

by Ritesh Harjani

[permalink] [raw]
Subject: [RFC 5/5] ext4: Enable blocksize < pagesize for dioread_nolock

All support is now added for blocksize < pagesize for dioread_nolock.
This patch removes those checks which disables dioread_nolock
feature for blocksize != pagesize.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/super.c | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dd654e53ba3d..7796e2ffc294 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2105,16 +2105,6 @@ static int parse_options(char *options, struct super_block *sb,
}
}
#endif
- if (test_opt(sb, DIOREAD_NOLOCK)) {
- int blocksize =
- BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
-
- if (blocksize < PAGE_SIZE) {
- ext4_msg(sb, KERN_ERR, "can't mount with "
- "dioread_nolock if block size != PAGE_SIZE");
- return 0;
- }
- }
return 1;
}

--
2.21.0

2019-10-24 10:53:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

Hi Ritesh,

I haven't had a chance to dig into the test failures yet, but FYI....
when I ran the auto test group in xfstests, I saw failures for
generic/219, generic 273, and generic/476 --- these errors did not
show up when running using a standard 4k blocksize on x86, and they
also did not show up when running dioread_nolock using a 4k blocksize.

So I tried running "generic/219 generic/273 generic/476" 30 times,
using in a Google Compute Engine VM, using gce-xfstests, and while I
wasn't able to get generic/219 to fail when run in isolation,
generic/273 seems to fail quite reliably, and generic/476 about a
third of the time.

How much testing have you done with these patches?

Thanks,

- Ted

TESTRUNID: tytso-20191023144956
KERNEL: kernel 5.4.0-rc3-xfstests-00005-g39b811602906 #1244 SMP Wed Oct 23 11:30:25 EDT 2019 x86_64
CMDLINE: --update-files -C 30 -c dioread_nolock_1k generic/219 generic/273 generic/476
CPUS: 2
MEM: 7680

ext4/dioread_nolock_1k: 90 tests, 42 failures, 10434 seconds
Failures: generic/273 generic/273 generic/273 generic/273
generic/476 generic/273 generic/476 generic/273 generic/273
generic/273 generic/476 generic/273 generic/476 generic/273
generic/476 generic/273 generic/476 generic/273 generic/273
generic/273 generic/273 generic/273 generic/476 generic/273
generic/273 generic/273 generic/273 generic/476 generic/273
generic/476 generic/273 generic/476 generic/273 generic/273
generic/273 generic/273 generic/273 generic/273 generic/273
generic/476 generic/273 generic/476
Totals: 90 tests, 0 skipped, 42 failures, 0 errors, 10434s

2019-10-24 11:01:20

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

Hello Ted,

On 10/24/19 4:56 AM, Theodore Y. Ts'o wrote:
> Hi Ritesh,
>
> I haven't had a chance to dig into the test failures yet, but FYI....
> when I ran the auto test group in xfstests, I saw failures for
> generic/219, generic 273, and generic/476 --- these errors did not
> show up when running using a standard 4k blocksize on x86, and they
> also did not show up when running dioread_nolock using a 4k blocksize.
>

Sorry about that. Were these 3 the only tests you saw to be failing,
or there were more?


> So I tried running "generic/219 generic/273 generic/476" 30 times,
> using in a Google Compute Engine VM, using gce-xfstests, and while I
> wasn't able to get generic/219 to fail when run in isolation,
> generic/273 seems to fail quite reliably, and generic/476 about a
> third of the time.
>
> How much testing have you done with these patches?

I did test "quick" group of xfstests & ltp/fsx tests with the posted
version. And had tested a full suite of xfstests with one of my previous
version.
I guess I was comparing against 1K blocksize without dioread_nolock.
But as I think more about it, I may need to compare against vanilla
kernel with 1K blocksize even without dioread_nolock. Since there may
be some changes in blocksize < pagesize path with this patch.


Also I see these tests(generic/273 & generic/476) are not part
of quick group. Let me check more about these failing tests at my end.

Thanks for your inputs.

-ritesh


>
> Thanks,
>
> - Ted
>
> TESTRUNID: tytso-20191023144956
> KERNEL: kernel 5.4.0-rc3-xfstests-00005-g39b811602906 #1244 SMP Wed Oct 23 11:30:25 EDT 2019 x86_64
> CMDLINE: --update-files -C 30 -c dioread_nolock_1k generic/219 generic/273 generic/476
> CPUS: 2
> MEM: 7680
>
> ext4/dioread_nolock_1k: 90 tests, 42 failures, 10434 seconds
> Failures: generic/273 generic/273 generic/273 generic/273
> generic/476 generic/273 generic/476 generic/273 generic/273
> generic/273 generic/476 generic/273 generic/476 generic/273
> generic/476 generic/273 generic/476 generic/273 generic/273
> generic/273 generic/273 generic/273 generic/476 generic/273
> generic/273 generic/273 generic/273 generic/476 generic/273
> generic/476 generic/273 generic/476 generic/273 generic/273
> generic/273 generic/273 generic/273 generic/273 generic/273
> generic/476 generic/273 generic/476
> Totals: 90 tests, 0 skipped, 42 failures, 0 errors, 10434s
>

2019-10-25 20:16:55

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 1/5] ext4: keep uniform naming convention for io & io_end variables

On Wed 16-10-19 13:07:07, Ritesh Harjani wrote:
> Let's keep uniform naming convention for ext4_submit_io (io)
> & ext4_end_io_t (io_end) structures, to avoid any confusion.
> No functionality change in this patch.
>
> Signed-off-by: Ritesh Harjani <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/page-io.c | 55 ++++++++++++++++++++++++-----------------------
> 1 file changed, 28 insertions(+), 27 deletions(-)
>
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 12ceadef32c5..d9b96fc976a3 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -136,19 +136,19 @@ static void ext4_release_io_end(ext4_io_end_t *io_end)
> * cannot get to ext4_ext_truncate() before all IOs overlapping that range are
> * completed (happens from ext4_free_ioend()).
> */
> -static int ext4_end_io(ext4_io_end_t *io)
> +static int ext4_end_io_end(ext4_io_end_t *io_end)
> {
> - struct inode *inode = io->inode;
> - loff_t offset = io->offset;
> - ssize_t size = io->size;
> - handle_t *handle = io->handle;
> + struct inode *inode = io_end->inode;
> + loff_t offset = io_end->offset;
> + ssize_t size = io_end->size;
> + handle_t *handle = io_end->handle;
> int ret = 0;
>
> - ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
> + ext4_debug("ext4_end_io_nolock: io_end 0x%p from inode %lu,list->next 0x%p,"
> "list->prev 0x%p\n",
> - io, inode->i_ino, io->list.next, io->list.prev);
> + io_end, inode->i_ino, io_end->list.next, io_end->list.prev);
>
> - io->handle = NULL; /* Following call will use up the handle */
> + io_end->handle = NULL; /* Following call will use up the handle */
> ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
> if (ret < 0 && !ext4_forced_shutdown(EXT4_SB(inode->i_sb))) {
> ext4_msg(inode->i_sb, KERN_EMERG,
> @@ -157,8 +157,8 @@ static int ext4_end_io(ext4_io_end_t *io)
> "(inode %lu, offset %llu, size %zd, error %d)",
> inode->i_ino, offset, size, ret);
> }
> - ext4_clear_io_unwritten_flag(io);
> - ext4_release_io_end(io);
> + ext4_clear_io_unwritten_flag(io_end);
> + ext4_release_io_end(io_end);
> return ret;
> }
>
> @@ -166,21 +166,21 @@ static void dump_completed_IO(struct inode *inode, struct list_head *head)
> {
> #ifdef EXT4FS_DEBUG
> struct list_head *cur, *before, *after;
> - ext4_io_end_t *io, *io0, *io1;
> + ext4_io_end_t *io_end, *io_end0, *io_end1;
>
> if (list_empty(head))
> return;
>
> ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
> - list_for_each_entry(io, head, list) {
> - cur = &io->list;
> + list_for_each_entry(io_end, head, list) {
> + cur = &io_end->list;
> before = cur->prev;
> - io0 = container_of(before, ext4_io_end_t, list);
> + io_end0 = container_of(before, ext4_io_end_t, list);
> after = cur->next;
> - io1 = container_of(after, ext4_io_end_t, list);
> + io_end1 = container_of(after, ext4_io_end_t, list);
>
> ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
> - io, inode->i_ino, io0, io1);
> + io_end, inode->i_ino, io_end0, io_end1);
> }
> #endif
> }
> @@ -207,7 +207,7 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
> static int ext4_do_flush_completed_IO(struct inode *inode,
> struct list_head *head)
> {
> - ext4_io_end_t *io;
> + ext4_io_end_t *io_end;
> struct list_head unwritten;
> unsigned long flags;
> struct ext4_inode_info *ei = EXT4_I(inode);
> @@ -219,11 +219,11 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
> spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>
> while (!list_empty(&unwritten)) {
> - io = list_entry(unwritten.next, ext4_io_end_t, list);
> - BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> - list_del_init(&io->list);
> + io_end = list_entry(unwritten.next, ext4_io_end_t, list);
> + BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> + list_del_init(&io_end->list);
>
> - err = ext4_end_io(io);
> + err = ext4_end_io_end(io_end);
> if (unlikely(!ret && err))
> ret = err;
> }
> @@ -242,13 +242,14 @@ void ext4_end_io_rsv_work(struct work_struct *work)
>
> ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
> {
> - ext4_io_end_t *io = kmem_cache_zalloc(io_end_cachep, flags);
> - if (io) {
> - io->inode = inode;
> - INIT_LIST_HEAD(&io->list);
> - atomic_set(&io->count, 1);
> + ext4_io_end_t *io_end = kmem_cache_zalloc(io_end_cachep, flags);
> +
> + if (io_end) {
> + io_end->inode = inode;
> + INIT_LIST_HEAD(&io_end->list);
> + atomic_set(&io_end->count, 1);
> }
> - return io;
> + return io_end;
> }
>
> void ext4_put_io_end_defer(ext4_io_end_t *io_end)
> --
> 2.21.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-29 07:31:17

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

Hello Ted,

I tried reproducing issue generic/273 & generic/476 on vanilla 5.4-rc4.
I could see that both of these could be very well reproduced with same
symptoms on vanilla 5.4-rc4 as well with 1K blocksize on x86
(i.e. without this patch series).

Although when I tried reproducing generic/273 & generic/476 on ppc64le
(64K pagesize & 4K blocksize) both with & without these patches,
I could not reproduce this issue on this platform.

As of now, I haven't gone deep into analyzing failure for generic/476,
but generic/273 seems to be failing with 1K blocksize because of
"No space left in device". This is happening, since the free inode count
is exhausted (mostly only with 1K blocksize).
I will have to look into this on whether it needs any xfstest fixing or
if there is something else going on.


So it looks like these failed tests does not seem to be because of this
patch series. But these are broken in general for at least 1K blocksize.

Also as an FYI, it seems generic/388 is also broken for blocksize <
pagesize for both platforms. So I will be planning to look into these
failures (generic/273 generic/388 generic/476) in parallel.


Do you think we can work on these failing tests separately, since it
does not look to be failing because of these patches and go ahead in
reviewing this current patch series?


-ritesh


On 10/24/19 4:56 AM, Theodore Y. Ts'o wrote:
> Hi Ritesh,
>
> I haven't had a chance to dig into the test failures yet, but FYI....
> when I ran the auto test group in xfstests, I saw failures for
> generic/219, generic 273, and generic/476 --- these errors did not
> show up when running using a standard 4k blocksize on x86, and they
> also did not show up when running dioread_nolock using a 4k blocksize.
>
> So I tried running "generic/219 generic/273 generic/476" 30 time
> using in a Google Compute Engine VM, using gce-xfstests, and while I
> wasn't able to get generic/219 to fail when run in isolation,
> generic/273 seems to fail quite reliably, and generic/476 about a
> third of the time.
>
> How much testing have you done with these patches?
>
> Thanks,
>
> - Ted
>
> TESTRUNID: tytso-20191023144956
> KERNEL: kernel 5.4.0-rc3-xfstests-00005-g39b811602906 #1244 SMP Wed Oct 23 11:30:25 EDT 2019 x86_64
> CMDLINE: --update-files -C 30 -c dioread_nolock_1k generic/219 generic/273 generic/476
> CPUS: 2
> MEM: 7680
>
> ext4/dioread_nolock_1k: 90 tests, 42 failures, 10434 seconds
> Failures: generic/273 generic/273 generic/273 generic/273
> generic/476 generic/273 generic/476 generic/273 generic/273
> generic/273 generic/476 generic/273 generic/476 generic/273
> generic/476 generic/273 generic/476 generic/273 generic/273
> generic/273 generic/273 generic/273 generic/476 generic/273
> generic/273 generic/273 generic/273 generic/476 generic/273
> generic/476 generic/273 generic/476 generic/273 generic/273
> generic/273 generic/273 generic/273 generic/273 generic/273
> generic/476 generic/273 generic/476
> Totals: 90 tests, 0 skipped, 42 failures, 0 errors, 10434s
>

2019-11-03 19:16:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

On Tue, Oct 29, 2019 at 12:49:24PM +0530, Ritesh Harjani wrote:
>
> So it looks like these failed tests does not seem to be because of this
> patch series. But these are broken in general for at least 1K blocksize.

Agreed, I failed to add them to the exclude list for diread_nolock_1k.
Thanks for pointing that out!

After looking through these patches, it looks good. So, I've landed
this series on the ext4 git tree.

There are some potential conflicts with Matthew's DIO using imap patch
set. I tried resolving them in the obvious way (see the tt/mb-dio
branch[1] on ext4.git), and unfortunately, there is a flaky test
failure with generic/270 --- 2 times out 30 runs of generic/270, the
file system is left inconsistent, with problems found in the block
allocation bitmap.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=tt/mb-dio

I've verified that generic/270 isn't a problem on -rc3, and it's not a
problem with just your patch series. So, it's almost certain it's
because I screwed up the merge. I applied each of Matthew's patch one
at a time, and conflict was in changes in ext4_end_io_dio, which is
dropped in Matthew's patch. It wasn't obvious though where the
dioread-nolock-1k change should be applied in Matthew's patch series.
Could you take a look? Thanks!!

> Also as an FYI, it seems generic/388 is also broken for blocksize <
> pagesize for both platforms. So I will be planning to look into these
> failures (generic/273 generic/388 generic/476) in parallel.

generic/388 is broken in a flaky fashion on all of the tests. That's
a shutdown test, and it's a known issue, having to do with how we
forcibly shut down the journalling machinery not being quite right.
Since unclean power off and/or dropped fibre channel/iSCSI case is
handled correctly, this hasn't been high on the priority list to track
down and fix.

> Do you think we can work on these failing tests separately, since it does
> not look to be failing because of these patches and go ahead in
> reviewing this current patch series?

Oh, I agree, those failures are pre-existing test failures, and so it
shouldn't and doens't block this series.

Thanks for your work on this feature!

- Ted

2019-11-04 10:17:12

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

On Sun, Nov 03, 2019 at 02:16:06PM -0500, Theodore Y. Ts'o wrote:
> On Tue, Oct 29, 2019 at 12:49:24PM +0530, Ritesh Harjani wrote:
> >
> > So it looks like these failed tests does not seem to be because of this
> > patch series. But these are broken in general for at least 1K blocksize.
>
> Agreed, I failed to add them to the exclude list for diread_nolock_1k.
> Thanks for pointing that out!
>
> After looking through these patches, it looks good. So, I've landed
> this series on the ext4 git tree.
>
> There are some potential conflicts with Matthew's DIO using imap patch
> set. I tried resolving them in the obvious way (see the tt/mb-dio
> branch[1] on ext4.git), and unfortunately, there is a flaky test
> failure with generic/270 --- 2 times out 30 runs of generic/270, the
> file system is left inconsistent, with problems found in the block
> allocation bitmap.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=tt/mb-dio
>
> I've verified that generic/270 isn't a problem on -rc3, and it's not a
> problem with just your patch series. So, it's almost certain it's
> because I screwed up the merge. I applied each of Matthew's patch one
> at a time, and conflict was in changes in ext4_end_io_dio, which is
> dropped in Matthew's patch. It wasn't obvious though where the
> dioread-nolock-1k change should be applied in Matthew's patch series.
> Could you take a look? Thanks!!

Hang on a second.

Are we not prematurely merging this series in with master? I thought
that this is something that should've come after the iomap direct I/O
port, no? The use of io_end's within the new direct I/O implementation
are effectively redundant...

/M

2019-11-04 10:38:47

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock



On 11/4/19 3:46 PM, Matthew Bobrowski wrote:
> On Sun, Nov 03, 2019 at 02:16:06PM -0500, Theodore Y. Ts'o wrote:
>> On Tue, Oct 29, 2019 at 12:49:24PM +0530, Ritesh Harjani wrote:
>>>
>>> So it looks like these failed tests does not seem to be because of this
>>> patch series. But these are broken in general for at least 1K blocksize.
>>
>> Agreed, I failed to add them to the exclude list for diread_nolock_1k.
>> Thanks for pointing that out!
>>
>> After looking through these patches, it looks good. So, I've landed
>> this series on the ext4 git tree.
>>
>> There are some potential conflicts with Matthew's DIO using imap patch
>> set. I tried resolving them in the obvious way (see the tt/mb-dio
>> branch[1] on ext4.git), and unfortunately, there is a flaky test
>> failure with generic/270 --- 2 times out 30 runs of generic/270, the
>> file system is left inconsistent, with problems found in the block
>> allocation bitmap.
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=tt/mb-dio
>>
>> I've verified that generic/270 isn't a problem on -rc3, and it's not a
>> problem with just your patch series. So, it's almost certain it's
>> because I screwed up the merge. I applied each of Matthew's patch one
>> at a time, and conflict was in changes in ext4_end_io_dio, which is
>> dropped in Matthew's patch. It wasn't obvious though where the
>> dioread-nolock-1k change should be applied in Matthew's patch series.
>> Could you take a look? Thanks!!
>
> Hang on a second.
>
> Are we not prematurely merging this series in with master? I thought
> that this is something that should've come after the iomap direct I/O
> port, no? The use of io_end's within the new direct I/O implementation
> are effectively redundant...

It sure may be giving a merge conflict (due to io_end structure).
But this dioread_nolock series was not dependent over iomap series.

-ritesh

2019-11-04 10:46:20

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock



On 11/4/19 12:46 AM, Theodore Y. Ts'o wrote:
> On Tue, Oct 29, 2019 at 12:49:24PM +0530, Ritesh Harjani wrote:
>>
>> So it looks like these failed tests does not seem to be because of this
>> patch series. But these are broken in general for at least 1K blocksize.
>
> Agreed, I failed to add them to the exclude list for diread_nolock_1k.
> Thanks for pointing that out!
>
> After looking through these patches, it looks good. So, I've landed
> this series on the ext4 git tree.
>
> There are some potential conflicts with Matthew's DIO using imap patch
> set. I tried resolving them in the obvious way (see the tt/mb-dio
> branch[1] on ext4.git), and unfortunately, there is a flaky test
> failure with generic/270 --- 2 times out 30 runs of generic/270, the
> file system is left inconsistent, with problems found in the block
> allocation bitmap.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=tt/mb-dio
>
> I've verified that generic/270 isn't a problem on -rc3, and it's not a
> problem with just your patch series. So, it's almost certain it's
> because I screwed up the merge. I applied each of Matthew's patch one
> at a time, and conflict was in changes in ext4_end_io_dio, which is
> dropped in Matthew's patch. It wasn't obvious though where the
> dioread-nolock-1k change should be applied in Matthew's patch series.
> Could you take a look? Thanks!!

Sure. Let me take a look at this and get back.
Meanwhile, if possible could you please help with what xfstest config is
failing and the failure details, if possible. Just curious to know
about it.

-ritesh

2019-11-04 10:52:09

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

On Mon, Nov 04, 2019 at 04:07:56PM +0530, Ritesh Harjani wrote:
> On 11/4/19 3:46 PM, Matthew Bobrowski wrote:
> > On Sun, Nov 03, 2019 at 02:16:06PM -0500, Theodore Y. Ts'o wrote:
> > > On Tue, Oct 29, 2019 at 12:49:24PM +0530, Ritesh Harjani wrote:
> > > >
> > > > So it looks like these failed tests does not seem to be because of this
> > > > patch series. But these are broken in general for at least 1K blocksize.
> > >
> > > Agreed, I failed to add them to the exclude list for diread_nolock_1k.
> > > Thanks for pointing that out!
> > >
> > > After looking through these patches, it looks good. So, I've landed
> > > this series on the ext4 git tree.
> > >
> > > There are some potential conflicts with Matthew's DIO using imap patch
> > > set. I tried resolving them in the obvious way (see the tt/mb-dio
> > > branch[1] on ext4.git), and unfortunately, there is a flaky test
> > > failure with generic/270 --- 2 times out 30 runs of generic/270, the
> > > file system is left inconsistent, with problems found in the block
> > > allocation bitmap.
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=tt/mb-dio
> > >
> > > I've verified that generic/270 isn't a problem on -rc3, and it's not a
> > > problem with just your patch series. So, it's almost certain it's
> > > because I screwed up the merge. I applied each of Matthew's patch one
> > > at a time, and conflict was in changes in ext4_end_io_dio, which is
> > > dropped in Matthew's patch. It wasn't obvious though where the
> > > dioread-nolock-1k change should be applied in Matthew's patch series.
> > > Could you take a look? Thanks!!
> >
> > Hang on a second.
> >
> > Are we not prematurely merging this series in with master? I thought
> > that this is something that should've come after the iomap direct I/O
> > port, no? The use of io_end's within the new direct I/O implementation
> > are effectively redundant...
>
> It sure may be giving a merge conflict (due to io_end structure).
> But this dioread_nolock series was not dependent over iomap series.

Uh ha. Well, there's been a chunk of code injected into
ext4_end_io_dio() here and by me removing it, I'm not entirely sure
what the downstream effects will be for this specific change...

/M

2019-11-04 12:00:27

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock


On 11/4/19 4:13 PM, Ritesh Harjani wrote:
>
>
> On 11/4/19 12:46 AM, Theodore Y. Ts'o wrote:
>> On Tue, Oct 29, 2019 at 12:49:24PM +0530, Ritesh Harjani wrote:
>>>
>>> So it looks like these failed tests does not seem to be because of this
>>> patch series. But these are broken in general for at least 1K blocksize.
>>
>> Agreed, I failed to add them to the exclude list for diread_nolock_1k.
>> Thanks for pointing that out!
>>
>> After looking through these patches, it looks good.  So, I've landed
>> this series on the ext4 git tree.
>>
>> There are some potential conflicts with Matthew's DIO using imap patch
>> set.  I tried resolving them in the obvious way (see the tt/mb-dio
>> branch[1] on ext4.git), and unfortunately, there is a flaky test
>> failure with generic/270 --- 2 times out 30 runs of generic/270, the
>> file system is left inconsistent, with problems found in the block
>> allocation bitmap.
>>
>> [1]
>> https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=tt/mb-dio
>>
>>
>> I've verified that generic/270 isn't a problem on -rc3, and it's not a
>> problem with just your patch series.  So, it's almost certain it's
>> because I screwed up the merge.  I applied each of Matthew's patch one
>> at a time, and conflict was in changes in ext4_end_io_dio, which is
>> dropped in Matthew's patch.  It wasn't obvious though where the
>> dioread-nolock-1k change should be applied in Matthew's patch series.
>> Could you take a look?  Thanks!!

Looked into the above mentioned branch.
I see the patches of iomap patch series were properly
applied. It is ok to completely remove function "ext4_end_io_dio" and
directly call for "ext4_convert_unwritten_extents" from
"ext4_dio_write_end_io" (which is also done by default).

Actually io_end_vec in dioread_nolock series was used for AIO
DIO, since AIO+DIO was already using io_end structure and all
io_end usage was moved to use io_end_vec.
But since in iomap, we don't use io_end structure so it is ok
to directly call for ext4_convert_unwritten_extents with offset &
size argument.

hmm, I will try and run this generic/270 at my end to see what is going
wrong with this. It would be good if you could share the xfstest config
under which this is failing & if possible the failed logs/report.

Meanwhile, I will test generic/270 on your given branch with
dioread_nolock & 1K bs configuration.

-ritesh

2019-11-04 16:10:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

On Mon, Nov 04, 2019 at 09:49:14PM +1100, Matthew Bobrowski wrote:
> > It sure may be giving a merge conflict (due to io_end structure).
> > But this dioread_nolock series was not dependent over iomap series.
>
> Uh ha. Well, there's been a chunk of code injected into
> ext4_end_io_dio() here and by me removing it, I'm not entirely sure
> what the downstream effects will be for this specific change...

Yeah, that was probably my failure to do the merge correctly; I'm
hoping that Ritesh will be able to fix that up. If not we can throw
an "experimental" config to enable dioread_nolock on subpage
blocksizes, just to warn people that under some extreme workloads,
they might end up corrupting their allocation bitmap, which then might
lead to data loss. I suspect it would actually work fine for most
users; but out of paranoia, if we can't figure out the generic/270
failure before the merge window, we can just make dioread_nolock_1k
experimental for now.

- Ted

2019-11-06 17:25:20

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock

On Wed 16-10-19 13:07:06, Ritesh Harjani wrote:
> This patch series adds the support for blocksize < pagesize for
> dioread_nolock feature.
>
> Since in case of blocksize < pagesize, we can have multiple
> small buffers of page as unwritten extents, we need to
> maintain a vector of these unwritten extents which needs
> the conversion after the IO is complete. Thus, we maintain
> a list of tuple <offset, size> pair (io_end_vec) for this &
> traverse this list to do the unwritten to written conversion.
>
> Appreciate any reviews/comments on this patches.

I know Ted has merged the patches already so this is just informational but
I've read the patches and they look fine to me. Thanks for the work! I was
just thinking that we could actually make the vector tracking more
efficient because the io_end always looks like:

one-big-extent-to-fully-write + whatever it takes to fully write out the
last page

So your vectors could be also expressed as "extent to write" + bitmap of
blocks written in the last page. And 64-bits are enough for the bitmap for
anything ext4 supports so we could easily save allocation of ioend_vec etc.
Just a suggestion.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-11-07 11:18:39

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFC 0/5] Ext4: Add support for blocksize < pagesize for dioread_nolock



On 11/6/19 10:53 PM, Jan Kara wrote:
> On Wed 16-10-19 13:07:06, Ritesh Harjani wrote:
>> This patch series adds the support for blocksize < pagesize for
>> dioread_nolock feature.
>>
>> Since in case of blocksize < pagesize, we can have multiple
>> small buffers of page as unwritten extents, we need to
>> maintain a vector of these unwritten extents which needs
>> the conversion after the IO is complete. Thus, we maintain
>> a list of tuple <offset, size> pair (io_end_vec) for this &
>> traverse this list to do the unwritten to written conversion.
>>
>> Appreciate any reviews/comments on this patches.
>
> I know Ted has merged the patches already so this is just informational but
> I've read the patches and they look fine to me. Thanks for the work! I was

Appreciate your help too for valuable feedback & pointers at various places.


> just thinking that we could actually make the vector tracking more
> efficient because the io_end always looks like:
>
> one-big-extent-to-fully-write + whatever it takes to fully write out the
> last page
>
> So your vectors could be also expressed as "extent to write" + bitmap of
> blocks written in the last page. And 64-bits are enough for the bitmap for
> anything ext4 supports so we could easily save allocation of ioend_vec etc.
> Just a suggestion.

Yes, sounds good to me. Although slab allocations are also fast.
However I agree this should be more efficient and will also avoid the
list management and or list pointer traversal.

Sure, will work over this optimization once I get some closure on some
ongoing open items.


Thanks & appreciate your feedback.
ritesh