2021-07-06 02:34:52

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH 0/4] ext4: improve delalloc buffer write performance

Hi,

This patchset address to improve buffer write performance with delalloc.
The first patch reduce the unnecessary update i_disksize, the second two
patch refactor the inline data write procedure and also do some small
fix, the last patch do improve by remove all unnecessary journal handle
in the delalloc write procedure.

After this patch set, we could get a lot of performance improvement.
Below is the Unixbench comparison data test on my machine with 'Intel
Xeon Gold 5120' CPU and nvme SSD backend.

Test cmd:

./Run -c 56 -i 3 fstime fsbuffer fsdisk

Before this patch:

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1
File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0
========
System Benchmarks Index Score (Partial Only) 1186.6

After this patch:

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3
File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7
========
System Benchmarks Index Score (Partial Only) 2053.0



Already test by fstests under below configuration, no add failure.

delalloc blocksize journal inline
1. Y 4K N N
2. Y 4K N Y
3. N 4K N N
4. N 4K N Y
5. Y 4K Y N
6. Y 4K Y Y
7. Y 1K N N
8. N 1K N N
9. Y 1K Y N

Thanks,
Yi.



Zhang Yi (4):
ext4: check and update i_disksize properly
ext4: correct the error path of ext4_write_inline_data_end()
ext4: factor out write end code of inline file
ext4: drop unnecessary journal handle in delalloc write

fs/ext4/ext4.h | 3 --
fs/ext4/inline.c | 124 +++++++++++++++++++++---------------------
fs/ext4/inode.c | 137 ++++++++++-------------------------------------
3 files changed, 93 insertions(+), 171 deletions(-)

--
2.31.1


2021-07-06 02:34:56

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH 2/4] ext4: correct the error path of ext4_write_inline_data_end()

Current error path of ext4_write_inline_data_end() is not correct.

Firstly, it should pass out the error value if ext4_get_inode_loc()
return fail, or else it could trigger infinite loop if we inject error
here. And then it's better to add inode to orphan list if it return fail
in ext4_journal_stop(), otherwise we could not restore inline xattr
entry after power failure. Finally, we need to reset the 'ret' value if
ext4_write_inline_data_end() return success in ext4_write_end() and
ext4_journalled_write_end(), otherwise we could not get the error return
value of ext4_journal_stop().

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inline.c | 15 +++++----------
fs/ext4/inode.c | 7 +++++--
2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 70cb64db33f7..28b666f25ac2 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -733,25 +733,20 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
void *kaddr;
struct ext4_iloc iloc;

- if (unlikely(copied < len)) {
- if (!PageUptodate(page)) {
- copied = 0;
- goto out;
- }
- }
+ if (unlikely(copied < len) && !PageUptodate(page))
+ return 0;

ret = ext4_get_inode_loc(inode, &iloc);
if (ret) {
ext4_std_error(inode->i_sb, ret);
- copied = 0;
- goto out;
+ return ret;
}

ext4_write_lock_xattr(inode, &no_expand);
BUG_ON(!ext4_has_inline_data(inode));

kaddr = kmap_atomic(page);
- ext4_write_inline_data(inode, &iloc, kaddr, pos, len);
+ ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
kunmap_atomic(kaddr);
SetPageUptodate(page);
/* clear page dirty so that writepages wouldn't work for us. */
@@ -760,7 +755,7 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
ext4_write_unlock_xattr(inode, &no_expand);
brelse(iloc.bh);
mark_inode_dirty(inode);
-out:
+
return copied;
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6f6a61f3ae5f..4efd50a844b9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1295,6 +1295,7 @@ static int ext4_write_end(struct file *file,
goto errout;
}
copied = ret;
+ ret = 0;
} else
copied = block_write_end(file, mapping, pos,
len, copied, page, fsdata);
@@ -1321,13 +1322,14 @@ static int ext4_write_end(struct file *file,
if (i_size_changed || inline_data)
ret = ext4_mark_inode_dirty(handle, inode);

+errout:
if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
* less. We will have blocks allocated outside
* inode->i_size. So truncate them
*/
ext4_orphan_add(handle, inode);
-errout:
+
ret2 = ext4_journal_stop(handle);
if (!ret)
ret = ret2;
@@ -1410,6 +1412,7 @@ static int ext4_journalled_write_end(struct file *file,
goto errout;
}
copied = ret;
+ ret = 0;
} else if (unlikely(copied < len) && !PageUptodate(page)) {
copied = 0;
ext4_journalled_zero_new_buffers(handle, page, from, to);
@@ -1439,6 +1442,7 @@ static int ext4_journalled_write_end(struct file *file,
ret = ret2;
}

+errout:
if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
* less. We will have blocks allocated outside
@@ -1446,7 +1450,6 @@ static int ext4_journalled_write_end(struct file *file,
*/
ext4_orphan_add(handle, inode);

-errout:
ret2 = ext4_journal_stop(handle);
if (!ret)
ret = ret2;
--
2.31.1

2021-07-06 02:36:19

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH 4/4] ext4: drop unnecessary journal handle in delalloc write

After we factor out the inline data write procedure from
ext4_da_write_end(), we don't need to start journal handle for the cases
of both buffer overwrite and append-write. If we need to update
i_disksize, mark_inode_dirty() do start handle and update inode buffer.
So we could just remove all the journal handle codes in the delalloc
write procedure.

After this patch, we could get a lot of performance improvement. Below
is the Unixbench comparison data test on my machine with 'Intel Xeon
Gold 5120' CPU and nvme SSD backend.

Test cmd:

./Run -c 56 -i 3 fstime fsbuffer fsdisk

Before this patch:

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1
File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0
======
System Benchmarks Index Score (Partial Only) 1186.6

After this patch:

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3
File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7
======
System Benchmarks Index Score (Partial Only) 2053.0

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 60 +++++--------------------------------------------
1 file changed, 5 insertions(+), 55 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 650da0648eba..9c86cada9a54 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2910,19 +2910,6 @@ static int ext4_nonda_switch(struct super_block *sb)
return 0;
}

-/* We always reserve for an inode update; the superblock could be there too */
-static int ext4_da_write_credits(struct inode *inode, loff_t pos, unsigned len)
-{
- if (likely(ext4_has_feature_large_file(inode->i_sb)))
- return 1;
-
- if (pos + len <= 0x7fffffffULL)
- return 1;
-
- /* We might need to update the superblock to set LARGE_FILE */
- return 2;
-}
-
static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
@@ -2931,7 +2918,6 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
struct page *page;
pgoff_t index;
struct inode *inode = mapping->host;
- handle_t *handle;

if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
@@ -2957,41 +2943,11 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
return 0;
}

- /*
- * grab_cache_page_write_begin() can take a long time if the
- * system is thrashing due to memory pressure, or if the page
- * is being written back. So grab it first before we start
- * the transaction handle. This also allows us to allocate
- * the page (if needed) without using GFP_NOFS.
- */
-retry_grab:
+retry:
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
- unlock_page(page);
-
- /*
- * With delayed allocation, we don't log the i_disksize update
- * if there is delayed block allocation. But we still need
- * to journalling the i_disksize update if writes to the end
- * of file which has an already mapped buffer.
- */
-retry_journal:
- handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
- ext4_da_write_credits(inode, pos, len));
- if (IS_ERR(handle)) {
- put_page(page);
- return PTR_ERR(handle);
- }

- lock_page(page);
- if (page->mapping != mapping) {
- /* The page got truncated from under us */
- unlock_page(page);
- put_page(page);
- ext4_journal_stop(handle);
- goto retry_grab;
- }
/* In case writeback began while the page was unlocked */
wait_for_stable_page(page);

@@ -3003,20 +2959,18 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
#endif
if (ret < 0) {
unlock_page(page);
- ext4_journal_stop(handle);
+ put_page(page);
/*
* block_write_begin may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
- * i_size_read because we hold i_mutex.
+ * i_size_read because we hold inode lock.
*/
if (pos + len > inode->i_size)
ext4_truncate_failed_write(inode);

if (ret == -ENOSPC &&
ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry_journal;
-
- put_page(page);
+ goto retry;
return ret;
}

@@ -3053,8 +3007,6 @@ static int ext4_da_write_end(struct file *file,
struct page *page, void *fsdata)
{
struct inode *inode = mapping->host;
- int ret;
- handle_t *handle = ext4_journal_current_handle();
loff_t new_i_size;
unsigned long start, end;
int write_mode = (int)(unsigned long)fsdata;
@@ -3086,9 +3038,7 @@ static int ext4_da_write_end(struct file *file,
ext4_da_should_update_i_disksize(page, end))
ext4_update_i_disksize(inode, new_i_size);

- copied = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
- ret = ext4_journal_stop(handle);
- return ret ? ret : copied;
+ return generic_write_end(file, mapping, pos, len, copied, page, fsdata);
}

/*
--
2.31.1

2021-07-06 02:36:19

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH 3/4] ext4: factor out write end code of inline file

Now that the inline_data file write end procedure are falled into the
common write end functions, it is not clear. Factor them out and do
some cleanup. This patch also drop ext4_da_write_inline_data_end()
and switch to use ext4_write_inline_data_end() instead because we also
need to do the same error processing if we failed to write data into
inline entry.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 3 --
fs/ext4/inline.c | 121 +++++++++++++++++++++++++----------------------
fs/ext4/inode.c | 73 +++++++++-------------------
3 files changed, 88 insertions(+), 109 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c51e243450d..d4799e619048 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3530,9 +3530,6 @@ extern int ext4_da_write_inline_data_begin(struct address_space *mapping,
unsigned flags,
struct page **pagep,
void **fsdata);
-extern int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
- unsigned len, unsigned copied,
- struct page *page);
extern int ext4_try_add_inline_entry(handle_t *handle,
struct ext4_filename *fname,
struct inode *dir, struct inode *inode);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 28b666f25ac2..8fbf8ec05bd5 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -729,34 +729,80 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
unsigned copied, struct page *page)
{
- int ret, no_expand;
+ handle_t *handle = ext4_journal_current_handle();
+ int i_size_changed = 0;
+ int no_expand;
void *kaddr;
struct ext4_iloc iloc;
+ int ret, ret2;

if (unlikely(copied < len) && !PageUptodate(page))
- return 0;
+ copied = 0;

- ret = ext4_get_inode_loc(inode, &iloc);
- if (ret) {
- ext4_std_error(inode->i_sb, ret);
- return ret;
- }
+ if (likely(copied)) {
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret) {
+ unlock_page(page);
+ put_page(page);
+ ext4_std_error(inode->i_sb, ret);
+ goto out;
+ }
+ ext4_write_lock_xattr(inode, &no_expand);
+ BUG_ON(!ext4_has_inline_data(inode));

- ext4_write_lock_xattr(inode, &no_expand);
- BUG_ON(!ext4_has_inline_data(inode));
+ kaddr = kmap_atomic(page);
+ ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
+ kunmap_atomic(kaddr);
+ SetPageUptodate(page);
+ /* clear page dirty so that writepages wouldn't work for us. */
+ ClearPageDirty(page);

- kaddr = kmap_atomic(page);
- ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
- kunmap_atomic(kaddr);
- SetPageUptodate(page);
- /* clear page dirty so that writepages wouldn't work for us. */
- ClearPageDirty(page);
+ ext4_write_unlock_xattr(inode, &no_expand);
+ brelse(iloc.bh);
+ }

- ext4_write_unlock_xattr(inode, &no_expand);
- brelse(iloc.bh);
- mark_inode_dirty(inode);
+ /*
+ * It's important to update i_size while still holding page lock:
+ * page writeout could otherwise come in and zero beyond i_size.
+ */
+ i_size_changed = ext4_update_inode_size(inode, pos + copied);
+ if (ext4_should_journal_data(inode)) {
+ ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+ EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
+ }
+ unlock_page(page);
+ put_page(page);

- return copied;
+ /*
+ * Don't mark the inode dirty under page lock. First, it unnecessarily
+ * makes the holding time of page lock longer. Second, it forces lock
+ * ordering of page lock and transaction start for journaling
+ * filesystems.
+ */
+ if (likely(copied) || i_size_changed)
+ mark_inode_dirty(inode);
+out:
+ /*
+ * If we have allocated more blocks and copied less. We will have
+ * blocks allocated outside inode->i_size, so truncate them.
+ */
+ if (pos + len > inode->i_size && ext4_can_truncate(inode))
+ ext4_orphan_add(handle, inode);
+
+ ret2 = ext4_journal_stop(handle);
+ if (!ret)
+ ret = ret2;
+ if (pos + len > inode->i_size) {
+ ext4_truncate_failed_write(inode);
+ /*
+ * If truncate failed early the inode might still be
+ * on the orphan list; we need to make sure the inode
+ * is removed from the orphan list in that case.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+ return ret ? ret : copied;
}

struct buffer_head *
@@ -937,43 +983,6 @@ int ext4_da_write_inline_data_begin(struct address_space *mapping,
return ret;
}

-int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
- unsigned len, unsigned copied,
- struct page *page)
-{
- int ret;
-
- ret = ext4_write_inline_data_end(inode, pos, len, copied, page);
- if (ret < 0) {
- unlock_page(page);
- put_page(page);
- return ret;
- }
- copied = ret;
-
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- *
- * But it's important to update i_size while still holding page lock:
- * page writeout could otherwise come in and zero beyond i_size.
- */
- if (pos+copied > inode->i_size)
- i_size_write(inode, pos+copied);
- unlock_page(page);
- put_page(page);
-
- /*
- * Don't mark the inode dirty under page lock. First, it unnecessarily
- * makes the holding time of page lock longer. Second, it forces lock
- * ordering of page lock and transaction start for journaling
- * filesystems.
- */
- mark_inode_dirty(inode);
-
- return copied;
-}
-
#ifdef INLINE_DIR_DEBUG
void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh,
void *inline_start, int inline_size)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4efd50a844b9..650da0648eba 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1282,23 +1282,14 @@ static int ext4_write_end(struct file *file,
loff_t old_size = inode->i_size;
int ret = 0, ret2;
int i_size_changed = 0;
- int inline_data = ext4_has_inline_data(inode);
bool verity = ext4_verity_in_progress(inode);

trace_ext4_write_end(inode, pos, len, copied);
- if (inline_data) {
- ret = ext4_write_inline_data_end(inode, pos, len,
- copied, page);
- if (ret < 0) {
- unlock_page(page);
- put_page(page);
- goto errout;
- }
- copied = ret;
- ret = 0;
- } else
- copied = block_write_end(file, mapping, pos,
- len, copied, page, fsdata);
+
+ if (ext4_has_inline_data(inode))
+ return ext4_write_inline_data_end(inode, pos, len, copied, page);
+
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
/*
* it's important to update i_size while still holding page lock:
* page writeout could otherwise come in and zero beyond i_size.
@@ -1319,10 +1310,9 @@ static int ext4_write_end(struct file *file,
* ordering of page lock and transaction start for journaling
* filesystems.
*/
- if (i_size_changed || inline_data)
+ if (i_size_changed)
ret = ext4_mark_inode_dirty(handle, inode);

-errout:
if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
* less. We will have blocks allocated outside
@@ -1394,7 +1384,6 @@ static int ext4_journalled_write_end(struct file *file,
int partial = 0;
unsigned from, to;
int size_changed = 0;
- int inline_data = ext4_has_inline_data(inode);
bool verity = ext4_verity_in_progress(inode);

trace_ext4_journalled_write_end(inode, pos, len, copied);
@@ -1403,17 +1392,10 @@ static int ext4_journalled_write_end(struct file *file,

BUG_ON(!ext4_handle_valid(handle));

- if (inline_data) {
- ret = ext4_write_inline_data_end(inode, pos, len,
- copied, page);
- if (ret < 0) {
- unlock_page(page);
- put_page(page);
- goto errout;
- }
- copied = ret;
- ret = 0;
- } else if (unlikely(copied < len) && !PageUptodate(page)) {
+ if (ext4_has_inline_data(inode))
+ return ext4_write_inline_data_end(inode, pos, len, copied, page);
+
+ if (unlikely(copied < len) && !PageUptodate(page)) {
copied = 0;
ext4_journalled_zero_new_buffers(handle, page, from, to);
} else {
@@ -1436,13 +1418,12 @@ static int ext4_journalled_write_end(struct file *file,
if (old_size < pos && !verity)
pagecache_isize_extended(inode, old_size, pos);

- if (size_changed || inline_data) {
+ if (size_changed) {
ret2 = ext4_mark_inode_dirty(handle, inode);
if (!ret)
ret = ret2;
}

-errout:
if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
* less. We will have blocks allocated outside
@@ -3072,7 +3053,7 @@ static int ext4_da_write_end(struct file *file,
struct page *page, void *fsdata)
{
struct inode *inode = mapping->host;
- int ret = 0, ret2;
+ int ret;
handle_t *handle = ext4_journal_current_handle();
loff_t new_i_size;
unsigned long start, end;
@@ -3083,6 +3064,12 @@ static int ext4_da_write_end(struct file *file,
len, copied, page, fsdata);

trace_ext4_da_write_end(inode, pos, len, copied);
+
+ if (write_mode != CONVERT_INLINE_DATA &&
+ ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
+ ext4_has_inline_data(inode))
+ return ext4_write_inline_data_end(inode, pos, len, copied, page);
+
start = pos & (PAGE_SIZE - 1);
end = start + copied - 1;

@@ -3095,26 +3082,12 @@ static int ext4_da_write_end(struct file *file,
* update i_disksize from i_size for delay allocated blocks properly.
*/
new_i_size = pos + copied;
- if (copied && new_i_size > inode->i_size) {
- if (ext4_has_inline_data(inode) ||
- ext4_da_should_update_i_disksize(page, end))
- ext4_update_i_disksize(inode, new_i_size);
- }
-
- if (write_mode != CONVERT_INLINE_DATA &&
- ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
- ext4_has_inline_data(inode))
- ret = ext4_da_write_inline_data_end(inode, pos, len, copied,
- page);
- else
- ret = generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
-
- copied = ret;
- ret2 = ext4_journal_stop(handle);
- if (unlikely(ret2 && !ret))
- ret = ret2;
+ if (copied && new_i_size > inode->i_size &&
+ ext4_da_should_update_i_disksize(page, end))
+ ext4_update_i_disksize(inode, new_i_size);

+ copied = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
+ ret = ext4_journal_stop(handle);
return ret ? ret : copied;
}

--
2.31.1

2021-07-06 12:35:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] ext4: correct the error path of ext4_write_inline_data_end()

On Tue 06-07-21 10:42:08, Zhang Yi wrote:
> Current error path of ext4_write_inline_data_end() is not correct.
>
> Firstly, it should pass out the error value if ext4_get_inode_loc()
> return fail, or else it could trigger infinite loop if we inject error
> here.
> And then it's better to add inode to orphan list if it return fail
> in ext4_journal_stop(), otherwise we could not restore inline xattr
> entry after power failure.
> Finally, we need to reset the 'ret' value if
> ext4_write_inline_data_end() return success in ext4_write_end() and
> ext4_journalled_write_end(), otherwise we could not get the error return
> value of ext4_journal_stop().
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inline.c | 15 +++++----------
> fs/ext4/inode.c | 7 +++++--
> 2 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 70cb64db33f7..28b666f25ac2 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -733,25 +733,20 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
> void *kaddr;
> struct ext4_iloc iloc;
>
> - if (unlikely(copied < len)) {
> - if (!PageUptodate(page)) {
> - copied = 0;
> - goto out;
> - }
> - }
> + if (unlikely(copied < len) && !PageUptodate(page))
> + return 0;
>
> ret = ext4_get_inode_loc(inode, &iloc);
> if (ret) {
> ext4_std_error(inode->i_sb, ret);
> - copied = 0;
> - goto out;
> + return ret;
> }
>
> ext4_write_lock_xattr(inode, &no_expand);
> BUG_ON(!ext4_has_inline_data(inode));
>
> kaddr = kmap_atomic(page);
> - ext4_write_inline_data(inode, &iloc, kaddr, pos, len);
> + ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
> kunmap_atomic(kaddr);
> SetPageUptodate(page);
> /* clear page dirty so that writepages wouldn't work for us. */
> @@ -760,7 +755,7 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
> ext4_write_unlock_xattr(inode, &no_expand);
> brelse(iloc.bh);
> mark_inode_dirty(inode);
> -out:
> +
> return copied;
> }
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6f6a61f3ae5f..4efd50a844b9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1295,6 +1295,7 @@ static int ext4_write_end(struct file *file,
> goto errout;
> }
> copied = ret;
> + ret = 0;
> } else
> copied = block_write_end(file, mapping, pos,
> len, copied, page, fsdata);
> @@ -1321,13 +1322,14 @@ static int ext4_write_end(struct file *file,
> if (i_size_changed || inline_data)
> ret = ext4_mark_inode_dirty(handle, inode);
>
> +errout:
> if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode))
> /* if we have allocated more blocks and copied
> * less. We will have blocks allocated outside
> * inode->i_size. So truncate them
> */
> ext4_orphan_add(handle, inode);
> -errout:
> +
> ret2 = ext4_journal_stop(handle);
> if (!ret)
> ret = ret2;
> @@ -1410,6 +1412,7 @@ static int ext4_journalled_write_end(struct file *file,
> goto errout;
> }
> copied = ret;
> + ret = 0;
> } else if (unlikely(copied < len) && !PageUptodate(page)) {
> copied = 0;
> ext4_journalled_zero_new_buffers(handle, page, from, to);
> @@ -1439,6 +1442,7 @@ static int ext4_journalled_write_end(struct file *file,
> ret = ret2;
> }
>
> +errout:
> if (pos + len > inode->i_size && !verity && ext4_can_truncate(inode))
> /* if we have allocated more blocks and copied
> * less. We will have blocks allocated outside
> @@ -1446,7 +1450,6 @@ static int ext4_journalled_write_end(struct file *file,
> */
> ext4_orphan_add(handle, inode);
>
> -errout:
> ret2 = ext4_journal_stop(handle);
> if (!ret)
> ret = ret2;
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-07-07 16:50:46

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] ext4: factor out write end code of inline file

On Tue 06-07-21 10:42:09, Zhang Yi wrote:
> Now that the inline_data file write end procedure are falled into the
> common write end functions, it is not clear. Factor them out and do
> some cleanup. This patch also drop ext4_da_write_inline_data_end()
> and switch to use ext4_write_inline_data_end() instead because we also
> need to do the same error processing if we failed to write data into
> inline entry.
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good. Just two nits below.

> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 28b666f25ac2..8fbf8ec05bd5 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -729,34 +729,80 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
> int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
> unsigned copied, struct page *page)
> {
> - int ret, no_expand;
> + handle_t *handle = ext4_journal_current_handle();
> + int i_size_changed = 0;
> + int no_expand;
> void *kaddr;
> struct ext4_iloc iloc;
> + int ret, ret2;
>
> if (unlikely(copied < len) && !PageUptodate(page))
> - return 0;
> + copied = 0;
>
> - ret = ext4_get_inode_loc(inode, &iloc);
> - if (ret) {
> - ext4_std_error(inode->i_sb, ret);
> - return ret;
> - }
> + if (likely(copied)) {
> + ret = ext4_get_inode_loc(inode, &iloc);
> + if (ret) {
> + unlock_page(page);
> + put_page(page);
> + ext4_std_error(inode->i_sb, ret);
> + goto out;
> + }
> + ext4_write_lock_xattr(inode, &no_expand);
> + BUG_ON(!ext4_has_inline_data(inode));
>
> - ext4_write_lock_xattr(inode, &no_expand);
> - BUG_ON(!ext4_has_inline_data(inode));
> + kaddr = kmap_atomic(page);
> + ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
> + kunmap_atomic(kaddr);
> + SetPageUptodate(page);
> + /* clear page dirty so that writepages wouldn't work for us. */
> + ClearPageDirty(page);
>
> - kaddr = kmap_atomic(page);
> - ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
> - kunmap_atomic(kaddr);
> - SetPageUptodate(page);
> - /* clear page dirty so that writepages wouldn't work for us. */
> - ClearPageDirty(page);
> + ext4_write_unlock_xattr(inode, &no_expand);
> + brelse(iloc.bh);
> + }
>
> - ext4_write_unlock_xattr(inode, &no_expand);
> - brelse(iloc.bh);
> - mark_inode_dirty(inode);
> + /*
> + * It's important to update i_size while still holding page lock:
> + * page writeout could otherwise come in and zero beyond i_size.
> + */
> + i_size_changed = ext4_update_inode_size(inode, pos + copied);
> + if (ext4_should_journal_data(inode)) {
> + ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> + EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> + }

I think this hunk should also go into the "if (copied)" block. There's no
point in changing i_size or i_disksize when nothing was written.

> + unlock_page(page);
> + put_page(page);
>
> - return copied;
> + /*
> + * Don't mark the inode dirty under page lock. First, it unnecessarily
> + * makes the holding time of page lock longer. Second, it forces lock
> + * ordering of page lock and transaction start for journaling
> + * filesystems.
> + */
> + if (likely(copied) || i_size_changed)
> + mark_inode_dirty(inode);

And then it is obvious here that (copied == 0) => !i_size_changed so we can
just remove the i_size_changed term from the condition.

> +out:
> + /*
> + * If we have allocated more blocks and copied less. We will have
> + * blocks allocated outside inode->i_size, so truncate them.
> + */
> + if (pos + len > inode->i_size && ext4_can_truncate(inode))
> + ext4_orphan_add(handle, inode);
> +
> + ret2 = ext4_journal_stop(handle);
> + if (!ret)
> + ret = ret2;
> + if (pos + len > inode->i_size) {
> + ext4_truncate_failed_write(inode);
> + /*
> + * If truncate failed early the inode might still be
> + * on the orphan list; we need to make sure the inode
> + * is removed from the orphan list in that case.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> + return ret ? ret : copied;
> }

Honza
---
Jan Kara <[email protected]>
SUSE Labs, CR

2021-07-07 17:00:07

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] ext4: drop unnecessary journal handle in delalloc write

On Tue 06-07-21 10:42:10, Zhang Yi wrote:
> After we factor out the inline data write procedure from
> ext4_da_write_end(), we don't need to start journal handle for the cases
> of both buffer overwrite and append-write. If we need to update
> i_disksize, mark_inode_dirty() do start handle and update inode buffer.
> So we could just remove all the journal handle codes in the delalloc
> write procedure.
>
> After this patch, we could get a lot of performance improvement. Below
> is the Unixbench comparison data test on my machine with 'Intel Xeon
> Gold 5120' CPU and nvme SSD backend.
>
> Test cmd:
>
> ./Run -c 56 -i 3 fstime fsbuffer fsdisk
>
> Before this patch:
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1
> File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9
> File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0
> ======
> System Benchmarks Index Score (Partial Only) 1186.6
>
> After this patch:
>
> System Benchmarks Partial Index BASELINE RESULT INDEX
> File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3
> File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5
> File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7
> ======
> System Benchmarks Index Score (Partial Only) 2053.0
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good and nice speedup! Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 60 +++++--------------------------------------------
> 1 file changed, 5 insertions(+), 55 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 650da0648eba..9c86cada9a54 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2910,19 +2910,6 @@ static int ext4_nonda_switch(struct super_block *sb)
> return 0;
> }
>
> -/* We always reserve for an inode update; the superblock could be there too */
> -static int ext4_da_write_credits(struct inode *inode, loff_t pos, unsigned len)
> -{
> - if (likely(ext4_has_feature_large_file(inode->i_sb)))
> - return 1;
> -
> - if (pos + len <= 0x7fffffffULL)
> - return 1;
> -
> - /* We might need to update the superblock to set LARGE_FILE */
> - return 2;
> -}
> -
> static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> @@ -2931,7 +2918,6 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> struct page *page;
> pgoff_t index;
> struct inode *inode = mapping->host;
> - handle_t *handle;
>
> if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
> @@ -2957,41 +2943,11 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> return 0;
> }
>
> - /*
> - * grab_cache_page_write_begin() can take a long time if the
> - * system is thrashing due to memory pressure, or if the page
> - * is being written back. So grab it first before we start
> - * the transaction handle. This also allows us to allocate
> - * the page (if needed) without using GFP_NOFS.
> - */
> -retry_grab:
> +retry:
> page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page)
> return -ENOMEM;
> - unlock_page(page);
> -
> - /*
> - * With delayed allocation, we don't log the i_disksize update
> - * if there is delayed block allocation. But we still need
> - * to journalling the i_disksize update if writes to the end
> - * of file which has an already mapped buffer.
> - */
> -retry_journal:
> - handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
> - ext4_da_write_credits(inode, pos, len));
> - if (IS_ERR(handle)) {
> - put_page(page);
> - return PTR_ERR(handle);
> - }
>
> - lock_page(page);
> - if (page->mapping != mapping) {
> - /* The page got truncated from under us */
> - unlock_page(page);
> - put_page(page);
> - ext4_journal_stop(handle);
> - goto retry_grab;
> - }
> /* In case writeback began while the page was unlocked */
> wait_for_stable_page(page);
>
> @@ -3003,20 +2959,18 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> #endif
> if (ret < 0) {
> unlock_page(page);
> - ext4_journal_stop(handle);
> + put_page(page);
> /*
> * block_write_begin may have instantiated a few blocks
> * outside i_size. Trim these off again. Don't need
> - * i_size_read because we hold i_mutex.
> + * i_size_read because we hold inode lock.
> */
> if (pos + len > inode->i_size)
> ext4_truncate_failed_write(inode);
>
> if (ret == -ENOSPC &&
> ext4_should_retry_alloc(inode->i_sb, &retries))
> - goto retry_journal;
> -
> - put_page(page);
> + goto retry;
> return ret;
> }
>
> @@ -3053,8 +3007,6 @@ static int ext4_da_write_end(struct file *file,
> struct page *page, void *fsdata)
> {
> struct inode *inode = mapping->host;
> - int ret;
> - handle_t *handle = ext4_journal_current_handle();
> loff_t new_i_size;
> unsigned long start, end;
> int write_mode = (int)(unsigned long)fsdata;
> @@ -3086,9 +3038,7 @@ static int ext4_da_write_end(struct file *file,
> ext4_da_should_update_i_disksize(page, end))
> ext4_update_i_disksize(inode, new_i_size);
>
> - copied = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> - ret = ext4_journal_stop(handle);
> - return ret ? ret : copied;
> + return generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> }
>
> /*
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-07-10 08:15:08

by Zhang Yi

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] ext4: factor out write end code of inline file

On 2021/7/8 0:49, Jan Kara wrote:
> On Tue 06-07-21 10:42:09, Zhang Yi wrote:
>> Now that the inline_data file write end procedure are falled into the
>> common write end functions, it is not clear. Factor them out and do
>> some cleanup. This patch also drop ext4_da_write_inline_data_end()
>> and switch to use ext4_write_inline_data_end() instead because we also
>> need to do the same error processing if we failed to write data into
>> inline entry.
>>
>> Signed-off-by: Zhang Yi <[email protected]>
>
> Looks good. Just two nits below.
>
>> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
>> index 28b666f25ac2..8fbf8ec05bd5 100644
>> --- a/fs/ext4/inline.c
>> +++ b/fs/ext4/inline.c
>> @@ -729,34 +729,80 @@ int ext4_try_to_write_inline_data(struct address_space *mapping,
>> int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
>> unsigned copied, struct page *page)
>> {
>> - int ret, no_expand;
>> + handle_t *handle = ext4_journal_current_handle();
>> + int i_size_changed = 0;
>> + int no_expand;
>> void *kaddr;
>> struct ext4_iloc iloc;
>> + int ret, ret2;
>>
>> if (unlikely(copied < len) && !PageUptodate(page))
>> - return 0;
>> + copied = 0;
>>
>> - ret = ext4_get_inode_loc(inode, &iloc);
>> - if (ret) {
>> - ext4_std_error(inode->i_sb, ret);
>> - return ret;
>> - }
>> + if (likely(copied)) {
>> + ret = ext4_get_inode_loc(inode, &iloc);
>> + if (ret) {
>> + unlock_page(page);
>> + put_page(page);
>> + ext4_std_error(inode->i_sb, ret);
>> + goto out;
>> + }
>> + ext4_write_lock_xattr(inode, &no_expand);
>> + BUG_ON(!ext4_has_inline_data(inode));
>>
>> - ext4_write_lock_xattr(inode, &no_expand);
>> - BUG_ON(!ext4_has_inline_data(inode));
>> + kaddr = kmap_atomic(page);
>> + ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
>> + kunmap_atomic(kaddr);
>> + SetPageUptodate(page);
>> + /* clear page dirty so that writepages wouldn't work for us. */
>> + ClearPageDirty(page);
>>
>> - kaddr = kmap_atomic(page);
>> - ext4_write_inline_data(inode, &iloc, kaddr, pos, copied);
>> - kunmap_atomic(kaddr);
>> - SetPageUptodate(page);
>> - /* clear page dirty so that writepages wouldn't work for us. */
>> - ClearPageDirty(page);
>> + ext4_write_unlock_xattr(inode, &no_expand);
>> + brelse(iloc.bh);
>> + }
>>
>> - ext4_write_unlock_xattr(inode, &no_expand);
>> - brelse(iloc.bh);
>> - mark_inode_dirty(inode);
>> + /*
>> + * It's important to update i_size while still holding page lock:
>> + * page writeout could otherwise come in and zero beyond i_size.
>> + */
>> + i_size_changed = ext4_update_inode_size(inode, pos + copied);
>> + if (ext4_should_journal_data(inode)) {
>> + ext4_set_inode_state(inode, EXT4_STATE_JDATA);
>> + EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
>> + }
>
> I think this hunk should also go into the "if (copied)" block. There's no
> point in changing i_size or i_disksize when nothing was written.
>

Yeah, I will put ext4_update_inode_size() into the "if (copied)" block.
Thinking about it again, IIUC, the hunk in "if (ext4_should_journal_data(inode))"
also seems useless for inline inode, and could be dropped.

Thanks,
Yi.