v1->v2:
-Filter non-expired in requeue_inode in patch "fs/writeback: avoid to
writeback non-expired inode in kupdate writeback"
-Wrap the comment at 80 columns in patch "fs/writeback: only calculate
dirtied_before when b_io is empty"
-Abandon patch "fs/writeback: remove unneeded check in
writeback_single_inode"
-Collect RVB from Jan and Tim
Kemeng Shi (6):
fs/writeback: avoid to writeback non-expired inode in kupdate
writeback
fs/writeback: bail out if there is no more inodes for IO and queued
once
fs/writeback: remove unused parameter wb of finish_writeback_work
fs/writeback: only calculate dirtied_before when b_io is empty
fs/writeback: correct comment of __wakeup_flusher_threads_bdi
fs/writeback: remove unnecessary return in writeback_inodes_sb
fs/fs-writeback.c | 57 +++++++++++++++++++++++++++--------------------
1 file changed, 33 insertions(+), 24 deletions(-)
--
2.30.0
In kupdate writeback, only expired inode (have been dirty for longer than
dirty_expire_interval) is supposed to be written back. However, kupdate
writeback will writeback non-expired inode left in b_io or b_more_io from
last wb_writeback. As a result, writeback will keep being triggered
unexpected when we keep dirtying pages even dirty memory is under
threshold and inode is not expired. To be more specific:
Assume dirty background threshold is > 1G and dirty_expire_centisecs is
> 60s. When we running fio -size=1G -invalidate=0 -ioengine=libaio
--time_based -runtime=60... (keep dirtying), the writeback will keep
being triggered as following:
wb_workfn
wb_do_writeback
wb_check_background_flush
/*
* Wb dirty background threshold starts at 0 if device was idle and
* grows up when bandwidth of wb is updated. So a background
* writeback is triggered.
*/
wb_over_bg_thresh
/*
* Dirtied inode will be written back and added to b_more_io list
* after slice used up (because we keep dirtying the inode).
*/
wb_writeback
Writeback is triggered per dirty_writeback_centisecs as following:
wb_workfn
wb_do_writeback
wb_check_old_data_flush
/*
* Write back inode left in b_io and b_more_io from last wb_writeback
* even the inode is non-expired and it will be added to b_more_io
* again as slice will be used up (because we keep dirtying the
* inode)
*/
wb_writeback
Fix this by moving non-expired inode to dirty list instead of more io
list for kupdate writeback in requeue_inode.
Test as following:
/* make it more easier to observe the issue */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 100 > /proc/sys/vm/dirty_writeback_centisecs
/* create a idle device */
mkfs.ext4 -F /dev/vdb
mount /dev/vdb /bdi1/
/* run buffer write with fio */
fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K \
-iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0
Fio result before fix (run three tests):
1360MB/s
1329MB/s
1455MB/s
Fio result after fix (run three tests):
1737MB/s
1729MB/s
1789MB/s
Writeback for non-expired inode is gone as expeted. Observe this with trace
writeback_start and writeback_written as following:
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_start/enab
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_written/enable
cat /sys/kernel/tracing/trace_pipe
Signed-off-by: Kemeng Shi <[email protected]>
---
fs/fs-writeback.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5ab1aaf805f7..4e6166e07eaf 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1561,7 +1561,8 @@ static void inode_sleep_on_writeback(struct inode *inode)
* thread's back can have unexpected consequences.
*/
static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
- struct writeback_control *wbc)
+ struct writeback_control *wbc,
+ unsigned long dirtied_before)
{
if (inode->i_state & I_FREEING)
return;
@@ -1594,7 +1595,8 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
* We didn't write back all the pages. nfs_writepages()
* sometimes bales out without doing anything.
*/
- if (wbc->nr_to_write <= 0) {
+ if (wbc->nr_to_write <= 0 &&
+ !inode_dirtied_after(inode, dirtied_before)) {
/* Slice used up. Queue for next turn. */
requeue_io(inode, wb);
} else {
@@ -1862,6 +1864,11 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long total_wrote = 0; /* count both pages and inodes */
+ unsigned long dirtied_before = jiffies;
+
+ if (work->for_kupdate)
+ dirtied_before = jiffies -
+ msecs_to_jiffies(dirty_expire_interval * 10);
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1967,7 +1974,7 @@ static long writeback_sb_inodes(struct super_block *sb,
spin_lock(&inode->i_lock);
if (!(inode->i_state & I_DIRTY_ALL))
total_wrote++;
- requeue_inode(inode, tmp_wb, &wbc);
+ requeue_inode(inode, tmp_wb, &wbc, dirtied_before);
inode_sync_complete(inode);
spin_unlock(&inode->i_lock);
--
2.30.0
On Wed 28-02-24 17:19:53, Kemeng Shi wrote:
> In kupdate writeback, only expired inode (have been dirty for longer than
> dirty_expire_interval) is supposed to be written back. However, kupdate
> writeback will writeback non-expired inode left in b_io or b_more_io from
> last wb_writeback. As a result, writeback will keep being triggered
> unexpected when we keep dirtying pages even dirty memory is under
> threshold and inode is not expired. To be more specific:
> Assume dirty background threshold is > 1G and dirty_expire_centisecs is
> > 60s. When we running fio -size=1G -invalidate=0 -ioengine=libaio
> --time_based -runtime=60... (keep dirtying), the writeback will keep
> being triggered as following:
> wb_workfn
> wb_do_writeback
> wb_check_background_flush
> /*
> * Wb dirty background threshold starts at 0 if device was idle and
> * grows up when bandwidth of wb is updated. So a background
> * writeback is triggered.
> */
> wb_over_bg_thresh
> /*
> * Dirtied inode will be written back and added to b_more_io list
> * after slice used up (because we keep dirtying the inode).
> */
> wb_writeback
>
> Writeback is triggered per dirty_writeback_centisecs as following:
> wb_workfn
> wb_do_writeback
> wb_check_old_data_flush
> /*
> * Write back inode left in b_io and b_more_io from last wb_writeback
> * even the inode is non-expired and it will be added to b_more_io
> * again as slice will be used up (because we keep dirtying the
> * inode)
> */
> wb_writeback
>
> Fix this by moving non-expired inode to dirty list instead of more io
> list for kupdate writeback in requeue_inode.
>
> Test as following:
> /* make it more easier to observe the issue */
> echo 300000 > /proc/sys/vm/dirty_expire_centisecs
> echo 100 > /proc/sys/vm/dirty_writeback_centisecs
> /* create a idle device */
> mkfs.ext4 -F /dev/vdb
> mount /dev/vdb /bdi1/
> /* run buffer write with fio */
> fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K \
> -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0
>
> Fio result before fix (run three tests):
> 1360MB/s
> 1329MB/s
> 1455MB/s
>
> Fio result after fix (run three tests):
> 1737MB/s
> 1729MB/s
> 1789MB/s
>
> Writeback for non-expired inode is gone as expeted. Observe this with trace
> writeback_start and writeback_written as following:
> echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_start/enab
> echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_written/enable
> cat /sys/kernel/tracing/trace_pipe
>
> Signed-off-by: Kemeng Shi <[email protected]>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/fs-writeback.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5ab1aaf805f7..4e6166e07eaf 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1561,7 +1561,8 @@ static void inode_sleep_on_writeback(struct inode *inode)
> * thread's back can have unexpected consequences.
> */
> static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
> - struct writeback_control *wbc)
> + struct writeback_control *wbc,
> + unsigned long dirtied_before)
> {
> if (inode->i_state & I_FREEING)
> return;
> @@ -1594,7 +1595,8 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
> * We didn't write back all the pages. nfs_writepages()
> * sometimes bales out without doing anything.
> */
> - if (wbc->nr_to_write <= 0) {
> + if (wbc->nr_to_write <= 0 &&
> + !inode_dirtied_after(inode, dirtied_before)) {
> /* Slice used up. Queue for next turn. */
> requeue_io(inode, wb);
> } else {
> @@ -1862,6 +1864,11 @@ static long writeback_sb_inodes(struct super_block *sb,
> unsigned long start_time = jiffies;
> long write_chunk;
> long total_wrote = 0; /* count both pages and inodes */
> + unsigned long dirtied_before = jiffies;
> +
> + if (work->for_kupdate)
> + dirtied_before = jiffies -
> + msecs_to_jiffies(dirty_expire_interval * 10);
>
> while (!list_empty(&wb->b_io)) {
> struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -1967,7 +1974,7 @@ static long writeback_sb_inodes(struct super_block *sb,
> spin_lock(&inode->i_lock);
> if (!(inode->i_state & I_DIRTY_ALL))
> total_wrote++;
> - requeue_inode(inode, tmp_wb, &wbc);
> + requeue_inode(inode, tmp_wb, &wbc, dirtied_before);
> inode_sync_complete(inode);
> spin_unlock(&inode->i_lock);
>
> --
> 2.30.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 28-02-24 17:19:52, Kemeng Shi wrote:
> v1->v2:
> -Filter non-expired in requeue_inode in patch "fs/writeback: avoid to
> writeback non-expired inode in kupdate writeback"
> -Wrap the comment at 80 columns in patch "fs/writeback: only calculate
> dirtied_before when b_io is empty"
> -Abandon patch "fs/writeback: remove unneeded check in
> writeback_single_inode"
> -Collect RVB from Jan and Tim
Christian, the series looks good to me. Please pick it up once your tree
settles after the merge window. Thanks!
Honza
>
> Kemeng Shi (6):
> fs/writeback: avoid to writeback non-expired inode in kupdate
> writeback
> fs/writeback: bail out if there is no more inodes for IO and queued
> once
> fs/writeback: remove unused parameter wb of finish_writeback_work
> fs/writeback: only calculate dirtied_before when b_io is empty
> fs/writeback: correct comment of __wakeup_flusher_threads_bdi
> fs/writeback: remove unnecessary return in writeback_inodes_sb
>
> fs/fs-writeback.c | 57 +++++++++++++++++++++++++++--------------------
> 1 file changed, 33 insertions(+), 24 deletions(-)
>
> --
> 2.30.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, 28 Feb 2024 17:19:52 +0800, Kemeng Shi wrote:
> v1->v2:
> -Filter non-expired in requeue_inode in patch "fs/writeback: avoid to
> writeback non-expired inode in kupdate writeback"
> -Wrap the comment at 80 columns in patch "fs/writeback: only calculate
> dirtied_before when b_io is empty"
> -Abandon patch "fs/writeback: remove unneeded check in
> writeback_single_inode"
> -Collect RVB from Jan and Tim
>
> [...]
Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc
[1/6] fs/writeback: avoid to writeback non-expired inode in kupdate writeback
https://git.kernel.org/vfs/vfs/c/c66cf7bdd77c
[2/6] fs/writeback: bail out if there is no more inodes for IO and queued once
https://git.kernel.org/vfs/vfs/c/d82e51471fc3
[3/6] fs/writeback: remove unused parameter wb of finish_writeback_work
https://git.kernel.org/vfs/vfs/c/7cb6d20fc517
[4/6] fs/writeback: only calculate dirtied_before when b_io is empty
https://git.kernel.org/vfs/vfs/c/e5cb59d053c2
[5/6] fs/writeback: correct comment of __wakeup_flusher_threads_bdi
https://git.kernel.org/vfs/vfs/c/78f2b24980d8
[6/6] fs/writeback: remove unnecessary return in writeback_inodes_sb
https://git.kernel.org/vfs/vfs/c/ed9d128c0c42
> Christian, the series looks good to me. Please pick it up once your tree
> settles after the merge window. Thanks!
Thanks for the heads-up, Jan! I've picked it now so it isn't forgotten
and will rebase once -rc1 is out.