2022-04-18 15:51:14

by Zhihao Cheng

[permalink] [raw]
Subject: [PATCH v2] fs-writeback: writeback_sb_inodes :Recalculate 'wrote' according skipped pages

Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during wb_writeback, which
may cause a potential ABBA dead lock:

wb_writeback fat_file_fsync
blk_start_plug(&plug)
for (;;) {
iter i-1: some reqs have been added into plug->mq_list // LOCK A
iter i:
progress = __writeback_inodes_wb(wb, work)
. writeback_sb_inodes // fat's bdev
. __writeback_single_inode
. . generic_writepages
. . __block_write_full_page
. . . . __generic_file_fsync
. . . . sync_inode_metadata
. . . . writeback_single_inode
. . . . __writeback_single_inode
. . . . fat_write_inode
. . . . __fat_write_inode
. . . . sync_dirty_buffer // fat's bdev
. . . . lock_buffer(bh) // LOCK B
. . . . submit_bh
. . . . blk_mq_get_tag // LOCK A
. . . trylock_buffer(bh) // LOCK B
. . . redirty_page_for_writepage
. . . wbc->pages_skipped++
. . --wbc->nr_to_write
. wrote += write_chunk - wbc.nr_to_write // wrote > 0
. requeue_inode
. redirty_tail_locked
if (progress) // progress > 0
continue;
iter i+1:
queue_io
// similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug) // flush plug won't be called

Above process triggers a hungtask like:
[ 399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
[ 399.046824] Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
[ 399.051539] task:bb state:D stack: 0 pid: 2607 ppid:
2426 flags:0x00004000
[ 399.051556] Call Trace:
[ 399.051570] __schedule+0x480/0x1050
[ 399.051592] schedule+0x92/0x1a0
[ 399.051602] io_schedule+0x22/0x50
[ 399.051613] blk_mq_get_tag+0x1d3/0x3c0
[ 399.051640] __blk_mq_alloc_requests+0x21d/0x3f0
[ 399.051657] blk_mq_submit_bio+0x68d/0xca0
[ 399.051674] __submit_bio+0x1b5/0x2d0
[ 399.051708] submit_bio_noacct+0x34e/0x720
[ 399.051718] submit_bio+0x3b/0x150
[ 399.051725] submit_bh_wbc+0x161/0x230
[ 399.051734] __sync_dirty_buffer+0xd1/0x420
[ 399.051744] sync_dirty_buffer+0x17/0x20
[ 399.051750] __fat_write_inode+0x289/0x310
[ 399.051766] fat_write_inode+0x2a/0xa0
[ 399.051783] __writeback_single_inode+0x53c/0x6f0
[ 399.051795] writeback_single_inode+0x145/0x200
[ 399.051803] sync_inode_metadata+0x45/0x70
[ 399.051856] __generic_file_fsync+0xa3/0x150
[ 399.051880] fat_file_fsync+0x1d/0x80
[ 399.051895] vfs_fsync_range+0x40/0xb0
[ 399.051929] __x64_sys_fsync+0x18/0x30

In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
unplug before cond_resched in writeback_sb_inodes") in function
'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
from write_cache_pages().

Fix it by correcting wrote number according number of skipped pages
in writeback_sb_inodes().

Goto Link to find a reproducer.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
Cc: [email protected] # v4.3
Signed-off-by: Zhihao Cheng <[email protected]>
---
fs/fs-writeback.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 591fe9cf1659..9740c7d346aa 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1775,11 +1775,12 @@ static long writeback_sb_inodes(struct super_block *sb,
};
unsigned long start_time = jiffies;
long write_chunk;
- long wrote = 0; /* count both pages and inodes */
+ long total_wrote = 0; /* count both pages and inodes */

while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
struct bdi_writeback *tmp_wb;
+ long wrote;

if (inode->i_sb != sb) {
if (work->sb) {
@@ -1854,8 +1855,10 @@ static long writeback_sb_inodes(struct super_block *sb,
__writeback_single_inode(inode, &wbc);

wbc_detach_inode(&wbc);
- work->nr_pages -= write_chunk - wbc.nr_to_write;
- wrote += write_chunk - wbc.nr_to_write;
+ wrote = write_chunk - wbc.nr_to_write - wbc.pages_skipped;
+ wrote = wrote < 0 ? 0 : wrote;
+ work->nr_pages -= wrote;
+ total_wrote += wrote;

if (need_resched()) {
/*
@@ -1877,7 +1880,7 @@ static long writeback_sb_inodes(struct super_block *sb,
tmp_wb = inode_to_wb_and_lock_list(inode);
spin_lock(&inode->i_lock);
if (!(inode->i_state & I_DIRTY_ALL))
- wrote++;
+ total_wrote++;
requeue_inode(inode, tmp_wb, &wbc);
inode_sync_complete(inode);
spin_unlock(&inode->i_lock);
@@ -1891,14 +1894,14 @@ static long writeback_sb_inodes(struct super_block *sb,
* bail out to wb_writeback() often enough to check
* background threshold and other termination conditions.
*/
- if (wrote) {
+ if (total_wrote) {
if (time_is_before_jiffies(start_time + HZ / 10UL))
break;
if (work->nr_pages <= 0)
break;
}
}
- return wrote;
+ return total_wrote;
}

static long __writeback_inodes_wb(struct bdi_writeback *wb,
--
2.31.1


2022-04-21 15:47:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] fs-writeback: writeback_sb_inodes:R ecalculate 'wrote' according skipped pages

[ Adding some scheduler people - the background here is a ABBA
deadlock because a plug never gets unplugged and the IO never starts
and the buffer lock thus never gets released. That's simplified, see
https://lore.kernel.org/all/[email protected]/
and
https://bugzilla.kernel.org/show_bug.cgi?id=215837
for details ]

On Mon, Apr 18, 2022 at 2:14 AM Zhihao Cheng <[email protected]> wrote:
>
> In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
> unplug before cond_resched in writeback_sb_inodes") in function
> 'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
> from write_cache_pages().

So I'm not reacting to the patch, but just to this part of the message...

I forget the exact history of plugging, but at some point (long long
ago - we're talking pre-git days) it was device-specific and always
released on a timeout (or, obviously, explicitly unplugged).

And then later it became per-process, and always released by task-work
on any schedule() call.

But over time, that "any schedule" has gone away. It did so gradually,
over time, and long ago:

73c101011926 ("block: initial patch for on-stack per-task plugging")
6631e635c65d ("block: don't flush plugged IO on forced preemtion scheduling")

And that's *mostly* perfectly fine, but the problem ends up being that
not everything necessarily triggers the flushing at all.

In fact, if you call "__schedule()" directly (rather than
"schedule()") I think you may end up avoiding flush entirely. I'm
looking at do_task_dead() and schedule_idle() and the
preempt_schedule() cases.

Similarly, tsk_is_pi_blocked() will disable the plug flush.

Back when it was a timer, the flushing was eventually guaranteed.

And then we would flush on any re-schedule, even if it was about
preemption and the process might stay on the CPU.

But these days we can be in the situation where we really don't flush
at all - the process may be scheduled away, but if it's still
runnable, the blk plug won't be flushed.

To make things *really* confusing, doing an io_schedule() will force a
plug flush, even if the process might stay runnable. So io_schedule()
has those old legacy "unconditional flush" guarantees that a normal
schedule does not any more.

Also note how the plug is per-process, so when another process *does*
block (because it's waiting for some resource), that doesn't end up
really unplugging the actual IO which was started by somebody else.
Even if that other process is using io_schedule().

Which all brings us back to how we have that hacky thing in
writeback_sb_inodes() that does

if (need_resched()) {
/*
* We're trying to balance between building up a nice
* long list of IOs to improve our merge rate, and
* getting those IOs out quickly for anyone throttling
* in balance_dirty_pages(). cond_resched() doesn't
* unplug, so get our IOs out the door before we
* give up the CPU.
*/
blk_flush_plug(current->plug, false);
cond_resched();
}

and that currently *mostly* ends up protecting us and flushing the
plug when doing big writebacks, but as you can see from the email I'm
quoting, it then doesn't always work very well, because
"need_resched()" may end up being cleared by some other scheduling
point, and is entirely meaningless when preemption is on anyway.

So I think that's basically just a random voodoo programming thing
that has protected us in the past in some situations.

Now, Zhihao has a patch that fixes the problem by limiting the
writeback by being better at accounting:

https://lore.kernel.org/all/[email protected]/

which is the email I'm answering, but I did want to bring in the
scheduler people to the discussion to see if people have ideas.

I think the writeback accounting fix is the right thing to do
regardless, but that whole need_resched() dance in
writeback_sb_inodes() is, I think, a sign that we do have real issues
here. That whole "flush plug if we need to reschedule" is simply a
fundamentally broken concept, when there are other rescheduling
points.

Comments?

The answer may just be that "the code in writeback_sb_inodes() is
fundamentally broken and should be removed".

But the fact that we have that code at all makes me quite nervous
about this. And we clearly *do* have situations where the writeback
code seems to cause nasty unplugging delays.

So I'm not convinced that "fix up the writeback accounting" is the
real and final fix.

I don't really have answers or suggestions, I just wanted people to
look at this in case they have ideas.

Linus