LinuxLists.cc - [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

2021-01-11 13:27:51

Subject: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

From: Linus Torvalds <[email protected]>

commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.

Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
logic") we've had some very occasional reports of BUG_ON(PageWriteback)
in write_cache_pages(), which we thought we already fixed in commit
073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").

But syzbot just reported another one, even with that commit in place.

And it turns out that there's a simpler way to trigger the BUG_ON() than
the one Hugh found with page re-use. It all boils down to the fact that
the page writeback is ostensibly serialized by the page lock, but that
isn't actually really true.

Yes, the people _setting_ writeback all do so under the page lock, but
the actual clearing of the bit - and waking up any waiters - happens
without any page lock.

This gives us this fairly simple race condition:

CPU1 = end previous writeback
CPU2 = start new writeback under page lock
CPU3 = write_cache_pages()

CPU1 CPU2 CPU3
---- ---- ----

end_page_writeback()
test_clear_page_writeback(page)
... delayed...

lock_page();
set_page_writeback()
unlock_page()

lock_page()
wait_on_page_writeback();

wake_up_page(page, PG_writeback);
.. wakes up CPU3 ..

BUG_ON(PageWriteback(page));

where the BUG_ON() happens because we woke up the PG_writeback bit
becasue of the _previous_ writeback, but a new one had already been
started because the clearing of the bit wasn't actually atomic wrt the
actual wakeup or serialized by the page lock.

The reason this didn't use to happen was that the old logic in waiting
on a page bit would just loop if it ever saw the bit set again.

The nice proper fix would probably be to get rid of the whole "wait for
writeback to clear, and then set it" logic in the writeback path, and
replace it with an atomic "wait-to-set" (ie the same as we have for page
locking: we set the page lock bit with a single "lock_page()", not with
"wait for lock bit to clear and then set it").

However, out current model for writeback is that the waiting for the
writeback bit is done by the generic VFS code (ie write_cache_pages()),
but the actual setting of the writeback bit is done much later by the
filesystem ".writepages()" function.

IOW, to make the writeback bit have that same kind of "wait-to-set"
behavior as we have for page locking, we'd have to change our roughly
~50 different writeback functions. Painful.

Instead, just make "wait_on_page_writeback()" loop on the very unlikely
situation that the PG_writeback bit is still set, basically re-instating
the old behavior. This is very non-optimal in case of contention, but
since we only ever set the bit under the page lock, that situation is
controlled.

Reported-by: [email protected]
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Acked-by: Hugh Dickins <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/page-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2826,7 +2826,7 @@ EXPORT_SYMBOL(__test_set_page_writeback)
*/
void wait_on_page_writeback(struct page *page)
{
- if (PageWriteback(page)) {
+ while (PageWriteback(page)) {
trace_wait_on_page_writeback(page, page_mapping(page));
wait_on_page_bit(page, PG_writeback);
}

2021-01-11 17:58:56

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

On Mon, 11 Jan 2021, Greg Kroah-Hartman wrote:

> From: Linus Torvalds <[email protected]>
>
> commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.
>
> Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
> logic") we've had some very occasional reports of BUG_ON(PageWriteback)
> in write_cache_pages(), which we thought we already fixed in commit
> 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").
>
> But syzbot just reported another one, even with that commit in place.
>
> And it turns out that there's a simpler way to trigger the BUG_ON() than
> the one Hugh found with page re-use. It all boils down to the fact that
> the page writeback is ostensibly serialized by the page lock, but that
> isn't actually really true.
>
> Yes, the people _setting_ writeback all do so under the page lock, but
> the actual clearing of the bit - and waking up any waiters - happens
> without any page lock.
>
> This gives us this fairly simple race condition:
>
> CPU1 = end previous writeback
> CPU2 = start new writeback under page lock
> CPU3 = write_cache_pages()
>
> CPU1 CPU2 CPU3
> ---- ---- ----
>
> end_page_writeback()
> test_clear_page_writeback(page)
> ... delayed...
>
> lock_page();
> set_page_writeback()
> unlock_page()
>
> lock_page()
> wait_on_page_writeback();
>
> wake_up_page(page, PG_writeback);
> .. wakes up CPU3 ..
>
> BUG_ON(PageWriteback(page));
>
> where the BUG_ON() happens because we woke up the PG_writeback bit
> becasue of the _previous_ writeback, but a new one had already been
> started because the clearing of the bit wasn't actually atomic wrt the
> actual wakeup or serialized by the page lock.
>
> The reason this didn't use to happen was that the old logic in waiting
> on a page bit would just loop if it ever saw the bit set again.
>
> The nice proper fix would probably be to get rid of the whole "wait for
> writeback to clear, and then set it" logic in the writeback path, and
> replace it with an atomic "wait-to-set" (ie the same as we have for page
> locking: we set the page lock bit with a single "lock_page()", not with
> "wait for lock bit to clear and then set it").
>
> However, out current model for writeback is that the waiting for the
> writeback bit is done by the generic VFS code (ie write_cache_pages()),
> but the actual setting of the writeback bit is done much later by the
> filesystem ".writepages()" function.
>
> IOW, to make the writeback bit have that same kind of "wait-to-set"
> behavior as we have for page locking, we'd have to change our roughly
> ~50 different writeback functions. Painful.
>
> Instead, just make "wait_on_page_writeback()" loop on the very unlikely
> situation that the PG_writeback bit is still set, basically re-instating
> the old behavior. This is very non-optimal in case of contention, but
> since we only ever set the bit under the page lock, that situation is
> controlled.
>
> Reported-by: [email protected]
> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: [email protected]
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>

I think it's too early to push this one through to stable:
Linus mentioned on Friday that Michael Larabel of Phoronix
has observed a performance regression from this commit.

Correctness outweighs performance of course, but I think
stable users might see the performance issue much sooner
than they would ever see the BUG fixed. Wait a bit,
while we think some more about what to try next?

Hugh

>
> ---
> mm/page-writeback.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2826,7 +2826,7 @@ EXPORT_SYMBOL(__test_set_page_writeback)
> */
> void wait_on_page_writeback(struct page *page)
> {
> - if (PageWriteback(page)) {
> + while (PageWriteback(page)) {
> trace_wait_on_page_writeback(page, page_mapping(page));
> wait_on_page_bit(page, PG_writeback);
> }

2021-01-11 19:07:10

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

On Mon, Jan 11, 2021 at 9:55 AM Hugh Dickins <[email protected]> wrote:
>
> I think it's too early to push this one through to stable:
> Linus mentioned on Friday that Michael Larabel of Phoronix
> has observed a performance regression from this commit.

That turned out to be a red herring. Yes, Michael saw a performance
regression on some machines, but the change to go back to the old
model (where the repeat was basically at wakeup time rather than in
the waiter) didn't actually make any difference.

And the issue only showed on a couple of machines, and only with
certain configurations (ie apparently switching to the performance
governor made it go away).

So it seems to have been some modal behavior about random timing
(possibly just interaction with cpufreq) rather than a real
regression.

I think the real issue is simply that some loads are very sensitive to
the exact writeback timing patterns. And I think we're making things
worse by having some of the patterns be very non-deterministic indeed.

For example, long before we actually take the page lock and then wait
for (and set) the page writeback bit, look at how we first use the
Xarray tags to turn the "page dirty" tag into "page needs writeback"
tag, and then look up an array of such writeback pages: all without
any real locking at all (apart from the xas lock itself for the
tagging op).

Making things even less deterministic, the code that doesn't do
writeback - but just _wait_ for writeback - doesn't actually serialize
with the page lock at all. It _just_ does that
"wait_for_page_writeback()", which is ambiguous when there are
consecutive writebacks happening. That's actually the case that I
think Michael would have seen - because he obviously never saw the
(very very rare) BUG_ON.

The BUG_ON() in page writeback itself is serialized by the page lock
and so there aren't really many possibilities for that to get
contention or other odd behavior (the wakeup race being the one very
very unlikely notable one). In contrast, the "wait for writeback"
isn't serialized by anything else, so that one is literally "if was at
writeback at some point, wait for it to no longer be", and then the
aggressive wakeup was good for that case, while it caused problems for
the writeback case.

Anyway, the numbers are all ambiguous, the one-liner fix is not
horrible, and the take-away from all of this is likely mostly: it
would be good to have some more clarity about the whole writeback and
wait-for-writeback thing.

In many ways it would be really line to have a sequence count rather
than just a single bit. But obviously that does not work for 'struct
page'.

Anyway, don't hold up this "get rid of BUG_ON() in writeback" patch for this.

Linus