2009-09-16 17:22:49

by Jan Kara

[permalink] [raw]
Subject: [PATCH] fs: Fix busyloop in wb_writeback()

If all inodes are under writeback (e.g. in case when there's only one inode
with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
write anything.

CC: Wu Fengguang <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 19 ++++++++++++++++++-
1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8e1e5e1..c59d673 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -706,6 +706,7 @@ static long wb_writeback(struct bdi_writeback *wb,
};
unsigned long oldest_jif;
long wrote = 0;
+ struct inode *inode;

if (wbc.for_kupdate) {
wbc.older_than_this = &oldest_jif;
@@ -747,8 +748,24 @@ static long wb_writeback(struct bdi_writeback *wb,
* If we ran out of stuff to write, bail unless more_io got set
*/
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
- if (wbc.more_io && !wbc.for_kupdate)
+ if (wbc.more_io && !wbc.for_kupdate) {
+ if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+ continue;
+ /*
+ * Nothing written. Wait for some inode to
+ * become available for writeback. Otherwise
+ * we'll just busyloop.
+ */
+ spin_lock(&inode_lock);
+ if (!list_empty(&wb->b_more_io)) {
+ inode = list_entry(
+ wb->b_more_io.prev,
+ struct inode, i_list);
+ inode_wait_for_writeback(inode);
+ }
+ spin_unlock(&inode_lock);
continue;
+ }
break;
}
}
--
1.6.0.2


2009-09-16 18:41:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Wed, Sep 16 2009, Jan Kara wrote:
> If all inodes are under writeback (e.g. in case when there's only one inode
> with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> write anything.

Interesting, so this will happen if the dirtier and flush thread end up
"fighting" each other over the same inode. I'll throw this into the
testing mix.

How did you notice?

--
Jens Axboe

2009-09-17 09:09:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Wed 16-09-09 20:41:06, Jens Axboe wrote:
> On Wed, Sep 16 2009, Jan Kara wrote:
> > If all inodes are under writeback (e.g. in case when there's only one inode
> > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > write anything.
>
> Interesting, so this will happen if the dirtier and flush thread end up
> "fighting" each other over the same inode. I'll throw this into the
> testing mix.
Yes, exactly. When you do "dd if=/dev/zero of=/mnt/mydisk" you're quite
likely to observe that.

> How did you notice?
Well, UML linux doesn't seem to preempt when a process when it is in the
kernel and thus such busyloop simply deadlocks it. BTW another similar
deadlock happens in UML linux in balance_dirty_pages() (process cycles
there waiting for nr_writeback to drop but that never seems to happen). But
I'd blame that to a problem in UML scheduling...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-19 01:53:45

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

Hi Jan,

On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> If all inodes are under writeback (e.g. in case when there's only one inode
> with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> write anything.

Reviewed-by: Wu Fengguang <[email protected]>

I can confirm this works. The traces before/after patch show that many
the fruitless loops are eliminated. Note that the busy loops can
happen _before_ and _during_ block io queue congestion, so it looks
like a very competitive alternative to the old congestion_wait().

Thanks,
Fengguang
---

writeback traces for

cp /dev/zero /mnt/test/zero0 &
dd if=/dev/zero of=/mnt/test/zero1 &

BEFORE
[ 315.025469] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 315.043293] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 320.065739] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 320.072397] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 391050 written 0
[ 325.081930] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 325.087874] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 328.752015] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 328.769780] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 328.795771] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 328.814094] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 330.848860] sda: sda1 sda2
[ 330.877501] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 330.895610] global dirty 1 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 330.920078] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 330.938017] global dirty 1 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 330.982888] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 331.000240] global dirty 1 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 331.009272] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 331.015196] global dirty 1 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 335.962649] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 335.980258] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 336.022632] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 336.040266] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 341.002978] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 341.011937] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 341.062940] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 341.071762] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 346.023053] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 346.030616] global dirty 3818 writeback 167 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 346.051762] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 16384
[ 346.058400] global dirty 3438 writeback 227 nfs 0 flags _M towrite 0 skipped 0 file 0 written 16384
[ 347.283567] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 16384
[ 347.290027] global dirty 13054 writeback 70 nfs 0 flags _M towrite 0 skipped 0 file 0 written 16384
[ 347.877429] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 16384
[ 347.884693] global dirty 3902 writeback 1622 nfs 0 flags _M towrite 0 skipped 0 file 0 written 16384
[ 348.124756] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 6967
[ 348.131689] global dirty 2 writeback 1702 nfs 0 flags __ towrite 9417 skipped 0 file 0 written 6967
[ 348.141112] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 348.146952] global dirty 2 writeback 1672 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 348.911029] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 15479
[ 348.918377] global dirty 1 writeback 1233 nfs 0 flags __ towrite 905 skipped 0 file 0 written 15479
[ 351.039556] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 351.045437] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 353.923671] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 353.941259] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 356.063781] fs/fs-writeback.c 776 wb_writeback: flush-0:15(1527) 0
[ 356.081537] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0
[ 358.963986] fs/fs-writeback.c 776 wb_writeback: flush-8:0(2824) 0
[ 358.981406] global dirty 0 writeback 0 nfs 0 flags __ towrite 16384 skipped 0 file 0 written 0


AFTER (with format changes and 128M MAX_WRITEBACK_PAGES)
[ 55.457583] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 55.468583] global dirty=2 writeback=0 nfs=1 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 57.835277] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=1
[ 57.845211] global dirty=2 writeback=0 nfs=1 flags=__ towrite=32767 skipped=0 file=317634 written=1
[ 60.477882] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 60.487177] global dirty=1 writeback=0 nfs=2 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 62.857975] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 62.865541] global dirty=1 writeback=0 nfs=2 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 65.497957] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 65.502201] global dirty=1 writeback=0 nfs=2 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 66.449865] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11437
[ 66.454802] global dirty=127 writeback=4820 nfs=2 flags=__ towrite=21331 skipped=0 file=12 written=5589
[ 67.089220] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11355
[ 67.093548] global dirty=257 writeback=3490 nfs=2 flags=__ towrite=21413 skipped=0 file=12 written=8273
[ 67.704613] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11347
[ 67.712514] global dirty=227 writeback=1810 nfs=2 flags=__ towrite=21421 skipped=0 file=12 written=8168
[ 67.874118] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 67.877341] global dirty=2337 writeback=0 nfs=2 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 68.417686] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11566
[ 68.421556] global dirty=357 writeback=5000 nfs=2 flags=__ towrite=21202 skipped=0 file=12 written=8074
[ 69.252307] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11994
[ 69.266007] global dirty=923 writeback=720 nfs=2 flags=__ towrite=20774 skipped=0 file=12 written=8758
[ 69.899959] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11837
[ 69.906911] global dirty=300 writeback=3156 nfs=2 flags=__ towrite=20931 skipped=0 file=12 written=8197
[ 70.553993] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11504
[ 70.560360] global dirty=280 writeback=4256 nfs=2 flags=__ towrite=21264 skipped=0 file=0 written=36
[ 70.577045] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 70.585187] global dirty=280 writeback=4126 nfs=2 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 71.323770] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11896
[ 71.329727] global dirty=460 writeback=10132 nfs=2 flags=__ towrite=20872 skipped=0 file=12 written=8877
[ 72.887085] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 74.189957] global dirty=9632 writeback=5243 nfs=2 flags=__ towrite=32768 skipped=0 file=314386 written=0
[ 74.367971] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11987
[ 74.396704] global dirty=298 writeback=15315 nfs=2 flags=__ towrite=20781 skipped=0 file=12 written=7758
[ 75.249214] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12148
[ 75.258805] global dirty=348 writeback=5021 nfs=2 flags=__ towrite=20620 skipped=0 file=12 written=9810
[ 76.288875] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=17368
[ 76.297156] global dirty=1792 writeback=542 nfs=2 flags=__ towrite=15400 skipped=0 file=0 written=5
[ 76.307482] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 76.328242] global dirty=1792 writeback=292 nfs=2 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 76.759708] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11691
[ 76.762893] global dirty=230 writeback=3874 nfs=2 flags=__ towrite=21077 skipped=0 file=12 written=7960
[ 77.261358] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11851
[ 77.267354] global dirty=279 writeback=3373 nfs=2 flags=__ towrite=20917 skipped=0 file=12 written=9484
[ 77.767085] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12188
[ 77.770236] global dirty=319 writeback=3769 nfs=2 flags=__ towrite=20580 skipped=0 file=12 written=9739
[ 78.287216] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11867
[ 78.292713] global dirty=416 writeback=2570 nfs=2 flags=__ towrite=20901 skipped=0 file=12 written=9548
[ 78.982964] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12784
[ 79.050506] global dirty=715 writeback=7312 nfs=2 flags=__ towrite=19984 skipped=0 file=12 written=10122
[ 79.217120] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 79.339208] global dirty=2627 writeback=4624 nfs=2 flags=__ towrite=32768 skipped=0 file=391050 written=0
[ 79.831759] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12303
[ 79.842709] global dirty=350 writeback=12302 nfs=2 flags=__ towrite=20465 skipped=0 file=12 written=9661
[ 81.664025] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=20207
[ 81.667844] global dirty=2342 writeback=1226 nfs=2 flags=__ towrite=12561 skipped=0 file=12 written=17537
[ 81.674537] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 81.679015] global dirty=2342 writeback=976 nfs=2 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 82.203041] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12762
[ 82.205884] global dirty=496 writeback=3170 nfs=2 flags=__ towrite=20006 skipped=0 file=12 written=9205
[ 83.153241] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=17142
[ 83.160762] global dirty=1332 writeback=3474 nfs=2 flags=__ towrite=15626 skipped=0 file=12 written=14034
[ 83.752083] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=13238
[ 83.761354] global dirty=522 writeback=10440 nfs=2 flags=__ towrite=19530 skipped=0 file=12 written=10428
[ 84.448215] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 84.514842] global dirty=85 writeback=257 nfs=2 flags=__ towrite=32768 skipped=0 file=309507 written=0
[ 84.653061] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11758
[ 84.666930] global dirty=475 writeback=1389 nfs=2 flags=__ towrite=21010 skipped=0 file=12 written=8598
[ 85.596776] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=15875
[ 85.610176] global dirty=1255 writeback=2492 nfs=1 flags=__ towrite=16893 skipped=0 file=12 written=12963
[ 86.679880] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=19187
[ 86.685391] global dirty=1941 writeback=700 nfs=1 flags=__ towrite=13581 skipped=0 file=12 written=15835
[ 86.693720] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 86.732001] global dirty=1941 writeback=440 nfs=1 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 87.220513] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12074
[ 87.226822] global dirty=280 writeback=3514 nfs=1 flags=__ towrite=20694 skipped=0 file=12 written=7892
[ 88.401825] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=18287
[ 88.521634] global dirty=1576 writeback=7698 nfs=1 flags=__ towrite=14481 skipped=0 file=12 written=15676
[ 88.545912] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=1700
[ 88.558696] global dirty=2 writeback=9392 nfs=1 flags=__ towrite=31068 skipped=0 file=12 written=131
[ 89.527762] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 89.540987] global dirty=8520 writeback=1 nfs=1 flags=__ towrite=32768 skipped=0 file=317634 written=0
[ 90.373312] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=22441
[ 90.385804] global dirty=2793 writeback=99 nfs=0 flags=__ towrite=10327 skipped=0 file=12 written=20896
[ 90.896688] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11962
[ 90.905466] global dirty=172 writeback=6032 nfs=0 flags=__ towrite=20806 skipped=0 file=0 written=4
[ 91.539929] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12300
[ 91.548079] global dirty=382 writeback=3210 nfs=0 flags=__ towrite=20468 skipped=0 file=12 written=10106
[ 92.046175] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12025
[ 92.054893] global dirty=320 writeback=8380 nfs=0 flags=__ towrite=20743 skipped=0 file=12 written=9523
[ 92.066455] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 92.075349] global dirty=320 writeback=8250 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 94.606396] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 94.620091] global dirty=18234 writeback=437 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 94.680711] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=2829 n=1536
[ 94.708757] global dirty=18234 writeback=117 nfs=0 flags=_M towrite=0 skipped=0 file=12 written=1536
[ 94.845526] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=2829 n=1536
[ 94.858110] global dirty=18740 writeback=377 nfs=0 flags=_M towrite=0 skipped=0 file=12 written=1536
[ 94.967886] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=2829 n=1536
[ 94.986908] global dirty=18232 writeback=825 nfs=0 flags=_M towrite=0 skipped=0 file=12 written=1536
[ 95.015415] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=3068
[ 95.049828] global dirty=18232 writeback=3287 nfs=0 flags=__ towrite=29700 skipped=0 file=13 written=3068
[ 95.992550] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=27637
[ 96.123581] global dirty=2 writeback=13998 nfs=0 flags=__ towrite=5131 skipped=0 file=13 written=1864
[ 96.397437] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=312
[ 96.400451] global dirty=2 writeback=6892 nfs=0 flags=__ towrite=32456 skipped=0 file=13 written=119
[ 97.104951] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12686
[ 97.193731] global dirty=364 writeback=9422 nfs=0 flags=__ towrite=20082 skipped=0 file=13 written=3420
[ 97.216739] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 97.223789] global dirty=364 writeback=9162 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 99.158402] mm/page-writeback.c +539 balance_dirty_pages(): comm=dd pid=2830 n=0
[ 99.162137] global dirty=27516 writeback=0 nfs=0 flags=_M towrite=1536 skipped=0 file=0 written=0
[ 99.185804] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=2829 n=1536
[ 99.193891] global dirty=3926 writeback=30 nfs=0 flags=_M towrite=0 skipped=0 file=13 written=1536
[ 99.284045] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=2829 n=0
[ 99.291527] global dirty=24718 writeback=0 nfs=0 flags=_M towrite=12 skipped=0 file=0 written=0
[ 99.415383] mm/page-writeback.c +539 balance_dirty_pages(): comm=dd pid=2830 n=1536
[ 99.421834] global dirty=22818 writeback=340 nfs=0 flags=_M towrite=0 skipped=0 file=13 written=1536
[ 99.587007] mm/page-writeback.c +539 balance_dirty_pages(): comm=dd pid=2830 n=1536
[ 99.598065] global dirty=18574 writeback=1130 nfs=0 flags=_M towrite=0 skipped=0 file=13 written=1536
[ 99.679940] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 99.686266] global dirty=17604 writeback=0 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 100.248582] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=32442
[ 100.441214] global dirty=2453 writeback=5776 nfs=0 flags=__ towrite=326 skipped=0 file=12 written=32442
[ 102.362534] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=22551
[ 102.501230] global dirty=4200 writeback=15770 nfs=0 flags=__ towrite=10217 skipped=0 file=12 written=19081
[ 103.255460] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=8602
[ 103.527612] global dirty=216 writeback=17970 nfs=0 flags=__ towrite=24166 skipped=0 file=12 written=4099
[ 103.541034] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 103.592828] global dirty=216 writeback=17720 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 104.599401] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11468
[ 104.717221] global dirty=169 writeback=19068 nfs=0 flags=__ towrite=21300 skipped=0 file=12 written=8234
[ 104.865264] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 104.963345] global dirty=169 writeback=18048 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 107.321816] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=32768
[ 107.334944] global dirty=2598 writeback=16741 nfs=0 flags=_M towrite=0 skipped=0 file=12 written=30241
[ 107.510672] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=3350
[ 107.524301] global dirty=2 writeback=18041 nfs=0 flags=__ towrite=29418 skipped=0 file=12 written=1072
[ 108.391365] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12939
[ 108.448731] global dirty=186 writeback=17293 nfs=0 flags=__ towrite=19829 skipped=0 file=12 written=10161
[ 108.641344] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 108.660176] global dirty=186 writeback=15887 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 109.432398] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11167
[ 109.519369] global dirty=252 writeback=16768 nfs=0 flags=__ towrite=21601 skipped=0 file=12 written=8985
[ 109.964534] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 110.202144] global dirty=6614 writeback=8453 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 111.449967] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=20458
[ 111.497377] global dirty=1244 writeback=14844 nfs=0 flags=__ towrite=12310 skipped=0 file=12 written=19047
[ 111.578967] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=1740
[ 111.689508] global dirty=4 writeback=13764 nfs=0 flags=__ towrite=31028 skipped=0 file=12 written=0
[ 112.402427] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12937
[ 112.496227] global dirty=435 writeback=18004 nfs=0 flags=__ towrite=19831 skipped=0 file=12 written=10037
[ 113.649987] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12469
[ 113.762850] global dirty=417 writeback=8854 nfs=0 flags=__ towrite=20299 skipped=0 file=12 written=9559
[ 113.777059] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 113.819819] global dirty=417 writeback=7380 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 114.469119] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=11468
[ 114.477888] global dirty=436 writeback=9377 nfs=0 flags=__ towrite=21300 skipped=0 file=12 written=8719
[ 115.205915] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=12628
[ 115.235844] global dirty=429 writeback=12730 nfs=0 flags=__ towrite=20140 skipped=0 file=12 written=9233
[ 115.286230] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 115.298967] global dirty=1119 writeback=10552 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 116.607551] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=25306
[ 116.617596] global dirty=1264 writeback=15443 nfs=0 flags=__ towrite=7462 skipped=0 file=12 written=22359
[ 116.660825] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=1491
[ 116.669616] global dirty=6 writeback=15261 nfs=0 flags=__ towrite=31277 skipped=0 file=12 written=213
[ 118.348742] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=15753
[ 118.495918] global dirty=1126 writeback=5683 nfs=0 flags=__ towrite=17015 skipped=0 file=12 written=14199
[ 119.455353] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=8122
[ 119.463974] global dirty=0 writeback=8928 nfs=0 flags=__ towrite=24646 skipped=0 file=12 written=5218
[ 119.474554] fs/fs-writeback.c +777 wb_writeback(): comm=flush-8:0 pid=2827 n=0
[ 119.516772] global dirty=0 writeback=7656 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0
[ 120.313149] fs/fs-writeback.c +777 wb_writeback(): comm=flush-0:15 pid=1149 n=0
[ 120.315446] global dirty=1 writeback=0 nfs=0 flags=__ towrite=32768 skipped=0 file=0 written=0

2009-09-20 02:35:34

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> If all inodes are under writeback (e.g. in case when there's only one inode
> with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> write anything.

Sorry, I realized that inode_wait_for_writeback() waits for I_SYNC.
But inodes in b_more_io are not expected to have I_SYNC set. So your
patch looks like a big no-op?

The busy loop does exists, when bdi is congested.
In this case, write_cache_pages() will refuse to write anything,
we used to be calling congestion_wait() to take a breath, but now
wb_writeback() purged that call and thus created a busy loop.

So the solution (at least for this merge window)
should be to restore the congestion_wait() call.

Thanks,
Fengguang

> CC: Wu Fengguang <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 19 ++++++++++++++++++-
> 1 files changed, 18 insertions(+), 1 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 8e1e5e1..c59d673 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -706,6 +706,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> };
> unsigned long oldest_jif;
> long wrote = 0;
> + struct inode *inode;
>
> if (wbc.for_kupdate) {
> wbc.older_than_this = &oldest_jif;
> @@ -747,8 +748,24 @@ static long wb_writeback(struct bdi_writeback *wb,
> * If we ran out of stuff to write, bail unless more_io got set
> */
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> - if (wbc.more_io && !wbc.for_kupdate)
> + if (wbc.more_io && !wbc.for_kupdate) {
> + if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> + continue;
> + /*
> + * Nothing written. Wait for some inode to
> + * become available for writeback. Otherwise
> + * we'll just busyloop.
> + */
> + spin_lock(&inode_lock);
> + if (!list_empty(&wb->b_more_io)) {
> + inode = list_entry(
> + wb->b_more_io.prev,
> + struct inode, i_list);
> + inode_wait_for_writeback(inode);
> + }
> + spin_unlock(&inode_lock);
> continue;
> + }
> break;
> }
> }
> --
> 1.6.0.2

2009-09-20 17:43:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Sun 20-09-09 10:35:28, Wu Fengguang wrote:
> On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> > If all inodes are under writeback (e.g. in case when there's only one inode
> > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > write anything.
>
> Sorry, I realized that inode_wait_for_writeback() waits for I_SYNC.
> But inodes in b_more_io are not expected to have I_SYNC set. So your
> patch looks like a big no-op?
Hmm, I don't think so. writeback_single_inode() does:
if (inode->i_state & I_SYNC) {
/*
* If this inode is locked for writeback and we are not
* doing
* writeback-for-data-integrity, move it to b_more_io so
* that
* writeback can proceed with the other inodes on s_io.
*
* We'll have another go at writing back this inode when we
* completed a full scan of b_io.
*/
if (!wait) {
requeue_io(inode);
return 0;
}

So when we see inode under writeback, we put it to b_more_io. So I think
my patch really fixes the issue when two threads are racing on writing the
same inode.

> The busy loop does exists, when bdi is congested.
> In this case, write_cache_pages() will refuse to write anything,
> we used to be calling congestion_wait() to take a breath, but now
> wb_writeback() purged that call and thus created a busy loop.
I don't think congestion is an issue here. The device needen't be
congested for the busyloop to happen.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-21 01:09:08

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 01:43:56AM +0800, Jan Kara wrote:
> On Sun 20-09-09 10:35:28, Wu Fengguang wrote:
> > On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > write anything.
> >
> > Sorry, I realized that inode_wait_for_writeback() waits for I_SYNC.
> > But inodes in b_more_io are not expected to have I_SYNC set. So your
> > patch looks like a big no-op?
> Hmm, I don't think so. writeback_single_inode() does:
> if (inode->i_state & I_SYNC) {
> /*
> * If this inode is locked for writeback and we are not
> * doing
> * writeback-for-data-integrity, move it to b_more_io so
> * that
> * writeback can proceed with the other inodes on s_io.
> *
> * We'll have another go at writing back this inode when we
> * completed a full scan of b_io.
> */
> if (!wait) {
> requeue_io(inode);
> return 0;
> }
>
> So when we see inode under writeback, we put it to b_more_io. So I think
> my patch really fixes the issue when two threads are racing on writing the
> same inode.

Ah OK. So it busy loops when there are more syncing threads than dirty
files. For example, one bdi flush thread plus one process running
balance_dirty_pages().

> > The busy loop does exists, when bdi is congested.
> > In this case, write_cache_pages() will refuse to write anything,
> > we used to be calling congestion_wait() to take a breath, but now
> > wb_writeback() purged that call and thus created a busy loop.
> I don't think congestion is an issue here. The device needen't be
> congested for the busyloop to happen.

bdi congestion is a different case. When there are only one syncing
thread, b_more_io inodes won't have I_SYNC, so your patch is a no-op.
wb_writeback() or any of its sub-routines must wait/yield for a while
to avoid busy looping on the congestion. Where is the wait with Jens'
new code?

Another question is, why wbc.more_io can be ignored for kupdate syncs?
I guess it would lead to slow writeback of large files.

This patch reflects my concerns on the two problems.

Thanks,
Fengguang
---
fs/fs-writeback.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c 2009-09-20 10:44:25.000000000 +0800
+++ linux/fs/fs-writeback.c 2009-09-21 08:53:09.000000000 +0800
@@ -818,8 +818,10 @@ static long wb_writeback(struct bdi_writ
/*
* If we ran out of stuff to write, bail unless more_io got set
*/
- if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
- if (wbc.more_io && !wbc.for_kupdate)
+ if (wbc.nr_to_write > 0) {
+ if (wbc.encountered_congestion)
+ congestion_wait(BLK_RW_ASYNC, HZ);
+ if (wbc.more_io)
continue;
break;
}

2009-09-21 13:02:12

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Thu, Sep 17, 2009 at 02:41:06AM +0800, Jens Axboe wrote:
> On Wed, Sep 16 2009, Jan Kara wrote:
> > If all inodes are under writeback (e.g. in case when there's only one inode
> > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > write anything.
>
> Interesting, so this will happen if the dirtier and flush thread end up
> "fighting" each other over the same inode. I'll throw this into the
> testing mix.
>
> How did you notice?

Jens, I found another busy loop. Not sure about the solution, but here
is the quick fact.

Tested git head is 1ef7d9aa32a8ee054c4d4fdcd2ea537c04d61b2f, which
seems to be the last writeback patch in the linux-next tree. I cannot
run the plain head of linux-next because it just refuses boot up.

On top of which Jan Kara's I_SYNC waiting patch and the attached
debugging patch is applied.

Test commands are:

# mount /mnt/test # ext4 fs
# echo 1 > /proc/sys/fs/dirty_debug

# cp /dev/zero /mnt/test/zero0

After that the box is locked up, the system is busy doing these:

[ 54.740295] requeue_io() +457: inode=79232
[ 54.740300] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
[ 54.740303] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
[ 54.740317] requeue_io() +457: inode=79232
[ 54.740322] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
[ 54.740325] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
[ 54.740339] requeue_io() +457: inode=79232
[ 54.740344] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
[ 54.740347] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
......

Basically the traces show that balance_dirty_pages() is busy looping.
It cannot write anything because the inode always be requeued by this line:

if (inode->i_state & I_SYNC) {
if (!wait) {
requeue_io(inode);
return 0;
}

This seem to happen when the partition is FULL.

Thanks,
Fengguang


Attachments:
(No filename) (2.21 kB)
wb-debug.patch (5.01 kB)
Download all attachments

2009-09-21 13:06:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21 2009, Wu Fengguang wrote:
> On Thu, Sep 17, 2009 at 02:41:06AM +0800, Jens Axboe wrote:
> > On Wed, Sep 16 2009, Jan Kara wrote:
> > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > write anything.
> >
> > Interesting, so this will happen if the dirtier and flush thread end up
> > "fighting" each other over the same inode. I'll throw this into the
> > testing mix.
> >
> > How did you notice?
>
> Jens, I found another busy loop. Not sure about the solution, but here
> is the quick fact.
>
> Tested git head is 1ef7d9aa32a8ee054c4d4fdcd2ea537c04d61b2f, which
> seems to be the last writeback patch in the linux-next tree. I cannot
> run the plain head of linux-next because it just refuses boot up.
>
> On top of which Jan Kara's I_SYNC waiting patch and the attached
> debugging patch is applied.
>
> Test commands are:
>
> # mount /mnt/test # ext4 fs
> # echo 1 > /proc/sys/fs/dirty_debug
>
> # cp /dev/zero /mnt/test/zero0
>
> After that the box is locked up, the system is busy doing these:
>
> [ 54.740295] requeue_io() +457: inode=79232
> [ 54.740300] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> [ 54.740303] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> [ 54.740317] requeue_io() +457: inode=79232
> [ 54.740322] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> [ 54.740325] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> [ 54.740339] requeue_io() +457: inode=79232
> [ 54.740344] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> [ 54.740347] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> ......
>
> Basically the traces show that balance_dirty_pages() is busy looping.
> It cannot write anything because the inode always be requeued by this line:
>
> if (inode->i_state & I_SYNC) {
> if (!wait) {
> requeue_io(inode);
> return 0;
> }
>
> This seem to happen when the partition is FULL.

OK, I'll have to look into these issues soonish. I'll be travelling next
week though, so I cannot do more about it right now. If you have time to
work on tested patches, then that would be MUCH appreciated it!

--
Jens Axboe

2009-09-21 13:11:11

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 09:06:34PM +0800, Jens Axboe wrote:
> On Mon, Sep 21 2009, Wu Fengguang wrote:
> > On Thu, Sep 17, 2009 at 02:41:06AM +0800, Jens Axboe wrote:
> > > On Wed, Sep 16 2009, Jan Kara wrote:
> > > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > > write anything.
> > >
> > > Interesting, so this will happen if the dirtier and flush thread end up
> > > "fighting" each other over the same inode. I'll throw this into the
> > > testing mix.
> > >
> > > How did you notice?
> >
> > Jens, I found another busy loop. Not sure about the solution, but here
> > is the quick fact.
> >
> > Tested git head is 1ef7d9aa32a8ee054c4d4fdcd2ea537c04d61b2f, which
> > seems to be the last writeback patch in the linux-next tree. I cannot
> > run the plain head of linux-next because it just refuses boot up.
> >
> > On top of which Jan Kara's I_SYNC waiting patch and the attached
> > debugging patch is applied.
> >
> > Test commands are:
> >
> > # mount /mnt/test # ext4 fs
> > # echo 1 > /proc/sys/fs/dirty_debug
> >
> > # cp /dev/zero /mnt/test/zero0
> >
> > After that the box is locked up, the system is busy doing these:
> >
> > [ 54.740295] requeue_io() +457: inode=79232
> > [ 54.740300] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > [ 54.740303] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > [ 54.740317] requeue_io() +457: inode=79232
> > [ 54.740322] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > [ 54.740325] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > [ 54.740339] requeue_io() +457: inode=79232
> > [ 54.740344] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > [ 54.740347] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > ......
> >
> > Basically the traces show that balance_dirty_pages() is busy looping.
> > It cannot write anything because the inode always be requeued by this line:
> >
> > if (inode->i_state & I_SYNC) {
> > if (!wait) {
> > requeue_io(inode);
> > return 0;
> > }
> >
> > This seem to happen when the partition is FULL.
>
> OK, I'll have to look into these issues soonish. I'll be travelling next
> week though, so I cannot do more about it right now. If you have time to
> work on tested patches, then that would be MUCH appreciated it!

Jens, your bug fix patches arrive just before I submit this bug report :)

Thanks,
Fengguang

2009-09-21 13:19:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon 21-09-09 21:01:45, Wu Fengguang wrote:
> On Thu, Sep 17, 2009 at 02:41:06AM +0800, Jens Axboe wrote:
> > On Wed, Sep 16 2009, Jan Kara wrote:
> > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > write anything.
> >
> > Interesting, so this will happen if the dirtier and flush thread end up
> > "fighting" each other over the same inode. I'll throw this into the
> > testing mix.
> >
> > How did you notice?
>
> Jens, I found another busy loop. Not sure about the solution, but here
> is the quick fact.
>
> Tested git head is 1ef7d9aa32a8ee054c4d4fdcd2ea537c04d61b2f, which
> seems to be the last writeback patch in the linux-next tree. I cannot
> run the plain head of linux-next because it just refuses boot up.
>
> On top of which Jan Kara's I_SYNC waiting patch and the attached
> debugging patch is applied.
>
> Test commands are:
>
> # mount /mnt/test # ext4 fs
> # echo 1 > /proc/sys/fs/dirty_debug
>
> # cp /dev/zero /mnt/test/zero0
>
> After that the box is locked up, the system is busy doing these:
>
> [ 54.740295] requeue_io() +457: inode=79232
> [ 54.740300] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> [ 54.740303] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> [ 54.740317] requeue_io() +457: inode=79232
> [ 54.740322] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> [ 54.740325] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> [ 54.740339] requeue_io() +457: inode=79232
> [ 54.740344] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> [ 54.740347] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> ......
>
> Basically the traces show that balance_dirty_pages() is busy looping.
> It cannot write anything because the inode always be requeued by this line:
>
> if (inode->i_state & I_SYNC) {
> if (!wait) {
> requeue_io(inode);
> return 0;
> }
>
> This seem to happen when the partition is FULL.
Hmm, are you sure my fix is applied? It should prevent exactly this busy
loop when we just requeue one inode again and again... If it really is, I
wonder why we didn't end up calling inode_wait_for_writeback I've added.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-21 13:28:50

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 09:19:26PM +0800, Jan Kara wrote:
> On Mon 21-09-09 21:01:45, Wu Fengguang wrote:
> > On Thu, Sep 17, 2009 at 02:41:06AM +0800, Jens Axboe wrote:
> > > On Wed, Sep 16 2009, Jan Kara wrote:
> > > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > > write anything.
> > >
> > > Interesting, so this will happen if the dirtier and flush thread end up
> > > "fighting" each other over the same inode. I'll throw this into the
> > > testing mix.
> > >
> > > How did you notice?
> >
> > Jens, I found another busy loop. Not sure about the solution, but here
> > is the quick fact.
> >
> > Tested git head is 1ef7d9aa32a8ee054c4d4fdcd2ea537c04d61b2f, which
> > seems to be the last writeback patch in the linux-next tree. I cannot
> > run the plain head of linux-next because it just refuses boot up.
> >
> > On top of which Jan Kara's I_SYNC waiting patch and the attached
> > debugging patch is applied.
> >
> > Test commands are:
> >
> > # mount /mnt/test # ext4 fs
> > # echo 1 > /proc/sys/fs/dirty_debug
> >
> > # cp /dev/zero /mnt/test/zero0
> >
> > After that the box is locked up, the system is busy doing these:
> >
> > [ 54.740295] requeue_io() +457: inode=79232
> > [ 54.740300] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > [ 54.740303] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > [ 54.740317] requeue_io() +457: inode=79232
> > [ 54.740322] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > [ 54.740325] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > [ 54.740339] requeue_io() +457: inode=79232
> > [ 54.740344] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > [ 54.740347] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > ......
> >
> > Basically the traces show that balance_dirty_pages() is busy looping.
> > It cannot write anything because the inode always be requeued by this line:
> >
> > if (inode->i_state & I_SYNC) {
> > if (!wait) {
> > requeue_io(inode);
> > return 0;
> > }
> >
> > This seem to happen when the partition is FULL.
> Hmm, are you sure my fix is applied? It should prevent exactly this busy
> loop when we just requeue one inode again and again... If it really is, I
> wonder why we didn't end up calling inode_wait_for_writeback I've added.

Yes it is applied, and in fact checking how it helps in real workload :)

For this bug I reported, it seems that the I_SYNC somehow cannot be
cleared _for ever_. I'm happy to confirm that Jens'
schedule_timeout_interruptible() patch fixed it. However I don't know
what exactly goes on underneath.

Thanks,
Fengguang

2009-09-21 13:40:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21 2009, Wu Fengguang wrote:
> On Mon, Sep 21, 2009 at 09:06:34PM +0800, Jens Axboe wrote:
> > On Mon, Sep 21 2009, Wu Fengguang wrote:
> > > On Thu, Sep 17, 2009 at 02:41:06AM +0800, Jens Axboe wrote:
> > > > On Wed, Sep 16 2009, Jan Kara wrote:
> > > > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > > > write anything.
> > > >
> > > > Interesting, so this will happen if the dirtier and flush thread end up
> > > > "fighting" each other over the same inode. I'll throw this into the
> > > > testing mix.
> > > >
> > > > How did you notice?
> > >
> > > Jens, I found another busy loop. Not sure about the solution, but here
> > > is the quick fact.
> > >
> > > Tested git head is 1ef7d9aa32a8ee054c4d4fdcd2ea537c04d61b2f, which
> > > seems to be the last writeback patch in the linux-next tree. I cannot
> > > run the plain head of linux-next because it just refuses boot up.
> > >
> > > On top of which Jan Kara's I_SYNC waiting patch and the attached
> > > debugging patch is applied.
> > >
> > > Test commands are:
> > >
> > > # mount /mnt/test # ext4 fs
> > > # echo 1 > /proc/sys/fs/dirty_debug
> > >
> > > # cp /dev/zero /mnt/test/zero0
> > >
> > > After that the box is locked up, the system is busy doing these:
> > >
> > > [ 54.740295] requeue_io() +457: inode=79232
> > > [ 54.740300] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > > [ 54.740303] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > > [ 54.740317] requeue_io() +457: inode=79232
> > > [ 54.740322] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > > [ 54.740325] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > > [ 54.740339] requeue_io() +457: inode=79232
> > > [ 54.740344] mm/page-writeback.c +539 balance_dirty_pages(): comm=cp pid=3327 n=0
> > > [ 54.740347] global dirty=60345 writeback=10145 nfs=0 flags=_M towrite=1536 skipped=0
> > > ......
> > >
> > > Basically the traces show that balance_dirty_pages() is busy looping.
> > > It cannot write anything because the inode always be requeued by this line:
> > >
> > > if (inode->i_state & I_SYNC) {
> > > if (!wait) {
> > > requeue_io(inode);
> > > return 0;
> > > }
> > >
> > > This seem to happen when the partition is FULL.
> >
> > OK, I'll have to look into these issues soonish. I'll be travelling next
> > week though, so I cannot do more about it right now. If you have time to
> > work on tested patches, then that would be MUCH appreciated it!
>
> Jens, your bug fix patches arrive just before I submit this bug report :)

Good!

--
Jens Axboe

2009-09-21 13:45:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon 21-09-09 09:08:59, Wu Fengguang wrote:
> On Mon, Sep 21, 2009 at 01:43:56AM +0800, Jan Kara wrote:
> > On Sun 20-09-09 10:35:28, Wu Fengguang wrote:
> > > On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> > > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > > write anything.
> > >
> > > Sorry, I realized that inode_wait_for_writeback() waits for I_SYNC.
> > > But inodes in b_more_io are not expected to have I_SYNC set. So your
> > > patch looks like a big no-op?
> > Hmm, I don't think so. writeback_single_inode() does:
> > if (inode->i_state & I_SYNC) {
> > /*
> > * If this inode is locked for writeback and we are not
> > * doing
> > * writeback-for-data-integrity, move it to b_more_io so
> > * that
> > * writeback can proceed with the other inodes on s_io.
> > *
> > * We'll have another go at writing back this inode when we
> > * completed a full scan of b_io.
> > */
> > if (!wait) {
> > requeue_io(inode);
> > return 0;
> > }
> >
> > So when we see inode under writeback, we put it to b_more_io. So I think
> > my patch really fixes the issue when two threads are racing on writing the
> > same inode.
>
> Ah OK. So it busy loops when there are more syncing threads than dirty
> files. For example, one bdi flush thread plus one process running
> balance_dirty_pages().
Yes.

> > > The busy loop does exists, when bdi is congested.
> > > In this case, write_cache_pages() will refuse to write anything,
> > > we used to be calling congestion_wait() to take a breath, but now
> > > wb_writeback() purged that call and thus created a busy loop.
> > I don't think congestion is an issue here. The device needen't be
> > congested for the busyloop to happen.
>
> bdi congestion is a different case. When there are only one syncing
> thread, b_more_io inodes won't have I_SYNC, so your patch is a no-op.
> wb_writeback() or any of its sub-routines must wait/yield for a while
> to avoid busy looping on the congestion. Where is the wait with Jens'
> new code?
I agree someone must wait when we bail out due to congestion. But we bail
out only when wbc->nonblocking is set. So I'd feel that callers setting
this flag should handle it when we stop the writeback due to congestion.

> Another question is, why wbc.more_io can be ignored for kupdate syncs?
> I guess it would lead to slow writeback of large files.
>
> This patch reflects my concerns on the two problems.
>
> Thanks,
> Fengguang
> ---
> fs/fs-writeback.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> --- linux.orig/fs/fs-writeback.c 2009-09-20 10:44:25.000000000 +0800
> +++ linux/fs/fs-writeback.c 2009-09-21 08:53:09.000000000 +0800
> @@ -818,8 +818,10 @@ static long wb_writeback(struct bdi_writ
> /*
> * If we ran out of stuff to write, bail unless more_io got set
> */
> - if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> - if (wbc.more_io && !wbc.for_kupdate)
> + if (wbc.nr_to_write > 0) {
> + if (wbc.encountered_congestion)
> + congestion_wait(BLK_RW_ASYNC, HZ);
> + if (wbc.more_io)
> continue;
> break;
> }
OK, this change looks reasonable but I think we'll have to revisit
the writeback logic more in detail as we discussed in the other thread.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-21 14:12:02

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 09:45:11PM +0800, Jan Kara wrote:
> On Mon 21-09-09 09:08:59, Wu Fengguang wrote:
> > On Mon, Sep 21, 2009 at 01:43:56AM +0800, Jan Kara wrote:
> > > On Sun 20-09-09 10:35:28, Wu Fengguang wrote:
> > > > On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> > > > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > > > write anything.
> > > >
> > > > Sorry, I realized that inode_wait_for_writeback() waits for I_SYNC.
> > > > But inodes in b_more_io are not expected to have I_SYNC set. So your
> > > > patch looks like a big no-op?
> > > Hmm, I don't think so. writeback_single_inode() does:
> > > if (inode->i_state & I_SYNC) {
> > > /*
> > > * If this inode is locked for writeback and we are not
> > > * doing
> > > * writeback-for-data-integrity, move it to b_more_io so
> > > * that
> > > * writeback can proceed with the other inodes on s_io.
> > > *
> > > * We'll have another go at writing back this inode when we
> > > * completed a full scan of b_io.
> > > */
> > > if (!wait) {
> > > requeue_io(inode);
> > > return 0;
> > > }
> > >
> > > So when we see inode under writeback, we put it to b_more_io. So I think
> > > my patch really fixes the issue when two threads are racing on writing the
> > > same inode.
> >
> > Ah OK. So it busy loops when there are more syncing threads than dirty
> > files. For example, one bdi flush thread plus one process running
> > balance_dirty_pages().
> Yes.
>
> > > > The busy loop does exists, when bdi is congested.
> > > > In this case, write_cache_pages() will refuse to write anything,
> > > > we used to be calling congestion_wait() to take a breath, but now
> > > > wb_writeback() purged that call and thus created a busy loop.
> > > I don't think congestion is an issue here. The device needen't be
> > > congested for the busyloop to happen.
> >
> > bdi congestion is a different case. When there are only one syncing
> > thread, b_more_io inodes won't have I_SYNC, so your patch is a no-op.
> > wb_writeback() or any of its sub-routines must wait/yield for a while
> > to avoid busy looping on the congestion. Where is the wait with Jens'
> > new code?
> I agree someone must wait when we bail out due to congestion. But we bail
> out only when wbc->nonblocking is set.

Here is another problem. wbc->nonblocking used to be set for kupdate
and background writebacks, but now it's gone. So they will be blocked
in get_request_wait(). That's fine, no busy loops.

However this inverts the priority. pageout() still have nonblocking=1.
So now vmscan can easily be live locked by heavy background writebacks.

Even though pageout() is inefficient and discouraged, it could be even
more disastrous to livelock vmscan.

Jens, I'd recommend to restore the nonblocking bits for this merge
window. The per-bdi patches are such a big change, it's not a good
time to piggy back more unrelated behavior changes.. I could submit
patches to revert the nonblocking and congestion wait bits.

> So I'd feel that callers setting this flag should handle it when we
> stop the writeback due to congestion.

Hmm, but we never stop writeback and return to caller on congestion :)
Instead wb_writeback() will retry in loops. And it doesn't make sense
to add duplicate retry loops in the callers.

> > Another question is, why wbc.more_io can be ignored for kupdate syncs?
> > I guess it would lead to slow writeback of large files.
> >
> > This patch reflects my concerns on the two problems.
> >
> > Thanks,
> > Fengguang
> > ---
> > fs/fs-writeback.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > --- linux.orig/fs/fs-writeback.c 2009-09-20 10:44:25.000000000 +0800
> > +++ linux/fs/fs-writeback.c 2009-09-21 08:53:09.000000000 +0800
> > @@ -818,8 +818,10 @@ static long wb_writeback(struct bdi_writ
> > /*
> > * If we ran out of stuff to write, bail unless more_io got set
> > */
> > - if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > - if (wbc.more_io && !wbc.for_kupdate)
> > + if (wbc.nr_to_write > 0) {
> > + if (wbc.encountered_congestion)
> > + congestion_wait(BLK_RW_ASYNC, HZ);
> > + if (wbc.more_io)
> > continue;
> > break;
> > }
> OK, this change looks reasonable but I think we'll have to revisit
> the writeback logic more in detail as we discussed in the other thread.

OK, and to check how exactly it should be combined with your patch.

Thanks,
Fengguang

2009-09-21 14:19:37

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 10:11:09PM +0800, Wu Fengguang wrote:
> On Mon, Sep 21, 2009 at 09:45:11PM +0800, Jan Kara wrote:
> > On Mon 21-09-09 09:08:59, Wu Fengguang wrote:
> > > On Mon, Sep 21, 2009 at 01:43:56AM +0800, Jan Kara wrote:
> > > > On Sun 20-09-09 10:35:28, Wu Fengguang wrote:
> > > > > On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> > > > > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > > > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > > > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > > > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > > > > write anything.
> > > > >
> > > > > Sorry, I realized that inode_wait_for_writeback() waits for I_SYNC.
> > > > > But inodes in b_more_io are not expected to have I_SYNC set. So your
> > > > > patch looks like a big no-op?
> > > > Hmm, I don't think so. writeback_single_inode() does:
> > > > if (inode->i_state & I_SYNC) {
> > > > /*
> > > > * If this inode is locked for writeback and we are not
> > > > * doing
> > > > * writeback-for-data-integrity, move it to b_more_io so
> > > > * that
> > > > * writeback can proceed with the other inodes on s_io.
> > > > *
> > > > * We'll have another go at writing back this inode when we
> > > > * completed a full scan of b_io.
> > > > */
> > > > if (!wait) {
> > > > requeue_io(inode);
> > > > return 0;
> > > > }
> > > >
> > > > So when we see inode under writeback, we put it to b_more_io. So I think
> > > > my patch really fixes the issue when two threads are racing on writing the
> > > > same inode.
> > >
> > > Ah OK. So it busy loops when there are more syncing threads than dirty
> > > files. For example, one bdi flush thread plus one process running
> > > balance_dirty_pages().
> > Yes.
> >
> > > > > The busy loop does exists, when bdi is congested.
> > > > > In this case, write_cache_pages() will refuse to write anything,
> > > > > we used to be calling congestion_wait() to take a breath, but now
> > > > > wb_writeback() purged that call and thus created a busy loop.
> > > > I don't think congestion is an issue here. The device needen't be
> > > > congested for the busyloop to happen.
> > >
> > > bdi congestion is a different case. When there are only one syncing
> > > thread, b_more_io inodes won't have I_SYNC, so your patch is a no-op.
> > > wb_writeback() or any of its sub-routines must wait/yield for a while
> > > to avoid busy looping on the congestion. Where is the wait with Jens'
> > > new code?
> > I agree someone must wait when we bail out due to congestion. But we bail
> > out only when wbc->nonblocking is set.
>
> Here is another problem. wbc->nonblocking used to be set for kupdate
> and background writebacks, but now it's gone. So they will be blocked
> in get_request_wait(). That's fine, no busy loops.
>
> However this inverts the priority. pageout() still have nonblocking=1.
> So now vmscan can easily be live locked by heavy background writebacks.

The important part of the nonblocking check for pageout is really to
make sure that it doesn't get stuck locking a buffer that is actually
under IO, which happens in ext3/reiserfs data=ordered mode.

Having pageout wait for a request is fine. Its just as likely to wait
for a request when it does actually start the IO, regardless of the
congestion checks earlier in the call chain.

I'd drop any congestion checks in the nooks and crannies of the
writeback paths.

-chris

2009-09-21 14:31:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 10:19:10PM +0800, Chris Mason wrote:
> On Mon, Sep 21, 2009 at 10:11:09PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 21, 2009 at 09:45:11PM +0800, Jan Kara wrote:
> > > On Mon 21-09-09 09:08:59, Wu Fengguang wrote:
> > > > On Mon, Sep 21, 2009 at 01:43:56AM +0800, Jan Kara wrote:
> > > > > On Sun 20-09-09 10:35:28, Wu Fengguang wrote:
> > > > > > On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> > > > > > > If all inodes are under writeback (e.g. in case when there's only one inode
> > > > > > > with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> > > > > > > to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> > > > > > > waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> > > > > > > write anything.
> > > > > >
> > > > > > Sorry, I realized that inode_wait_for_writeback() waits for I_SYNC.
> > > > > > But inodes in b_more_io are not expected to have I_SYNC set. So your
> > > > > > patch looks like a big no-op?
> > > > > Hmm, I don't think so. writeback_single_inode() does:
> > > > > if (inode->i_state & I_SYNC) {
> > > > > /*
> > > > > * If this inode is locked for writeback and we are not
> > > > > * doing
> > > > > * writeback-for-data-integrity, move it to b_more_io so
> > > > > * that
> > > > > * writeback can proceed with the other inodes on s_io.
> > > > > *
> > > > > * We'll have another go at writing back this inode when we
> > > > > * completed a full scan of b_io.
> > > > > */
> > > > > if (!wait) {
> > > > > requeue_io(inode);
> > > > > return 0;
> > > > > }
> > > > >
> > > > > So when we see inode under writeback, we put it to b_more_io. So I think
> > > > > my patch really fixes the issue when two threads are racing on writing the
> > > > > same inode.
> > > >
> > > > Ah OK. So it busy loops when there are more syncing threads than dirty
> > > > files. For example, one bdi flush thread plus one process running
> > > > balance_dirty_pages().
> > > Yes.
> > >
> > > > > > The busy loop does exists, when bdi is congested.
> > > > > > In this case, write_cache_pages() will refuse to write anything,
> > > > > > we used to be calling congestion_wait() to take a breath, but now
> > > > > > wb_writeback() purged that call and thus created a busy loop.
> > > > > I don't think congestion is an issue here. The device needen't be
> > > > > congested for the busyloop to happen.
> > > >
> > > > bdi congestion is a different case. When there are only one syncing
> > > > thread, b_more_io inodes won't have I_SYNC, so your patch is a no-op.
> > > > wb_writeback() or any of its sub-routines must wait/yield for a while
> > > > to avoid busy looping on the congestion. Where is the wait with Jens'
> > > > new code?
> > > I agree someone must wait when we bail out due to congestion. But we bail
> > > out only when wbc->nonblocking is set.
> >
> > Here is another problem. wbc->nonblocking used to be set for kupdate
> > and background writebacks, but now it's gone. So they will be blocked
> > in get_request_wait(). That's fine, no busy loops.
> >
> > However this inverts the priority. pageout() still have nonblocking=1.
> > So now vmscan can easily be live locked by heavy background writebacks.
>
> The important part of the nonblocking check for pageout is really to
> make sure that it doesn't get stuck locking a buffer that is actually
> under IO, which happens in ext3/reiserfs data=ordered mode.

OK.

> Having pageout wait for a request is fine. Its just as likely to wait
> for a request when it does actually start the IO, regardless of the
> congestion checks earlier in the call chain.

There are fundamental differences. The congestion wait is live lock for
pageout, while wait_on_page_writeback() will finish in bounded time.

> I'd drop any congestion checks in the nooks and crannies of the
> writeback paths.

Let's work on a better solution then?

Thanks,
Fengguang

2009-09-21 14:46:10

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 10:31:07PM +0800, Wu Fengguang wrote:
> On Mon, Sep 21, 2009 at 10:19:10PM +0800, Chris Mason wrote:
> > On Mon, Sep 21, 2009 at 10:11:09PM +0800, Wu Fengguang wrote:
> > > On Mon, Sep 21, 2009 at 09:45:11PM +0800, Jan Kara wrote:
> > > > On Mon 21-09-09 09:08:59, Wu Fengguang wrote:
> > > > > On Mon, Sep 21, 2009 at 01:43:56AM +0800, Jan Kara wrote:
> > > > > > So when we see inode under writeback, we put it to b_more_io. So I think
> > > > > > my patch really fixes the issue when two threads are racing on writing the
> > > > > > same inode.
> > > > >
> > > > > Ah OK. So it busy loops when there are more syncing threads than dirty
> > > > > files. For example, one bdi flush thread plus one process running
> > > > > balance_dirty_pages().
> > > > Yes.
> > > >
> > > > > > > The busy loop does exists, when bdi is congested.
> > > > > > > In this case, write_cache_pages() will refuse to write anything,
> > > > > > > we used to be calling congestion_wait() to take a breath, but now
> > > > > > > wb_writeback() purged that call and thus created a busy loop.
> > > > > > I don't think congestion is an issue here. The device needen't be
> > > > > > congested for the busyloop to happen.
> > > > >
> > > > > bdi congestion is a different case. When there are only one syncing
> > > > > thread, b_more_io inodes won't have I_SYNC, so your patch is a no-op.
> > > > > wb_writeback() or any of its sub-routines must wait/yield for a while
> > > > > to avoid busy looping on the congestion. Where is the wait with Jens'
> > > > > new code?
> > > > I agree someone must wait when we bail out due to congestion. But we bail
> > > > out only when wbc->nonblocking is set.
> > >
> > > Here is another problem. wbc->nonblocking used to be set for kupdate
> > > and background writebacks, but now it's gone. So they will be blocked
> > > in get_request_wait(). That's fine, no busy loops.
> > >
> > > However this inverts the priority. pageout() still have nonblocking=1.
> > > So now vmscan can easily be live locked by heavy background writebacks.
> >
> > The important part of the nonblocking check for pageout is really to
> > make sure that it doesn't get stuck locking a buffer that is actually
> > under IO, which happens in ext3/reiserfs data=ordered mode.
>
> OK.
>
> > Having pageout wait for a request is fine. Its just as likely to wait
> > for a request when it does actually start the IO, regardless of the
> > congestion checks earlier in the call chain.
>
> There are fundamental differences. The congestion wait is live lock for
> pageout, while wait_on_page_writeback() will finish in bounded time.
>
> > I'd drop any congestion checks in the nooks and crannies of the
> > writeback paths.
>
> Let's work on a better solution then?

Today, wbc->nonblocking and congestion are checked together:

1) in writeback_inodes_wb before we call writeback_single_inode
2) in write_cache_pages, before we call writepage
3) in write_cache_pages, after we call writepage

If we delete all 3, we get rid of the livelock but keep the check that
makes sure we don't wait on locked buffers that are under IO.

If we delete #1 and #2, we'll get rid of the livelock but pageout will
still stop trying to do IO on this backing dev once it finds some
congestion.

I think either way is fine ;)

-chris

2009-09-22 09:14:25

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Mon, Sep 21, 2009 at 10:45:51PM +0800, Chris Mason wrote:
> On Mon, Sep 21, 2009 at 10:31:07PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 21, 2009 at 10:19:10PM +0800, Chris Mason wrote:
> > > On Mon, Sep 21, 2009 at 10:11:09PM +0800, Wu Fengguang wrote:
> > > > On Mon, Sep 21, 2009 at 09:45:11PM +0800, Jan Kara wrote:
> > > > > On Mon 21-09-09 09:08:59, Wu Fengguang wrote:
> > > > > > On Mon, Sep 21, 2009 at 01:43:56AM +0800, Jan Kara wrote:
> > > > > > > So when we see inode under writeback, we put it to b_more_io. So I think
> > > > > > > my patch really fixes the issue when two threads are racing on writing the
> > > > > > > same inode.
> > > > > >
> > > > > > Ah OK. So it busy loops when there are more syncing threads than dirty
> > > > > > files. For example, one bdi flush thread plus one process running
> > > > > > balance_dirty_pages().
> > > > > Yes.
> > > > >
> > > > > > > > The busy loop does exists, when bdi is congested.
> > > > > > > > In this case, write_cache_pages() will refuse to write anything,
> > > > > > > > we used to be calling congestion_wait() to take a breath, but now
> > > > > > > > wb_writeback() purged that call and thus created a busy loop.
> > > > > > > I don't think congestion is an issue here. The device needen't be
> > > > > > > congested for the busyloop to happen.
> > > > > >
> > > > > > bdi congestion is a different case. When there are only one syncing
> > > > > > thread, b_more_io inodes won't have I_SYNC, so your patch is a no-op.
> > > > > > wb_writeback() or any of its sub-routines must wait/yield for a while
> > > > > > to avoid busy looping on the congestion. Where is the wait with Jens'
> > > > > > new code?
> > > > > I agree someone must wait when we bail out due to congestion. But we bail
> > > > > out only when wbc->nonblocking is set.
> > > >
> > > > Here is another problem. wbc->nonblocking used to be set for kupdate
> > > > and background writebacks, but now it's gone. So they will be blocked
> > > > in get_request_wait(). That's fine, no busy loops.
> > > >
> > > > However this inverts the priority. pageout() still have nonblocking=1.
> > > > So now vmscan can easily be live locked by heavy background writebacks.
> > >
> > > The important part of the nonblocking check for pageout is really to
> > > make sure that it doesn't get stuck locking a buffer that is actually
> > > under IO, which happens in ext3/reiserfs data=ordered mode.
> >
> > OK.
> >
> > > Having pageout wait for a request is fine. Its just as likely to wait
> > > for a request when it does actually start the IO, regardless of the
> > > congestion checks earlier in the call chain.
> >
> > There are fundamental differences. The congestion wait is live lock for
> > pageout, while wait_on_page_writeback() will finish in bounded time.

Ah sorry for making silly mistakes! According Jan Kara, live lock is
not possible because pageout calls ->writepage() directly without
congestion wait.

> > > I'd drop any congestion checks in the nooks and crannies of the
> > > writeback paths.
> >
> > Let's work on a better solution then?
>
> Today, wbc->nonblocking and congestion are checked together:
>
> 1) in writeback_inodes_wb before we call writeback_single_inode
> 2) in write_cache_pages, before we call writepage
> 3) in write_cache_pages, after we call writepage
>
> If we delete all 3, we get rid of the livelock but keep the check that
> makes sure we don't wait on locked buffers that are under IO.
>
> If we delete #1 and #2, we'll get rid of the livelock but pageout will
> still stop trying to do IO on this backing dev once it finds some
> congestion.
>
> I think either way is fine ;)

To remove all these blocks? Looks like good cleanups because now no
one is passing nonblocking to these functions.

Thanks,
Fengguang

---
fs/fs-writeback.c | 9 +--------
mm/page-writeback.c | 11 -----------
2 files changed, 1 insertion(+), 19 deletions(-)

--- linux.orig/fs/fs-writeback.c 2009-09-22 16:29:58.000000000 +0800
+++ linux/fs/fs-writeback.c 2009-09-22 17:09:25.000000000 +0800
@@ -566,14 +567,6 @@ rescan:
continue;
}

- if (wbc->nonblocking && bdi_write_congested(wb->bdi)) {
- wbc->encountered_congestion = 1;
- if (!is_blkdev_sb)
- break; /* Skip a congested fs */
- requeue_io(inode);
- continue; /* Skip a congested blockdev */
- }
-
if (inode_dirtied_after(inode, wbc->older_than_this)) {
if (list_empty(&wb->b_more_io))
break;
--- linux.orig/mm/page-writeback.c 2009-09-22 17:09:28.000000000 +0800
+++ linux/mm/page-writeback.c 2009-09-22 17:09:47.000000000 +0800
@@ -827,11 +827,6 @@ int write_cache_pages(struct address_spa
int range_whole = 0;
long nr_to_write = wbc->nr_to_write;

- if (wbc->nonblocking && bdi_write_congested(bdi)) {
- wbc->encountered_congestion = 1;
- return 0;
- }
-
pagevec_init(&pvec, 0);
if (wbc->range_cyclic) {
writeback_index = mapping->writeback_index; /* prev offset */
@@ -950,12 +945,6 @@ continue_unlock:
break;
}
}
-
- if (wbc->nonblocking && bdi_write_congested(bdi)) {
- wbc->encountered_congestion = 1;
- done = 1;
- break;
- }
}
pagevec_release(&pvec);
cond_resched();

2009-09-23 07:56:57

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: Fix busyloop in wb_writeback()

On Thu, Sep 17, 2009 at 01:22:48AM +0800, Jan Kara wrote:
> If all inodes are under writeback (e.g. in case when there's only one inode
> with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
> to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
> waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
> write anything.

Hi Jan,

I reconfirmed that this patch fixed the busy loop in wb_writeback()
path. I also noticed that balance_dirty_pages() looped a lot. Given
that balance_dirty_pages() schedules in between, it cannot be called
busy loop. But the loop interval is about 0.01s, which is shorter than
the expected 0.1s. This is a 2-core CPU, so I guess the cp process got
re-scheduled immediately, long before it timed out.

[ 491.440702] requeue_io() +457: inode=39381
[ 491.442360] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.444257] global dirty=51770 writeback=18201 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.445856] requeue_io() +457: inode=39381
[ 491.446543] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.448346] global dirty=51770 writeback=18201 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.460535] requeue_io() +457: inode=39381
[ 491.462206] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.463574] global dirty=51894 writeback=18077 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.474072] requeue_io() +457: inode=39381
[ 491.475416] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.477220] global dirty=52049 writeback=17922 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.481859] requeue_io() +457: inode=39381
[ 491.482618] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.485026] global dirty=52049 writeback=17798 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.496843] requeue_io() +457: inode=39381
[ 491.499001] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.500658] global dirty=52173 writeback=17674 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.508540] requeue_io() +457: inode=39381
[ 491.509490] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.511300] global dirty=52297 writeback=17674 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.517728] requeue_io() +457: inode=39381
[ 491.518752] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.521498] global dirty=52421 writeback=17550 nfs=0 flags=_M towrite=1536 skipped=0
[ 491.525881] requeue_io() +457: inode=39381
[ 491.527339] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3302 n=0
[ 491.528730] global dirty=52421 writeback=17550 nfs=0 flags=_M towrite=1536 skipped=0

Thanks,
Fengguang


> CC: Wu Fengguang <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 19 ++++++++++++++++++-
> 1 files changed, 18 insertions(+), 1 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 8e1e5e1..c59d673 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -706,6 +706,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> };
> unsigned long oldest_jif;
> long wrote = 0;
> + struct inode *inode;
>
> if (wbc.for_kupdate) {
> wbc.older_than_this = &oldest_jif;
> @@ -747,8 +748,24 @@ static long wb_writeback(struct bdi_writeback *wb,
> * If we ran out of stuff to write, bail unless more_io got set
> */
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> - if (wbc.more_io && !wbc.for_kupdate)
> + if (wbc.more_io && !wbc.for_kupdate) {
> + if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> + continue;
> + /*
> + * Nothing written. Wait for some inode to
> + * become available for writeback. Otherwise
> + * we'll just busyloop.
> + */
> + spin_lock(&inode_lock);
> + if (!list_empty(&wb->b_more_io)) {
> + inode = list_entry(
> + wb->b_more_io.prev,
> + struct inode, i_list);
> + inode_wait_for_writeback(inode);
> + }
> + spin_unlock(&inode_lock);
> continue;
> + }
> break;
> }
> }
> --
> 1.6.0.2