2007-12-11 02:03:26

by Michael Rubin

[permalink] [raw]
Subject: [patch 1/1] Writeback fix for concurrent large and small file writes.

From: Michael Rubin <[email protected]>

Fixing a bug where writing to large files while concurrently writing to
smaller ones creates a situation where writeback cannot keep up with the
traffic and memory baloons until the we hit the threshold watermark. This
can result in surprising latency spikes when syncing. This latency
can take minutes on large memory systems. Upon request I can provide
a test to reproduce this situation.

The only concern I have is that this makes the wb_kupdate slightly more
agressive. I am not sure it is enough to cause any problems. I think
there is enough checks to throttle the background activity.

Feng also the one line change that you recommended here
http://marc.info/?l=linux-kernel&m=119629655402153&w=2 had no effect.

Signed-off-by: Michael Rubin <[email protected]>
---
Index: 2624rc3_feng/fs/fs-writeback.c
===================================================================
--- 2624rc3_feng.orig/fs/fs-writeback.c 2007-11-29 14:44:24.000000000 -0800
+++ 2624rc3_feng/fs/fs-writeback.c 2007-12-10 17:21:45.000000000 -0800
@@ -408,8 +408,7 @@ sync_sb_inodes(struct super_block *sb, s
{
const unsigned long start = jiffies; /* livelock avoidance */

- if (!wbc->for_kupdate || list_empty(&sb->s_io))
- queue_io(sb, wbc->older_than_this);
+ queue_io(sb, wbc->older_than_this);

while (!list_empty(&sb->s_io)) {
struct inode *inode = list_entry(sb->s_io.prev,
Index: 2624rc3_feng/mm/page-writeback.c
===================================================================
--- 2624rc3_feng.orig/mm/page-writeback.c 2007-11-16 21:16:36.000000000 -0800
+++ 2624rc3_feng/mm/page-writeback.c 2007-12-10 17:37:17.000000000 -0800
@@ -638,7 +638,7 @@ static void wb_kupdate(unsigned long arg
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
- if (wbc.encountered_congestion || wbc.more_io)
+ if (wbc.encountered_congestion)
congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */


2007-12-12 20:56:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 1/1] Writeback fix for concurrent large and small file writes.


On Mon, 2007-12-10 at 18:02 -0800, Michael Rubin wrote:
> From: Michael Rubin <[email protected]>
>
> Fixing a bug where writing to large files while concurrently writing to
> smaller ones creates a situation where writeback cannot keep up with the
> traffic and memory baloons until the we hit the threshold watermark. This
> can result in surprising latency spikes when syncing. This latency
> can take minutes on large memory systems. Upon request I can provide
> a test to reproduce this situation.

The part I miss here is the rationale on _how_ you solve the problem.

The patch itself is simple enough, but I've been staring at this code
for a while now, and I'm just not getting it.

> The only concern I have is that this makes the wb_kupdate slightly more
> agressive. I am not sure it is enough to cause any problems. I think
> there is enough checks to throttle the background activity.
>
> Feng also the one line change that you recommended here
> http://marc.info/?l=linux-kernel&m=119629655402153&w=2 had no effect.
>
> Signed-off-by: Michael Rubin <[email protected]>
> ---
> Index: 2624rc3_feng/fs/fs-writeback.c
> ===================================================================
> --- 2624rc3_feng.orig/fs/fs-writeback.c 2007-11-29 14:44:24.000000000 -0800
> +++ 2624rc3_feng/fs/fs-writeback.c 2007-12-10 17:21:45.000000000 -0800
> @@ -408,8 +408,7 @@ sync_sb_inodes(struct super_block *sb, s
> {
> const unsigned long start = jiffies; /* livelock avoidance */
>
> - if (!wbc->for_kupdate || list_empty(&sb->s_io))
> - queue_io(sb, wbc->older_than_this);
> + queue_io(sb, wbc->older_than_this);
>
> while (!list_empty(&sb->s_io)) {
> struct inode *inode = list_entry(sb->s_io.prev,
> Index: 2624rc3_feng/mm/page-writeback.c
> ===================================================================
> --- 2624rc3_feng.orig/mm/page-writeback.c 2007-11-16 21:16:36.000000000 -0800
> +++ 2624rc3_feng/mm/page-writeback.c 2007-12-10 17:37:17.000000000 -0800
> @@ -638,7 +638,7 @@ static void wb_kupdate(unsigned long arg
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> writeback_inodes(&wbc);
> if (wbc.nr_to_write > 0) {
> - if (wbc.encountered_congestion || wbc.more_io)
> + if (wbc.encountered_congestion)
> congestion_wait(WRITE, HZ/10);
> else
> break; /* All the old data is written */

2007-12-12 23:03:21

by Michael Rubin

[permalink] [raw]
Subject: Re: [patch 1/1] Writeback fix for concurrent large and small file writes.

On Dec 12, 2007 12:55 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, 2007-12-10 at 18:02 -0800, Michael Rubin wrote:
> > From: Michael Rubin <[email protected]>
> The part I miss here is the rationale on _how_ you solve the problem.
>
> The patch itself is simple enough, but I've been staring at this code
> for a while now, and I'm just not getting it.

Apologies for the lack of rationale. I have been staring at this code
for awhile also and it makes my head hurt. I have a patch coming
(hopefully today) that proposes using one data structure with a more
consistent priority scheme for 2.6.25. To me it's simpler, but I am
biased.

The problem we encounter when we append to a large file at a fast rate
while also writing to smaller files is that the wb_kupdate thread does
not keep up with disk traffic. In this workload often the inodes end
up at fs/fs-writeback.c:287 after do_writepages, since do_writepages
did not write all the pages. This can be due to congestion but I
think there are other causes also since I have observed so.

The first issue is that the inode is put on the s_more_io queue. This
ensures that more_io is set at the end of sync_sb_inodes. The result
from that is the wb_kupdate routine will perform a sleep at
mm/page-writeback.c:642. This slows us down enough that the wb_kupdate
cannot keep up with traffic.

The other issue is that the inode that has been placed on the
s_more_io queue cannot be processed by sync_sb_inodes until the entire
s_io list is empty. With lots of small files that are not being
dirtied as quickly as the one large inode on the s_more_io queue the
inode with the most pages being dirtied is not given attention and
wb_kupdate cannot keep up again.

mrubin

2007-12-28 07:35:32

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch 1/1] Writeback fix for concurrent large and small file writes.

Hi Michael,

// sorry for the delay...

On Mon, Dec 10, 2007 at 06:02:55PM -0800, Michael Rubin wrote:
> From: Michael Rubin <[email protected]>
>
> Fixing a bug where writing to large files while concurrently writing to
> smaller ones creates a situation where writeback cannot keep up with the
> traffic and memory baloons until the we hit the threshold watermark. This
> can result in surprising latency spikes when syncing. This latency
> can take minutes on large memory systems. Upon request I can provide
> a test to reproduce this situation.
>
> The only concern I have is that this makes the wb_kupdate slightly more
> agressive. I am not sure it is enough to cause any problems. I think
> there is enough checks to throttle the background activity.
>
> Feng also the one line change that you recommended here
> http://marc.info/?l=linux-kernel&m=119629655402153&w=2 had no effect.
>
> Signed-off-by: Michael Rubin <[email protected]>
> ---
> Index: 2624rc3_feng/fs/fs-writeback.c
> ===================================================================
> --- 2624rc3_feng.orig/fs/fs-writeback.c 2007-11-29 14:44:24.000000000 -0800
> +++ 2624rc3_feng/fs/fs-writeback.c 2007-12-10 17:21:45.000000000 -0800
> @@ -408,8 +408,7 @@ sync_sb_inodes(struct super_block *sb, s
> {
> const unsigned long start = jiffies; /* livelock avoidance */
>
> - if (!wbc->for_kupdate || list_empty(&sb->s_io))
> - queue_io(sb, wbc->older_than_this);
> + queue_io(sb, wbc->older_than_this);

Basically it's a workaround by changing the service priority.

Assume A to be the large file and B,C,D,E,... to be the small files.
- old behavior:
sync 4MB of A; sync B,C; congestion_wait();
sync 4MB of A; sync D,E; congestion_wait();
sync 4MB of A; sync F,G; congestion_wait();
...
- new behavior:
sync 4MB of A;
sync 4MB of A;
sync 4MB of A;
sync 4MB of A;
sync 4MB of A;
... // repeat until A is clean
sync B,C,D,E,F,G;

So the bug is gone, but now A could possibly starve other files :-(

> while (!list_empty(&sb->s_io)) {
> struct inode *inode = list_entry(sb->s_io.prev,
> Index: 2624rc3_feng/mm/page-writeback.c
> ===================================================================
> --- 2624rc3_feng.orig/mm/page-writeback.c 2007-11-16 21:16:36.000000000 -0800
> +++ 2624rc3_feng/mm/page-writeback.c 2007-12-10 17:37:17.000000000 -0800
> @@ -638,7 +638,7 @@ static void wb_kupdate(unsigned long arg
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> writeback_inodes(&wbc);
> if (wbc.nr_to_write > 0) {
> - if (wbc.encountered_congestion || wbc.more_io)
> + if (wbc.encountered_congestion)

No, this could make wb_kupdate() abort even when there are more data
to be synced. That will make David Chinner unhappy ;-)

> congestion_wait(WRITE, HZ/10);
> else
> break; /* All the old data is written */

Just a minute, I'll propose a way out of this bug :-)

Fengguang