2006-02-06 04:00:45

by David Chinner

[permalink] [raw]
Subject: [PATCH] Prevent large file writeback starvation

Folks,

I have recently been running some mixed workload tests on a 4p Altix,
and I came across what looks to be a lack-of-writeback problem. The
filesystem is XFS, but the problem is in the generic writeback code.

The workload involves ~16 postmark threads running in the background
each creating ~15m subdirectories of ~1m files each. The idea is
that this generates a nice, steady background file creation load.
Each file is between 1-10k in size, and it runs at 3-5k creates/s.

The disk subsystem is nowhere near I/O bound - the luns are less than 10%
busy when running this workload, and only writing about 30-40MB/s aggregate.

The problem comes when I run another thread that writes a large single
file to disk. e.g.:

# dd if=/dev/zero of=/mnt/dgc/stripe/testfile bs=1024k count=4096

to write out a 4GB file. Now this goes straight into memory (takes
about 7-8s) with some writeback occurring. The result is that approximately
2.5GB of the file is still dirty in memory.

It then takes over an hour to write the remaining data to disk. The
pattern of writeback appears to be that roughly every
dirty_expire_centisecs a chunk of 1024 pages (16MB on altix) are
written to for that large file, and it is done in a single flush.
Then the inode then gets moved to the superblock dirty list, and the
next pdflush iteration of 1024 pages works on the next inodes on the
superblock I/O list.

The problem is that when you are creating thousands of files per second
with some data in them, the superblock I/O list blows out to approximately
(create rate * expiry age) inodes, and any one inode in this list will
get a maximum of 1024 pages written back per iteration on the list.

So, in a typical 30s period, the superblock dirty inode list grows to 60-70k
inodes, which pdflush then splices to the I/O list when the I/O list is
empty. We now have and empty dirty list and a really long I/O list.

In the time it takes the I/O list to be emptied, we've created many more
thousands of files, so the large file gets moved to an extremely heavily
populated dirty list after a tiny amount of writeback. Hence when pdflush
finally empties the I/O list, it splices another 60-70k inodes into the
I/O list, and we go through the cycle again.

The result is that under this sort of load we starve the large files
of I/O bandwidth and cannot keep the disk subsystem busy.

If I ran sync(1) while there was lots of dirty data from the large file
still in memory, it would take roughly 4-5s to complete the writeback
at disk bandwidth (~400MB/s).

Looking at this comment in __sync_single_inode():

196 if (wbc->for_kupdate) {
197 /*
198 * For the kupdate function we leave the inode
199 * at the head of sb_dirty so it will get more
200 * writeout as soon as the queue becomes
201 * uncongested.
202 */
203 inode->i_state |= I_DIRTY_PAGES;
204 list_move_tail(&inode->i_list, &sb->s_dirty);
205 } else {

It appears that it is intended to handle congested devices. The thing
is, 1024 pages on writeback is not enough to congest a single disk,
let alone a RAID box 10 or 100 times faster than a single disk.
Hence we're stopping writeback long before we congest the device.

Therefore, lets only move the inode back onto the dirty list if the device
really is congested. Patch against 2.6.15-rc2 below.

The difference is that writing back the large file takes about 2
minutes - it gets written in 3 chunks of about 70,000 pages each 30s
apart. In the test that I just ran, the create rate dips from
around 4.5k/s to 1-2k/s during each period where pdflush is flushing
the large file.

With multiple large files being written, the create rate dips lower,
for longer, due to actually geting the device to congestion levels
for periods of a few seconds, but in the time period between the
writeback of the large files, it returns to ~4.5k creates/s.

It's not perfect - the disks only reach full throughput for a
few seconds every dirty_expire_centisecs - but it's a couple of
orders of magnitude better and it scales with disk bandwidth
and queue depths.

Signed-off-by: Dave Chinner <[email protected]>

---

fs-writeback.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c 2006-01-25 13:10:17.000000000 +1100
+++ linux/fs/fs-writeback.c 2006-02-06 14:17:24.060687209 +1100
@@ -198,10 +198,15 @@
* For the kupdate function we leave the inode
* at the head of sb_dirty so it will get more
* writeout as soon as the queue becomes
- * uncongested.
+ * uncongested. Only do this move if we really
+ * did encounter congestion so we don't starve
+ * heavy writers and under-utilise disk
+ * resources.
*/
inode->i_state |= I_DIRTY_PAGES;
- list_move_tail(&inode->i_list, &sb->s_dirty);
+ if (wbc->encountered_congestion)
+ list_move_tail(&inode->i_list,
+ &sb->s_dirty);
} else {
/*
* Otherwise fully redirty the inode so that

--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group


2006-02-06 04:27:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

David Chinner <[email protected]> wrote:
>
> Folks,
>
> I have recently been running some mixed workload tests on a 4p Altix,
> and I came across what looks to be a lack-of-writeback problem. The
> filesystem is XFS, but the problem is in the generic writeback code.
>
> The workload involves ~16 postmark threads running in the background
> each creating ~15m subdirectories of ~1m files each. The idea is
> that this generates a nice, steady background file creation load.
> Each file is between 1-10k in size, and it runs at 3-5k creates/s.
>
> The disk subsystem is nowhere near I/O bound - the luns are less than 10%
> busy when running this workload, and only writing about 30-40MB/s aggregate.
>
> The problem comes when I run another thread that writes a large single
> file to disk. e.g.:
>
> # dd if=/dev/zero of=/mnt/dgc/stripe/testfile bs=1024k count=4096
>
> to write out a 4GB file. Now this goes straight into memory (takes
> about 7-8s) with some writeback occurring. The result is that approximately
> 2.5GB of the file is still dirty in memory.
>
> It then takes over an hour to write the remaining data to disk. The
> pattern of writeback appears to be that roughly every
> dirty_expire_centisecs a chunk of 1024 pages (16MB on altix) are
> written to for that large file, and it is done in a single flush.
> Then the inode then gets moved to the superblock dirty list, and the
> next pdflush iteration of 1024 pages works on the next inodes on the
> superblock I/O list.
>
> The problem is that when you are creating thousands of files per second
> with some data in them, the superblock I/O list blows out to approximately
> (create rate * expiry age) inodes, and any one inode in this list will
> get a maximum of 1024 pages written back per iteration on the list.

That code does so many different things it ain't funny. This is why when
one thing gets changed, something else gets broken.

The intention here is that once an inode has "expired" (dirtied_when is
more than dirty_expire_centisecs ago), the inode will get fully synced.

>From a quick peek, this code:

if (wbc->for_kupdate) {
/*
* For the kupdate function we leave the inode
* at the head of sb_dirty so it will get more
* writeout as soon as the queue becomes
* uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
list_move_tail(&inode->i_list, &sb->s_dirty);


isn't working right any more.

(aside: a "full sync" of a file is livelocky if some process is continually
writing to it. There's logic in sync_sb_inodes which tries to prevent that).

> So, in a typical 30s period, the superblock dirty inode list grows to 60-70k
> inodes, which pdflush then splices to the I/O list when the I/O list is
> empty. We now have and empty dirty list and a really long I/O list.
>
> In the time it takes the I/O list to be emptied, we've created many more
> thousands of files, so the large file gets moved to an extremely heavily
> populated dirty list after a tiny amount of writeback. Hence when pdflush
> finally empties the I/O list, it splices another 60-70k inodes into the
> I/O list, and we go through the cycle again.
>
> The result is that under this sort of load we starve the large files
> of I/O bandwidth and cannot keep the disk subsystem busy.
>
> If I ran sync(1) while there was lots of dirty data from the large file
> still in memory, it would take roughly 4-5s to complete the writeback
> at disk bandwidth (~400MB/s).
>
> Looking at this comment in __sync_single_inode():
>
> 196 if (wbc->for_kupdate) {
> 197 /*
> 198 * For the kupdate function we leave the inode
> 199 * at the head of sb_dirty so it will get more
> 200 * writeout as soon as the queue becomes
> 201 * uncongested.
> 202 */
> 203 inode->i_state |= I_DIRTY_PAGES;
> 204 list_move_tail(&inode->i_list, &sb->s_dirty);
> 205 } else {
>
> It appears that it is intended to handle congested devices. The thing
> is, 1024 pages on writeback is not enough to congest a single disk,
> let alone a RAID box 10 or 100 times faster than a single disk.
> Hence we're stopping writeback long before we congest the device.

I think the comment is misleading. The writeout pass can terminate because
wbc->nr_to_write was satisfied, as well as for queue congestion.

I suspect what's happened here is that someone other than pdflush has tried
to do some writeback and didn't set for_kupdate, so we ended up resetting
dirtied_when.

> Therefore, lets only move the inode back onto the dirty list if the device
> really is congested. Patch against 2.6.15-rc2 below.

This'll break something else, I bet :(

I'll take a look. Another approach would be to look at nr_to_write. ie:

if (wbc->for_kupdate || wmb->nr_to_write <= 0)

but it'll take half an hour's grovelling through changelogs to work out wht
that'll break.


2006-02-06 05:48:34

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On Sun, Feb 05, 2006 at 08:27:33PM -0800, Andrew Morton wrote:
> David Chinner <[email protected]> wrote:
> > The problem is that when you are creating thousands of files per second
> > with some data in them, the superblock I/O list blows out to approximately
> > (create rate * expiry age) inodes, and any one inode in this list will
> > get a maximum of 1024 pages written back per iteration on the list.
>
> That code does so many different things it ain't funny. This is why when
> one thing gets changed, something else gets broken.
>
> The intention here is that once an inode has "expired" (dirtied_when is
> more than dirty_expire_centisecs ago), the inode will get fully synced.

Sure. And it works just fine when you're not creating lots of small
files at the same time because we iterate across s_io and s_dirty
very quickly.

However, you can't fully sync anything from pdflush with the
writeback parameters it uses without multiple passes through
this code.

> >From a quick peek, this code:
>
> if (wbc->for_kupdate) {
> /*
> * For the kupdate function we leave the inode
> * at the head of sb_dirty so it will get more
> * writeout as soon as the queue becomes
> * uncongested.
> */
> inode->i_state |= I_DIRTY_PAGES;
> list_move_tail(&inode->i_list, &sb->s_dirty);
>
>
> isn't working right any more.

If the intent is to continue writing it back until fully
sync'd, then shouldn't we be moving that to the tail of I/O list so
we don't have to iterate over the dirty list again before we try to
write another chunk out?

FWIW, we've never seen this problem before with XFS because prior to
2.6.15 XFS ignored wbc and block device congestion and just wrote as
much as it could cluster into a single extent in a single
do_writepages() call. hence it would have written the 8GB in one
hit, hence we never saw this problem.

We made XFS behave nicely because it solved several problems
including preventing pdflush from sleeping on full block device
queues during writeback...

> > Looking at this comment in __sync_single_inode():
> >
> > 196 if (wbc->for_kupdate) {
> > 197 /*
> > 198 * For the kupdate function we leave the inode
> > 199 * at the head of sb_dirty so it will get more
> > 200 * writeout as soon as the queue becomes
> > 201 * uncongested.
> > 202 */
> > 203 inode->i_state |= I_DIRTY_PAGES;
> > 204 list_move_tail(&inode->i_list, &sb->s_dirty);
> > 205 } else {
> >
> > It appears that it is intended to handle congested devices. The thing
> > is, 1024 pages on writeback is not enough to congest a single disk,
> > let alone a RAID box 10 or 100 times faster than a single disk.
> > Hence we're stopping writeback long before we congest the device.
>
> I think the comment is misleading. The writeout pass can terminate because
> wbc->nr_to_write was satisfied, as well as for queue congestion.

Exactly my point and what the patch addresses - it allows writeback on
that inode to continue from where it left off if the device was not
congested.

> I suspect what's happened here is that someone other than pdflush has tried
> to do some writeback and didn't set for_kupdate, so we ended up resetting
> dirtied_when.

If it's not wb_kupdate that is trying to write it back, and we have little
memory pressure, and we completed writing the file long ago, then what behaves
exactly like wb_kupdate for hours on end apart from wb_kupdate?

> > Therefore, lets only move the inode back onto the dirty list if the device
> > really is congested. Patch against 2.6.15-rc2 below.
>
> This'll break something else, I bet :(

Wonderful. What needs testing to indicate something else hasn't broken?
Does anyone have any regression tests for this code?

> I'll take a look. Another approach would be to look at nr_to_write. ie:
>
> if (wbc->for_kupdate || wmb->nr_to_write <= 0)

I just tested this and it doesn't change the default behaviour.
After writing the 4GB file ~5 minutes ago, I've seen ~10k pages go to
disk, and I still have another 140k to go. IOWs, exactly the same
behaviour as the current code.

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-02-06 06:22:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

David Chinner <[email protected]> wrote:
>
> > >From a quick peek, this code:
> >
> > if (wbc->for_kupdate) {
> > /*
> > * For the kupdate function we leave the inode
> > * at the head of sb_dirty so it will get more
> > * writeout as soon as the queue becomes
> > * uncongested.
> > */
> > inode->i_state |= I_DIRTY_PAGES;
> > list_move_tail(&inode->i_list, &sb->s_dirty);
> >
> >
> > isn't working right any more.
>
> If the intent is to continue writing it back until fully
> sync'd, then shouldn't we be moving that to the tail of I/O list so
> we don't have to iterate over the dirty list again before we try to
> write another chunk out?

Only if dirtied_when has expired. Until that's true I think it's right to
move onto other (potentially expired) inodes.

Your patch leaves these inodes on s_io, actually.

> > >
> > > It appears that it is intended to handle congested devices. The thing
> > > is, 1024 pages on writeback is not enough to congest a single disk,
> > > let alone a RAID box 10 or 100 times faster than a single disk.
> > > Hence we're stopping writeback long before we congest the device.
> >
> > I think the comment is misleading. The writeout pass can terminate because
> > wbc->nr_to_write was satisfied, as well as for queue congestion.
>
> Exactly my point and what the patch addresses - it allows writeback on
> that inode to continue from where it left off if the device was not
> congested.

But what will it do to other inodes? Say, ones which have expired? This
inode could take many minutes to write out if it's all fragmented.

s_dirty is supposed to be kept in dirtied_when order, btw.

> > I suspect what's happened here is that someone other than pdflush has tried
> > to do some writeback and didn't set for_kupdate, so we ended up resetting
> > dirtied_when.
>
> If it's not wb_kupdate that is trying to write it back, and we have little
> memory pressure, and we completed writing the file long ago, then what behaves
> exactly like wb_kupdate for hours on end apart from wb_kupdate?

Don't know. I'm not sure that we exactly know what's going on yet?

The list_move_tail is supposed to put the inode at the *head* of s_dirty.
So it's the first one which gets encountered on the next pdflush pass.

And I guess that's working OK. Except we only write 4MB of it each five
seconds. Is that the case?

If so, why would that happen? Take a look at wb_kupdate(). It's supposed
to work *continuously* on the inodes until writeback_inodes() failed to
write back enough pages. It takes this as an indication that there's no
more work to do at this time.

It'd be interesting to take a look at what's happening in wb_kupdate().

> > > Therefore, lets only move the inode back onto the dirty list if the device
> > > really is congested. Patch against 2.6.15-rc2 below.
> >
> > This'll break something else, I bet :(
>
> Wonderful. What needs testing to indicate something else hasn't broken?

Hard.

> Does anyone have any regression tests for this code?

No.



2006-02-06 06:36:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

Andrew Morton <[email protected]> wrote:
>
> If so, why would that happen? Take a look at wb_kupdate(). It's supposed
> to work *continuously* on the inodes until writeback_inodes() failed to
> write back enough pages. It takes this as an indication that there's no
> more work to do at this time.
>
> It'd be interesting to take a look at what's happening in wb_kupdate().

Took a quick look at xfs_convert_page(). I don't immediately see a cause
in there, but

if (count) {
struct backing_dev_info *bdi;

bdi = inode->i_mapping->backing_dev_info;
if (bdi_write_congested(bdi)) {
wbc->encountered_congestion = 1;
done = 1;
} else if (--wbc->nr_to_write <= 0) {
done = 1;
}
}
xfs_start_page_writeback(page, wbc, !page_dirty, count);

shouldn't we be decrementing wbc->nr_to_write even if the queue is congested?

2006-02-06 11:55:18

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On Sun, Feb 05, 2006 at 10:22:15PM -0800, Andrew Morton wrote:
> David Chinner <[email protected]> wrote:
> >
> > > >From a quick peek, this code:
> > >
> > > if (wbc->for_kupdate) {
> > > /*
> > > * For the kupdate function we leave the inode
> > > * at the head of sb_dirty so it will get more
> > > * writeout as soon as the queue becomes
> > > * uncongested.
> > > */
> > > inode->i_state |= I_DIRTY_PAGES;
> > > list_move_tail(&inode->i_list, &sb->s_dirty);
> > >
> > >
> > > isn't working right any more.
> >
> > If the intent is to continue writing it back until fully
> > sync'd, then shouldn't we be moving that to the tail of I/O list so
> > we don't have to iterate over the dirty list again before we try to
> > write another chunk out?
>
> Only if dirtied_when has expired. Until that's true I think it's right to
> move onto other (potentially expired) inodes.

The inode I'm seeing being starved has long since passed it's dirty_when
expiry....

> Your patch leaves these inodes on s_io, actually.

Correct and as intended, because wb_kupdate comes back to the s_io
list, not the s_dirty list.

> > > > It appears that it is intended to handle congested devices. The thing
> > > > is, 1024 pages on writeback is not enough to congest a single disk,
> > > > let alone a RAID box 10 or 100 times faster than a single disk.
> > > > Hence we're stopping writeback long before we congest the device.
> > >
> > > I think the comment is misleading. The writeout pass can terminate because
> > > wbc->nr_to_write was satisfied, as well as for queue congestion.
> >
> > Exactly my point and what the patch addresses - it allows writeback on
> > that inode to continue from where it left off if the device was not
> > congested.
>
> But what will it do to other inodes? Say, ones which have expired?

Writeback for them gets delayed for a short while because we have an
inode with large amounts of aged dirty data that needs to be flushed.

XFS has behaved like this for a long time. The only problems this
caused were I/o latency and blocking pdflush incorrectly. It
could keep pdflush flushing the one inode for minutes on end.
However, I've never seen anyone report it as a bug, and the only
side effects I noticed were under high bandwidths with many parallel
write streams.

The change I sent does effectively the same as XFS did, but without
any of the subtle side effects caused by blocking and monopolising
pdflush on one inode.

> This
> inode could take many minutes to write out if it's all fragmented.

And if it's fragmented, seeking the disks will make them become congested
far faster, and we'll move onto the next inode sooner.....

> s_dirty is supposed to be kept in dirtied_when order, btw.

Yes, but that doesn't take into account s_io.

306 static void
307 sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
308 {
309 const unsigned long start = jiffies; /* livelock avoidance */
310
311 if (!wbc->for_kupdate || list_empty(&sb->s_io))
312 list_splice_init(&sb->s_dirty, &sb->s_io);
313
314 while (!list_empty(&sb->s_io)) {

Correct me if I'm wrong, but my reading of this is that for
wb_kupdate, we only ever move s_dirty to s_io when s_io is empty.
then we iterate over s_io until all inodes are moved off this list
or we hit someother termination criteria. This is why i left the
large inode on the head of the s_io list until congestion was
encountered - so that wb_kupdate returned to it first in it's next
pass.

So when we get to a young inode on the s_io list, we abort the
writeback loop for that filesystem with wbc->nr_to_write > 0 and
return to wb_kupdate....

However, we still have an inode with lots of dirty data on the head of
s_dirty, which we can do nothing with until s_io is emptied by
wb_kupdate.

> > > I suspect what's happened here is that someone other than pdflush has tried
> > > to do some writeback and didn't set for_kupdate, so we ended up resetting
> > > dirtied_when.
> >
> > If it's not wb_kupdate that is trying to write it back, and we have little
> > memory pressure, and we completed writing the file long ago, then what behaves
> > exactly like wb_kupdate for hours on end apart from wb_kupdate?
>
> Don't know. I'm not sure that we exactly know what's going on yet?
> The list_move_tail is supposed to put the inode at the *head* of s_dirty.
> So it's the first one which gets encountered on the next pdflush pass.
> And I guess that's working OK.

Well, no, it's not working OK because pdflush won't come back to s_dirty
unless s_io is empty.

> Except we only write 4MB of it each five
> seconds. Is that the case?

We write 1024 (16k) pages roughly every dirty_expire_centisecs, not every
dirty_writeback_centisecs.

> If so, why would that happen? Take a look at wb_kupdate().

I have.

> It's supposed
> to work *continuously* on the inodes until writeback_inodes() failed to
> write back enough pages. It takes this as an indication that there's no
> more work to do at this time.

That indication is incorrect, then.

What nr_to_write > 0 indicates is that the *s_io list* for each
filesystem have nothing more to do, not that there is nothing
left to flush. We've moved anyhitng that has more work to onto
s_dirty, so this inidication from s_io is always going to be incorrect.

Basically, This is why I chose to leave the inode on the head
of the s_io list - that's the list pdflush returns to on it's
next iteration, it's what we indicate completion from and we've
still got work to do on this inode....

> It'd be interesting to take a look at what's happening in wb_kupdate().
> > > > Therefore, lets only move the inode back onto the dirty list if the device
> > > > really is congested. Patch against 2.6.15-rc2 below.
> > >
> > > This'll break something else, I bet :(
> >
> > Wonderful. What needs testing to indicate something else hasn't broken?
>
> Hard.

Yes. And...?

> > Does anyone have any regression tests for this code?
>
> No.

Ok.

I've done basic QA on the change i.e. sync(1) still works, fsync(2) still
works, etc. and I can run XFSQA over it, but without further guidance as to
what the necessary testing is, I'm just handwaving saying it Works For Me....

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-02-06 11:57:43

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On Sun, Feb 05, 2006 at 10:36:11PM -0800, Andrew Morton wrote:
> Andrew Morton <[email protected]> wrote:
> >
> > If so, why would that happen? Take a look at wb_kupdate(). It's supposed
> > to work *continuously* on the inodes until writeback_inodes() failed to
> > write back enough pages. It takes this as an indication that there's no
> > more work to do at this time.
> >
> > It'd be interesting to take a look at what's happening in wb_kupdate().
>
> Took a quick look at xfs_convert_page(). I don't immediately see a cause
> in there, but
>
> if (count) {
> struct backing_dev_info *bdi;
>
> bdi = inode->i_mapping->backing_dev_info;
> if (bdi_write_congested(bdi)) {
> wbc->encountered_congestion = 1;
> done = 1;
> } else if (--wbc->nr_to_write <= 0) {
> done = 1;
> }
> }
> xfs_start_page_writeback(page, wbc, !page_dirty, count);
>
> shouldn't we be decrementing wbc->nr_to_write even if the queue is congested?

Yes, you are right. I'll fix it tomorrow.

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-02-06 14:36:12

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

I wonder if this is related to my previous observation here,
on ext3, that large file writebacks get deferred wayyyyy too long.

Original post is below:


-------- Original Message --------
Subject: 2.6.xx: dirty pages never being sync'd to disk?
Date: Mon, 14 Nov 2005 10:30:58 -0500
From: Mark Lord <[email protected]>
To: Linux Kernel <[email protected]>

Okay, this one's been nagging me since I first began using 2.6.xx.

My Notebook computer has 2GB of RAM, and the 2.6.xx kernel seems quite
happy to leave hundreds of MB of dirty unsync'd pages laying around
more or less indefinitely. This worries me, because that's a lot of data
to lose should the kernel crash (which it has once quite recently)
or the battery die.

/proc/sys/vm/dirty_expire_centisecs = 3000 (30 seconds)
/proc/sys/vm/dirty_writeback_centisecs = 500 (5 seconds)

My understanding (please correct if wrong) is that this means
that any (file data) page which is dirtied, should get flushed
back to disk after 30 seconds or so.

That doesn't happen here. Hundreds of MB of dirty pages just
hang around indefinitely, until I manually type "sync",
at which point the hard drive gets very busy for 20 seconds or so.

What's going on?

2006-02-06 14:39:59

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

And some more background on what I saw back in November
and still see today in 2.6.15.

-------- Original Message --------
Subject: Re: 2.6.xx: dirty pages never being sync'd to disk?
Date: Mon, 14 Nov 2005 10:49:15 -0500
From: Mark Lord <[email protected]>
To: Arjan van de Ven <[email protected]>
CC: Linux Kernel <[email protected]>
References: <[email protected]> <[email protected]>

Arjan van de Ven wrote:
> On Mon, 2005-11-14 at 10:30 -0500, Mark Lord wrote:
..
>>My Notebook computer has 2GB of RAM, and the 2.6.xx kernel seems quite
>>happy to leave hundreds of MB of dirty unsync'd pages laying around
..
>>/proc/sys/vm/dirty_expire_centisecs = 3000 (30 seconds)
>>/proc/sys/vm/dirty_writeback_centisecs = 500 (5 seconds)
..
> do you have laptop mode enabled? That changes the behavior bigtime in
> this regard and makes the kernel behave quite different.

No. Laptop-mode mostly just modifies the dirty_expire
and related settings, and I have them set as shown above.
But there's also this:

/proc/sys/vm/laptop_mode = 0

> also if these are files written to by mmap, the kernel only really sees
> those as dirty when the mapping gets taken down

They certainly show up in the counts in /proc/meminfo under "Dirty",
so I assumed that means the kernel knows they are dirty.

A simple test I do for this:

$ mkdir t
$ cp /usr/src/*.bz2 t (about 400-500MB worth of kernel tar files)

In another window, I do this:

$ while (sleep 1); do echo -n "`date`: "; grep Dirty /proc/meminfo; done

And then watch the count get large, but take virtually forever
to count back down to a "safe" value.

Typing "sync" causes all the Dirty pages to immediately be flushed to disk,
as expected.

Here's what the monitoring of /proc/meminfo shows,
on an otherwise mostly idle system after having done
the big file copies noted earlier:

Mon Nov 14 10:40:22 EST 2005: Dirty: 481284 kB
Mon Nov 14 10:40:23 EST 2005: Dirty: 479680 kB
Mon Nov 14 10:40:24 EST 2005: Dirty: 480380 kB
Mon Nov 14 10:40:25 EST 2005: Dirty: 480380 kB
Mon Nov 14 10:40:26 EST 2005: Dirty: 480380 kB
Mon Nov 14 10:40:27 EST 2005: Dirty: 480380 kB
Mon Nov 14 10:40:28 EST 2005: Dirty: 480384 kB
Mon Nov 14 10:40:29 EST 2005: Dirty: 480384 kB
Mon Nov 14 10:40:30 EST 2005: Dirty: 480384 kB
Mon Nov 14 10:40:31 EST 2005: Dirty: 480384 kB
Mon Nov 14 10:40:32 EST 2005: Dirty: 480384 kB
Mon Nov 14 10:40:33 EST 2005: Dirty: 480688 kB
Mon Nov 14 10:40:34 EST 2005: Dirty: 479972 kB
Mon Nov 14 10:40:35 EST 2005: Dirty: 479972 kB
Mon Nov 14 10:40:36 EST 2005: Dirty: 479972 kB
Mon Nov 14 10:40:37 EST 2005: Dirty: 480016 kB
Mon Nov 14 10:40:38 EST 2005: Dirty: 480016 kB
Mon Nov 14 10:40:39 EST 2005: Dirty: 480016 kB
Mon Nov 14 10:40:40 EST 2005: Dirty: 480020 kB
Mon Nov 14 10:40:41 EST 2005: Dirty: 480020 kB
Mon Nov 14 10:40:42 EST 2005: Dirty: 480028 kB
Mon Nov 14 10:40:43 EST 2005: Dirty: 480028 kB
Mon Nov 14 10:40:44 EST 2005: Dirty: 475868 kB
Mon Nov 14 10:40:45 EST 2005: Dirty: 475868 kB
Mon Nov 14 10:40:46 EST 2005: Dirty: 475868 kB
Mon Nov 14 10:40:47 EST 2005: Dirty: 475868 kB
Mon Nov 14 10:40:48 EST 2005: Dirty: 475880 kB
Mon Nov 14 10:40:49 EST 2005: Dirty: 475880 kB
Mon Nov 14 10:40:50 EST 2005: Dirty: 475880 kB
Mon Nov 14 10:40:51 EST 2005: Dirty: 475880 kB
Mon Nov 14 10:40:52 EST 2005: Dirty: 475880 kB
Mon Nov 14 10:40:53 EST 2005: Dirty: 475880 kB
Mon Nov 14 10:40:54 EST 2005: Dirty: 455160 kB
Mon Nov 14 10:40:55 EST 2005: Dirty: 455160 kB
Mon Nov 14 10:40:57 EST 2005: Dirty: 455160 kB
Mon Nov 14 10:40:58 EST 2005: Dirty: 455160 kB
Mon Nov 14 10:40:59 EST 2005: Dirty: 455164 kB
Mon Nov 14 10:41:00 EST 2005: Dirty: 455160 kB
Mon Nov 14 10:41:01 EST 2005: Dirty: 455160 kB
Mon Nov 14 10:41:02 EST 2005: Dirty: 455160 kB
Mon Nov 14 10:41:03 EST 2005: Dirty: 455164 kB
Mon Nov 14 10:41:04 EST 2005: Dirty: 455164 kB
Mon Nov 14 10:41:05 EST 2005: Dirty: 455168 kB
Mon Nov 14 10:41:06 EST 2005: Dirty: 455168 kB
Mon Nov 14 10:41:07 EST 2005: Dirty: 455168 kB
Mon Nov 14 10:41:08 EST 2005: Dirty: 455188 kB
Mon Nov 14 10:41:09 EST 2005: Dirty: 455176 kB
Mon Nov 14 10:41:10 EST 2005: Dirty: 455176 kB
Mon Nov 14 10:41:11 EST 2005: Dirty: 455176 kB
Mon Nov 14 10:41:12 EST 2005: Dirty: 455176 kB
Mon Nov 14 10:41:13 EST 2005: Dirty: 455180 kB
Mon Nov 14 10:41:14 EST 2005: Dirty: 450972 kB
Mon Nov 14 10:41:15 EST 2005: Dirty: 450972 kB
Mon Nov 14 10:41:16 EST 2005: Dirty: 450972 kB
Mon Nov 14 10:41:17 EST 2005: Dirty: 450972 kB
Mon Nov 14 10:41:18 EST 2005: Dirty: 451016 kB
Mon Nov 14 10:41:19 EST 2005: Dirty: 430336 kB
Mon Nov 14 10:41:20 EST 2005: Dirty: 430336 kB
Mon Nov 14 10:41:21 EST 2005: Dirty: 430336 kB
Mon Nov 14 10:41:22 EST 2005: Dirty: 430336 kB
Mon Nov 14 10:41:23 EST 2005: Dirty: 430348 kB
Mon Nov 14 10:41:24 EST 2005: Dirty: 430348 kB
Mon Nov 14 10:41:25 EST 2005: Dirty: 430348 kB
Mon Nov 14 10:41:26 EST 2005: Dirty: 430348 kB
Mon Nov 14 10:41:27 EST 2005: Dirty: 430348 kB
Mon Nov 14 10:41:28 EST 2005: Dirty: 430356 kB
Mon Nov 14 10:41:29 EST 2005: Dirty: 430352 kB
Mon Nov 14 10:41:30 EST 2005: Dirty: 430352 kB
Mon Nov 14 10:41:31 EST 2005: Dirty: 430352 kB
Mon Nov 14 10:41:32 EST 2005: Dirty: 430352 kB
Mon Nov 14 10:41:33 EST 2005: Dirty: 430356 kB
Mon Nov 14 10:41:34 EST 2005: Dirty: 430356 kB
Mon Nov 14 10:41:35 EST 2005: Dirty: 430356 kB
Mon Nov 14 10:41:36 EST 2005: Dirty: 430356 kB
Mon Nov 14 10:41:37 EST 2005: Dirty: 430368 kB
Mon Nov 14 10:41:38 EST 2005: Dirty: 430364 kB
Mon Nov 14 10:41:39 EST 2005: Dirty: 430360 kB
Mon Nov 14 10:41:40 EST 2005: Dirty: 430364 kB
Mon Nov 14 10:41:41 EST 2005: Dirty: 430364 kB
Mon Nov 14 10:41:42 EST 2005: Dirty: 430368 kB
Mon Nov 14 10:41:43 EST 2005: Dirty: 430368 kB
Mon Nov 14 10:41:44 EST 2005: Dirty: 405552 kB
Mon Nov 14 10:41:45 EST 2005: Dirty: 405552 kB
Mon Nov 14 10:41:46 EST 2005: Dirty: 405552 kB
Mon Nov 14 10:41:47 EST 2005: Dirty: 405552 kB
Mon Nov 14 10:41:48 EST 2005: Dirty: 405556 kB
Mon Nov 14 10:41:49 EST 2005: Dirty: 405548 kB
Mon Nov 14 10:41:50 EST 2005: Dirty: 405548 kB
Mon Nov 14 10:41:51 EST 2005: Dirty: 405548 kB
Mon Nov 14 10:41:52 EST 2005: Dirty: 405548 kB
Mon Nov 14 10:41:53 EST 2005: Dirty: 405552 kB
Mon Nov 14 10:41:54 EST 2005: Dirty: 405492 kB
Mon Nov 14 10:41:55 EST 2005: Dirty: 405492 kB
Mon Nov 14 10:41:56 EST 2005: Dirty: 405492 kB
Mon Nov 14 10:41:57 EST 2005: Dirty: 405524 kB
Mon Nov 14 10:41:58 EST 2005: Dirty: 405528 kB
Mon Nov 14 10:41:59 EST 2005: Dirty: 405524 kB
Mon Nov 14 10:42:00 EST 2005: Dirty: 405524 kB
Mon Nov 14 10:42:01 EST 2005: Dirty: 405524 kB
Mon Nov 14 10:42:02 EST 2005: Dirty: 405536 kB
Mon Nov 14 10:42:03 EST 2005: Dirty: 405536 kB
Mon Nov 14 10:42:04 EST 2005: Dirty: 405536 kB
Mon Nov 14 10:42:05 EST 2005: Dirty: 405536 kB
Mon Nov 14 10:42:06 EST 2005: Dirty: 405536 kB
Mon Nov 14 10:42:07 EST 2005: Dirty: 405536 kB
Mon Nov 14 10:42:08 EST 2005: Dirty: 405536 kB
Mon Nov 14 10:42:10 EST 2005: Dirty: 405532 kB
Mon Nov 14 10:42:11 EST 2005: Dirty: 405532 kB
Mon Nov 14 10:42:12 EST 2005: Dirty: 405532 kB
Mon Nov 14 10:42:13 EST 2005: Dirty: 405532 kB
Mon Nov 14 10:42:14 EST 2005: Dirty: 405544 kB
Mon Nov 14 10:42:15 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:16 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:17 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:18 EST 2005: Dirty: 380680 kB
Mon Nov 14 10:42:19 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:20 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:21 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:22 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:23 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:24 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:25 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:26 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:27 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:28 EST 2005: Dirty: 380680 kB
Mon Nov 14 10:42:29 EST 2005: Dirty: 380668 kB
Mon Nov 14 10:42:30 EST 2005: Dirty: 380668 kB
Mon Nov 14 10:42:31 EST 2005: Dirty: 380668 kB
Mon Nov 14 10:42:32 EST 2005: Dirty: 380668 kB
Mon Nov 14 10:42:33 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:34 EST 2005: Dirty: 380628 kB
Mon Nov 14 10:42:35 EST 2005: Dirty: 380628 kB
Mon Nov 14 10:42:36 EST 2005: Dirty: 380628 kB
Mon Nov 14 10:42:37 EST 2005: Dirty: 380632 kB
Mon Nov 14 10:42:38 EST 2005: Dirty: 380672 kB
Mon Nov 14 10:42:39 EST 2005: Dirty: 380672 kB
Mon Nov 14 10:42:40 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:41 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:42 EST 2005: Dirty: 380676 kB
Mon Nov 14 10:42:43 EST 2005: Dirty: 380684 kB
Mon Nov 14 10:42:44 EST 2005: Dirty: 362476 kB
Mon Nov 14 10:42:45 EST 2005: Dirty: 362476 kB
Mon Nov 14 10:42:46 EST 2005: Dirty: 362476 kB
Mon Nov 14 10:42:47 EST 2005: Dirty: 362476 kB
Mon Nov 14 10:42:48 EST 2005: Dirty: 362476 kB
Mon Nov 14 10:42:49 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:50 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:51 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:52 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:53 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:54 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:55 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:56 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:57 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:42:58 EST 2005: Dirty: 358344 kB
Mon Nov 14 10:42:59 EST 2005: Dirty: 358344 kB
Mon Nov 14 10:43:00 EST 2005: Dirty: 358344 kB
Mon Nov 14 10:43:01 EST 2005: Dirty: 358344 kB
Mon Nov 14 10:43:02 EST 2005: Dirty: 358352 kB
Mon Nov 14 10:43:03 EST 2005: Dirty: 358352 kB
Mon Nov 14 10:43:04 EST 2005: Dirty: 358348 kB
Mon Nov 14 10:43:05 EST 2005: Dirty: 358348 kB
Mon Nov 14 10:43:06 EST 2005: Dirty: 358348 kB
Mon Nov 14 10:43:07 EST 2005: Dirty: 358348 kB
Mon Nov 14 10:43:08 EST 2005: Dirty: 358352 kB
Mon Nov 14 10:43:09 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:43:10 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:43:11 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:43:12 EST 2005: Dirty: 358340 kB
Mon Nov 14 10:43:13 EST 2005: Dirty: 358344 kB
Mon Nov 14 10:43:14 EST 2005: Dirty: 341716 kB
Mon Nov 14 10:43:15 EST 2005: Dirty: 341716 kB
Mon Nov 14 10:43:16 EST 2005: Dirty: 341716 kB
Mon Nov 14 10:43:17 EST 2005: Dirty: 341756 kB
Mon Nov 14 10:43:18 EST 2005: Dirty: 341756 kB
Mon Nov 14 10:43:19 EST 2005: Dirty: 341748 kB
Mon Nov 14 10:43:21 EST 2005: Dirty: 341748 kB
Mon Nov 14 10:43:22 EST 2005: Dirty: 341748 kB
Mon Nov 14 10:43:23 EST 2005: Dirty: 341752 kB
Mon Nov 14 10:43:24 EST 2005: Dirty: 341752 kB
Mon Nov 14 10:43:25 EST 2005: Dirty: 338268 kB
Mon Nov 14 10:43:26 EST 2005: Dirty: 338268 kB
Mon Nov 14 10:43:27 EST 2005: Dirty: 338268 kB
Mon Nov 14 10:43:28 EST 2005: Dirty: 338276 kB
Mon Nov 14 10:43:29 EST 2005: Dirty: 338268 kB
Mon Nov 14 10:43:30 EST 2005: Dirty: 338268 kB
Mon Nov 14 10:43:31 EST 2005: Dirty: 338268 kB
Mon Nov 14 10:43:32 EST 2005: Dirty: 338268 kB
Mon Nov 14 10:43:33 EST 2005: Dirty: 338272 kB
Mon Nov 14 10:43:34 EST 2005: Dirty: 338272 kB
Mon Nov 14 10:43:35 EST 2005: Dirty: 338272 kB
Mon Nov 14 10:43:36 EST 2005: Dirty: 338272 kB
Mon Nov 14 10:43:37 EST 2005: Dirty: 338276 kB
Mon Nov 14 10:43:38 EST 2005: Dirty: 338280 kB
Mon Nov 14 10:43:39 EST 2005: Dirty: 338276 kB
Mon Nov 14 10:43:40 EST 2005: Dirty: 338280 kB
Mon Nov 14 10:43:41 EST 2005: Dirty: 338280 kB
Mon Nov 14 10:43:42 EST 2005: Dirty: 338280 kB
Mon Nov 14 10:43:43 EST 2005: Dirty: 338288 kB
Mon Nov 14 10:43:44 EST 2005: Dirty: 321708 kB
Mon Nov 14 10:43:45 EST 2005: Dirty: 321708 kB
Mon Nov 14 10:43:46 EST 2005: Dirty: 321708 kB
Mon Nov 14 10:43:47 EST 2005: Dirty: 321708 kB
Mon Nov 14 10:43:48 EST 2005: Dirty: 321708 kB
Mon Nov 14 10:43:49 EST 2005: Dirty: 321704 kB
Mon Nov 14 10:43:50 EST 2005: Dirty: 321704 kB
Mon Nov 14 10:43:51 EST 2005: Dirty: 321704 kB
Mon Nov 14 10:43:52 EST 2005: Dirty: 321704 kB
Mon Nov 14 10:43:53 EST 2005: Dirty: 321708 kB
Mon Nov 14 10:43:54 EST 2005: Dirty: 321656 kB
Mon Nov 14 10:43:55 EST 2005: Dirty: 321656 kB
Mon Nov 14 10:43:56 EST 2005: Dirty: 321656 kB
Mon Nov 14 10:43:57 EST 2005: Dirty: 321656 kB
Mon Nov 14 10:43:58 EST 2005: Dirty: 321688 kB
Mon Nov 14 10:43:59 EST 2005: Dirty: 321684 kB
Mon Nov 14 10:44:00 EST 2005: Dirty: 321684 kB
Mon Nov 14 10:44:01 EST 2005: Dirty: 321684 kB
Mon Nov 14 10:44:02 EST 2005: Dirty: 321696 kB
Mon Nov 14 10:44:03 EST 2005: Dirty: 321696 kB
Mon Nov 14 10:44:04 EST 2005: Dirty: 321696 kB
Mon Nov 14 10:44:05 EST 2005: Dirty: 321696 kB
Mon Nov 14 10:44:06 EST 2005: Dirty: 321696 kB
Mon Nov 14 10:44:07 EST 2005: Dirty: 321696 kB
Mon Nov 14 10:44:08 EST 2005: Dirty: 321696 kB
Mon Nov 14 10:44:09 EST 2005: Dirty: 321692 kB
Mon Nov 14 10:44:10 EST 2005: Dirty: 321692 kB
Mon Nov 14 10:44:11 EST 2005: Dirty: 321692 kB
Mon Nov 14 10:44:12 EST 2005: Dirty: 321692 kB
Mon Nov 14 10:44:13 EST 2005: Dirty: 321692 kB
Mon Nov 14 10:44:14 EST 2005: Dirty: 317604 kB
Mon Nov 14 10:44:15 EST 2005: Dirty: 317604 kB
Mon Nov 14 10:44:16 EST 2005: Dirty: 317608 kB
Mon Nov 14 10:44:17 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:18 EST 2005: Dirty: 317616 kB
Mon Nov 14 10:44:19 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:20 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:21 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:22 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:23 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:24 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:25 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:26 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:27 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:28 EST 2005: Dirty: 317616 kB
Mon Nov 14 10:44:29 EST 2005: Dirty: 317608 kB
Mon Nov 14 10:44:30 EST 2005: Dirty: 317608 kB
Mon Nov 14 10:44:32 EST 2005: Dirty: 317608 kB
Mon Nov 14 10:44:33 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:34 EST 2005: Dirty: 317612 kB
Mon Nov 14 10:44:35 EST 2005: Dirty: 317564 kB
Mon Nov 14 10:44:36 EST 2005: Dirty: 317564 kB
Mon Nov 14 10:44:37 EST 2005: Dirty: 317568 kB
Mon Nov 14 10:44:38 EST 2005: Dirty: 317608 kB
Mon Nov 14 10:44:39 EST 2005: Dirty: 317616 kB

2006-02-06 14:53:08

by Pedro Alves

[permalink] [raw]
Subject: Several Hangs on diferent Hardwares and diferent kernels

Hi,

I am using a linux box as a multimidia computer to show news,
advertisement and clips. After some time
the box hang randomly. I ?ve change everything i could, power source,
motherboard, memory, compact flash
(i use an ide to compact flash adapter), kernel version (only tried
2.6.12 to 2.6.15 versios). But nothing solve
the problem, and what is drives crazy :no log, no core dump, no clue at
all... The box hang on X display during
an clip or advertising and stoped to respond keaboard and network....
Does any body can help me how to get a clue
before a system hang ?

Thank You

Pedro Alves


2006-02-06 20:12:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

Mark Lord <[email protected]> wrote:
>
> A simple test I do for this:
>
> $ mkdir t
> $ cp /usr/src/*.bz2 t (about 400-500MB worth of kernel tar files)
>
> In another window, I do this:
>
> $ while (sleep 1); do echo -n "`date`: "; grep Dirty /proc/meminfo; done
>
> And then watch the count get large, but take virtually forever
> to count back down to a "safe" value.
>
> Typing "sync" causes all the Dirty pages to immediately be flushed to disk,
> as expected.

I've never seen that happen and I don't recall seeing any other reports of
it, so your machine must be doing something peculiar. I think it can
happen if, say, an inode gets itself onto the wrong inode list, or
incorrectly gets its dirty flag cleared.

Are you using any unusual mount options, or unusual combinations of
filesystems, or anything like that?

2006-02-06 23:12:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

David Chinner <[email protected]> wrote:
>
> 306 static void
> 307 sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
> 308 {
> 309 const unsigned long start = jiffies; /* livelock avoidance */
> 310
> 311 if (!wbc->for_kupdate || list_empty(&sb->s_io))
> 312 list_splice_init(&sb->s_dirty, &sb->s_io);
> 313
> 314 while (!list_empty(&sb->s_io)) {
>
> Correct me if I'm wrong, but my reading of this is that for
> wb_kupdate, we only ever move s_dirty to s_io when s_io is empty.
> then we iterate over s_io until all inodes are moved off this list
> or we hit someother termination criteria. This is why i left the
> large inode on the head of the s_io list until congestion was
> encountered - so that wb_kupdate returned to it first in it's next
> pass.
>
> So when we get to a young inode on the s_io list, we abort the
> writeback loop for that filesystem with wbc->nr_to_write > 0 and
> return to wb_kupdate....
>
> However, we still have an inode with lots of dirty data on the head of
> s_dirty, which we can do nothing with until s_io is emptied by
> wb_kupdate.

That sounds right. I guess what is happening is that we get into the state
where your big-dirty-file is on s_dirty and there are a few
small-dirty-files on s_io. sync_sb_inodes() writes the small files,
returns "small number of pages written" and that causes wb_kupdate() to
terminate.

I think the problem here is that the wb_kupdate() termination test is wrong
- it should just keep going.

We have another bug due to this: if you create a large number of
zero-length files on a traditional filesystem (ext2, minix, ...), the
writeout of these inodes doesn't work correctly - it takes ages. Because
the wb_kupdate logic is driven by "number of dirty pages", and all those
dirty inodes have zero dirty pages associated with them. wb_kupdate says
"oh, nothing to do" and gives up.

So to fix both these problems we need to be smarter about terminating the
wb_kupdate() loop. Something like "loop until no expired inodes have been
written".

Wildly untested patch:


diff -puN include/linux/writeback.h~wb_kupdate-fix-termination-condition include/linux/writeback.h
--- 25/include/linux/writeback.h~wb_kupdate-fix-termination-condition Mon Feb 6 15:09:32 2006
+++ 25-akpm/include/linux/writeback.h Mon Feb 6 15:10:33 2006
@@ -58,6 +58,7 @@ struct writeback_control {
unsigned for_kupdate:1; /* A kupdate writeback */
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1; /* This is a writepages() call */
+ unsigned wrote_expired_inode:1; /* 1 or more expired inodes written */
};

/*
diff -puN fs/fs-writeback.c~wb_kupdate-fix-termination-condition fs/fs-writeback.c
--- 25/fs/fs-writeback.c~wb_kupdate-fix-termination-condition Mon Feb 6 15:09:32 2006
+++ 25-akpm/fs/fs-writeback.c Mon Feb 6 15:11:58 2006
@@ -367,6 +367,8 @@ sync_sb_inodes(struct super_block *sb, s
__iget(inode);
pages_skipped = wbc->pages_skipped;
__writeback_single_inode(inode, wbc);
+ if (unlikely(wbc->wrote_expired_inode == 0))
+ wbc->wrote_expired_inode = 1;
if (wbc->sync_mode == WB_SYNC_HOLD) {
inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
diff -puN mm/page-writeback.c~wb_kupdate-fix-termination-condition mm/page-writeback.c
--- 25/mm/page-writeback.c~wb_kupdate-fix-termination-condition Mon Feb 6 15:09:32 2006
+++ 25-akpm/mm/page-writeback.c Mon Feb 6 15:12:43 2006
@@ -414,8 +414,9 @@ static void wb_kupdate(unsigned long arg
while (nr_to_write > 0) {
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+ wbc.wrote_expired_inode = 0;
writeback_inodes(&wbc);
- if (wbc.nr_to_write > 0) {
+ if (wbc.wrote_expired_inode == 0) {
if (wbc.encountered_congestion)
blk_congestion_wait(WRITE, HZ/10);
else
_

2006-02-07 00:34:26

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On Mon, Feb 06, 2006 at 03:14:35PM -0800, Andrew Morton wrote:
> David Chinner <[email protected]> wrote:
> >
> > 306 static void
> > 307 sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
> > 308 {
> > 309 const unsigned long start = jiffies; /* livelock avoidance */
> > 310
> > 311 if (!wbc->for_kupdate || list_empty(&sb->s_io))
> > 312 list_splice_init(&sb->s_dirty, &sb->s_io);
> > 313
> > 314 while (!list_empty(&sb->s_io)) {
> >
> > Correct me if I'm wrong, but my reading of this is that for
> > wb_kupdate, we only ever move s_dirty to s_io when s_io is empty.
> > then we iterate over s_io until all inodes are moved off this list
> > or we hit someother termination criteria. This is why i left the
> > large inode on the head of the s_io list until congestion was
> > encountered - so that wb_kupdate returned to it first in it's next
> > pass.
> >
> > So when we get to a young inode on the s_io list, we abort the
> > writeback loop for that filesystem with wbc->nr_to_write > 0 and
> > return to wb_kupdate....
> >
> > However, we still have an inode with lots of dirty data on the head of
> > s_dirty, which we can do nothing with until s_io is emptied by
> > wb_kupdate.
>
> That sounds right. I guess what is happening is that we get into the state
> where your big-dirty-file is on s_dirty and there are a few
> small-dirty-files on s_io. sync_sb_inodes() writes the small files,
> returns "small number of pages written" and that causes wb_kupdate() to
> terminate.
>
> I think the problem here is that the wb_kupdate() termination test is wrong
> - it should just keep going.
>
> We have another bug due to this: if you create a large number of
> zero-length files on a traditional filesystem (ext2, minix, ...), the
> writeout of these inodes doesn't work correctly - it takes ages. Because
> the wb_kupdate logic is driven by "number of dirty pages", and all those
> dirty inodes have zero dirty pages associated with them. wb_kupdate says
> "oh, nothing to do" and gives up.

Ok, i can see how that would be a problem ;)

> So to fix both these problems we need to be smarter about terminating the
> wb_kupdate() loop. Something like "loop until no expired inodes have been
> written".
>
> Wildly untested patch:

Wildly untested assertion - it won't fix my case for the same reason I'm seeing
the current code not working - we abort higher up in writeback_inodes()
on the age check. All this will do is cause wb_kupdate to do one
furhter iteration down the stack until we hit the same young inode
on the s_io list. because its at the head, we return with expired
inodes zero, just like we current return with nr_to_write > 0.

i think we need to leave the inodes which we have more work to
do on on the s_io list. Alternatively, we add a new list off
the superblock (the s_more_io list) and work on that list
when we have nothing more to do or cannot do anything on
s_io or s_dirty.

That is:

splice s_dirty -> s_io

while s_io is not empty {
if young, break
writeback inode
if inode needs more work, put on s_more_io
}

while s_more_io is not empty {
writeback inode
if inode needs more work, move to end of s_more_io
}

That way instead of pdflush going idle because s_io has not been emptied
and can't be emptied until the inodes expire, it continues to work
on the expired inodes that need more work. And it will flush out
new inodes that have expired prior to working on the indoes that require
lots of work, so the large inode writeback does not starve other,
smaller inodes being written back.

This, combined with the change you just posted, should fix both
of the conditions mentioned. Does this sound like a reasonable approach?

Cheers,

dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-02-07 01:02:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

David Chinner <[email protected]> wrote:
>
> > So to fix both these problems we need to be smarter about terminating the
> > wb_kupdate() loop. Something like "loop until no expired inodes have been
> > written".
> >
> > Wildly untested patch:
>
> Wildly untested assertion - it won't fix my case for the same reason I'm seeing
> the current code not working - we abort higher up in writeback_inodes()
> on the age check.

You mean that we're in the state

a) big-dirty-expired inode is on s_dirty

b) small-dirty-not-expired inode is on s_io

sync_sb_inodes() sees the small-dirty-not-expired inode on s_io and gives up?


In which case, yes, perhaps leaving big-dirty-expired inode on s_io is the
right thing to do. But should we be checking that it has expired before
deciding to do this? We don't want to get in a situation where continuous
overwriting of a large file causes other files on that fs to never be
written out.

2006-02-07 01:32:15

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On Mon, Feb 06, 2006 at 05:04:11PM -0800, Andrew Morton wrote:
> David Chinner <[email protected]> wrote:
> >
> > > So to fix both these problems we need to be smarter about terminating the
> > > wb_kupdate() loop. Something like "loop until no expired inodes have been
> > > written".
> > >
> > > Wildly untested patch:
> >
> > Wildly untested assertion - it won't fix my case for the same reason I'm seeing
> > the current code not working - we abort higher up in writeback_inodes()
> > on the age check.
>
> You mean that we're in the state
>
> a) big-dirty-expired inode is on s_dirty
>
> b) small-dirty-not-expired inode is on s_io
>
> sync_sb_inodes() sees the small-dirty-not-expired inode on s_io and gives up?

Yes, that's right.

> In which case, yes, perhaps leaving big-dirty-expired inode on s_io is the
> right thing to do. But should we be checking that it has expired before
> deciding to do this?

Well, we are writing it out because it has expired in the first place,
right? And it remains expired until we actually clean it, so I
don't see any need for a check such as this....

> We don't want to get in a situation where continuous
> overwriting of a large file causes other files on that fs to never be
> written out.

Agreed. That's why i proposed the s_more_io queue - it works on those inodes
that need more work only after all the other inodes have been written out.
That prevents starvation, and makes large inode flushes background work (i.e.
occur when there is nothing else to do). it will get much better disk
utilisation than the method I originally proposed, as well, because it'll keep
the disk near congestion levels until the data is written out...

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-02-07 05:28:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

David Chinner <[email protected]> wrote:
>
> > In which case, yes, perhaps leaving big-dirty-expired inode on s_io is the
> > right thing to do. But should we be checking that it has expired before
> > deciding to do this?
>
> Well, we are writing it out because it has expired in the first place,
> right? And it remains expired until we actually clean it, so I
> don't see any need for a check such as this....

OK. I was worried about redirtyings while inode_lock is dropped, but the
I_DIRTY and _LOCK logic looks tight to me.

> > We don't want to get in a situation where continuous
> > overwriting of a large file causes other files on that fs to never be
> > written out.
>
> Agreed. That's why i proposed the s_more_io queue - it works on those inodes
> that need more work only after all the other inodes have been written out.
> That prevents starvation, and makes large inode flushes background work (i.e.
> occur when there is nothing else to do). it will get much better disk
> utilisation than the method I originally proposed, as well, because it'll keep
> the disk near congestion levels until the data is written out...
>

Yes, s_more_io does make sense. So now dirty inodes can be on one of three
lists. It'll be fun writing the changelog for this one. And we'll need a
big fat comment describing what the locks do, and the protocol for handling
them.

We need to be extra-careful to not break sys_sync(), umount, etc. I guess
if !for_kupdate, we splice s_dirty and s_more_io onto s_io and go for it.

So the protocol would be:

s_io: contains expired and non-expired dirty inodes, with expired ones at
the head. Unexpired ones (at least) are in time order.

s_more_io: contains dirty expired inodes which haven't been fully written.
Ordering doesn't matter (unless someone goes and changes
dirty_expire_centisecs - but as long as we don't do anything really bad in
response to this we'll be OK).

s_dirty: contains expired and non-expired dirty inodes. The non-expired
ones are in time-of-dirtying order.



I wonder if it would be saner to have separate lists for expired and
unexpired inodes. If when writing an expired inode we don't write it all
out, just rotate it to the back of the expired inode list. On entry to
sync_sb_inodes, do a walk of s_dirty, moving expired inodes onto the
expired list.

2006-02-07 07:42:32

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On Mon, Feb 06, 2006 at 09:27:50PM -0800, Andrew Morton wrote:
> David Chinner <[email protected]> wrote:
> >
> > > We don't want to get in a situation where continuous
> > > overwriting of a large file causes other files on that fs to never be
> > > written out.
> >
> > Agreed. That's why i proposed the s_more_io queue - it works on those inodes
> > that need more work only after all the other inodes have been written out.
> > That prevents starvation, and makes large inode flushes background work (i.e.
> > occur when there is nothing else to do). it will get much better disk
> > utilisation than the method I originally proposed, as well, because it'll keep
> > the disk near congestion levels until the data is written out...
> >
>
> Yes, s_more_io does make sense. So now dirty inodes can be on one of three
> lists. It'll be fun writing the changelog for this one. And we'll need a
> big fat comment describing what the locks do, and the protocol for handling
> them.
>
> We need to be extra-careful to not break sys_sync(), umount, etc. I guess
> if !for_kupdate, we splice s_dirty and s_more_io onto s_io and go for it.

I've done that slightly differently - s_more_io remains a separate list that
we don't splice - we process it after s_io using the same logic. Clunky, but
good for a test.

> So the protocol would be:
>
> s_io: contains expired and non-expired dirty inodes, with expired ones at
> the head. Unexpired ones (at least) are in time order.
>
> s_more_io: contains dirty expired inodes which haven't been fully written.
> Ordering doesn't matter (unless someone goes and changes
> dirty_expire_centisecs - but as long as we don't do anything really bad in
> response to this we'll be OK).
>
> s_dirty: contains expired and non-expired dirty inodes. The non-expired
> ones are in time-of-dirtying order.

Yup, that's pretty much it.

> I wonder if it would be saner to have separate lists for expired and
> unexpired inodes. If when writing an expired inode we don't write it all
> out, just rotate it to the back of the expired inode list. On entry to
> sync_sb_inodes, do a walk of s_dirty, moving expired inodes onto the
> expired list.

That is also a possibility - it would mean that we only are attempting to
write out inodes that have expired.

I've also discovered another wrinkle - we hit a different case if the inode
log I/o (from allocation) completes during writeback (do_writepages()). On
XFS, that calls mark_inode_dirty_sync(), so we take the I_DIRTY path in the
logic at the end of __sync_single_inode() and move the inode to the dirty
list. This is why the writeback has been terminating at 70-80k pages written
back in my testing.

Unfortunately, a quick test to move leave the inode on the s_more_io list here
as well when this happens opens us up to overwrite-leads-to-endless-writeout -
we keep dirtying the inode, it's forever expired, and it always remains on the
more_io list, so it always writes out data because it's got data to write out.

So, if we leave the inode on the "more to do" list, we need to prevent
overwrite from monopolising writeout because the only thing stopping it
now is the fact the inode gets shoved to the dirty list by chance.....

I'm going to have a bit more of a think about this. Current
patch attached below.

Signed-of-by: Dave Chinner <[email protected]>


Index: 2.6.x-xfs-new/fs/fs-writeback.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/fs-writeback.c 2006-02-07 17:27:47.200582770 +1100
+++ 2.6.x-xfs-new/fs/fs-writeback.c 2006-02-07 17:58:20.256182092 +1100
@@ -195,13 +195,12 @@ __sync_single_inode(struct inode *inode,
*/
if (wbc->for_kupdate) {
/*
- * For the kupdate function we leave the inode
- * at the head of sb_dirty so it will get more
- * writeout as soon as the queue becomes
- * uncongested.
+ * For the kupdate function we move the inode
+ * to the more_io list so that we continue to
+ * service it after writing other inodes back.
*/
inode->i_state |= I_DIRTY_PAGES;
- list_move_tail(&inode->i_list, &sb->s_dirty);
+ list_move(&inode->i_list, &sb->s_more_io);
} else {
/*
* Otherwise fully redirty the inode so that
@@ -217,9 +216,13 @@ __sync_single_inode(struct inode *inode,
} else if (inode->i_state & I_DIRTY) {
/*
* Someone redirtied the inode while were writing back
- * the pages.
+ * the pages. Do more work if it's kupdate that is
+ * writing back.
*/
- list_move(&inode->i_list, &sb->s_dirty);
+ if (wbc->for_kupdate)
+ list_move(&inode->i_list, &sb->s_more_io);
+ else
+ list_move(&inode->i_list, &sb->s_dirty);
} else if (atomic_read(&inode->i_count)) {
/*
* The inode is clean, inuse
@@ -307,12 +310,15 @@ static void
sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
{
const unsigned long start = jiffies; /* livelock avoidance */
+ struct list_head *head = &sb->s_io;

- if (!wbc->for_kupdate || list_empty(&sb->s_io))
+ if (!wbc->for_kupdate || list_empty(&sb->s_io)) {
list_splice_init(&sb->s_dirty, &sb->s_io);
+ }

- while (!list_empty(&sb->s_io)) {
- struct inode *inode = list_entry(sb->s_io.prev,
+more:
+ while (!list_empty(head)) {
+ struct inode *inode = list_entry(head->prev,
struct inode, i_list);
struct address_space *mapping = inode->i_mapping;
struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -351,8 +357,9 @@ sync_sb_inodes(struct super_block *sb, s
}

/* Was this inode dirtied after sync_sb_inodes was called? */
- if (time_after(inode->dirtied_when, start))
+ if (time_after(inode->dirtied_when, start)) {
break;
+ }

/* Was this inode dirtied too recently? */
if (wbc->older_than_this && time_after(inode->dirtied_when,
@@ -389,6 +396,12 @@ sync_sb_inodes(struct super_block *sb, s
if (wbc->nr_to_write <= 0)
break;
}
+
+ /* Do we have inodes that need more I/O? */
+ if (head == &sb->s_io && !list_empty(&sb->s_more_io)) {
+ head = &sb->s_more_io;
+ goto more;
+ }
return; /* Leave any unwritten inodes on s_io */
}

@@ -421,7 +434,9 @@ writeback_inodes(struct writeback_contro
restart:
sb = sb_entry(super_blocks.prev);
for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
- if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
+ if (!list_empty(&sb->s_dirty) ||
+ !list_empty(&sb->s_io) ||
+ !list_empty(&sb->s_more_io)) {
/* we're making our own get_super here */
sb->s_count++;
spin_unlock(&sb_lock);
Index: 2.6.x-xfs-new/fs/super.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/super.c 2006-02-07 17:27:47.200582770 +1100
+++ 2.6.x-xfs-new/fs/super.c 2006-02-07 17:31:24.706456190 +1100
@@ -67,6 +67,7 @@ static struct super_block *alloc_super(v
}
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
+ INIT_LIST_HEAD(&s->s_more_io);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
Index: 2.6.x-xfs-new/include/linux/fs.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/fs.h 2006-02-07 17:27:47.200582770 +1100
+++ 2.6.x-xfs-new/include/linux/fs.h 2006-02-07 17:31:24.708409044 +1100
@@ -828,6 +828,7 @@ struct super_block {
struct list_head s_inodes; /* all inodes */
struct list_head s_dirty; /* dirty inodes */
struct list_head s_io; /* parked for writeback */
+ struct list_head s_more_io; /* parked for background writeback */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;


--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-02-07 07:49:28

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On Mon, Feb 06, 2006 at 03:14:35PM -0800, Andrew Morton wrote:
>
> So to fix both these problems we need to be smarter about terminating the
> wb_kupdate() loop. Something like "loop until no expired inodes have been
> written".
>
> Wildly untested patch:

> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> + wbc.wrote_expired_inode = 0;
> writeback_inodes(&wbc);
> - if (wbc.nr_to_write > 0) {
> + if (wbc.wrote_expired_inode == 0) {
> if (wbc.encountered_congestion)
> blk_congestion_wait(WRITE, HZ/10);
> else

FWIW, Theres a problem with the logic here - if we've encountered congestion,
we want to wait even if we wrote back expired inodes. Should it be:

if (!wbc.wrote_expired_inode && !wbc.encountered_congestion)
break; /* All the old data is written */
if (wbc.encountered_congestion)
blk_congestion_wait(WRITE, HZ/10);


Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-02-07 22:49:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

David Chinner <[email protected]> wrote:
>
> So, if we leave the inode on the "more to do" list, we need to prevent
> overwrite from monopolising writeout because the only thing stopping it
> now is the fact the inode gets shoved to the dirty list by chance.....
>
> I'm going to have a bit more of a think about this. Current
> patch attached below.

One concern I have about all this is sys_sync() and
umount->sync_inodes_sb(). Those functions _must_ write all inodes and wait
upon the result. If pdflush is concurrently moving inodes between the
various lists, we'll miss inodes.

I suspect we're wrong already. Adding more lists won't help. Possibly
adding some tests for sb->s_syncing in the right place will plug the
problem, but it'll be hard to test and won't do much to clarify things.

An alternative fix is to add locking, which might be acceptable.

2006-02-13 13:59:47

by Johannes Stezenbach

[permalink] [raw]
Subject: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

On Mon, Feb 06, 2006, Andrew Morton wrote:
> Mark Lord <[email protected]> wrote:
> >
> > A simple test I do for this:
> >
> > $ mkdir t
> > $ cp /usr/src/*.bz2 t (about 400-500MB worth of kernel tar files)
> >
> > In another window, I do this:
> >
> > $ while (sleep 1); do echo -n "`date`: "; grep Dirty /proc/meminfo; done
> >
> > And then watch the count get large, but take virtually forever
> > to count back down to a "safe" value.
> >
> > Typing "sync" causes all the Dirty pages to immediately be flushed to disk,
> > as expected.
>
> I've never seen that happen and I don't recall seeing any other reports of
> it, so your machine must be doing something peculiar. I think it can
> happen if, say, an inode gets itself onto the wrong inode list, or
> incorrectly gets its dirty flag cleared.
>
> Are you using any unusual mount options, or unusual combinations of
> filesystems, or anything like that?

I've been seeing something like this for some time, but kept
silent as I'm forced to use vmware on my Thinkpad T42p (1G RAM,
but CONFIG_NOHIGHMEM=y).
Sometimes 'sync' takes serveral seconds, even when the machine
had been idle for >15mins. I don't have laptop mode enabled.
so far I've not found a deterinistic way to reproduce this behaviour.

Anyway, I temporarily deinstalled vmware (deleted the kernel
modules and rebooted; kernel is still tainted because of madwifi
if that matters).
The behaviour I see with vmware (long 'sync' time) doesn't seem
to happen without it so far, but:

Now copying a 700MB file makes "Dirty" go up to 350MB. It then
slowly decreases to 325MB and stays there. However:

$ time sync

real 0m0.326s
user 0m0.000s
sys 0m0.280s

and output from the dirty monitor one-liner:

Mon Feb 13 14:31:43 CET 2006: Dirty: 325916 kB
Mon Feb 13 14:31:44 CET 2006: Dirty: 325916 kB
Mon Feb 13 14:31:45 CET 2006: Dirty: 4 kB
Mon Feb 13 14:31:46 CET 2006: Dirty: 8 kB


Clearly my notebook's hdd isn't that fast. ;-/
What does "Dirty" in /proc/meminfo really mean?

Kernel is 2.6.15, fs is ext3, .config etc. on request.


Johannes

2006-02-13 20:09:50

by Andrew Morton

[permalink] [raw]
Subject: Re: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

Johannes Stezenbach <[email protected]> wrote:
>
> On Mon, Feb 06, 2006, Andrew Morton wrote:
> > Mark Lord <[email protected]> wrote:
> > >
> > > A simple test I do for this:
> > >
> > > $ mkdir t
> > > $ cp /usr/src/*.bz2 t (about 400-500MB worth of kernel tar files)
> > >
> > > In another window, I do this:
> > >
> > > $ while (sleep 1); do echo -n "`date`: "; grep Dirty /proc/meminfo; done
> > >
> > > And then watch the count get large, but take virtually forever
> > > to count back down to a "safe" value.
> > >
> > > Typing "sync" causes all the Dirty pages to immediately be flushed to disk,
> > > as expected.
> >
> > I've never seen that happen and I don't recall seeing any other reports of
> > it, so your machine must be doing something peculiar. I think it can
> > happen if, say, an inode gets itself onto the wrong inode list, or
> > incorrectly gets its dirty flag cleared.
> >
> > Are you using any unusual mount options, or unusual combinations of
> > filesystems, or anything like that?
>
> I've been seeing something like this for some time, but kept
> silent as I'm forced to use vmware on my Thinkpad T42p (1G RAM,
> but CONFIG_NOHIGHMEM=y).
> Sometimes 'sync' takes serveral seconds, even when the machine
> had been idle for >15mins. I don't have laptop mode enabled.
> so far I've not found a deterinistic way to reproduce this behaviour.
>
> Anyway, I temporarily deinstalled vmware (deleted the kernel
> modules and rebooted; kernel is still tainted because of madwifi
> if that matters).
> The behaviour I see with vmware (long 'sync' time) doesn't seem
> to happen without it so far, but:

vmware uses mmap a lot. Perhaps it's doing regular msyncs as well.

> Now copying a 700MB file makes "Dirty" go up to 350MB. It then
> slowly decreases to 325MB and stays there.

It shouldn't. Did you really leave it for long enough?

If you did, then why does it happen there and not here?

> However:
>
> $ time sync
>
> real 0m0.326s
> user 0m0.000s
> sys 0m0.280s
>
> and output from the dirty monitor one-liner:
>
> Mon Feb 13 14:31:43 CET 2006: Dirty: 325916 kB
> Mon Feb 13 14:31:44 CET 2006: Dirty: 325916 kB
> Mon Feb 13 14:31:45 CET 2006: Dirty: 4 kB
> Mon Feb 13 14:31:46 CET 2006: Dirty: 8 kB
>
>
> Clearly my notebook's hdd isn't that fast. ;-/
> What does "Dirty" in /proc/meminfo really mean?

The number of pages which are marked dirty, roughly.

In this case, all those pages' buffers had been written out by kjournald
behind the VM's back, so when the VM tried to write these "dirty" pages it
found that no I/O needed to be performed.

It would be nice if ext3 could clean the parent pages as it goes, but I
seem to recall deciding that this is not trivial. I guess we could get a
99.9% solution by trylocking the page.

2006-02-13 22:48:58

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

On Mon, Feb 13, 2006, Andrew Morton wrote:
> Johannes Stezenbach <[email protected]> wrote:
> > Now copying a 700MB file makes "Dirty" go up to 350MB. It then
> > slowly decreases to 325MB and stays there.
>
> It shouldn't. Did you really leave it for long enough?
>
> If you did, then why does it happen there and not here?

Good question. I just repeated the execise, rebooted and
copied a 700MB file. After ~30min "Dirty" is down to ~130MB, and
continues to decrease very slowly.

On my Desktop machine (P4 HT, 1G RAM) "Dirty" goes down near
zero after ~30sec, as expected.

Here's some output from sysctl -a (should all be default values,
I did not mess with any of those setting but I'm not sure what
Debian does behind my back):

vm.swap_token_timeout = 300
vm.legacy_va_layout = 0
vm.vfs_cache_pressure = 100
vm.block_dump = 0
vm.laptop_mode = 0
vm.max_map_count = 65536
vm.min_free_kbytes = 3831
vm.lowmem_reserve_ratio = 256 256 32
vm.swappiness = 60
vm.nr_pdflush_threads = 2
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.page-cluster = 3
vm.overcommit_ratio = 50
vm.overcommit_memory = 0

.config and dmesg output attached.

Let me know if I can do more tests to find out what's going on.


Johannes


Attachments:
(No filename) (1.26 kB)
t42p-config (37.51 kB)
dmesg.txt (17.08 kB)
Download all attachments

2006-02-13 23:06:01

by Andrew Morton

[permalink] [raw]
Subject: Re: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

Johannes Stezenbach <[email protected]> wrote:
>
> On Mon, Feb 13, 2006, Andrew Morton wrote:
> > Johannes Stezenbach <[email protected]> wrote:
> > > Now copying a 700MB file makes "Dirty" go up to 350MB. It then
> > > slowly decreases to 325MB and stays there.
> >
> > It shouldn't. Did you really leave it for long enough?
> >
> > If you did, then why does it happen there and not here?
>
> Good question. I just repeated the execise, rebooted and
> copied a 700MB file. After ~30min "Dirty" is down to ~130MB, and
> continues to decrease very slowly.
>
> On my Desktop machine (P4 HT, 1G RAM) "Dirty" goes down near
> zero after ~30sec, as expected.

Are you using any unusual mount options?

Which filesystem types are online (not that this should affect it...)

2006-02-13 23:31:32

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

On Mon, Feb 13, 2006 at 03:04:57PM -0800, Andrew Morton wrote:
> Johannes Stezenbach <[email protected]> wrote:
> >
> > On Mon, Feb 13, 2006, Andrew Morton wrote:
> > > Johannes Stezenbach <[email protected]> wrote:
> > > > Now copying a 700MB file makes "Dirty" go up to 350MB. It then
> > > > slowly decreases to 325MB and stays there.
> > >
> > > It shouldn't. Did you really leave it for long enough?
> > >
> > > If you did, then why does it happen there and not here?
> >
> > Good question. I just repeated the execise, rebooted and
> > copied a 700MB file. After ~30min "Dirty" is down to ~130MB, and
> > continues to decrease very slowly.
> >
> > On my Desktop machine (P4 HT, 1G RAM) "Dirty" goes down near
> > zero after ~30sec, as expected.
>
> Are you using any unusual mount options?
>
> Which filesystem types are online (not that this should affect it...)

$ cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw,data=ordered 0 0
proc /proc proc rw,nodiratime 0 0
sysfs /sys sysfs rw 0 0
usbfs /proc/bus/usb usbfs rw 0 0
/dev/root /dev/.static/dev ext3 rw,data=ordered 0 0
tmpfs /dev tmpfs rw 0 0
tmpfs /dev/shm tmpfs rw 0 0
devpts /dev/pts devpts rw 0 0
/dev/hda6 /home ext3 rw,data=ordered 0 0
nfsd /proc/fs/nfsd nfsd rw 0 0
$

I found that if I copy a large number of small files (e.g. the linux
source tree), "Dirty" drops back near zero after ~30sec. Only if
I copy large files it won't.


Johannes

2006-02-13 23:52:26

by Mark Lord

[permalink] [raw]
Subject: Re: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

Andrew Morton wrote:
> Johannes Stezenbach <[email protected]> wrote:
>> On Mon, Feb 06, 2006, Andrew Morton wrote:
>>> Mark Lord <[email protected]> wrote:
>>>> A simple test I do for this:
>>>>
>>>> $ mkdir t
>>>> $ cp /usr/src/*.bz2 t (about 400-500MB worth of kernel tar files)
>>>> In another window, I do this:
>>>> $ while (sleep 1); do echo -n "`date`: "; grep Dirty /proc/meminfo; done
>>>> And then watch the count get large, but take virtually forever
>>>> to count back down to a "safe" value.
>>>> Typing "sync" causes all the Dirty pages to immediately be flushed to disk,
>>>> as expected.
...
>> I've been seeing something like this for some time, but kept
>> silent as I'm forced to use vmware on my Thinkpad T42p (1G RAM,
>> but CONFIG_NOHIGHMEM=y).
>> Sometimes 'sync' takes serveral seconds, even when the machine
>> had been idle for >15mins. I don't have laptop mode enabled.
>> so far I've not found a deterinistic way to reproduce this behaviour.
>>
>> Anyway, I temporarily deinstalled vmware (deleted the kernel
>> modules and rebooted; kernel is still tainted because of madwifi
>> if that matters).
>> The behaviour I see with vmware (long 'sync' time) doesn't seem
>> to happen without it so far ...

Mmm.. Okay, all of my machines normally have VMWare-WS installed on them,
so that might just be the culprit.

MMMmm... isn't there an option somewhere in VMWare for lazy-writeback
or something like that, intended to speed up use of snapshots and
suspend/resume of VMs.. Ah, here is its description:

"Workstation uses a memory trimming technique to return unused virtual
machine memory to the host machine for other uses. While trimming usually
has little impact on performance and may be needed in low memory situations,
the I/O caused by memory trimming can sometimes interfere with disk-oriented
workload performance in a guest."

"Workstation uses a page sharing technique to allow guest memory pages with
identical contents to be stored as a single copy-on-write page. Page sharing
decreases host memory usage, but consumes some system resources, potentially
including I/O bandwidth. You may want to avoid this overhead for guests for
which host memory is plentiful and I/O latency is important."

Mmm.. so the intent is to affect only VMWare itself, not the rest of the
system while VMWare is dormant. I guess it's time to disable loading of
the VMWare modules and reboot. Bye bye uptime! Maybe I'll install
2.6.16-rc3 while I'm at it.

I'll follow-up with results in an hour or two.

Cheers

2006-02-14 00:50:07

by Mark Lord

[permalink] [raw]
Subject: Re: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

Mark Lord wrote:
> Andrew Morton wrote:
>> Johannes Stezenbach <[email protected]> wrote:
>>> On Mon, Feb 06, 2006, Andrew Morton wrote:
...
>>> Anyway, I temporarily deinstalled vmware (deleted the kernel
>>> modules and rebooted; kernel is still tainted because of madwifi
>>> if that matters).
>>> The behaviour I see with vmware (long 'sync' time) doesn't seem
>>> to happen without it so far ...
>
> Mmm.. Okay, all of my machines normally have VMWare-WS installed on them,
> so that might just be the culprit.
...
> Mmm.. so the intent is to affect only VMWare itself, not the rest of the
> system while VMWare is dormant. I guess it's time to disable loading of
> the VMWare modules and reboot. Bye bye uptime! Maybe I'll install
> 2.6.16-rc3 while I'm at it.
>
> I'll follow-up with results in an hour or two.

Okay, results are non-conclusive, because 2.6.16-rc3-git1 breaks VMWare;
the vmmon module no longer compiles. Unstable kernel non-API issue again.

So I'm back on 2.6.15 until that gets fixed.

Cheers

2006-02-14 16:32:37

by Mark Lord

[permalink] [raw]
Subject: Re: dirty pages (Was: Re: [PATCH] Prevent large file writeback starvation)

Mark Lord wrote:
> Andrew Morton wrote:
>> Johannes Stezenbach <[email protected]> wrote:
>>> On Mon, Feb 06, 2006, Andrew Morton wrote:
>>>> Mark Lord <[email protected]> wrote:
>>>>> A simple test I do for this:
>>>>>
>>>>> $ mkdir t
>>>>> $ cp /usr/src/*.bz2 t (about 400-500MB worth of kernel tar files)
>>>>> In another window, I do this:
>>>>> $ while (sleep 1); do echo -n "`date`: "; grep Dirty
>>>>> /proc/meminfo; done
>>>>> And then watch the count get large, but take virtually forever
>>>>> to count back down to a "safe" value.
>>>>> Typing "sync" causes all the Dirty pages to immediately be flushed
>>>>> to disk,
>>>>> as expected.
> ...
>>> I've been seeing something like this for some time, but kept
>>> silent as I'm forced to use vmware on my Thinkpad T42p (1G RAM,
>>> but CONFIG_NOHIGHMEM=y).
>>> Sometimes 'sync' takes serveral seconds, even when the machine
>>> had been idle for >15mins. I don't have laptop mode enabled.
>>> so far I've not found a deterinistic way to reproduce this behaviour.
>>>
>>> Anyway, I temporarily deinstalled vmware (deleted the kernel
>>> modules and rebooted; kernel is still tainted because of madwifi
>>> if that matters).
>>> The behaviour I see with vmware (long 'sync' time) doesn't seem
>>> to happen without it so far ...
>
> Mmm.. Okay, all of my machines normally have VMWare-WS installed on them,
> so that might just be the culprit.
>
> MMMmm... isn't there an option somewhere in VMWare for lazy-writeback
> or something like that, intended to speed up use of snapshots and
> suspend/resume of VMs.. Ah, here is its description:
>
> "Workstation uses a memory trimming technique to return unused virtual
> machine memory to the host machine for other uses. While trimming usually
> has little impact on performance and may be needed in low memory
> situations,
> the I/O caused by memory trimming can sometimes interfere with
> disk-oriented
> workload performance in a guest."
>
> "Workstation uses a page sharing technique to allow guest memory pages with
> identical contents to be stored as a single copy-on-write page. Page
> sharing
> decreases host memory usage, but consumes some system resources,
> potentially
> including I/O bandwidth. You may want to avoid this overhead for guests for
> which host memory is plentiful and I/O latency is important."
>
> Mmm.. so the intent is to affect only VMWare itself, not the rest of the
> system while VMWare is dormant. I guess it's time to disable loading of
> the VMWare modules and reboot. Bye bye uptime! Maybe I'll install
> 2.6.16-rc3 while I'm at it.
>
> I'll follow-up with results in an hour or two.

Okay, I fixed VMWare-WS-5.5.1 to work on 2.6.16-rc3, and tried my old test (above)
both *before* loading anything to do with VMWare, and again after using VMWare.

The old behaviour of keeping zillions of uncommitted dirty pages around is no
longer present, with or without VMWare. So that issue appears to have gone away.

But the /proc/meminfo/Dirty value stays in the hundreds of megabytes
until I do a "sync", at which point it almost immediately drops to zero
without actually doing hundreds of MB of writes at that point.

Weird.

Cheers

2006-03-20 22:40:26

by Alexander Bergolth

[permalink] [raw]
Subject: Re: [PATCH] Prevent large file writeback starvation

On 02/06/06 21:11, Andrew Morton wrote:
> Mark Lord <[email protected]> wrote:
>
>>A simple test I do for this:
>> $ mkdir t
>> $ cp /usr/src/*.bz2 t (about 400-500MB worth of kernel tar files)
>>
>> In another window, I do this:
>>
>> $ while (sleep 1); do echo -n "`date`: "; grep Dirty /proc/meminfo; done
>>
>> And then watch the count get large, but take virtually forever
>> to count back down to a "safe" value.
>>
>> Typing "sync" causes all the Dirty pages to immediately be flushed to disk,
>> as expected.
>
> I've never seen that happen and I don't recall seeing any other reports of
> it, so your machine must be doing something peculiar.

We are seeing the same issue on several boxes using xfs on some of them
and ext3 on the others. Dirty pages are not periodically flushed to disk
and even the sync command sometimes does only flush a small amount of
the dirty buffers.
It looks like the problems arise after a few days uptime, a freshly
booted system doesn't show the symptoms. (At least it doesn't show them
when I'm looking out for them. ;))

I've written a small test-script to visualize the behavior. (Attached.)

The script creates a 200MB file, monitors nr_dirty in /proc/vmstat and
executes sync after some time.

The output looks like that:

-------------------- snip! bad: --------------------
Linux slime.wu-wien.ac.at 2.6.14-1.1653_FC4smp #1 SMP Tue Dec 13
21:46:01 EST 2005 i686 i686 i386 GNU/Linux
12:22:46 up 10 days, 18:12, 37 users, load average: 0.17, 0.15, 0.09
12:22:46 start: head -c 200000000 /dev/zero
>/var/tmp/dirty-buffers.EFYFF13399 # nr_dirty 1076
12:22:46 # nr_dirty 1805
12:22:47 end: head -c 200000000 /dev/zero
>/var/tmp/dirty-buffers.EFYFF13399 # nr_dirty 31061
12:22:51 # nr_dirty 25671
12:22:56 # nr_dirty 25724
12:23:01 # nr_dirty 25724
12:23:06 # nr_dirty 25724
12:23:11 # nr_dirty 25724
12:23:16 # nr_dirty 25724
12:23:21 # nr_dirty 25724
12:23:26 # nr_dirty 25724
12:23:31 # nr_dirty 25724
12:23:36 # nr_dirty 25724
12:23:41 # nr_dirty 25724
12:23:47 # nr_dirty 25725
12:23:52 # nr_dirty 25726
12:23:57 # nr_dirty 25728
12:24:02 # nr_dirty 25728
12:24:07 # nr_dirty 25728
12:24:12 # nr_dirty 25728
12:24:12 # nr_dirty 25728
12:24:12 start: sync # nr_dirty 25728
12:24:12 end: sync # nr_dirty 23566
12:24:17 # nr_dirty 23566
12:24:22 # nr_dirty 23582
12:24:27 # nr_dirty 23583
12:24:32 # nr_dirty 23583
12:24:37 # nr_dirty 23583
12:24:42 # nr_dirty 23583
12:24:47 # nr_dirty 23583
12:24:52 # nr_dirty 23583
12:24:57 # nr_dirty 23583
-------------------- snip! --------------------

While writing the temp-file, some buffers are flushed. (31061->25671)
But after writing is completed, the 25000 buffers remain dirty and are
not flushed after 30 secs, as I would expect. The sync causes the dirty
buffers to shrink from 25728 to 23566 but I'd expect that sync should
cause them to become near 0.

Here is the output of another system with a lower uptime that doesn't
show that behavior yet:

-------------------- snip! good: --------------------
Linux roaster.wu-wien.ac.at 2.6.12-1.1376_FC3.stk16smp #1 SMP Mon Aug 29
16:41:37 EDT 2005 i686 i686 i386 GNU/Linux
12:44:54 up 3 days, 1:50, 2 users, load average: 0.00, 0.16, 0.14
12:44:54 start: head -c 200000000 /dev/zero
>/tmp/dirty-buffers.cgRFjZ1720 # nr_dirty 2
12:44:54 # nr_dirty 2
12:44:55 end: head -c 200000000 /dev/zero >/tmp/dirty-buffers.cgRFjZ1720
# nr_dirty 31257
12:44:59 # nr_dirty 22239
12:45:04 # nr_dirty 22239
12:45:09 # nr_dirty 22239
12:45:14 # nr_dirty 22240
12:45:19 # nr_dirty 22240
12:45:24 # nr_dirty 22240
12:45:29 # nr_dirty 4830
12:45:34 # nr_dirty 1
12:45:39 # nr_dirty 1
12:45:44 # nr_dirty 2
12:45:49 # nr_dirty 2
12:45:54 # nr_dirty 2
12:45:59 # nr_dirty 2
12:46:04 # nr_dirty 1
12:46:09 # nr_dirty 1
12:46:14 # nr_dirty 1
12:46:19 # nr_dirty 1
12:46:19 # nr_dirty 1
12:46:19 start: sync # nr_dirty 1
12:46:19 end: sync # nr_dirty 0
12:46:24 # nr_dirty 0
12:46:29 # nr_dirty 0
12:46:34 # nr_dirty 0
12:46:39 # nr_dirty 0
12:46:44 # nr_dirty 0
12:46:49 # nr_dirty 0
12:46:54 # nr_dirty 0
12:46:59 # nr_dirty 1
12:47:04 # nr_dirty 1
-------------------- snip! --------------------

There are no special mount options used in the first example (ext3),
noatime is used in the second example (xfs).

Cheers,
--leo


Attachments:
dirty-buffers.sh (469.00 B)