2011-04-20 18:23:29

by Chris Mason

[permalink] [raw]
Subject: buffered writeback torture program

Hi everyone,

I dug out my old fsync latency tester to make sure Jens' new plugging
code hadn't caused regressions. This started off as a program Ted wrote
during the firefox dark times, and I added some more code to saturate
spindles with random IO.

The basic idea is:

1) make a nice big sequential 8GB file
2) fork a process doing random buffered writes inside that file
3) overwrite a second 4K file in a loop, doing fsyncs as you go.

The test program times how long each write and fsync take in step three.
The idea is that if we have problems with concurrent buffered writes and
fsyncs, then all of our fsyncs will get stuck behind the random IO
writeback and our latencies will be high.

For a while, xfs, btrfs and ext4 did really well at this test. Our
fsync latencies were very small and we all sent down synchronous IO that
the elevator dispatched straight to the drive.

Things have changed though, both xfs and ext4 have grown code to do
dramatically more IO than write_cache_pages has asked for (I'm pretty
sure I told everyone this was a good idea at the time). When doing
sequential writes, this is a great idea. When doing random IO, it leads
to unbound stalls in balance_dirty_pages.

Here's an example run on xfs:

# fsync-tester
setting up random write file
done setting up random write file
starting fsync run
starting random io!
write time 0.0009s fsync time: 2.0142s
write time 128.9305s fsync time: 2.6046s
run done 2 fsyncs total, killing random writer

In this case the 128s spent in write was on a single 4K overwrite on a
4K file.

ext4 is doing a blanket increase on wbc->nr_to_write, while xfs is
pushing down any other IO inside the same extent as the writepage it
just serviced. The test program ends up being a worst case for both of
them.

I honestly don't know what the right answer is here, but I'd say we
should be backing off our nr_to_write bumps if we notice a lot of random
IO.

Btrfs does bump wbc->nr_to_write as well, but only as far as we have
sequential IO to process. We also end up throttling the random IO
harder because it is a worst case for our delayed allocation reservation
code.

The test program is below, and can also be found at:

http://oss.oracle.com/~mason/fsync-tester.c

Usage: gcc -Wall -o fsync-tester fsync-tester.c

mkfs.your_favorite /dev/xxx
mount /dev/xxx /mnt
cd /mnt
fsync-tester

It creates a single 8GB file and one 4K file, so your FS must be at
least that big.

-chris

/*
* fsync-tester.c
*
* Written by Theodore Ts'o, 3/21/09. Updated by Chris Mason to include
* the random writer thread
*
* This file may be redistributed under the terms of the GNU Public
* License, version 2.
*/
#define _FILE_OFFSET_BITS 64
#define _XOPEN_SOURCE 500
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <signal.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>

#define SIZE (32768*32)
static char bigbuf[4096];

static float timeval_subtract(struct timeval *tv1, struct timeval *tv2)
{
return ((tv1->tv_sec - tv2->tv_sec) +
((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
}

static void random_io(int fd, loff_t total)
{
loff_t cur = 0;
int ret;

/* just some constant so our runs are always the same */
srand(4096);
while(1) {
/*
* we want a random offset into the file,
* but rand only returns max in. So we make
* it a random block number instead, and multiply
* by 4096.
*/
cur = rand();
cur = (cur * 4096) % (total - 4096);

/* align our offset to 4k */
cur = cur / 4096;
cur = cur * 4096;
ret = pwrite(fd, bigbuf, 4096, cur);
if (ret < 4096) {
fprintf(stderr, "short write ret %d cur %llu\n",
ret, (unsigned long long)cur);
exit(1);
}
}
}

int main(int argc, char **argv)
{
int fd;
struct timeval tv, tv2, start;
char buf[SIZE];
pid_t pid;
loff_t total = ((loff_t)8) * 1024 * 1024 * 1024;
loff_t cur = 0;
int rand_fd;
int ret;
int i;
int status;
struct stat st;

memset(bigbuf, 0, 4096);

rand_fd = open("fsync-tester.rnd-file", O_WRONLY|O_CREAT);
if (rand_fd < 0) {
perror("open");
exit(1);
}

ret = fstat(rand_fd, &st);
if (ret < 0) {
perror("fstat");
exit(1);
}

if (st.st_size < total) {
printf("setting up random write file\n");
while(cur < total) {
ret = write(rand_fd, bigbuf, 4096);
if (ret <= 0) {
fprintf(stderr, "short write\n");
exit(1);
}
cur += ret;
}
printf("done setting up random write file\n");
}

fd = open("fsync-tester.tst-file", O_WRONLY|O_CREAT);
if (fd < 0) {
perror("open");
exit(1);
}
memset(buf, 'a', SIZE);

pid = fork();
if (pid == 0) {
printf("starting random io!\n");
random_io(rand_fd, total);
exit(0);
}

close(rand_fd);

gettimeofday(&start, NULL);
printf("starting fsync run\n");
for(i = 0; i < 60; i++) {
float pwrite_time;

gettimeofday(&tv2, NULL);
pwrite(fd, buf, SIZE, 0);
gettimeofday(&tv, NULL);

pwrite_time = timeval_subtract(&tv, &tv2);

fsync(fd);
gettimeofday(&tv2, NULL);

printf("write time: %5.4fs fsync time: %5.4fs\n", pwrite_time,
timeval_subtract(&tv2, &tv));

if (timeval_subtract(&tv2, &start) > 60)
break;
sleep(4);
}
printf("run done %d fsyncs total, killing random writer\n", i + 1);
fflush(stdout);
kill(pid, SIGTERM);
wait(&status);

return 0;
}





2011-04-20 22:06:37

by Vivek Goyal

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Wed, Apr 20, 2011 at 02:23:29PM -0400, Chris Mason wrote:
> Hi everyone,
>
> I dug out my old fsync latency tester to make sure Jens' new plugging
> code hadn't caused regressions. This started off as a program Ted wrote
> during the firefox dark times, and I added some more code to saturate
> spindles with random IO.
>
> The basic idea is:
>
> 1) make a nice big sequential 8GB file
> 2) fork a process doing random buffered writes inside that file
> 3) overwrite a second 4K file in a loop, doing fsyncs as you go.
>
> The test program times how long each write and fsync take in step three.
> The idea is that if we have problems with concurrent buffered writes and
> fsyncs, then all of our fsyncs will get stuck behind the random IO
> writeback and our latencies will be high.
>
> For a while, xfs, btrfs and ext4 did really well at this test. Our
> fsync latencies were very small and we all sent down synchronous IO that
> the elevator dispatched straight to the drive.
>
> Things have changed though, both xfs and ext4 have grown code to do
> dramatically more IO than write_cache_pages has asked for (I'm pretty
> sure I told everyone this was a good idea at the time). When doing
> sequential writes, this is a great idea. When doing random IO, it leads
> to unbound stalls in balance_dirty_pages.
>
> Here's an example run on xfs:
>
> # fsync-tester
> setting up random write file
> done setting up random write file
> starting fsync run
> starting random io!
> write time 0.0009s fsync time: 2.0142s
> write time 128.9305s fsync time: 2.6046s
> run done 2 fsyncs total, killing random writer
>
> In this case the 128s spent in write was on a single 4K overwrite on a
> 4K file.

Chris, You seem to be doing 1MB (32768*32) writes on fsync file instead of 4K.
I changed the size to 4K still not much difference though.

Once the program has exited because of high write time, i restarted it and
this time I don't see high write times.

First run
---------
# ./a.out
setting up random write file
done setting up random write file
starting fsync run
starting random io!
write time: 0.0006s fsync time: 0.3400s
write time: 63.3270s fsync time: 0.3760s
run done 2 fsyncs total, killing random writer

Second run
----------
# ./a.out
starting fsync run
starting random io!
write time: 0.0006s fsync time: 0.5359s
write time: 0.0007s fsync time: 0.3559s
write time: 0.0009s fsync time: 0.3113s
write time: 0.0008s fsync time: 0.4336s
write time: 0.0009s fsync time: 0.3780s
write time: 0.0008s fsync time: 0.3114s
write time: 0.0009s fsync time: 0.3225s
write time: 0.0009s fsync time: 0.3891s
write time: 0.0009s fsync time: 0.4336s
write time: 0.0009s fsync time: 0.4225s
write time: 0.0009s fsync time: 0.4114s
write time: 0.0007s fsync time: 0.4004s

Not sure why would that happen.

I am wondering why pwrite/fsync process was throttled. It did not have any
pages in page cache and it shouldn't have hit the task dirty limits. Does that
mean per task dirty limit logic does not work or I am completely missing
the root cause of the problem.

Thanks
Vivek

2011-04-21 08:32:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Wed, Apr 20, 2011 at 02:23:29PM -0400, Chris Mason wrote:
> # fsync-tester
> setting up random write file
> done setting up random write file
> starting fsync run
> starting random io!
> write time 0.0009s fsync time: 2.0142s
> write time 128.9305s fsync time: 2.6046s
> run done 2 fsyncs total, killing random writer
>
> In this case the 128s spent in write was on a single 4K overwrite on a
> 4K file.

I can't really reproduce this locally on XFS:

setting up random write file
done setting up random write file
starting fsync run
starting random io!
write time: 0.0023s fsync time: 0.5949s
write time: 0.0605s fsync time: 0.2339s
write time: 0.0018s fsync time: 0.0179s
write time: 0.0020s fsync time: 0.0201s
write time: 0.0019s fsync time: 0.0176s
write time: 0.0018s fsync time: 0.0209s
write time: 0.0025s fsync time: 0.0197s
write time: 0.0013s fsync time: 0.0183s
write time: 0.0013s fsync time: 0.0217s
write time: 0.0016s fsync time: 0.0158s
write time: 0.0022s fsync time: 0.0240s
write time: 0.0024s fsync time: 0.0190s
write time: 0.0017s fsync time: 0.0205s
write time: 0.0030s fsync time: 0.0688s
write time: 0.0045s fsync time: 0.0193s
write time: 0.0022s fsync time: 0.0356s

But given that you are able to reproduce it, does the following patch
help your latencies? Currently XFS actually does stop I/O when
nr_to_write reaches zero, but only for non-blocking I/O. This behaviour
was introduced in commit efceab1d563153a2b1a6e7d35376241a48126989

"xfs: handle negative wbc->nr_to_write during sync writeback"

and works around issues in the generic writeback code.


Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c 2011-04-21 10:20:48.303550404 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c 2011-04-21 10:20:58.203496773 +0200
@@ -765,8 +765,7 @@ xfs_convert_page(
SetPageUptodate(page);

if (count) {
- if (--wbc->nr_to_write <= 0 &&
- wbc->sync_mode == WB_SYNC_NONE)
+ if (--wbc->nr_to_write <= 0)
done = 1;
}
xfs_start_page_writeback(page, !page_dirty, count);

2011-04-21 11:09:11

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Vivek Goyal's message of 2011-04-20 18:06:26 -0400:
> On Wed, Apr 20, 2011 at 02:23:29PM -0400, Chris Mason wrote:
> > Hi everyone,
> >
> > I dug out my old fsync latency tester to make sure Jens' new plugging
> > code hadn't caused regressions. This started off as a program Ted wrote
> > during the firefox dark times, and I added some more code to saturate
> > spindles with random IO.
> >
> > The basic idea is:
> >
> > 1) make a nice big sequential 8GB file
> > 2) fork a process doing random buffered writes inside that file
> > 3) overwrite a second 4K file in a loop, doing fsyncs as you go.
> >
> > The test program times how long each write and fsync take in step three.
> > The idea is that if we have problems with concurrent buffered writes and
> > fsyncs, then all of our fsyncs will get stuck behind the random IO
> > writeback and our latencies will be high.
> >
> > For a while, xfs, btrfs and ext4 did really well at this test. Our
> > fsync latencies were very small and we all sent down synchronous IO that
> > the elevator dispatched straight to the drive.
> >
> > Things have changed though, both xfs and ext4 have grown code to do
> > dramatically more IO than write_cache_pages has asked for (I'm pretty
> > sure I told everyone this was a good idea at the time). When doing
> > sequential writes, this is a great idea. When doing random IO, it leads
> > to unbound stalls in balance_dirty_pages.
> >
> > Here's an example run on xfs:
> >
> > # fsync-tester
> > setting up random write file
> > done setting up random write file
> > starting fsync run
> > starting random io!
> > write time 0.0009s fsync time: 2.0142s
> > write time 128.9305s fsync time: 2.6046s
> > run done 2 fsyncs total, killing random writer
> >
> > In this case the 128s spent in write was on a single 4K overwrite on a
> > 4K file.
>
> Chris, You seem to be doing 1MB (32768*32) writes on fsync file instead of 4K.
> I changed the size to 4K still not much difference though.

Whoops, I had that change made locally but didn't get it copied out.

>
> Once the program has exited because of high write time, i restarted it and
> this time I don't see high write times.

I see this for some of my runs as well.

>
> First run
> ---------
> # ./a.out
> setting up random write file
> done setting up random write file
> starting fsync run
> starting random io!
> write time: 0.0006s fsync time: 0.3400s
> write time: 63.3270s fsync time: 0.3760s
> run done 2 fsyncs total, killing random writer
>
> Second run
> ----------
> # ./a.out
> starting fsync run
> starting random io!
> write time: 0.0006s fsync time: 0.5359s
> write time: 0.0007s fsync time: 0.3559s
> write time: 0.0009s fsync time: 0.3113s
> write time: 0.0008s fsync time: 0.4336s
> write time: 0.0009s fsync time: 0.3780s
> write time: 0.0008s fsync time: 0.3114s
> write time: 0.0009s fsync time: 0.3225s
> write time: 0.0009s fsync time: 0.3891s
> write time: 0.0009s fsync time: 0.4336s
> write time: 0.0009s fsync time: 0.4225s
> write time: 0.0009s fsync time: 0.4114s
> write time: 0.0007s fsync time: 0.4004s
>
> Not sure why would that happen.
>
> I am wondering why pwrite/fsync process was throttled. It did not have any
> pages in page cache and it shouldn't have hit the task dirty limits. Does that
> mean per task dirty limit logic does not work or I am completely missing
> the root cause of the problem.

I haven't traced it to see. This test box only has 1GB of ram, so the
dirty ratios can be very tight.

-chris

2011-04-21 15:25:41

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Chris Mason's message of 2011-04-21 07:09:11 -0400:
> Excerpts from Vivek Goyal's message of 2011-04-20 18:06:26 -0400:
> > >
> > > In this case the 128s spent in write was on a single 4K overwrite on a
> > > 4K file.
> >
> > Chris, You seem to be doing 1MB (32768*32) writes on fsync file instead of 4K.
> > I changed the size to 4K still not much difference though.
>
> Whoops, I had that change made locally but didn't get it copied out.
>
> >
> > Once the program has exited because of high write time, i restarted it and
> > this time I don't see high write times.
>
> I see this for some of my runs as well.
>
> >
> > First run
> > ---------
> > # ./a.out
> > setting up random write file
> > done setting up random write file
> > starting fsync run
> > starting random io!
> > write time: 0.0006s fsync time: 0.3400s
> > write time: 63.3270s fsync time: 0.3760s
> > run done 2 fsyncs total, killing random writer
> >
> > Second run
> > ----------
> > # ./a.out
> > starting fsync run
> > starting random io!
> > write time: 0.0006s fsync time: 0.5359s
> > write time: 0.0007s fsync time: 0.3559s
> > write time: 0.0009s fsync time: 0.3113s
> > write time: 0.0008s fsync time: 0.4336s
> > write time: 0.0009s fsync time: 0.3780s
> > write time: 0.0008s fsync time: 0.3114s
> > write time: 0.0009s fsync time: 0.3225s
> > write time: 0.0009s fsync time: 0.3891s
> > write time: 0.0009s fsync time: 0.4336s
> > write time: 0.0009s fsync time: 0.4225s
> > write time: 0.0009s fsync time: 0.4114s
> > write time: 0.0007s fsync time: 0.4004s
> >
> > Not sure why would that happen.
> >
> > I am wondering why pwrite/fsync process was throttled. It did not have any
> > pages in page cache and it shouldn't have hit the task dirty limits. Does that
> > mean per task dirty limit logic does not work or I am completely missing
> > the root cause of the problem.
>
> I haven't traced it to see. This test box only has 1GB of ram, so the
> dirty ratios can be very tight.

Oh, I see now. The test program first creates the file with a big
streaming write. So the task doing the streaming writes gets nailed
with the per-task dirty accounting because it is making a ton of dirty
data.

Then the task forks the random writer to do all the random IO.

Then the original pid goes back to do the fsyncs on the new file.

So, in the original run, we get stuffed into balance_dirty_pages because
the per-task limits show we've done a lot of dirties.

In all later runs, the file already exists, so our fsyncing process
hasn't done much dirtying at all. Looks like the VM is doing something
sane, we just get nailed with big random IO.

-chris

2011-04-21 15:36:05

by Vivek Goyal

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Thu, Apr 21, 2011 at 11:25:41AM -0400, Chris Mason wrote:
> Excerpts from Chris Mason's message of 2011-04-21 07:09:11 -0400:
> > Excerpts from Vivek Goyal's message of 2011-04-20 18:06:26 -0400:
> > > >
> > > > In this case the 128s spent in write was on a single 4K overwrite on a
> > > > 4K file.
> > >
> > > Chris, You seem to be doing 1MB (32768*32) writes on fsync file instead of 4K.
> > > I changed the size to 4K still not much difference though.
> >
> > Whoops, I had that change made locally but didn't get it copied out.
> >
> > >
> > > Once the program has exited because of high write time, i restarted it and
> > > this time I don't see high write times.
> >
> > I see this for some of my runs as well.
> >
> > >
> > > First run
> > > ---------
> > > # ./a.out
> > > setting up random write file
> > > done setting up random write file
> > > starting fsync run
> > > starting random io!
> > > write time: 0.0006s fsync time: 0.3400s
> > > write time: 63.3270s fsync time: 0.3760s
> > > run done 2 fsyncs total, killing random writer
> > >
> > > Second run
> > > ----------
> > > # ./a.out
> > > starting fsync run
> > > starting random io!
> > > write time: 0.0006s fsync time: 0.5359s
> > > write time: 0.0007s fsync time: 0.3559s
> > > write time: 0.0009s fsync time: 0.3113s
> > > write time: 0.0008s fsync time: 0.4336s
> > > write time: 0.0009s fsync time: 0.3780s
> > > write time: 0.0008s fsync time: 0.3114s
> > > write time: 0.0009s fsync time: 0.3225s
> > > write time: 0.0009s fsync time: 0.3891s
> > > write time: 0.0009s fsync time: 0.4336s
> > > write time: 0.0009s fsync time: 0.4225s
> > > write time: 0.0009s fsync time: 0.4114s
> > > write time: 0.0007s fsync time: 0.4004s
> > >
> > > Not sure why would that happen.
> > >
> > > I am wondering why pwrite/fsync process was throttled. It did not have any
> > > pages in page cache and it shouldn't have hit the task dirty limits. Does that
> > > mean per task dirty limit logic does not work or I am completely missing
> > > the root cause of the problem.
> >
> > I haven't traced it to see. This test box only has 1GB of ram, so the
> > dirty ratios can be very tight.
>
> Oh, I see now. The test program first creates the file with a big
> streaming write. So the task doing the streaming writes gets nailed
> with the per-task dirty accounting because it is making a ton of dirty
> data.
>
> Then the task forks the random writer to do all the random IO.
>
> Then the original pid goes back to do the fsyncs on the new file.
>
> So, in the original run, we get stuffed into balance_dirty_pages because
> the per-task limits show we've done a lot of dirties.
>
> In all later runs, the file already exists, so our fsyncing process
> hasn't done much dirtying at all. Looks like the VM is doing something
> sane, we just get nailed with big random IO.

Ok, that makes sense. So initial file creation accounted lots of buffered
IO to this process hence VM thinks it has crossed it dirty limits and later
this task comes with 4K write and gets throttled.

Thanks
Vivek

2011-04-21 16:55:29

by Jan Kara

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Thu 21-04-11 11:25:41, Chris Mason wrote:
> Excerpts from Chris Mason's message of 2011-04-21 07:09:11 -0400:
> > Excerpts from Vivek Goyal's message of 2011-04-20 18:06:26 -0400:
> > > >
> > > > In this case the 128s spent in write was on a single 4K overwrite on a
> > > > 4K file.
> > >
> > > Chris, You seem to be doing 1MB (32768*32) writes on fsync file instead of 4K.
> > > I changed the size to 4K still not much difference though.
> >
> > Whoops, I had that change made locally but didn't get it copied out.
> >
> > >
> > > Once the program has exited because of high write time, i restarted it and
> > > this time I don't see high write times.
> >
> > I see this for some of my runs as well.
> >
> > >
> > > First run
> > > ---------
> > > # ./a.out
> > > setting up random write file
> > > done setting up random write file
> > > starting fsync run
> > > starting random io!
> > > write time: 0.0006s fsync time: 0.3400s
> > > write time: 63.3270s fsync time: 0.3760s
> > > run done 2 fsyncs total, killing random writer
> > >
> > > Second run
> > > ----------
> > > # ./a.out
> > > starting fsync run
> > > starting random io!
> > > write time: 0.0006s fsync time: 0.5359s
> > > write time: 0.0007s fsync time: 0.3559s
> > > write time: 0.0009s fsync time: 0.3113s
> > > write time: 0.0008s fsync time: 0.4336s
> > > write time: 0.0009s fsync time: 0.3780s
> > > write time: 0.0008s fsync time: 0.3114s
> > > write time: 0.0009s fsync time: 0.3225s
> > > write time: 0.0009s fsync time: 0.3891s
> > > write time: 0.0009s fsync time: 0.4336s
> > > write time: 0.0009s fsync time: 0.4225s
> > > write time: 0.0009s fsync time: 0.4114s
> > > write time: 0.0007s fsync time: 0.4004s
> > >
> > > Not sure why would that happen.
> > >
> > > I am wondering why pwrite/fsync process was throttled. It did not have any
> > > pages in page cache and it shouldn't have hit the task dirty limits. Does that
> > > mean per task dirty limit logic does not work or I am completely missing
> > > the root cause of the problem.
> >
> > I haven't traced it to see. This test box only has 1GB of ram, so the
> > dirty ratios can be very tight.
>
> Oh, I see now. The test program first creates the file with a big
> streaming write. So the task doing the streaming writes gets nailed
> with the per-task dirty accounting because it is making a ton of dirty
> data.
>
> Then the task forks the random writer to do all the random IO.
>
> Then the original pid goes back to do the fsyncs on the new file.
>
> So, in the original run, we get stuffed into balance_dirty_pages because
> the per-task limits show we've done a lot of dirties.
>
> In all later runs, the file already exists, so our fsyncing process
> hasn't done much dirtying at all. Looks like the VM is doing something
> sane, we just get nailed with big random IO.
Ok, so there isn't a problem with fsync() as such if I understand it
right. We just block tasks in balance_dirty_pages() for a *long* time
because it takes long time to write out that dirty IO and we make it even
worse by trying to writeout more on behalf of the throttled task. Am I
right? The IO-less throttling will solve this regardless of patchset we
choose so I wouldn't be too worried about the problem now.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-21 16:59:15

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Jan Kara's message of 2011-04-21 12:55:29 -0400:
> On Thu 21-04-11 11:25:41, Chris Mason wrote:
> > Excerpts from Chris Mason's message of 2011-04-21 07:09:11 -0400:
> > > Excerpts from Vivek Goyal's message of 2011-04-20 18:06:26 -0400:
> > > > >
> > > > > In this case the 128s spent in write was on a single 4K overwrite on a
> > > > > 4K file.
> > > >
> > > > Chris, You seem to be doing 1MB (32768*32) writes on fsync file instead of 4K.
> > > > I changed the size to 4K still not much difference though.
> > >
> > > Whoops, I had that change made locally but didn't get it copied out.
> > >
> > > >
> > > > Once the program has exited because of high write time, i restarted it and
> > > > this time I don't see high write times.
> > >
> > > I see this for some of my runs as well.
> > >
> > > >
> > > > First run
> > > > ---------
> > > > # ./a.out
> > > > setting up random write file
> > > > done setting up random write file
> > > > starting fsync run
> > > > starting random io!
> > > > write time: 0.0006s fsync time: 0.3400s
> > > > write time: 63.3270s fsync time: 0.3760s
> > > > run done 2 fsyncs total, killing random writer
> > > >
> > > > Second run
> > > > ----------
> > > > # ./a.out
> > > > starting fsync run
> > > > starting random io!
> > > > write time: 0.0006s fsync time: 0.5359s
> > > > write time: 0.0007s fsync time: 0.3559s
> > > > write time: 0.0009s fsync time: 0.3113s
> > > > write time: 0.0008s fsync time: 0.4336s
> > > > write time: 0.0009s fsync time: 0.3780s
> > > > write time: 0.0008s fsync time: 0.3114s
> > > > write time: 0.0009s fsync time: 0.3225s
> > > > write time: 0.0009s fsync time: 0.3891s
> > > > write time: 0.0009s fsync time: 0.4336s
> > > > write time: 0.0009s fsync time: 0.4225s
> > > > write time: 0.0009s fsync time: 0.4114s
> > > > write time: 0.0007s fsync time: 0.4004s
> > > >
> > > > Not sure why would that happen.
> > > >
> > > > I am wondering why pwrite/fsync process was throttled. It did not have any
> > > > pages in page cache and it shouldn't have hit the task dirty limits. Does that
> > > > mean per task dirty limit logic does not work or I am completely missing
> > > > the root cause of the problem.
> > >
> > > I haven't traced it to see. This test box only has 1GB of ram, so the
> > > dirty ratios can be very tight.
> >
> > Oh, I see now. The test program first creates the file with a big
> > streaming write. So the task doing the streaming writes gets nailed
> > with the per-task dirty accounting because it is making a ton of dirty
> > data.
> >
> > Then the task forks the random writer to do all the random IO.
> >
> > Then the original pid goes back to do the fsyncs on the new file.
> >
> > So, in the original run, we get stuffed into balance_dirty_pages because
> > the per-task limits show we've done a lot of dirties.
> >
> > In all later runs, the file already exists, so our fsyncing process
> > hasn't done much dirtying at all. Looks like the VM is doing something
> > sane, we just get nailed with big random IO.
> Ok, so there isn't a problem with fsync() as such if I understand it
> right. We just block tasks in balance_dirty_pages() for a *long* time
> because it takes long time to write out that dirty IO and we make it even
> worse by trying to writeout more on behalf of the throttled task. Am I
> right? The IO-less throttling will solve this regardless of patchset we
> choose so I wouldn't be too worried about the problem now.

You're right. With one small exception, we probably do want to rotor
out of the random buffered writes in hopes of finding some sequential IO
even with the IO-less dirty throttling.

But I definitely do expect this to get better with any of the IOless
code.

-chris

2011-04-21 17:36:52

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Christoph Hellwig's message of 2011-04-21 04:32:58 -0400:
> On Wed, Apr 20, 2011 at 02:23:29PM -0400, Chris Mason wrote:
> But given that you are able to reproduce it, does the following patch
> help your latencies? Currently XFS actually does stop I/O when
> nr_to_write reaches zero, but only for non-blocking I/O. This behaviour
> was introduced in commit efceab1d563153a2b1a6e7d35376241a48126989
>
> "xfs: handle negative wbc->nr_to_write during sync writeback"
>
> and works around issues in the generic writeback code.
>
>
> Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c 2011-04-21 10:20:48.303550404 +0200
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c 2011-04-21 10:20:58.203496773 +0200
> @@ -765,8 +765,7 @@ xfs_convert_page(
> SetPageUptodate(page);
>
> if (count) {
> - if (--wbc->nr_to_write <= 0 &&
> - wbc->sync_mode == WB_SYNC_NONE)
> + if (--wbc->nr_to_write <= 0)
> done = 1;
> }
> xfs_start_page_writeback(page, !page_dirty, count);

Sorry, this doesn't do it. I think that given what a strange special
case this is, we're best off waiting for the IO-less throttling, and
maybe changing the code in xfs/ext4 to be a little more seek aware. Or
maybe not, it has to get written eventually either way.

-chris

2011-04-21 17:41:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Thu, Apr 21, 2011 at 01:34:44PM -0400, Chris Mason wrote:
> Sorry, this doesn't do it. I think that given what a strange special
> case this is, we're best off waiting for the IO-less throttling, and
> maybe changing the code in xfs/ext4 to be a little more seek aware. Or
> maybe not, it has to get written eventually either way.

I'm not sure what you mean with seek aware. XFS only clusters
additional pages that are in the same extent, and in fact only does
so for asynchrononous writeback. Not sure how this should be more
seek aware.


2011-04-21 17:59:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: buffered writeback torture program

On 2011-04-21, at 11:41 AM, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 01:34:44PM -0400, Chris Mason wrote:
>> Sorry, this doesn't do it. I think that given what a strange special
>> case this is, we're best off waiting for the IO-less throttling, and
>> maybe changing the code in xfs/ext4 to be a little more seek aware. Or
>> maybe not, it has to get written eventually either way.
>
> I'm not sure what you mean with seek aware. XFS only clusters
> additional pages that are in the same extent, and in fact only does
> so for asynchrononous writeback. Not sure how this should be more
> seek aware.

But doesn't XFS have potentially very large extents, especially in the case of files that were fallocate()'d or linearly written? If there is a single 8GB extent, and then random writes within that extent (seems very database like) grouping the all of the writes in the extent doesn't seem so great.


Cheers, Andreas






2011-04-21 18:02:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Thu, Apr 21, 2011 at 11:59:37AM -0600, Andreas Dilger wrote:
> But doesn't XFS have potentially very large extents, especially in the case of files that were fallocate()'d or linearly written? If there is a single 8GB extent, and then random writes within that extent (seems very database like) grouping the all of the writes in the extent doesn't seem so great.

It doesn't cluster any writes in an extent. It only writes out
additional dirty pages directly following that one we were asked to
write out. As soon as we hit a non-dirty page we give up.


2011-04-21 18:02:36

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Christoph Hellwig's message of 2011-04-21 13:41:21 -0400:
> On Thu, Apr 21, 2011 at 01:34:44PM -0400, Chris Mason wrote:
> > Sorry, this doesn't do it. I think that given what a strange special
> > case this is, we're best off waiting for the IO-less throttling, and
> > maybe changing the code in xfs/ext4 to be a little more seek aware. Or
> > maybe not, it has to get written eventually either way.
>
> I'm not sure what you mean with seek aware. XFS only clusters
> additional pages that are in the same extent, and in fact only does
> so for asynchrononous writeback. Not sure how this should be more
> seek aware.
>

How big are extents? fiemap tells me the file has a single 8GB extent.

There's a little room for seeking inside there.

-chris

2011-04-21 18:02:43

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Christoph Hellwig's message of 2011-04-21 14:02:13 -0400:
> On Thu, Apr 21, 2011 at 11:59:37AM -0600, Andreas Dilger wrote:
> > But doesn't XFS have potentially very large extents, especially in the case of files that were fallocate()'d or linearly written? If there is a single 8GB extent, and then random writes within that extent (seems very database like) grouping the all of the writes in the extent doesn't seem so great.
>
> It doesn't cluster any writes in an extent. It only writes out
> additional dirty pages directly following that one we were asked to
> write out. As soon as we hit a non-dirty page we give up.

For this program, they are almost all dirty pages.

I tried patching it to give up if we seek but it is still pretty slow.
There's something else going on in addition to the xfs clustering being
too aggressive.

I'll try dropping the clustering completley.

-chris

2011-04-21 18:08:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Thu, Apr 21, 2011 at 02:02:43PM -0400, Chris Mason wrote:
> For this program, they are almost all dirty pages.
>
> I tried patching it to give up if we seek but it is still pretty slow.
> There's something else going on in addition to the xfs clustering being
> too aggressive.

I'm not sure where you this beeing to agressive from - it's doing
exactly the same amount of I/O as a filesystem writing out a single
page from ->writepage or using write_cache_pages (either directly
or as a copy) as ->writepages. The only thing special compared to
the no ->writepages case is that it's submitting a large I/O
from the first ->writepage call.


2011-04-21 18:29:37

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Christoph Hellwig's message of 2011-04-21 14:08:05 -0400:
> On Thu, Apr 21, 2011 at 02:02:43PM -0400, Chris Mason wrote:
> > For this program, they are almost all dirty pages.
> >
> > I tried patching it to give up if we seek but it is still pretty slow.
> > There's something else going on in addition to the xfs clustering being
> > too aggressive.
>
> I'm not sure where you this beeing to agressive from - it's doing
> exactly the same amount of I/O as a filesystem writing out a single
> page from ->writepage or using write_cache_pages (either directly
> or as a copy) as ->writepages. The only thing special compared to
> the no ->writepages case is that it's submitting a large I/O
> from the first ->writepage call.
>

Ok, I see what you mean. The clustering code stops once it hits
nr_to_write, I missed that. So we shouldn't be doing more than a single
writepages call.

-chris

2011-04-21 18:43:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: buffered writeback torture program

On 2011-04-21, at 12:29 PM, Chris Mason wrote:
> Excerpts from Christoph Hellwig's message of 2011-04-21 14:08:05 -0400:
>> On Thu, Apr 21, 2011 at 02:02:43PM -0400, Chris Mason wrote:
>>> For this program, they are almost all dirty pages.
>>>
>>> I tried patching it to give up if we seek but it is still pretty slow.
>>> There's something else going on in addition to the xfs clustering being
>>> too aggressive.
>>
>> I'm not sure where you this beeing to agressive from - it's doing
>> exactly the same amount of I/O as a filesystem writing out a single
>> page from ->writepage or using write_cache_pages (either directly
>> or as a copy) as ->writepages. The only thing special compared to
>> the no ->writepages case is that it's submitting a large I/O
>> from the first ->writepage call.
>
> Ok, I see what you mean. The clustering code stops once it hits
> nr_to_write, I missed that. So we shouldn't be doing more than a single
> writepages call.

I wonder if it makes sense to disentangle the two processes state in the kernel, by forking the fsync thread before any writes are done. That would avoid penalizing the random writer in the VM/VFS, but means there needs to be some coordination between the threads (e.g. polling for a sentinel file written when the sequential phase is complete).

Cheers, Andreas






2011-04-21 18:49:25

by Chris Mason

[permalink] [raw]
Subject: Re: buffered writeback torture program

Excerpts from Andreas Dilger's message of 2011-04-21 14:43:47 -0400:
> On 2011-04-21, at 12:29 PM, Chris Mason wrote:
> > Excerpts from Christoph Hellwig's message of 2011-04-21 14:08:05 -0400:
> >> On Thu, Apr 21, 2011 at 02:02:43PM -0400, Chris Mason wrote:
> >>> For this program, they are almost all dirty pages.
> >>>
> >>> I tried patching it to give up if we seek but it is still pretty slow.
> >>> There's something else going on in addition to the xfs clustering being
> >>> too aggressive.
> >>
> >> I'm not sure where you this beeing to agressive from - it's doing
> >> exactly the same amount of I/O as a filesystem writing out a single
> >> page from ->writepage or using write_cache_pages (either directly
> >> or as a copy) as ->writepages. The only thing special compared to
> >> the no ->writepages case is that it's submitting a large I/O
> >> from the first ->writepage call.
> >
> > Ok, I see what you mean. The clustering code stops once it hits
> > nr_to_write, I missed that. So we shouldn't be doing more than a single
> > writepages call.
>
> I wonder if it makes sense to disentangle the two processes state in the kernel, by forking the fsync thread before any writes are done. That would avoid penalizing the random writer in the VM/VFS, but means there needs to be some coordination between the threads (e.g. polling for a sentinel file written when the sequential phase is complete).
>

The test itself may not be realistic, but I actually think its a feature
that we end up stuck doing the random buffered ios. Somehow it is much
slower than it should be on my box on xfs. This isn't universal, other
machines seem to be working fine.

-chris

2011-04-21 20:44:36

by Jan Kara

[permalink] [raw]
Subject: Re: buffered writeback torture program

On Thu 21-04-11 12:57:17, Chris Mason wrote:
> Excerpts from Jan Kara's message of 2011-04-21 12:55:29 -0400:
> > On Thu 21-04-11 11:25:41, Chris Mason wrote:
> > > Excerpts from Chris Mason's message of 2011-04-21 07:09:11 -0400:
> > > > Excerpts from Vivek Goyal's message of 2011-04-20 18:06:26 -0400:
> > > > > >
> > > > > > In this case the 128s spent in write was on a single 4K overwrite on a
> > > > > > 4K file.
> > > > >
> > > > > Chris, You seem to be doing 1MB (32768*32) writes on fsync file instead of 4K.
> > > > > I changed the size to 4K still not much difference though.
> > > >
> > > > Whoops, I had that change made locally but didn't get it copied out.
> > > >
> > > > >
> > > > > Once the program has exited because of high write time, i restarted it and
> > > > > this time I don't see high write times.
> > > >
> > > > I see this for some of my runs as well.
> > > >
> > > > >
> > > > > First run
> > > > > ---------
> > > > > # ./a.out
> > > > > setting up random write file
> > > > > done setting up random write file
> > > > > starting fsync run
> > > > > starting random io!
> > > > > write time: 0.0006s fsync time: 0.3400s
> > > > > write time: 63.3270s fsync time: 0.3760s
> > > > > run done 2 fsyncs total, killing random writer
> > > > >
> > > > > Second run
> > > > > ----------
> > > > > # ./a.out
> > > > > starting fsync run
> > > > > starting random io!
> > > > > write time: 0.0006s fsync time: 0.5359s
> > > > > write time: 0.0007s fsync time: 0.3559s
> > > > > write time: 0.0009s fsync time: 0.3113s
> > > > > write time: 0.0008s fsync time: 0.4336s
> > > > > write time: 0.0009s fsync time: 0.3780s
> > > > > write time: 0.0008s fsync time: 0.3114s
> > > > > write time: 0.0009s fsync time: 0.3225s
> > > > > write time: 0.0009s fsync time: 0.3891s
> > > > > write time: 0.0009s fsync time: 0.4336s
> > > > > write time: 0.0009s fsync time: 0.4225s
> > > > > write time: 0.0009s fsync time: 0.4114s
> > > > > write time: 0.0007s fsync time: 0.4004s
> > > > >
> > > > > Not sure why would that happen.
> > > > >
> > > > > I am wondering why pwrite/fsync process was throttled. It did not have any
> > > > > pages in page cache and it shouldn't have hit the task dirty limits. Does that
> > > > > mean per task dirty limit logic does not work or I am completely missing
> > > > > the root cause of the problem.
> > > >
> > > > I haven't traced it to see. This test box only has 1GB of ram, so the
> > > > dirty ratios can be very tight.
> > >
> > > Oh, I see now. The test program first creates the file with a big
> > > streaming write. So the task doing the streaming writes gets nailed
> > > with the per-task dirty accounting because it is making a ton of dirty
> > > data.
> > >
> > > Then the task forks the random writer to do all the random IO.
> > >
> > > Then the original pid goes back to do the fsyncs on the new file.
> > >
> > > So, in the original run, we get stuffed into balance_dirty_pages because
> > > the per-task limits show we've done a lot of dirties.
> > >
> > > In all later runs, the file already exists, so our fsyncing process
> > > hasn't done much dirtying at all. Looks like the VM is doing something
> > > sane, we just get nailed with big random IO.
> > Ok, so there isn't a problem with fsync() as such if I understand it
> > right. We just block tasks in balance_dirty_pages() for a *long* time
> > because it takes long time to write out that dirty IO and we make it even
> > worse by trying to writeout more on behalf of the throttled task. Am I
> > right? The IO-less throttling will solve this regardless of patchset we
> > choose so I wouldn't be too worried about the problem now.
>
> You're right. With one small exception, we probably do want to rotor
> out of the random buffered writes in hopes of finding some sequential IO
> even with the IO-less dirty throttling.
Flusher thread should do this - it writes at most 1024 pages (as much as
->writepages call does) from the big file and then put inode to b_more_io
queue and go on with the next dirty inode. So if there is some sequential
IO as well, we should get to it...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR