LinuxLists.cc - [RFC PATCH] ext4: fix 50% disk write performance regression

2010-08-30 03:17:50

Subject: [RFC PATCH] ext4: fix 50% disk write performance regression

A 50% ext4 disk write performance regression was introduced
in 2.6.32 and still exists in 2.6.35, although somewhat improved
from 2.6.32. Read performance was not affected).

2.6.31 disk write performance (RAID5 with 8 disks):

i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB) copied, 49.7106 s, 691 MB/s

2.6.32 disk write performance (RAID5 with 8 disks):

i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB) copied, 100.395 s, 342 MB/s

2.6.35 disk write performance (RAID5 with 8 disks):

i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB) copied, 75.7265 s, 454 MB/s

A git bisect targetted commit 55138e0bc29c0751e2152df9ad35deea542f29b3
(ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks).
Specifically the performance issue is caused by the use of the function
ext4_num_dirty_pages.

The included patch avoids calling ext4_num_dirty_pages
(and removes its definition) by unconditionally setting
desired_nr_to_write to wbc->nr_to_write * 8.

With the patch, the disk write performance is back to
approximately 2.6.31 performance levels.

2.6.35+patch disk write performance (RAID5 with 8 disks):

i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB) copied, 50.7234 s, 677 MB/s

Since I'm no expert in this area, I'm submitting this
RFC patch against 2.6.35. I'm not sure what all the
ramifications of my suggested change would be. However,
to my admittedly novice eyes, it doesn't seem to be an
unreasonable change. Also, subjectively from building
kernels on a RAID5 ext4 filesystem using the patched
2.6.35 kernel (via make -j 8), I didn't notice any issues,
and it actually seemed more responsive than when using
the unpatched 2.6.35 kernel.

-Bill

P.S. I am not subscribed to the linux-ext4 e-mail list,
plus this is my very first attempted linux kernel
patch submission.

Partially revert 55138e0bc29c0751e2152df9ad35deea542f29b3
(ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks)
to fix a 50% ext4 disk write performance regression introduced
between 2.6.31 and 2.6.32.

Signed-off-by: Bill Fink <[email protected]>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..f6e639b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1143,64 +1143,6 @@ static int check_block_validity(struct inode *inode, const char *func,
}

/*
- * Return the number of contiguous dirty pages in a given inode
- * starting at page frame idx.
- */
-static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
- unsigned int max_pages)
-{
- struct address_space *mapping = inode->i_mapping;
- pgoff_t index;
- struct pagevec pvec;
- pgoff_t num = 0;
- int i, nr_pages, done = 0;
-
- if (max_pages == 0)
- return 0;
- pagevec_init(&pvec, 0);
- while (!done) {
- index = idx;
- nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
- PAGECACHE_TAG_DIRTY,
- (pgoff_t)PAGEVEC_SIZE);
- if (nr_pages == 0)
- break;
- for (i = 0; i < nr_pages; i++) {
- struct page *page = pvec.pages[i];
- struct buffer_head *bh, *head;
-
- lock_page(page);
- if (unlikely(page->mapping != mapping) ||
- !PageDirty(page) ||
- PageWriteback(page) ||
- page->index != idx) {
- done = 1;
- unlock_page(page);
- break;
- }
- if (page_has_buffers(page)) {
- bh = head = page_buffers(page);
- do {
- if (!buffer_delay(bh) &&
- !buffer_unwritten(bh))
- done = 1;
- bh = bh->b_this_page;
- } while (!done && (bh != head));
- }
- unlock_page(page);
- if (done)
- break;
- idx++;
- num++;
- if (num >= max_pages)
- break;
- }
- pagevec_release(&pvec);
- }
- return num;
-}

2010-08-30 17:05:45

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

Bill Fink wrote:
> A 50% ext4 disk write performance regression was introduced
> in 2.6.32 and still exists in 2.6.35, although somewhat improved
> from 2.6.32. Read performance was not affected).
>
> 2.6.31 disk write performance (RAID5 with 8 disks):
>
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 49.7106 s, 691 MB/s
>
> 2.6.32 disk write performance (RAID5 with 8 disks):
>
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 100.395 s, 342 MB/s
>
> 2.6.35 disk write performance (RAID5 with 8 disks):
>
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 75.7265 s, 454 MB/s
>
> A git bisect targetted commit 55138e0bc29c0751e2152df9ad35deea542f29b3
> (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks).
> Specifically the performance issue is caused by the use of the function
> ext4_num_dirty_pages.
>
> The included patch avoids calling ext4_num_dirty_pages
> (and removes its definition) by unconditionally setting
> desired_nr_to_write to wbc->nr_to_write * 8.
>
> With the patch, the disk write performance is back to
> approximately 2.6.31 performance levels.

Firstly, thanks very much for tracking that down. I've had various &
sundry reports of slowdowns but I'd never really gotten to the bottom
of it with a simple testcase somehow.

When I get some time (soon I hope) I'll look into the ramifications
of this change (i.e. what if wbc->nr_to_write * 8 is more than the dirty
pages, do things work out ok?) but it seems pretty reasonable.

Since the commit was Ted's originally, perhaps he has some more
immediate comments.

Thanks a ton!

-Eric

> 2.6.35+patch disk write performance (RAID5 with 8 disks):
>
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 50.7234 s, 677 MB/s
>
> Since I'm no expert in this area, I'm submitting this
> RFC patch against 2.6.35. I'm not sure what all the
> ramifications of my suggested change would be. However,
> to my admittedly novice eyes, it doesn't seem to be an
> unreasonable change. Also, subjectively from building
> kernels on a RAID5 ext4 filesystem using the patched
> 2.6.35 kernel (via make -j 8), I didn't notice any issues,
> and it actually seemed more responsive than when using
> the unpatched 2.6.35 kernel.
>
> -Bill
>
> P.S. I am not subscribed to the linux-ext4 e-mail list,
> plus this is my very first attempted linux kernel
> patch submission.
>
>
>
> Partially revert 55138e0bc29c0751e2152df9ad35deea542f29b3
> (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks)
> to fix a 50% ext4 disk write performance regression introduced
> between 2.6.31 and 2.6.32.
>
> Signed-off-by: Bill Fink <[email protected]>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 42272d6..f6e639b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1143,64 +1143,6 @@ static int check_block_validity(struct inode *inode, const char *func,
> }
>
> /*
> - * Return the number of contiguous dirty pages in a given inode
> - * starting at page frame idx.
> - */
> -static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> - unsigned int max_pages)
> -{
> - struct address_space *mapping = inode->i_mapping;
> - pgoff_t index;
> - struct pagevec pvec;
> - pgoff_t num = 0;
> - int i, nr_pages, done = 0;
> -
> - if (max_pages == 0)
> - return 0;
> - pagevec_init(&pvec, 0);
> - while (!done) {
> - index = idx;
> - nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> - PAGECACHE_TAG_DIRTY,
> - (pgoff_t)PAGEVEC_SIZE);
> - if (nr_pages == 0)
> - break;
> - for (i = 0; i < nr_pages; i++) {
> - struct page *page = pvec.pages[i];
> - struct buffer_head *bh, *head;
> -
> - lock_page(page);
> - if (unlikely(page->mapping != mapping) ||
> - !PageDirty(page) ||
> - PageWriteback(page) ||
> - page->index != idx) {
> - done = 1;
> - unlock_page(page);
> - break;
> - }
> - if (page_has_buffers(page)) {
> - bh = head = page_buffers(page);
> - do {
> - if (!buffer_delay(bh) &&
> - !buffer_unwritten(bh))
> - done = 1;
> - bh = bh->b_this_page;
> - } while (!done && (bh != head));
> - }
> - unlock_page(page);
> - if (done)
> - break;
> - idx++;
> - num++;
> - if (num >= max_pages)
> - break;
> - }
> - pagevec_release(&pvec);
> - }
> - return num;
> -}
> -
> -/*
> * The ext4_map_blocks() function tries to look up the requested blocks,
> * and returns if the blocks are already mapped.
> *
> @@ -2972,15 +2914,10 @@ static int ext4_da_writepages(struct address_space *mapping,
> * contiguous. Unfortunately this brings us to the second
> * stupidity, which is that ext4's mballoc code only allocates
> * at most 2048 blocks. So we force contiguous writes up to
> - * the number of dirty blocks in the inode, or
> - * sbi->max_writeback_mb_bump whichever is smaller.
> + * sbi->max_writeback_mb_bump
> */
> max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> - if (!range_cyclic && range_whole)
> - desired_nr_to_write = wbc->nr_to_write * 8;
> - else
> - desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> - max_pages);
> + desired_nr_to_write = wbc->nr_to_write * 8;
> if (desired_nr_to_write > max_pages)
> desired_nr_to_write = max_pages;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-08-30 17:40:04

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

On Sun, Aug 29, 2010 at 11:11:26PM -0400, Bill Fink wrote:
> A 50% ext4 disk write performance regression was introduced
> in 2.6.32 and still exists in 2.6.35, although somewhat improved
> from 2.6.32. Read performance was not affected).

Thanks for reporting it. I'm going to have to take a closer look at
why this makes a difference. I'm going to guess though that what's
going on is that we're posting writes in such a way that they're no
longer aligned or ending at the end of a RAID5 stripe, causing a
read-modify-write pass. That would easily explain the write
performance regression.

The interesting thing is that we don't actually do anything in
ext4_da_writepages() to assure that we are making our writes are
appropriate aligned and sized. We do pay attention to make sure they
are alligned correctly in the allocator, but _not_ in the writepages
code. So the fact that apparently things were well aligned in 2.6.32
seems to be luck... (or maybe the writes are perfectly aligned in
2.6.32; they're just much worse with 2.6.35, and with explicit
attention paid to the RAID stripe size, we could do even better :-)

If you could run blktraces on 2.6.32, 2.6.35 stock, and 2.6.35 with
your patch, that would be really helpful to confirm my hypothesis. Is
that something that wouldn't be too much trouble?

Thanks, regards,

- Ted

2010-08-30 19:36:02

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

Bill Fink wrote:
> On Mon, 30 Aug 2010, Eric Sandeen wrote:

...

>> When I get some time (soon I hope) I'll look into the ramifications
>> of this change (i.e. what if wbc->nr_to_write * 8 is more than the dirty
>> pages, do things work out ok?) but it seems pretty reasonable.
>
> In thinking about that issue, my non-expert thought was that
> if desired_nr_to_write was larger, it hopefully wouldn't be an
> issue since presumably it's only going to actually write at most
> the number of dirty pages anyway. And on the flip side, there
> doesn't seem to be an issue with desired_nr_to_write possibly
> being smaller that it was without the patch, since the empirical
> evidence is that the performance significantly improved with the
> patch, and the effect of a lower value would presumably be
> reduced performance.
>
> -Bill

Ted's suggestion of capturing blktrace data would be a really great
step towards understanding what changed, too.

Thanks!

-Eric

2010-08-30 19:55:42

by Bill Fink

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

On Mon, 30 Aug 2010, Eric Sandeen wrote:

> Bill Fink wrote:
> > A 50% ext4 disk write performance regression was introduced
> > in 2.6.32 and still exists in 2.6.35, although somewhat improved
> > from 2.6.32. Read performance was not affected).
> >
> > 2.6.31 disk write performance (RAID5 with 8 disks):
> >
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 49.7106 s, 691 MB/s
> >
> > 2.6.32 disk write performance (RAID5 with 8 disks):
> >
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 100.395 s, 342 MB/s
> >
> > 2.6.35 disk write performance (RAID5 with 8 disks):
> >
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 75.7265 s, 454 MB/s
> >
> > A git bisect targetted commit 55138e0bc29c0751e2152df9ad35deea542f29b3
> > (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks).
> > Specifically the performance issue is caused by the use of the function
> > ext4_num_dirty_pages.
> >
> > The included patch avoids calling ext4_num_dirty_pages
> > (and removes its definition) by unconditionally setting
> > desired_nr_to_write to wbc->nr_to_write * 8.
> >
> > With the patch, the disk write performance is back to
> > approximately 2.6.31 performance levels.
>
> Firstly, thanks very much for tracking that down. I've had various &
> sundry reports of slowdowns but I'd never really gotten to the bottom
> of it with a simple testcase somehow.
>
> When I get some time (soon I hope) I'll look into the ramifications
> of this change (i.e. what if wbc->nr_to_write * 8 is more than the dirty
> pages, do things work out ok?) but it seems pretty reasonable.

In thinking about that issue, my non-expert thought was that
if desired_nr_to_write was larger, it hopefully wouldn't be an
issue since presumably it's only going to actually write at most
the number of dirty pages anyway. And on the flip side, there
doesn't seem to be an issue with desired_nr_to_write possibly
being smaller that it was without the patch, since the empirical
evidence is that the performance significantly improved with the
patch, and the effect of a lower value would presumably be
reduced performance.

-Bill

> Since the commit was Ted's originally, perhaps he has some more
> immediate comments.
>
> Thanks a ton!
>
> -Eric
>
> > 2.6.35+patch disk write performance (RAID5 with 8 disks):
> >
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 50.7234 s, 677 MB/s
> >
> > Since I'm no expert in this area, I'm submitting this
> > RFC patch against 2.6.35. I'm not sure what all the
> > ramifications of my suggested change would be. However,
> > to my admittedly novice eyes, it doesn't seem to be an
> > unreasonable change. Also, subjectively from building
> > kernels on a RAID5 ext4 filesystem using the patched
> > 2.6.35 kernel (via make -j 8), I didn't notice any issues,
> > and it actually seemed more responsive than when using
> > the unpatched 2.6.35 kernel.
> >
> > -Bill
> >
> > P.S. I am not subscribed to the linux-ext4 e-mail list,
> > plus this is my very first attempted linux kernel
> > patch submission.
> >
> >
> >
> > Partially revert 55138e0bc29c0751e2152df9ad35deea542f29b3
> > (ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks)
> > to fix a 50% ext4 disk write performance regression introduced
> > between 2.6.31 and 2.6.32.
> >
> > Signed-off-by: Bill Fink <[email protected]>
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 42272d6..f6e639b 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1143,64 +1143,6 @@ static int check_block_validity(struct inode *inode, const char *func,
> > }
> >
> > /*
> > - * Return the number of contiguous dirty pages in a given inode
> > - * starting at page frame idx.
> > - */
> > -static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> > - unsigned int max_pages)
> > -{
> > - struct address_space *mapping = inode->i_mapping;
> > - pgoff_t index;
> > - struct pagevec pvec;
> > - pgoff_t num = 0;
> > - int i, nr_pages, done = 0;
> > -
> > - if (max_pages == 0)
> > - return 0;
> > - pagevec_init(&pvec, 0);
> > - while (!done) {
> > - index = idx;
> > - nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> > - PAGECACHE_TAG_DIRTY,
> > - (pgoff_t)PAGEVEC_SIZE);
> > - if (nr_pages == 0)
> > - break;
> > - for (i = 0; i < nr_pages; i++) {
> > - struct page *page = pvec.pages[i];
> > - struct buffer_head *bh, *head;
> > -
> > - lock_page(page);
> > - if (unlikely(page->mapping != mapping) ||
> > - !PageDirty(page) ||
> > - PageWriteback(page) ||
> > - page->index != idx) {
> > - done = 1;
> > - unlock_page(page);
> > - break;
> > - }
> > - if (page_has_buffers(page)) {
> > - bh = head = page_buffers(page);
> > - do {
> > - if (!buffer_delay(bh) &&
> > - !buffer_unwritten(bh))
> > - done = 1;
> > - bh = bh->b_this_page;
> > - } while (!done && (bh != head));
> > - }
> > - unlock_page(page);
> > - if (done)
> > - break;
> > - idx++;
> > - num++;
> > - if (num >= max_pages)
> > - break;
> > - }
> > - pagevec_release(&pvec);
> > - }
> > - return num;
> > -}
> > -
> > -/*
> > * The ext4_map_blocks() function tries to look up the requested blocks,
> > * and returns if the blocks are already mapped.
> > *
> > @@ -2972,15 +2914,10 @@ static int ext4_da_writepages(struct address_space *mapping,
> > * contiguous. Unfortunately this brings us to the second
> > * stupidity, which is that ext4's mballoc code only allocates
> > * at most 2048 blocks. So we force contiguous writes up to
> > - * the number of dirty blocks in the inode, or
> > - * sbi->max_writeback_mb_bump whichever is smaller.
> > + * sbi->max_writeback_mb_bump
> > */
> > max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> > - if (!range_cyclic && range_whole)
> > - desired_nr_to_write = wbc->nr_to_write * 8;
> > - else
> > - desired_nr_to_write = ext4_num_dirty_pages(inode, index,
> > - max_pages);
> > + desired_nr_to_write = wbc->nr_to_write * 8;
> > if (desired_nr_to_write > max_pages)
> > desired_nr_to_write = max_pages;
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2010-08-30 20:50:05

by Bill Fink

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

On Mon, 30 Aug 2010, Ted Ts'o wrote:

> On Sun, Aug 29, 2010 at 11:11:26PM -0400, Bill Fink wrote:
> > A 50% ext4 disk write performance regression was introduced
> > in 2.6.32 and still exists in 2.6.35, although somewhat improved
> > from 2.6.32. Read performance was not affected).
>
> Thanks for reporting it. I'm going to have to take a closer look at
> why this makes a difference. I'm going to guess though that what's
> going on is that we're posting writes in such a way that they're no
> longer aligned or ending at the end of a RAID5 stripe, causing a
> read-modify-write pass. That would easily explain the write
> performance regression.

I'm not sure I understand. How could calling or not calling
ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
affect the write alignment?

I was wondering if the locking being done in ext4_num_dirty_pages()
could somehow be affecting the performance. I did notice from top
that in the patched 2.6.35 kernel, the I/O wait time was generally
in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
at a higher 75-80% range. However, I don't know if that's just a
result of the lower performance, or a possible clue to its cause.

> The interesting thing is that we don't actually do anything in
> ext4_da_writepages() to assure that we are making our writes are
> appropriate aligned and sized. We do pay attention to make sure they
> are alligned correctly in the allocator, but _not_ in the writepages
> code. So the fact that apparently things were well aligned in 2.6.32
> seems to be luck... (or maybe the writes are perfectly aligned in
> 2.6.32; they're just much worse with 2.6.35, and with explicit
> attention paid to the RAID stripe size, we could do even better :-)

It was 2.6.31 that was good. The regression was in 2.6.32. And again
how does the write alignment get modified simply by whether or not
ext4_num_dirty_pages() is called?

> If you could run blktraces on 2.6.32, 2.6.35 stock, and 2.6.35 with
> your patch, that would be really helpful to confirm my hypothesis. Is
> that something that wouldn't be too much trouble?

I'd be glad to if you explain how one runs blktraces.

-Thanks

-Bill

2010-08-30 21:05:19

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

Bill Fink wrote:
> On Mon, 30 Aug 2010, Ted Ts'o wrote:
>
>> On Sun, Aug 29, 2010 at 11:11:26PM -0400, Bill Fink wrote:
>>> A 50% ext4 disk write performance regression was introduced
>>> in 2.6.32 and still exists in 2.6.35, although somewhat improved
>>> from 2.6.32. Read performance was not affected).
>> Thanks for reporting it. I'm going to have to take a closer look at
>> why this makes a difference. I'm going to guess though that what's
>> going on is that we're posting writes in such a way that they're no
>> longer aligned or ending at the end of a RAID5 stripe, causing a
>> read-modify-write pass. That would easily explain the write
>> performance regression.
>
> I'm not sure I understand. How could calling or not calling
> ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
> affect the write alignment?
>
> I was wondering if the locking being done in ext4_num_dirty_pages()
> could somehow be affecting the performance. I did notice from top
> that in the patched 2.6.35 kernel, the I/O wait time was generally
> in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
> at a higher 75-80% range. However, I don't know if that's just a
> result of the lower performance, or a possible clue to its cause.

Using oprofile might also show you how much time is getting spent there..

>> The interesting thing is that we don't actually do anything in
>> ext4_da_writepages() to assure that we are making our writes are
>> appropriate aligned and sized. We do pay attention to make sure they
>> are alligned correctly in the allocator, but _not_ in the writepages
>> code. So the fact that apparently things were well aligned in 2.6.32
>> seems to be luck... (or maybe the writes are perfectly aligned in
>> 2.6.32; they're just much worse with 2.6.35, and with explicit
>> attention paid to the RAID stripe size, we could do even better :-)
>
> It was 2.6.31 that was good. The regression was in 2.6.32. And again
> how does the write alignment get modified simply by whether or not
> ext4_num_dirty_pages() is called?

writeback is full of deep mysteries ... :)

>> If you could run blktraces on 2.6.32, 2.6.35 stock, and 2.6.35 with
>> your patch, that would be really helpful to confirm my hypothesis. Is
>> that something that wouldn't be too much trouble?
>
> I'd be glad to if you explain how one runs blktraces.

Probably the easiest thing to do is to use seekwatcher to invoke blktrace,
if it's easily available for your distro. Then it's just mount debugfs on
/sys/kernel/debug, and:

# seekwatcher -d /dev/whatever -t tracename -o tracename.png -p "your dd command"

It'll leave tracename.* blktrace files, and generate a graph of the IO
in the PNG file.

(this causes an abbreviated trace, but it's probably enough to see what
boundaries the IO was issued on)

Thanks!
-Eric

> -Thanks
>
> -Bill
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-08-30 23:53:27

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

Bill Fink wrote:
> On Mon, 30 Aug 2010, Eric Sandeen wrote:
>
>> Bill Fink wrote:
>>> On Mon, 30 Aug 2010, Ted Ts'o wrote:
>>>
>>>> On Sun, Aug 29, 2010 at 11:11:26PM -0400, Bill Fink wrote:
>>>>> A 50% ext4 disk write performance regression was introduced
>>>>> in 2.6.32 and still exists in 2.6.35, although somewhat improved
>>>>> from 2.6.32. Read performance was not affected).
>>>> Thanks for reporting it. I'm going to have to take a closer look at
>>>> why this makes a difference. I'm going to guess though that what's
>>>> going on is that we're posting writes in such a way that they're no
>>>> longer aligned or ending at the end of a RAID5 stripe, causing a
>>>> read-modify-write pass. That would easily explain the write
>>>> performance regression.
>>> I'm not sure I understand. How could calling or not calling
>>> ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
>>> affect the write alignment?
>>>
>>> I was wondering if the locking being done in ext4_num_dirty_pages()
>>> could somehow be affecting the performance. I did notice from top
>>> that in the patched 2.6.35 kernel, the I/O wait time was generally
>>> in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
>>> at a higher 75-80% range. However, I don't know if that's just a
>>> result of the lower performance, or a possible clue to its cause.
>> Using oprofile might also show you how much time is getting spent there..
>>
>>>> The interesting thing is that we don't actually do anything in
>>>> ext4_da_writepages() to assure that we are making our writes are
>>>> appropriate aligned and sized. We do pay attention to make sure they
>>>> are alligned correctly in the allocator, but _not_ in the writepages
>>>> code. So the fact that apparently things were well aligned in 2.6.32
>>>> seems to be luck... (or maybe the writes are perfectly aligned in
>>>> 2.6.32; they're just much worse with 2.6.35, and with explicit
>>>> attention paid to the RAID stripe size, we could do even better :-)
>>> It was 2.6.31 that was good. The regression was in 2.6.32. And again
>>> how does the write alignment get modified simply by whether or not
>>> ext4_num_dirty_pages() is called?
>> writeback is full of deep mysteries ... :)
>>
>>>> If you could run blktraces on 2.6.32, 2.6.35 stock, and 2.6.35 with
>>>> your patch, that would be really helpful to confirm my hypothesis. Is
>>>> that something that wouldn't be too much trouble?
>>> I'd be glad to if you explain how one runs blktraces.
>> Probably the easiest thing to do is to use seekwatcher to invoke blktrace,
>> if it's easily available for your distro. Then it's just mount debugfs on
>> /sys/kernel/debug, and:
>>
>> # seekwatcher -d /dev/whatever -t tracename -o tracename.png -p "your dd command"
>>
>> It'll leave tracename.* blktrace files, and generate a graph of the IO
>> in the PNG file.
>>
>> (this causes an abbreviated trace, but it's probably enough to see what
>> boundaries the IO was issued on)
>
> Thanks for the info. How would you like me to send the blktraces?
> Even using bzip2 they're 2.6 MB. I could send them to you and Ted
> via private e-mail or I can hunt around and try and find somewhere
> I can post them. I'm attaching the PNG files (2.6.35 is unpatched
> and 2.6.35+ is patched).

Private email is fine I think, I don't mind a 2.6MB attachment and doubt
Ted would either. :)

I keep meaning to patch seekwatcher to color unaligned IOs differently,
but without that we need the blktrace data to know if that's what's going
on.

It's interesting that the patched run is starting at block 0 while
unpatched is starting futher in (which would be a little slower at least)

was there a fresh mkfs in between?

Thanks!

-Eric

2010-08-31 00:37:19

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

On Mon, Aug 30, 2010 at 04:49:58PM -0400, Bill Fink wrote:
> > Thanks for reporting it. I'm going to have to take a closer look at
> > why this makes a difference. I'm going to guess though that what's
> > going on is that we're posting writes in such a way that they're no
> > longer aligned or ending at the end of a RAID5 stripe, causing a
> > read-modify-write pass. That would easily explain the write
> > performance regression.
>
> I'm not sure I understand. How could calling or not calling
> ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
> affect the write alignment?

Suppose you have 8 disks, with stripe size of 16k. Assuming that
you're only using one parity disk (i.e., RAID 5) and no spare disks,
that means the optimal I/O size is 7*16k == 112k. If we do a write
which is smaller than 112k, or which is not a multiple of 112k, then
the RAID subsystem will need to do a read-modify-write to update the
parity disk. Furthermore, the write had better be aligned on an 112k
byte boundary. The block allocator will guarantee that block #0 is
aligned on a 112k block, but writes have to also be right size in
order to avoid the read-modify-write.

If we end up doing very small writes, then it can end up being quite
disatrous for write performance.

> I was wondering if the locking being done in ext4_num_dirty_pages()
> could somehow be affecting the performance. I did notice from top
> that in the patched 2.6.35 kernel, the I/O wait time was generally
> in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
> at a higher 75-80% range. However, I don't know if that's just a
> result of the lower performance, or a possible clue to its cause.

I/O wait time would tend to imply that the raid controller is taking
longer to do the write updates, which would tend to confirm that we're
doing more read-modify-write cycles. If we were hitting spinlock
contention, this would show up as more system CPU time consumed.

- Ted

2010-08-31 00:51:39

by Justin Maggard

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

On Mon, Aug 30, 2010 at 5:37 PM, Ted Ts'o <[email protected]> wrote:
> On Mon, Aug 30, 2010 at 04:49:58PM -0400, Bill Fink wrote:
>> > Thanks for reporting it. ?I'm going to have to take a closer look at
>> > why this makes a difference. ?I'm going to guess though that what's
>> > going on is that we're posting writes in such a way that they're no
>> > longer aligned or ending at the end of a RAID5 stripe, causing a
>> > read-modify-write pass. ?That would easily explain the write
>> > performance regression.
>>
>> I'm not sure I understand. ?How could calling or not calling
>> ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
>> affect the write alignment?
>
> Suppose you have 8 disks, with stripe size of 16k. ?Assuming that
> you're only using one parity disk (i.e., RAID 5) and no spare disks,
> that means the optimal I/O size is 7*16k == 112k. ?If we do a write
> which is smaller than 112k, or which is not a multiple of 112k, then
> the RAID subsystem will need to do a read-modify-write to update the
> parity disk. ?Furthermore, the write had better be aligned on an 112k
> byte boundary. ?The block allocator will guarantee that block #0 is
> aligned on a 112k block, but writes have to also be right size in
> order to avoid the read-modify-write.
>
> If we end up doing very small writes, then it can end up being quite
> disatrous for write performance.

I'd have to agree that this is likely the case. Just to add a little
more data here, I tried the same 32GB dd test against a 12-disk MD
RAID 6 64k chunk array today with and without the patch (although
against a 2.6.33.7 kernel), and my write performance dropped from
~420MB/sec down to 350MB/sec when I used the patched kernel.

-Justin

2010-08-31 01:14:54

by Bill Fink

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

On Mon, 30 Aug 2010, Ted Ts'o wrote:

> On Mon, Aug 30, 2010 at 04:49:58PM -0400, Bill Fink wrote:
> > > Thanks for reporting it. I'm going to have to take a closer look at
> > > why this makes a difference. I'm going to guess though that what's
> > > going on is that we're posting writes in such a way that they're no
> > > longer aligned or ending at the end of a RAID5 stripe, causing a
> > > read-modify-write pass. That would easily explain the write
> > > performance regression.
> >
> > I'm not sure I understand. How could calling or not calling
> > ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
> > affect the write alignment?
>
> Suppose you have 8 disks, with stripe size of 16k. Assuming that
> you're only using one parity disk (i.e., RAID 5) and no spare disks,
> that means the optimal I/O size is 7*16k == 112k. If we do a write
> which is smaller than 112k, or which is not a multiple of 112k, then
> the RAID subsystem will need to do a read-modify-write to update the
> parity disk. Furthermore, the write had better be aligned on an 112k
> byte boundary. The block allocator will guarantee that block #0 is
> aligned on a 112k block, but writes have to also be right size in
> order to avoid the read-modify-write.
>
> If we end up doing very small writes, then it can end up being quite
> disatrous for write performance.

I understand how unaligned writes can be very bad for performance.
That makes perfect sense. What I don't understand is how just
calling or not calling ext4_num_dirty_pages() can affect the
write alignment, and that's the only difference between the
unpatched and patched 2.6.35 kernels. I thought the only thing
ext4_num_dirty_pages does is to count the number of ext4 dirty
pages. How can that counting affect the write alignment? I
guess there must be some subtle side affect of ext4_num_dirty_pages
that I'm not getting.

> > I was wondering if the locking being done in ext4_num_dirty_pages()
> > could somehow be affecting the performance. I did notice from top
> > that in the patched 2.6.35 kernel, the I/O wait time was generally
> > in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
> > at a higher 75-80% range. However, I don't know if that's just a
> > result of the lower performance, or a possible clue to its cause.
>
> I/O wait time would tend to imply that the raid controller is taking
> longer to do the write updates, which would tend to confirm that we're
> doing more read-modify-write cycles. If we were hitting spinlock
> contention, this would show up as more system CPU time consumed.

OK. There wasn't more CPU utilization. It was about proportionally
less in the bad case as the reduced level of performance.

-Thanks

-Bill

2010-08-31 01:44:23

by Bill Fink

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

On Mon, 30 Aug 2010, Justin Maggard wrote:

> On Mon, Aug 30, 2010 at 5:37 PM, Ted Ts'o <[email protected]> wrote:
> > On Mon, Aug 30, 2010 at 04:49:58PM -0400, Bill Fink wrote:
> >> > Thanks for reporting it. ?I'm going to have to take a closer look at
> >> > why this makes a difference. ?I'm going to guess though that what's
> >> > going on is that we're posting writes in such a way that they're no
> >> > longer aligned or ending at the end of a RAID5 stripe, causing a
> >> > read-modify-write pass. ?That would easily explain the write
> >> > performance regression.
> >>
> >> I'm not sure I understand. ?How could calling or not calling
> >> ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
> >> affect the write alignment?
> >
> > Suppose you have 8 disks, with stripe size of 16k. ?Assuming that
> > you're only using one parity disk (i.e., RAID 5) and no spare disks,
> > that means the optimal I/O size is 7*16k == 112k. ?If we do a write
> > which is smaller than 112k, or which is not a multiple of 112k, then
> > the RAID subsystem will need to do a read-modify-write to update the
> > parity disk. ?Furthermore, the write had better be aligned on an 112k
> > byte boundary. ?The block allocator will guarantee that block #0 is
> > aligned on a 112k block, but writes have to also be right size in
> > order to avoid the read-modify-write.
> >
> > If we end up doing very small writes, then it can end up being quite
> > disatrous for write performance.
>
> I'd have to agree that this is likely the case. Just to add a little
> more data here, I tried the same 32GB dd test against a 12-disk MD
> RAID 6 64k chunk array today with and without the patch (although
> against a 2.6.33.7 kernel), and my write performance dropped from
> ~420MB/sec down to 350MB/sec when I used the patched kernel.

I'm curious. Since you're using 12 disks where I was only
using 8, I'm wondering what performance you would get if you
changed the multiplier to say 16, i.e.

desired_nr_to_write = wbc->nr_to_write * 16;

It seems you should be getting better than 420 MB/sec on a
12-disk raid, although perhaps the overhead of doing RAID6
is an issue. I use md RAID0 to combine 2 of the hardware
RAID5 arrays (total of 16 disks), and I'm seeing (with my
patch) 1.3 GB/sec write performance.

-Bill

2010-08-31 03:27:21

by Bill Fink

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

[adding linux-ext4 back in]

On Mon, 30 Aug, Eric Sandeen wrote:

> Bill Fink wrote:
> > On Mon, 30 Aug 2010, Eric Sandeen wrote:
> >
> >> Bill Fink wrote:
> >>> On Mon, 30 Aug 2010, Eric Sandeen wrote:
> >>>
> >>>> Bill Fink wrote:
> >>>>> On Mon, 30 Aug 2010, Ted Ts'o wrote:
> >>>>>
> >>>>>> On Sun, Aug 29, 2010 at 11:11:26PM -0400, Bill Fink wrote:
> >>>>>>> A 50% ext4 disk write performance regression was introduced
> >>>>>>> in 2.6.32 and still exists in 2.6.35, although somewhat improved
> >>>>>>> from 2.6.32. Read performance was not affected).
> >>>>>> Thanks for reporting it. I'm going to have to take a closer look at
> >>>>>> why this makes a difference. I'm going to guess though that what's
> >>>>>> going on is that we're posting writes in such a way that they're no
> >>>>>> longer aligned or ending at the end of a RAID5 stripe, causing a
> >>>>>> read-modify-write pass. That would easily explain the write
> >>>>>> performance regression.
> >>>>> I'm not sure I understand. How could calling or not calling
> >>>>> ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
> >>>>> affect the write alignment?
> >>>>>
> >>>>> I was wondering if the locking being done in ext4_num_dirty_pages()
> >>>>> could somehow be affecting the performance. I did notice from top
> >>>>> that in the patched 2.6.35 kernel, the I/O wait time was generally
> >>>>> in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
> >>>>> at a higher 75-80% range. However, I don't know if that's just a
> >>>>> result of the lower performance, or a possible clue to its cause.
> >>>> Using oprofile might also show you how much time is getting spent there..
> >>>>
> >>>>>> The interesting thing is that we don't actually do anything in
> >>>>>> ext4_da_writepages() to assure that we are making our writes are
> >>>>>> appropriate aligned and sized. We do pay attention to make sure they
> >>>>>> are alligned correctly in the allocator, but _not_ in the writepages
> >>>>>> code. So the fact that apparently things were well aligned in 2.6.32
> >>>>>> seems to be luck... (or maybe the writes are perfectly aligned in
> >>>>>> 2.6.32; they're just much worse with 2.6.35, and with explicit
> >>>>>> attention paid to the RAID stripe size, we could do even better :-)
> >>>>> It was 2.6.31 that was good. The regression was in 2.6.32. And again
> >>>>> how does the write alignment get modified simply by whether or not
> >>>>> ext4_num_dirty_pages() is called?
> >>>> writeback is full of deep mysteries ... :)
> >>>>
> >>>>>> If you could run blktraces on 2.6.32, 2.6.35 stock, and 2.6.35 with
> >>>>>> your patch, that would be really helpful to confirm my hypothesis. Is
> >>>>>> that something that wouldn't be too much trouble?
> >>>>> I'd be glad to if you explain how one runs blktraces.
> >>>> Probably the easiest thing to do is to use seekwatcher to invoke blktrace,
> >>>> if it's easily available for your distro. Then it's just mount debugfs on
> >>>> /sys/kernel/debug, and:
> >>>>
> >>>> # seekwatcher -d /dev/whatever -t tracename -o tracename.png -p "your dd command"
> >>>>
> >>>> It'll leave tracename.* blktrace files, and generate a graph of the IO
> >>>> in the PNG file.
> >>>>
> >>>> (this causes an abbreviated trace, but it's probably enough to see what
> >>>> boundaries the IO was issued on)
> >>> Thanks for the info. How would you like me to send the blktraces?
> >>> Even using bzip2 they're 2.6 MB. I could send them to you and Ted
> >>> via private e-mail or I can hunt around and try and find somewhere
> >>> I can post them. I'm attaching the PNG files (2.6.35 is unpatched
> >>> and 2.6.35+ is patched).
> >> Private email is fine I think, I don't mind a 2.6MB attachment and doubt
> >> Ted would either. :)
> >
> > OK. It's two such attachments (2.6.35 is unpatched and 2.6.35+
> > is patched).
>
> (fwiw I had to create *.0 and *.1 files to make blktrace happy, it didn't
> like starting with *.2 ...? oh well)

The .0 and .1 files were 0 bytes. There was also a 56 byte .3 file
in each case. Did you also want to see those?

> [sandeen@sandeen tmp]$ blkparse ext4dd-2.6.35-trace | tail -n 13
> CPU2 (ext4dd-2.6.35-trace):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 249, 996KiB Writes Completed: 270,621, 30,544MiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 0
> IO unplugs: 0 Timer unplugs: 0
>
> Throughput (R/W): 13KiB/s / 409,788KiB/s
> Events (ext4dd-2.6.35-trace): 270,870 entries
> Skips: 0 forward (0 - 0.0%)
> Input file ext4dd-2.6.35-trace.blktrace.2 added
>
>
> [sandeen@sandeen tmp]$ blkparse ext4dd-2.6.35+-trace | tail -n 13
> CPU2 (ext4dd-2.6.35+-trace):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 504, 2,016KiB Writes Completed: 246,500, 30,610MiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 0
> IO unplugs: 0 Timer unplugs: 0
>
> Throughput (R/W): 38KiB/s / 590,917KiB/s
> Events (ext4dd-2.6.35+-trace): 247,004 entries
> Skips: 0 forward (0 - 0.0%)
> Input file ext4dd-2.6.35+-trace.blktrace.2 added
>
>
> Ok not a -huge- difference in the overall stats, though the unpatched version did
> fdo 24,000 more writes, which is 10% ...
>
> At the extremes in both cases there are 8-block and 256-block writes.
>
> 2.6.35:
>
> nr wrt size
> 25256 8
> 1701 16
> 1646 24
> 297 248
> ...
> 232657 256
>
>
> 2.6.35+:
>
> nr wrt size
> 4785 8
> 1732 16
> 357 24
> ...
> 50 248
> 237907 256
>
> So not a huge difference in the distribution really, though there were
> 20,000 more 1-block (8-sector) writes in the unpatched version.
>
> So a lot of 32-block writes turned into 1-block writes and smaller, I guess.
>
> To know for sure about alignment, what is your raid stripe unit, and is this
> filesystem on a partition? If so at what offset?

Would the stripe unit be the Block size in the following?

[root@i7test7 linux-git]# hptrc -s1 query arrays
ID Capacity(GB) Type Status Block Sector Cache Name
-------------------------------------------------------------------------------
1 500.00 RAID5 NORMAL 256k 512B WB RAID50_1
2 250.00 RAID5 NORMAL 256k 512B WB RAID500_1
3 2350.00 RAID5 NORMAL 256k 512B WB RAID5_1

The testing was done on the first 500 GB array.

And here's how the ext4 filesystem was created (using the
entire device):

[root@i7test7 linux-git]# mkfs.ext4 /dev/sde
mke2fs 1.41.10 (10-Feb-2009)
/dev/sde is entire device, not just one partition!
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
30523392 inodes, 122070144 blocks
6103507 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3726 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 25 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

-Thanks

-Bill

> -Eric
>
> >> I keep meaning to patch seekwatcher to color unaligned IOs differently,
> >> but without that we need the blktrace data to know if that's what's going
> >> on.
> >>
> >> It's interesting that the patched run is starting at block 0 while
> >> unpatched is starting futher in (which would be a little slower at least)
> >>
> >> was there a fresh mkfs in between?
> >
> > No mkfs in between, and the original mkfs.ext4 was done without
> > any special options. I am using the noauto_da_alloc option on the
> > mount to workaround another 9% performance hit between 2.6.31 and
> > 2.6.32, introduced by 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10
> > (ext4: Fix the alloc on close after a truncate hueristic). That
> > one only affected already existing files.
> >
> >> Thanks!
> >>
> >> -Eric

2010-08-31 03:29:35

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression

Bill Fink wrote:

> The .0 and .1 files were 0 bytes. There was also a 56 byte .3 file
> in each case. Did you also want to see those?

nah it's ok I filled it in.

I think I see at least 1 thing wrong in the writeback path, I may have
a patch for you yet tonight.

-Eric

2010-08-31 03:43:55

by Eric Sandeen

[permalink] [raw]

Subject: [PATCH] ext4: fix 50% disk write performance regression

Can you give this a shot?

The first hunk is, I think, the biggest problem. Even if
we get the max number of pages we need, we keep scanning forward
until "done" without doing any more actual, useful work.

The 2nd hunk is an oddity, some places assign nr_to_write
to LONG_MAX, and we get here and multiply -that- by 8... giving
us "-8" for nr_to_write, that can't help things when we
do later comparisons on that number...

I also see us asking to find pages starting at "idx" and
the first dirty page we find is well ahead of that,
I'm not sure if that's indicative of a problem or not.

Anyway, want to give this a shot, in place of the patch you sent,
and see how it fares compared to stock and/or with your patch?

It's build-and-sanity tested but not really performance tested here.

Thanks,
-Eric

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4b8debe..33c2167 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1207,8 +1207,10 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
break;
idx++;
num++;
- if (num >= max_pages)
- break;
+ if (num >= max_pages) {
+ pagevec_release(&pvec);
+ return num;
+ }
}
pagevec_release(&pvec);
}
@@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct address_space *mapping,
* sbi->max_writeback_mb_bump whichever is smaller.
*/
max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
- if (!range_cyclic && range_whole)
+ if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
desired_nr_to_write = wbc->nr_to_write * 8;
else
desired_nr_to_write = ext4_num_dirty_pages(inode, index,

2010-08-31 04:26:48

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH] ext4: fix 50% disk write performance regression

Eric Sandeen wrote:
> Can you give this a shot?
>
> The first hunk is, I think, the biggest problem. Even if
> we get the max number of pages we need, we keep scanning forward
> until "done" without doing any more actual, useful work.
>
> The 2nd hunk is an oddity, some places assign nr_to_write
> to LONG_MAX, and we get here and multiply -that- by 8... giving
> us "-8" for nr_to_write, that can't help things when we
> do later comparisons on that number...
>
> I also see us asking to find pages starting at "idx" and
> the first dirty page we find is well ahead of that,
> I'm not sure if that's indicative of a problem or not.
>
> Anyway, want to give this a shot, in place of the patch you sent,
> and see how it fares compared to stock and/or with your patch?
>
> It's build-and-sanity tested but not really performance tested here.
>
> Thanks,
> -Eric
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4b8debe..33c2167 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1207,8 +1207,10 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> break;
> idx++;
> num++;
> - if (num >= max_pages)
> - break;
> + if (num >= max_pages) {
> + pagevec_release(&pvec);
> + return num;
> + }
> }
> pagevec_release(&pvec);
> }
> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct address_space *mapping,
> * sbi->max_writeback_mb_bump whichever is smaller.
> */
> max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> - if (!range_cyclic && range_whole)
> + if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
> desired_nr_to_write = wbc->nr_to_write * 8;

sorry no, this isn't right, we should just leave it at nr_to_write for the
LONG_MAX case, not go counting pages. And something odd is going on where we
are looking for dirty pages starting at an index we've already written out.

Maybe:

if (!range_cyclic && range_whole) {
if (wbc->nr_to_write != LONG_MAX)
desired_nr_to_write = wbc->nr_to_write * 8;
else
desired_nr_to_write = wbc->nr_to_write;
}

I'll have to look at this more when I'm not quite so sleepy, sorry. :)

-Eric

> else
> desired_nr_to_write = ext4_num_dirty_pages(inode, index,
>

2010-08-31 04:53:54

by Bill Fink

[permalink] [raw]

Subject: Re: [PATCH] ext4: fix 50% disk write performance regression

On Mon, 30 Aug 2010, Eric Sandeen wrote:

> Can you give this a shot?
>
> The first hunk is, I think, the biggest problem. Even if
> we get the max number of pages we need, we keep scanning forward
> until "done" without doing any more actual, useful work.
>
> The 2nd hunk is an oddity, some places assign nr_to_write
> to LONG_MAX, and we get here and multiply -that- by 8... giving
> us "-8" for nr_to_write, that can't help things when we
> do later comparisons on that number...
>
> I also see us asking to find pages starting at "idx" and
> the first dirty page we find is well ahead of that,
> I'm not sure if that's indicative of a problem or not.
>
> Anyway, want to give this a shot, in place of the patch you sent,
> and see how it fares compared to stock and/or with your patch?
>
> It's build-and-sanity tested but not really performance tested here.
>
> Thanks,
> -Eric

Great! It looks like that does the trick.

2.6.35 + your patch:

i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s

That's the same performance as with my patch, and pretty darn
close to the original 2.6.31 performance.

-Thanks a bunch

-Bill

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4b8debe..33c2167 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1207,8 +1207,10 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> break;
> idx++;
> num++;
> - if (num >= max_pages)
> - break;
> + if (num >= max_pages) {
> + pagevec_release(&pvec);
> + return num;
> + }
> }
> pagevec_release(&pvec);
> }
> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct address_space *mapping,
> * sbi->max_writeback_mb_bump whichever is smaller.
> */
> max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
> - if (!range_cyclic && range_whole)
> + if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
> desired_nr_to_write = wbc->nr_to_write * 8;
> else
> desired_nr_to_write = ext4_num_dirty_pages(inode, index,

2010-08-31 05:05:59

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH] ext4: fix 50% disk write performance regression

Bill Fink wrote:
> On Mon, 30 Aug 2010, Eric Sandeen wrote:
>
>> Can you give this a shot?
>>
>> The first hunk is, I think, the biggest problem. Even if
>> we get the max number of pages we need, we keep scanning forward
>> until "done" without doing any more actual, useful work.
>>
>> The 2nd hunk is an oddity, some places assign nr_to_write
>> to LONG_MAX, and we get here and multiply -that- by 8... giving
>> us "-8" for nr_to_write, that can't help things when we
>> do later comparisons on that number...
>>
>> I also see us asking to find pages starting at "idx" and
>> the first dirty page we find is well ahead of that,
>> I'm not sure if that's indicative of a problem or not.
>>
>> Anyway, want to give this a shot, in place of the patch you sent,
>> and see how it fares compared to stock and/or with your patch?
>>
>> It's build-and-sanity tested but not really performance tested here.
>>
>> Thanks,
>> -Eric
>
> Great! It looks like that does the trick.
>
> 2.6.35 + your patch:
>
> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> 32768+0 records in
> 32768+0 records out
> 34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s
>
> That's the same performance as with my patch, and pretty darn
> close to the original 2.6.31 performance.

hah, that's good esp. considering my followup email that found
what I think is a problem with my patch. ;)

What happens if you change:

if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
desired_nr_to_write = wbc->nr_to_write * 8;
else
desired_nr_to_write = ext4_num_dirty_pages(inode, index,

to:

if (!range_cyclic && range_whole) {
if (wbc->nr_to_write != LONG_MAX)
desired_nr_to_write = wbc->nr_to_write * 8;
else
desired_nr_to_write = wbc->nr_to_write;
} else
desired_nr_to_write = ext4_num_dirty_pages(inode, index,

and see how that fares? I think that makes a little more sense, if we
got there with LONG_MAX that means "write everything" and there's no need
to bump it up or to go counting pages. It may not make any real difference.

But I'm seeing really weird behavior in writeback, it starts out nicely
writing 32768 pages at a time, and then goes all wonky, revisiting pages
it's already done and doing IO in little chunks. This is going to take
some staring I think.

-Eric

> -Thanks a bunch
>
> -Bill
>
>
>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 4b8debe..33c2167 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1207,8 +1207,10 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>> break;
>> idx++;
>> num++;
>> - if (num >= max_pages)
>> - break;
>> + if (num >= max_pages) {
>> + pagevec_release(&pvec);
>> + return num;
>> + }
>> }
>> pagevec_release(&pvec);
>> }
>> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct address_space *mapping,
>> * sbi->max_writeback_mb_bump whichever is smaller.
>> */
>> max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
:

2010-08-31 05:31:46

by Bill Fink

[permalink] [raw]

Subject: Re: [PATCH] ext4: fix 50% disk write performance regression

On Tue, 31 Aug 2010, Eric Sandeen wrote:

> Bill Fink wrote:
> > On Mon, 30 Aug 2010, Eric Sandeen wrote:
> >
> >> Can you give this a shot?
> >>
> >> The first hunk is, I think, the biggest problem. Even if
> >> we get the max number of pages we need, we keep scanning forward
> >> until "done" without doing any more actual, useful work.
> >>
> >> The 2nd hunk is an oddity, some places assign nr_to_write
> >> to LONG_MAX, and we get here and multiply -that- by 8... giving
> >> us "-8" for nr_to_write, that can't help things when we
> >> do later comparisons on that number...
> >>
> >> I also see us asking to find pages starting at "idx" and
> >> the first dirty page we find is well ahead of that,
> >> I'm not sure if that's indicative of a problem or not.
> >>
> >> Anyway, want to give this a shot, in place of the patch you sent,
> >> and see how it fares compared to stock and/or with your patch?
> >>
> >> It's build-and-sanity tested but not really performance tested here.
> >>
> >> Thanks,
> >> -Eric
> >
> > Great! It looks like that does the trick.
> >
> > 2.6.35 + your patch:
> >
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s
> >
> > That's the same performance as with my patch, and pretty darn
> > close to the original 2.6.31 performance.
>
> hah, that's good esp. considering my followup email that found
> what I think is a problem with my patch. ;)
>
> What happens if you change:
>
> if (!range_cyclic && range_whole && wbc->nr_to_write != LONG_MAX)
> desired_nr_to_write = wbc->nr_to_write * 8;
> else
> desired_nr_to_write = ext4_num_dirty_pages(inode, index,
>
> to:
>
> if (!range_cyclic && range_whole) {
> if (wbc->nr_to_write != LONG_MAX)
> desired_nr_to_write = wbc->nr_to_write * 8;
> else
> desired_nr_to_write = wbc->nr_to_write;
> } else
> desired_nr_to_write = ext4_num_dirty_pages(inode, index,
>
> and see how that fares? I think that makes a little more sense, if we
> got there with LONG_MAX that means "write everything" and there's no need
> to bump it up or to go counting pages. It may not make any real difference.

That's also fine.

-Bill

> But I'm seeing really weird behavior in writeback, it starts out nicely
> writing 32768 pages at a time, and then goes all wonky, revisiting pages
> it's already done and doing IO in little chunks. This is going to take
> some staring I think.
>
> -Eric

2010-09-09 00:33:51

by Daniel Taylor

[permalink] [raw]

Subject: RE: [PATCH] ext4: fix 50% disk write performance regression

Just wondering if this patch is adequate or there's more to come.

I want to put a fix into our 2.6.32 kernel.

Thanks.

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Eric Sandeen
> Sent: Monday, August 30, 2010 10:06 PM
> To: Bill Fink
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH] ext4: fix 50% disk write performance regression
>
> Bill Fink wrote:
> > On Mon, 30 Aug 2010, Eric Sandeen wrote:
> >
> >> Can you give this a shot?
> >>
> >> The first hunk is, I think, the biggest problem. Even if
> >> we get the max number of pages we need, we keep scanning forward
> >> until "done" without doing any more actual, useful work.
> >>
> >> The 2nd hunk is an oddity, some places assign nr_to_write
> >> to LONG_MAX, and we get here and multiply -that- by 8... giving
> >> us "-8" for nr_to_write, that can't help things when we
> >> do later comparisons on that number...
> >>
> >> I also see us asking to find pages starting at "idx" and
> >> the first dirty page we find is well ahead of that,
> >> I'm not sure if that's indicative of a problem or not.
> >>
> >> Anyway, want to give this a shot, in place of the patch you sent,
> >> and see how it fares compared to stock and/or with your patch?
> >>
> >> It's build-and-sanity tested but not really performance
> tested here.
> >>
> >> Thanks,
> >> -Eric
> >
> > Great! It looks like that does the trick.
> >
> > 2.6.35 + your patch:
> >
> > i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
> > 32768+0 records in
> > 32768+0 records out
> > 34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s
> >
> > That's the same performance as with my patch, and pretty darn
> > close to the original 2.6.31 performance.
>
> hah, that's good esp. considering my followup email that found
> what I think is a problem with my patch. ;)
>
> What happens if you change:
>
> if (!range_cyclic && range_whole && wbc->nr_to_write !=
> LONG_MAX)
> desired_nr_to_write = wbc->nr_to_write * 8;
> else
> desired_nr_to_write = ext4_num_dirty_pages(inode, index,
>
> to:
>
> if (!range_cyclic && range_whole) {
> if (wbc->nr_to_write != LONG_MAX)
> desired_nr_to_write = wbc->nr_to_write * 8;
> else
> desired_nr_to_write = wbc->nr_to_write;
> } else
> desired_nr_to_write = ext4_num_dirty_pages(inode, index,
>
> and see how that fares? I think that makes a little more sense, if we
> got there with LONG_MAX that means "write everything" and
> there's no need
> to bump it up or to go counting pages. It may not make any
> real difference.
>
> But I'm seeing really weird behavior in writeback, it starts
> out nicely
> writing 32768 pages at a time, and then goes all wonky,
> revisiting pages
> it's already done and doing IO in little chunks. This is
> going to take
> some staring I think.
>
> -Eric
>
>
>
> > -Thanks a bunch
> >
> > -Bill
> >
> >
> >
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 4b8debe..33c2167 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -1207,8 +1207,10 @@ static pgoff_t
> ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> >> break;
> >> idx++;
> >> num++;
> >> - if (num >= max_pages)
> >> - break;
> >> + if (num >= max_pages) {
> >> + pagevec_release(&pvec);
> >> + return num;
> >> + }
> >> }
> >> pagevec_release(&pvec);
> >> }
> >> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct
> address_space *mapping,
> >> * sbi->max_writeback_mb_bump whichever is smaller.
> >> */
> >> max_pages = sbi->s_max_writeback_mb_bump << (20 -
> PAGE_CACHE_SHIFT);
> :
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2010-09-09 03:29:48

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH] ext4: fix 50% disk write performance regression

Daniel Taylor wrote:
> Just wondering if this patch is adequate or there's more to come.
>
> I want to put a fix into our 2.6.32 kernel.
>
> Thanks.
>
>
I'm working on a couple more things, but I think these 2 fixes are
pretty straightforward if you really want to throw something in
ahead of upstream (I think this applies, hand-edited from my
full series a bit)

I've got a couple other things that seem to keep the writeback
index marching forward properly...

Index: build/fs/ext4/inode.c
===================================================================
--- build.orig/fs/ext4/inode.c
+++ build/fs/ext4/inode.c
@@ -1230,8 +1233,10 @@ static pgoff_t ext4_num_dirty_pages(stru
break;
idx++;
num++;
- if (num >= max_pages)
+ if (num >= max_pages) {
+ done = 1;
break;
+ }
}
pagevec_release(&pvec);
}
@@ -2912,9 +2909,13 @@ static int ext4_da_writepages(struct add
* sbi->max_writeback_mb_bump whichever is smaller.
*/
max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
- if (!range_cyclic && range_whole)
- desired_nr_to_write = wbc->nr_to_write * 8;
- else
+
+ if (!range_cyclic && range_whole) {
+ if (wbc->nr_to_write == LLONG_MAX)
+ desired_nr_to_write = wbc->nr_to_write;
+ else
+ desired_nr_to_write = wbc->nr_to_write * 8;
+ } else
desired_nr_to_write = ext4_num_dirty_pages(inode, index,
max_pages);
if (desired_nr_to_write > max_pages)

-Eric

>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Eric Sandeen
>> Sent: Monday, August 30, 2010 10:06 PM
>> To: Bill Fink
>> Cc: [email protected]; [email protected];
>> [email protected]; [email protected]
>> Subject: Re: [PATCH] ext4: fix 50% disk write performance regression
>>
>> Bill Fink wrote:
>>
>>> On Mon, 30 Aug 2010, Eric Sandeen wrote:
>>>
>>>
>>>> Can you give this a shot?
>>>>
>>>> The first hunk is, I think, the biggest problem. Even if
>>>> we get the max number of pages we need, we keep scanning forward
>>>> until "done" without doing any more actual, useful work.
>>>>
>>>> The 2nd hunk is an oddity, some places assign nr_to_write
>>>> to LONG_MAX, and we get here and multiply -that- by 8... giving
>>>> us "-8" for nr_to_write, that can't help things when we
>>>> do later comparisons on that number...
>>>>
>>>> I also see us asking to find pages starting at "idx" and
>>>> the first dirty page we find is well ahead of that,
>>>> I'm not sure if that's indicative of a problem or not.
>>>>
>>>> Anyway, want to give this a shot, in place of the patch you sent,
>>>> and see how it fares compared to stock and/or with your patch?
>>>>
>>>> It's build-and-sanity tested but not really performance
>>>>
>> tested here.
>>
>>>> Thanks,
>>>> -Eric
>>>>
>>> Great! It looks like that does the trick.
>>>
>>> 2.6.35 + your patch:
>>>
>>> i7test7% dd if=/dev/zero of=/i7raid/bill/testfile1 bs=1M count=32768
>>> 32768+0 records in
>>> 32768+0 records out
>>> 34359738368 bytes (34 GB) copied, 50.6702 s, 678 MB/s
>>>
>>> That's the same performance as with my patch, and pretty darn
>>> close to the original 2.6.31 performance.
>>>
>> hah, that's good esp. considering my followup email that found
>> what I think is a problem with my patch. ;)
>>
>> What happens if you change:
>>
>> if (!range_cyclic && range_whole && wbc->nr_to_write !=
>> LONG_MAX)
>> desired_nr_to_write = wbc->nr_to_write * 8;
>> else
>> desired_nr_to_write = ext4_num_dirty_pages(inode, index,
>>
>> to:
>>
>> if (!range_cyclic && range_whole) {
>> if (wbc->nr_to_write != LONG_MAX)
>> desired_nr_to_write = wbc->nr_to_write * 8;
>> else
>> desired_nr_to_write = wbc->nr_to_write;
>> } else
>> desired_nr_to_write = ext4_num_dirty_pages(inode, index,
>>
>> and see how that fares? I think that makes a little more sense, if we
>> got there with LONG_MAX that means "write everything" and
>> there's no need
>> to bump it up or to go counting pages. It may not make any
>> real difference.
>>
>> But I'm seeing really weird behavior in writeback, it starts
>> out nicely
>> writing 32768 pages at a time, and then goes all wonky,
>> revisiting pages
>> it's already done and doing IO in little chunks. This is
>> going to take
>> some staring I think.
>>
>> -Eric
>>
>>
>>
>>
>>> -Thanks a bunch
>>>
>>> -Bill
>>>
>>>
>>>
>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 4b8debe..33c2167 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -1207,8 +1207,10 @@ static pgoff_t
>>>>
>> ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>>
>>>> break;
>>>> idx++;
>>>> num++;
>>>> - if (num >= max_pages)
>>>> - break;
>>>> + if (num >= max_pages) {
>>>> + pagevec_release(&pvec);
>>>> + return num;
>>>> + }
>>>> }
>>>> pagevec_release(&pvec);
>>>> }
>>>> @@ -3002,7 +3004,7 @@ static int ext4_da_writepages(struct
>>>>
>> address_space *mapping,
>>
>>>> * sbi->max_writeback_mb_bump whichever is smaller.
>>>> */
>>>> max_pages = sbi->s_max_writeback_mb_bump << (20 -
>>>>
>> PAGE_CACHE_SHIFT);
>> :
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>