LinuxLists.cc - Major mke2fs slowdown (reproducable, bisected)

2007-11-12 18:25:56

Subject: Major mke2fs slowdown (reproducable, bisected)

Cross-compile farm here migrated to .ccache and build dir on separate
disks and now I have a way to blow up .ccache without waiting half an
hour for rm(1) to finish. It's called mke2fs(8).

However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
several fat cross-compile builds. Normally it takes ~11 seconds to
finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
"Bias the placement of kernel pages at lower PFNs" it takes several
minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
I'm pretty sure bisection wasn't screwed up.

Details:

Core 2 Duo on x86_64, no debugging
4G RAM
30G ext2 .ccache partition with noatime
100G ext2 partition with source tree and build dirs with noatime

I build alpha-allnoconfig, alpha-defconfig and 4 allmodconfigs
(SMP=y/n x DEBUG_KERNEL=y/n).

Right after compilation finishes, free(1) reports more or less the same
picture (VM hackers, please, tell me which info you need):

total used free shared buffers cached
Mem: 4032320 2802604 1229716 0 97160 2424816
-/+ buffers/cache: 280628 3751692
Swap: 7823644 0 7823644

Last steps of build script are:

umount /home/ad/.ccache
sudo mkfs.ext2 -m 0 <=== this is slow

I can prepare standalone script with more affordable x86_64 configs if
needed.

commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be
Author: Mel Gorman <[email protected]>
Date: Tue Oct 16 01:25:54 2007 -0700

Bias the placement of kernel pages at lower PFNs

This patch chooses blocks with lower PFNs when placing kernel allocations.
This is particularly important during fallback in low memory situations to
stop unmovable pages being placed throughout the entire address space.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 676aec9..e1d87ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -765,6 +765,23 @@ int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
return move_freepages(zone, start_page, end_page, migratetype);
}

+/* Return the page with the lowest PFN in the list */
+static struct page *min_page(struct list_head *list)
+{
+ unsigned long min_pfn = -1UL;
+ struct page *min_page = NULL, *page;;
+
+ list_for_each_entry(page, list, lru) {
+ unsigned long pfn = page_to_pfn(page);
+ if (pfn < min_pfn) {
+ min_pfn = pfn;
+ min_page = page;
+ }
+ }
+
+ return min_page;
+}
+
/* Remove an element from the buddy allocator from the fallback list */
static struct page *__rmqueue_fallback(struct zone *zone, int order,
int start_migratetype)
@@ -795,8 +812,11 @@ retry:
if (list_empty(&area->free_list[migratetype]))
continue;

+ /* Bias kernel allocations towards low pfns */
page = list_entry(area->free_list[migratetype].next,
struct page, lru);
+ if (unlikely(start_migratetype != MIGRATE_MOVABLE))
+ page = min_page(&area->free_list[migratetype]);
area->nr_free--;

/*

2007-11-12 18:39:40

by Linus Torvalds

[permalink] [raw]

Subject: Re: Major mke2fs slowdown (reproducable, bisected)

On Mon, 12 Nov 2007, Alexey Dobriyan wrote:
>
> Cross-compile farm here migrated to .ccache and build dir on separate
> disks and now I have a way to blow up .ccache without waiting half an
> hour for rm(1) to finish. It's called mke2fs(8).
>
> However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
> several fat cross-compile builds. Normally it takes ~11 seconds to
> finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
> "Bias the placement of kernel pages at lower PFNs" it takes several
> minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
> I'm pretty sure bisection wasn't screwed up.

Can you (just to make sure) do a "git revert" of this commit on top of the
current tree, and verify that that makes it all work fine again too? If
so, let's just revert it.

I just want to make sure that there isn't some subtle interaction with
anything else in there.

Linus

2007-11-12 21:34:53

by Alexey Dobriyan

[permalink] [raw]

Subject: Re: Major mke2fs slowdown (reproducable, bisected)

On Mon, Nov 12, 2007 at 10:39:15AM -0800, Linus Torvalds wrote:
> On Mon, 12 Nov 2007, Alexey Dobriyan wrote:
> >
> > Cross-compile farm here migrated to .ccache and build dir on separate
> > disks and now I have a way to blow up .ccache without waiting half an
> > hour for rm(1) to finish. It's called mke2fs(8).
> >
> > However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
> > several fat cross-compile builds. Normally it takes ~11 seconds to
> > finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
> > "Bias the placement of kernel pages at lower PFNs" it takes several
> > minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
> > I'm pretty sure bisection wasn't screwed up.
>
> Can you (just to make sure) do a "git revert" of this commit on top of the
> current tree, and verify that that makes it all work fine again too? If
> so, let's just revert it.
>
> I just want to make sure that there isn't some subtle interaction with
> anything else in there.

OK, with 2.6.24-rc2-6e800af233e0bdf108efb7bd23c11ea6fa34cdeb
mkfs took 4m6.915s seconds. With just "lower PFNs" patch reverted it's
back to 10 seconds.

2007-11-12 22:15:41

by Linus Torvalds

[permalink] [raw]

Subject: Re: Major mke2fs slowdown (reproducable, bisected)

On Tue, 13 Nov 2007, Alexey Dobriyan wrote:
>
> OK, with 2.6.24-rc2-6e800af233e0bdf108efb7bd23c11ea6fa34cdeb
> mkfs took 4m6.915s seconds. With just "lower PFNs" patch reverted it's
> back to 10 seconds.

Ok, I reverted it. Thanks for double-checking.

Mel, I guess it's in your court.

Linus

2007-11-13 16:25:25

by Andi Kleen

[permalink] [raw]

Subject: Re: Major mke2fs slowdown (reproducable, bisected)

Alexey Dobriyan <[email protected]> writes:
>
> +/* Return the page with the lowest PFN in the list */
> +static struct page *min_page(struct list_head *list)
> +{
> + unsigned long min_pfn = -1UL;
> + struct page *min_page = NULL, *page;;
> +
> + list_for_each_entry(page, list, lru) {
> + unsigned long pfn = page_to_pfn(page);
> + if (pfn < min_pfn) {
> + min_pfn = pfn;
> + min_page = page;
> + }
> + }
> +
> + return min_page;
> +}
> +
> /* Remove an element from the buddy allocator from the fallback list */
> static struct page *__rmqueue_fallback(struct zone *zone, int order,
> int start_migratetype)
> @@ -795,8 +812,11 @@ retry:
> if (list_empty(&area->free_list[migratetype]))
> continue;
>
> + /* Bias kernel allocations towards low pfns */
> page = list_entry(area->free_list[migratetype].next,
> struct page, lru);
> + if (unlikely(start_migratetype != MIGRATE_MOVABLE))
> + page = min_page(&area->free_list[migratetype]);

Do I misread this, or does it really turn the O(1) buddy allocation into
a "search whole free list" algorithm? Even as fallback that looks like
a quite extreme thing to do.

-Andi

2007-11-13 16:54:25

by mel

[permalink] [raw]

Subject: Re: Major mke2fs slowdown (reproducable, bisected)

Hi Alexey,

On (12/11/07 21:25), Alexey Dobriyan didst pronounce:
> Cross-compile farm here migrated to .ccache and build dir on separate
> disks and now I have a way to blow up .ccache without waiting half an
> hour for rm(1) to finish. It's called mke2fs(8).
>
> However, in e.g 2.6.24-rc2 mke2fs is amazingly slow if done right after
> several fat cross-compile builds. Normally it takes ~11 seconds to
> finish. After commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be aka
> "Bias the placement of kernel pages at lower PFNs" it takes several
> minutes. 2.6.24-rc2 without this patch also gives normal mkfs speeds.
> I'm pretty sure bisection wasn't screwed up.
>
>
> Details:
>
> Core 2 Duo on x86_64, no debugging
> 4G RAM
> 30G ext2 .ccache partition with noatime
> 100G ext2 partition with source tree and build dirs with noatime
>
> I build alpha-allnoconfig, alpha-defconfig and 4 allmodconfigs
> (SMP=y/n x DEBUG_KERNEL=y/n).
>
> Right after compilation finishes, free(1) reports more or less the same
> picture (VM hackers, please, tell me which info you need):
>
> total used free shared buffers cached
> Mem: 4032320 2802604 1229716 0 97160 2424816
> -/+ buffers/cache: 280628 3751692
> Swap: 7823644 0 7823644
>
> Last steps of build script are:
>
> umount /home/ad/.ccache
> sudo mkfs.ext2 -m 0 <=== this is slow
>

Thanks very much for the report and the bisect. I spent the day trying
to reproduce it but I'm having trouble seeing the same problem using just
mke2fs. I've tried

Pentium III x86 machine with 1GB of RAM, 9GB partition
4-way Opteron with 8GB RAM, 10GB partition
2-way Opteron with 2GB RAM, 10GB partition
Pentium D (duel core) with 2GB RAM, 128GB partition

In all cases, the comparison between 2.6.23, latest git and latest git
with patch reverted were the same. For example, on the Pentium D, I got

2.6.23: 95.672 real, 0.068 user, 10.334 sys
2.6.24-rc2-git: 96.112 real, 0.08 user, 10.664 sys
2.6.24-rc2-revert: 96.182 real, 0.072 user, 10.602 sys

This is an average of 5 runs on a 128GB partition. Somewhat unexpectedly,
the revert was fractionally slower. The deviation between runs was around
the 0.4 second mark so the differences appear to be in the noise.

On the other machines, the reverted version was slightly faster but I was
seeing about 0.5% of overall running time, not the massive differences you
were seeing. Clearly there is still a problem because reverting the patch
fixes your problem.... As I write this, it occurs to me that it might be
because your compile-job has created very long free-lists and searching them
is causing problems.

Can you post the contents of /proc/buddyinfo before and after you run
mke2fs? It will give an indication of how long the linked lists are being
searched. After I push send here, I'll be trying the tests after running
compile-tests similar to yours to see if that reproduces the problem.

Here are some other questions I hope you can answer just to eliminate them
as possibilities. Can you tell me what sort of disk driver you are using
(results here are for sata_nv)? Are you using RAID or MD? Is anything running
in the background while mke2fs is running? What is the output of mke2fs -V,
gcc --version and ld -v? Finally, can you mail me your .config and I'll try
it on my machine here.

In the meantime, it is safe to revert this patch. Andy Whitcroft tested
the behaviour of anti-fragmentation on a number of machines and while the
results are adversely affected in terms of hugepage allocation success rates,
they are still pretty decent. We will investigate a less expensive way of
achieving the same effect of the patch without the potentially long searches.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-11-14 11:17:53

by mel

[permalink] [raw]

Subject: Re: Major mke2fs slowdown (reproducable, bisected)

On (13/11/07 16:54), Mel Gorman didst pronounce:

> fixes your problem.... As I write this, it occurs to me that it might be
> because your compile-job has created very long free-lists and searching them
> is causing problems.

This indeed did appear to be the problem. When a basic compile-job was
run first, the mke2fs times on a vanilla kernel were

119.96 real, 0.08 user, 19.85 sys
194.91 real, 0.11 user, 24.71 sys
102.24 real, 0.08 user, 10.47 sys
104.66 real, 0.08 user, 10.45 sys
100.77 real, 0.13 user, 10.54 sys

and with the patch reverted was

121.83 real, 0.15 user, 11.16 sys
126.68 real, 0.10 user, 10.78 sys
104.47 real, 0.09 user, 10.48 sys
104.75 real, 0.10 user, 10.66 sys
106.06 real, 0.09 user, 10.55 sys

The high sys times initially are due to the number of fallbacks that
take place as creating a filesystem creates an unusually large number of
pinned pages for a short-period of time. The search times usually
unmeasurably then show up in the profiles.

Reverting the patch is still the right solution. We will find an alternative
way of biasing the placement of pages without the expensive search.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-11-14 11:25:42

by mel

[permalink] [raw]

Subject: Re: Major mke2fs slowdown (reproducable, bisected)

On (13/11/07 17:25), Andi Kleen didst pronounce:
> Alexey Dobriyan <[email protected]> writes:
> >
> > +/* Return the page with the lowest PFN in the list */
> > +static struct page *min_page(struct list_head *list)
> > +{
> > + unsigned long min_pfn = -1UL;
> > + struct page *min_page = NULL, *page;;
> > +
> > + list_for_each_entry(page, list, lru) {
> > + unsigned long pfn = page_to_pfn(page);
> > + if (pfn < min_pfn) {
> > + min_pfn = pfn;
> > + min_page = page;
> > + }
> > + }
> > +
> > + return min_page;
> > +}
> > +
> > /* Remove an element from the buddy allocator from the fallback list */
> > static struct page *__rmqueue_fallback(struct zone *zone, int order,
> > int start_migratetype)
> > @@ -795,8 +812,11 @@ retry:
> > if (list_empty(&area->free_list[migratetype]))
> > continue;
> >
> > + /* Bias kernel allocations towards low pfns */
> > page = list_entry(area->free_list[migratetype].next,
> > struct page, lru);
> > + if (unlikely(start_migratetype != MIGRATE_MOVABLE))
> > + page = min_page(&area->free_list[migratetype]);
>
> Do I misread this, or does it really turn the O(1) buddy allocation into
> a "search whole free list" algorithm? Even as fallback that looks like
> a quite extreme thing to do.
>

It's extreme but not *quite* as extreme as you imply. The whole free-lists are
not searched, just one set at a specific order so it's "search a portion of
the free-lists". It happens for non-movable allocations (usually the minority)
and only then in fallback (in itself quite rare in almost all cases I've seen).

The problem was not detected before by me because it wasn't just a case of
creating a large number of pinned allocations but also depended on the type
of workload preceding it. If mke2fs was long-lived, it might not even have
been noticed. When run more than once, the fallbacks have all been dealt
with and it goes back to normal times.

The patch is now reverted and I don't expect to try bringing it back.
There are ways to bias the placement the pages as the patch intended without
doing an expensive search.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab