LinuxLists.cc - Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Mon, Mar 02, 2009 at 11:21:22AM +0000, Mel Gorman wrote:
> (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
>
> On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > can see how time is being spent and why it might have gotten worse?
> >
> > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > patches applied to 2.6.29-rc6.
> > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > line with addr2line.
> >
> > You can download the oprofile data and vmlinux from below link,
> > http://www.filefactory.com/file/af2330b/
> >
>
> Perfect, thanks a lot for profiling this. It is a big help in figuring out
> how the allocator is actually being used for your workloads.
>
> The OLTP results had the following things to say about the page allocator.

Is this OLTP, or UDP-U-4K?

> Samples in the free path
> vanilla: 6207
> mg-v2: 4911
> Samples in the allocation path
> vanilla 19948
> mg-v2: 14238
>
> This is based on glancing at the following graphs and not counting the VM
> counters as it can't be determined which samples are due to the allocator
> and which are due to the rest of the VM accounting.
>
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
>
> So the path costs are reduced in both cases. Whatever caused the regression
> there doesn't appear to be in time spent in the allocator but due to
> something else I haven't imagined yet. Other oddness
>
> o According to the profile, something like 45% of time is spent entering
> the __alloc_pages_nodemask() function. Function entry costs but not
> that much. Another significant part appears to be in checking a simple
> mask. That doesn't make much sense to me so I don't know what to do with
> that information yet.
>
> o In get_page_from_freelist(), 9% of the time is spent deleting a page
> from the freelist.
>
> Neither of these make sense, we're not spending time where I would expect
> to at all. One of two things are happening. Something like cache misses or
> bounces are dominating for some reason that is specific to this machine. Cache
> misses are one possibility that I'll check out. The other is that the sample
> rate is too low and the profile counts are hence misleading.
>
> Question 1: Would it be possible to increase the sample rate and track cache
> misses as well please?

If the events are constantly biased, I don't think sample rate will
help. I don't know how the internals of profiling counters work exactly,
but you would expect yes cache misses, and stalls from any number of
different resources could put results in funny places.

Intel's OLTP workload is very sensitive to cacheline footprint of the
kernel, and if you touch some extra cachelines at point A, it can just
result in profile hits getting distributed all over the place. Profiling
cache misses might help, but probably see a similar phenomenon.

I can't remember, does your latest patchset include any patches that change
the possible order in which pages move around? Or is it just made up of
straight-line performance improvement of existing implementation?

2009-03-02 12:16:45

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> On Mon, Mar 02, 2009 at 11:21:22AM +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> >
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > >
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > >
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > >
> >
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> >
> > The OLTP results had the following things to say about the page allocator.
>
> Is this OLTP, or UDP-U-4K?
>

OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
looking at other than to note that high-order allocations may be a
bigger deal there.

>
> > Samples in the free path
> > vanilla: 6207
> > mg-v2: 4911
> > Samples in the allocation path
> > vanilla 19948
> > mg-v2: 14238
> >
> > This is based on glancing at the following graphs and not counting the VM
> > counters as it can't be determined which samples are due to the allocator
> > and which are due to the rest of the VM accounting.
> >
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> > http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
> >
> > So the path costs are reduced in both cases. Whatever caused the regression
> > there doesn't appear to be in time spent in the allocator but due to
> > something else I haven't imagined yet. Other oddness
> >
> > o According to the profile, something like 45% of time is spent entering
> > the __alloc_pages_nodemask() function. Function entry costs but not
> > that much. Another significant part appears to be in checking a simple
> > mask. That doesn't make much sense to me so I don't know what to do with
> > that information yet.
> >
> > o In get_page_from_freelist(), 9% of the time is spent deleting a page
> > from the freelist.
> >
> > Neither of these make sense, we're not spending time where I would expect
> > to at all. One of two things are happening. Something like cache misses or
> > bounces are dominating for some reason that is specific to this machine. Cache
> > misses are one possibility that I'll check out. The other is that the sample
> > rate is too low and the profile counts are hence misleading.
> >
> > Question 1: Would it be possible to increase the sample rate and track cache
> > misses as well please?
>
> If the events are constantly biased, I don't think sample rate will
> help. I don't know how the internals of profiling counters work exactly,
> but you would expect yes cache misses, and stalls from any number of
> different resources could put results in funny places.
>

Ok, if it's stalls that are the real factor then yes, increasing the
sample rate might not help. However, the same rates for instructions
were so low, I thought it might be a combination of both low sample
count and stalls happening at particular places. A profile of cache
misses will still be useful as it'll say in general if there is a marked
increase overall or not.

> Intel's OLTP workload is very sensitive to cacheline footprint of the
> kernel, and if you touch some extra cachelines at point A, it can just
> result in profile hits getting distributed all over the place. Profiling
> cache misses might help, but probably see a similar phenomenon.
>

Interesting, this might put a hole in replacing the gfp_zone() with a
version that uses an additional (or maybe two depending on alignment)
cacheline.

> I can't remember, does your latest patchset include any patches that change
> the possible order in which pages move around? Or is it just made up of
> straight-line performance improvement of existing implementation?
>

It shouldn't affect order. I did a test a while ago to make sure pages
were still coming back in contiguous order as some IO cards depend on this
behaviour for performance. The intention for the first pass is a straight-line
performance improvement.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-03 04:42:56

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Mon, Mar 02, 2009 at 12:16:33PM +0000, Mel Gorman wrote:
> On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > how the allocator is actually being used for your workloads.
> > >
> > > The OLTP results had the following things to say about the page allocator.
> >
> > Is this OLTP, or UDP-U-4K?
> >
>
> OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
> looking at other than to note that high-order allocations may be a
> bigger deal there.

OK.

> > > Question 1: Would it be possible to increase the sample rate and track cache
> > > misses as well please?
> >
> > If the events are constantly biased, I don't think sample rate will
> > help. I don't know how the internals of profiling counters work exactly,
> > but you would expect yes cache misses, and stalls from any number of
> > different resources could put results in funny places.
> >
>
> Ok, if it's stalls that are the real factor then yes, increasing the
> sample rate might not help. However, the same rates for instructions
> were so low, I thought it might be a combination of both low sample
> count and stalls happening at particular places. A profile of cache
> misses will still be useful as it'll say in general if there is a marked
> increase overall or not.

OK.

> > Intel's OLTP workload is very sensitive to cacheline footprint of the
> > kernel, and if you touch some extra cachelines at point A, it can just
> > result in profile hits getting distributed all over the place. Profiling
> > cache misses might help, but probably see a similar phenomenon.
> >
>
> Interesting, this might put a hole in replacing the gfp_zone() with a
> version that uses an additional (or maybe two depending on alignment)
> cacheline.

Well... I still think it is probably a good idea. Firstly is that
it probably saves a line of icache too. Secondly, I guess adding a
*single* extra readonly cacheline is probably not such a problem
even for this workload. I was more thinking of if you changed the
pattern in which pages are allocated (ie. like the hot/cold thing),
or if some change resulted in more cross-cpu operations then it
could result in worse cache efficiency.

But you never know, it might be one patch to look at.

> > I can't remember, does your latest patchset include any patches that change
> > the possible order in which pages move around? Or is it just made up of
> > straight-line performance improvement of existing implementation?
> >
>
> It shouldn't affect order. I did a test a while ago to make sure pages
> were still coming back in contiguous order as some IO cards depend on this
> behaviour for performance. The intention for the first pass is a straight-line
> performance improvement.

OK, but the dynamic behaviour too. Free page A, free page B, allocate page
A allocate page B etc.

The hot/cold removal would be an obvious example of what I mean, although
that wasn't included in this recent patchset anyway.

2009-03-03 08:25:29

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> On Mon, Mar 02, 2009 at 12:16:33PM +0000, Mel Gorman wrote:
> > On Mon, Mar 02, 2009 at 12:39:36PM +0100, Nick Piggin wrote:
> > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > how the allocator is actually being used for your workloads.
> > > >
> > > > The OLTP results had the following things to say about the page allocator.
> > >
> > > Is this OLTP, or UDP-U-4K?
> > >
> >
> > OLTP. I didn't do a comparison for UDP due to uncertainity of what I was
> > looking at other than to note that high-order allocations may be a
> > bigger deal there.
>
> OK.
>
> > > > Question 1: Would it be possible to increase the sample rate and track cache
> > > > misses as well please?
> > >
> > > If the events are constantly biased, I don't think sample rate will
> > > help. I don't know how the internals of profiling counters work exactly,
> > > but you would expect yes cache misses, and stalls from any number of
> > > different resources could put results in funny places.
> > >
> >
> > Ok, if it's stalls that are the real factor then yes, increasing the
> > sample rate might not help. However, the same rates for instructions
> > were so low, I thought it might be a combination of both low sample
> > count and stalls happening at particular places. A profile of cache
> > misses will still be useful as it'll say in general if there is a marked
> > increase overall or not.
>
> OK.
>

As it turns out, my own tests here are showing increased cache misses so
I'm checking out why. One possibility is that the per-cpu structures are
increased in size to avoid a list search during allocation.

>
> > > Intel's OLTP workload is very sensitive to cacheline footprint of the
> > > kernel, and if you touch some extra cachelines at point A, it can just
> > > result in profile hits getting distributed all over the place. Profiling
> > > cache misses might help, but probably see a similar phenomenon.
> > >
> >
> > Interesting, this might put a hole in replacing the gfp_zone() with a
> > version that uses an additional (or maybe two depending on alignment)
> > cacheline.
>
> Well... I still think it is probably a good idea. Firstly is that
> it probably saves a line of icache too. Secondly, I guess adding a
> *single* extra readonly cacheline is probably not such a problem
> even for this workload. I was more thinking of if you changed the
> pattern in which pages are allocated (ie. like the hot/cold thing),

I need to think about it again, but I think the allocation/free pattern
should be more or less the same.

> or if some change resulted in more cross-cpu operations then it
> could result in worse cache efficiency.
>

It occured to me before sleeping last night that there could be a lot
of cross-cpu operations taking place in the buddy allocator itself. When
bulk-freeing pages, we have to examine all the buddies and merge them. In
the case of a freshly booted system, many of the pages of interest will be
within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
they'll bounce the struct pages between each other a lot as we are writing
those cache lines. However, this would be incurring with or without my patches.

It's an old observation of buddy that it can spend its time merging and
splitting buddies but it was in plain computational cost. I'm not sure if
the potential SMP-badness of it was also considered.

> But you never know, it might be one patch to look at.
>

I'm shuffling the patches that might affect cache like this towards the
end of the end where they'll be easier to bisect.

> > > I can't remember, does your latest patchset include any patches that change
> > > the possible order in which pages move around? Or is it just made up of
> > > straight-line performance improvement of existing implementation?
> > >
> >
> > It shouldn't affect order. I did a test a while ago to make sure pages
> > were still coming back in contiguous order as some IO cards depend on this
> > behaviour for performance. The intention for the first pass is a straight-line
> > performance improvement.
>
> OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> A allocate page B etc.
>
> The hot/cold removal would be an obvious example of what I mean, although
> that wasn't included in this recent patchset anyway.
>

I get your point though, I'll keep it in mind. I've gone from plain
"reduce the clock cycles" to "reduce the cache misses" as if OLTP is
sensitive to this it has to be addressed as well.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-03 09:05:35

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Tue, Mar 03, 2009 at 08:25:12AM +0000, Mel Gorman wrote:
> On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> > or if some change resulted in more cross-cpu operations then it
> > could result in worse cache efficiency.
> >
>
> It occured to me before sleeping last night that there could be a lot
> of cross-cpu operations taking place in the buddy allocator itself. When
> bulk-freeing pages, we have to examine all the buddies and merge them. In
> the case of a freshly booted system, many of the pages of interest will be
> within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
> they'll bounce the struct pages between each other a lot as we are writing
> those cache lines. However, this would be incurring with or without my patches.

Oh yes it would definitely be a factor I think.

> > OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> > A allocate page B etc.
> >
> > The hot/cold removal would be an obvious example of what I mean, although
> > that wasn't included in this recent patchset anyway.
> >
>
> I get your point though, I'll keep it in mind. I've gone from plain
> "reduce the clock cycles" to "reduce the cache misses" as if OLTP is
> sensitive to this it has to be addressed as well.

OK cool. The patchset did look pretty good for reducing clock cycles
though, so hopefully it turns out to be something simple.

2009-03-03 13:51:19

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Tue, Mar 03, 2009 at 10:04:42AM +0100, Nick Piggin wrote:
> On Tue, Mar 03, 2009 at 08:25:12AM +0000, Mel Gorman wrote:
> > On Tue, Mar 03, 2009 at 05:42:40AM +0100, Nick Piggin wrote:
> > > or if some change resulted in more cross-cpu operations then it
> > > could result in worse cache efficiency.
> > >
> >
> > It occured to me before sleeping last night that there could be a lot
> > of cross-cpu operations taking place in the buddy allocator itself. When
> > bulk-freeing pages, we have to examine all the buddies and merge them. In
> > the case of a freshly booted system, many of the pages of interest will be
> > within the same MAX_ORDER blocks. If multiple CPUs bulk free their pages,
> > they'll bounce the struct pages between each other a lot as we are writing
> > those cache lines. However, this would be incurring with or without my patches.
>
> Oh yes it would definitely be a factor I think.
>

It's on the list for a second or third pass to investigate.

>
> > > OK, but the dynamic behaviour too. Free page A, free page B, allocate page
> > > A allocate page B etc.
> > >
> > > The hot/cold removal would be an obvious example of what I mean, although
> > > that wasn't included in this recent patchset anyway.
> > >
> >
> > I get your point though, I'll keep it in mind. I've gone from plain
> > "reduce the clock cycles" to "reduce the cache misses" as if OLTP is
> > sensitive to this it has to be addressed as well.
>
> OK cool. The patchset did look pretty good for reducing clock cycles
> though, so hopefully it turns out to be something simple.
>

I'm hoping it is. I noticed a few oddities where we use more cache than we
need to that I cleaned up. However, the strongest possibility of being a
problem is actually the patch that removes the list-search for a page of a
given migratetype in the allocation path. The fix simplifies the allocation
path but increases the complexity of the bulk-free path by quite a bit and
increases the number of cache lines that are accessed. Worse, the fix grows
the per-cpu structure from one cache line to two on x86-64 NUMA machines
which I think is significant. I'm testing that at the moment but I might
end up dropping the patch from the first pass as a result and confine
the set to "obvious" wins.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-03 16:42:58

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Mon, 2 Mar 2009, Mel Gorman wrote:

> Going by the vanilla kernel, a *large* amount of time is spent doing
> high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> pages are required for the packets? That means high-order allocations and
> high contention on the zone-list. That is bad obviously and has implications
> for the SLUB-passthru patch because whether 8K allocations are handled by
> SL*B or the page allocator has a big impact on locking.
>
> Next, a little over 50% of the cost get_page_from_freelist() is being spent
> acquiring the zone spinlock. The implication is that the SL*B allocators
> passing in order-1 allocations to the page allocator are currently going to
> hit scalability problems in a big way. The solution may be to extend the
> per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> check it out.

Then we are increasing the number of queues dramatically in the page
allocator. More of a memory sink. Less cache hotness.

2009-03-03 21:48:55

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Tue, Mar 03, 2009 at 11:31:46AM -0500, Christoph Lameter wrote:
> On Mon, 2 Mar 2009, Mel Gorman wrote:
>
> > Going by the vanilla kernel, a *large* amount of time is spent doing
> > high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> > the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> > pages are required for the packets? That means high-order allocations and
> > high contention on the zone-list. That is bad obviously and has implications
> > for the SLUB-passthru patch because whether 8K allocations are handled by
> > SL*B or the page allocator has a big impact on locking.
> >
> > Next, a little over 50% of the cost get_page_from_freelist() is being spent
> > acquiring the zone spinlock. The implication is that the SL*B allocators
> > passing in order-1 allocations to the page allocator are currently going to
> > hit scalability problems in a big way. The solution may be to extend the
> > per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> > check it out.
>
> Then we are increasing the number of queues dramatically in the page
> allocator. More of a memory sink. Less cache hotness.
>

It doesn't have to be more queues and networking is doing order-1 allocations
based on a quick instrumentation so we might be justified in doing this to
avoid contending excessively on the zone lock.

Without the patchset, we do a search of the pcp lists for a page of the
appropriate migrate type. There is a patch that removes this search at
the cost of one cache line per CPU and it works reasonably well.

However, if the search was left in place, you can add pages of other orders
and just search for those which should be a lot less costly. Yes, the search
is unfortunate but you avoid acquiring the zone lock without increasing
the size of the per-cpu structure. The search will require cache lines it's
probably less than acquiring teh zone lock and going through the whole buddy
allocator for order-1 pages.

2009-03-04 02:05:44

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
>
> On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > can see how time is being spent and why it might have gotten worse?
> >
> > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > patches applied to 2.6.29-rc6.
> > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > line with addr2line.
> >
> > You can download the oprofile data and vmlinux from below link,
> > http://www.filefactory.com/file/af2330b/
> >
>
> Perfect, thanks a lot for profiling this. It is a big help in figuring out
> how the allocator is actually being used for your workloads.
>
> The OLTP results had the following things to say about the page allocator.
In case we might mislead you guys, I want to clarify that here OLTP is
sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
memory.

Ma Chinang, another Intel guy, does work on the famous OLTP running.

>
> Samples in the free path
> vanilla: 6207
> mg-v2: 4911
> Samples in the allocation path
> vanilla 19948
> mg-v2: 14238
>
> This is based on glancing at the following graphs and not counting the VM
> counters as it can't be determined which samples are due to the allocator
> and which are due to the rest of the VM accounting.
>
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-vanilla-oltp.png
> http://www.csn.ul.ie/~mel/postings/lin-20090228/free_pages-mgv2-oltp.png
>
> So the path costs are reduced in both cases. Whatever caused the regression
> there doesn't appear to be in time spent in the allocator but due to
> something else I haven't imagined yet. Other oddness
>
> o According to the profile, something like 45% of time is spent entering
> the __alloc_pages_nodemask() function. Function entry costs but not
> that much. Another significant part appears to be in checking a simple
> mask. That doesn't make much sense to me so I don't know what to do with
> that information yet.
>
> o In get_page_from_freelist(), 9% of the time is spent deleting a page
> from the freelist.
>
> Neither of these make sense, we're not spending time where I would expect
> to at all. One of two things are happening. Something like cache misses or
> bounces are dominating for some reason that is specific to this machine. Cache
> misses are one possibility that I'll check out. The other is that the sample
> rate is too low and the profile counts are hence misleading.
>
> Question 1: Would it be possible to increase the sample rate and track cache
> misses as well please?
I will try to capture cache miss with oprofile.

>
> Another interesting fact is that we are spending about 15% of the overall
> time is spent in tg_shares_up() for both kernels but the vanilla kernel
> recorded 977348 samples and the patched kernel recorded 514576 samples. We
> are spending less time in the kernel and it's not obvious why or if that is
> a good thing or not. You'd think less time in kernel is good but it might
> mean we are doing less work overall.
>
> Total aside from the page allocator, I checked what we were doing
> in tg_shares_up where the vast amount of time is being spent. This has
> something to do with CONFIG_FAIR_GROUP_SCHED.
>
> Question 2: Scheduler guys, can you think of what it means to be spending
> less time in tg_shares_up please?
>
> I don't know enough of how it works to guess why we are in there. FWIW,
> we are appear to be spending the most time in the following lines
>
> weight = tg->cfs_rq[i]->load.weight;
> if (!weight)
> weight = NICE_0_LOAD;
>
> tg->cfs_rq[i]->rq_weight = weight;
> rq_weight += weight;
> shares += tg->cfs_rq[i]->shares;
>
> So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
> and we're writing to it. How often is this function run by multiple CPUs? If
> the answer is "lots", does that not mean we are cache line bouncing in
> here like mad? Another crazy amount of time is spent accessing tg->se when
> validating. Basically, any access of the task_group appears to incur huge
> costs and cache line bounces would be the obvious explanation.
FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
We did find it takes lots of time to check/update the share weight which might create
lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
mysql runs as user mysql and sysbench runs as another regular user. When starting
the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
thread are proactive.

>
> More stupid poking around. We appear to update these share things on each
> fork().
>
> Question 3: Scheduler guys, If the database or clients being used for OLTP is
> fork-based instead of thread-based, then we are going to be balancing a lot,
> right? What does that mean, how can it be avoided?
>
> Question 4: Lin, this is unrelated to the page allocator but do you know
> what the performance difference between vanilla-with-group-sched and
> vanilla-without-group-sched is?
When FAIR_GROUP_SCHED appeared in kernel at the first time, we did many such testing.
There is another thread to discuss it at http://lkml.org/lkml/2008/9/10/214.

set sched_shares_ratelimit to a large value could reduce the regression.

Scheduler guys keep improving it.

>
> The UDP results are screwy as the profiles are not matching up to the
> images. For example
Mostly, it's caused by not cleaning up old oprofile data when starting
new sampling.

I will retry.

>
> oltp.oprofile.2.6.29-rc6: ffffffff802808a0 11022 0.1727 get_page_from_freelist
> oltp.oprofile.2.6.29-rc6-mg-v2: ffffffff80280610 7958 0.2403 get_page_from_freelist
> UDP-U-4K.oprofile.2.6.29-rc6: ffffffff802808a0 29914 1.2866 get_page_from_freelist
> UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153 1.1708 get_page_from_freelist
>
> Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
> for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
> for the patched kernel :(.
>
> Question 5: Lin, would it be possible to get whatever script you use for
> running netperf so I can try reproducing it?
Below is a simple script. As for formal testing, we add parameter "-i 50,3 -I" 99,5"
to get a more stable result.

PROG_DIR=/home/ymzhang/test/netperf/src
taskset -c 0 ${PROG_DIR}/netserver
sleep 2
taskset -c 7 ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
killall netserver

Basically, we start 1 client and bind client/server to different physical cpu.

>
> Going by the vanilla kernel, a *large* amount of time is spent doing
> high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> pages are required for the packets? That means high-order allocations and
> high contention on the zone-list. That is bad obviously and has implications
> for the SLUB-passthru patch because whether 8K allocations are handled by
> SL*B or the page allocator has a big impact on locking.
>
> Next, a little over 50% of the cost get_page_from_freelist() is being spent
> acquiring the zone spinlock. The implication is that the SL*B allocators
> passing in order-1 allocations to the page allocator are currently going to
> hit scalability problems in a big way. The solution may be to extend the
> per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> check it out.
>

2009-03-04 07:24:27

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Wed, 2009-03-04 at 10:05 +0800, Zhang, Yanmin wrote:
> FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> We did find it takes lots of time to check/update the share weight which might create
> lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> mysql runs as user mysql and sysbench runs as another regular user. When starting
> the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> thread are proactive.

cgroup based group scheduling doesn't bother with users. So unless you
create sched-cgroups your should all be in the same (root) group.

2009-03-04 08:32:27

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Wed, 2009-03-04 at 08:23 +0100, Peter Zijlstra wrote:
> On Wed, 2009-03-04 at 10:05 +0800, Zhang, Yanmin wrote:
> > FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> > We did find it takes lots of time to check/update the share weight which might create
> > lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> > mysql runs as user mysql and sysbench runs as another regular user. When starting
> > the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> > thread are proactive.
>
> cgroup based group scheduling doesn't bother with users. So unless you
> create sched-cgroups your should all be in the same (root) group.

I disable CGROUP, but enable GROUP_SCHED and USER_SCHED. My config inherits from old config
files.

CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set

I check defconfig on x86-64 of 2.6.28 and it does enable CGROUP and disable USER_SCHED.

Perhaps I need change my latest config file to the default on sched options.

2009-03-04 09:08:08

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

2009-03-04 18:04:42

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> >
> > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > can see how time is being spent and why it might have gotten worse?
> > >
> > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > patches applied to 2.6.29-rc6.
> > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > line with addr2line.
> > >
> > > You can download the oprofile data and vmlinux from below link,
> > > http://www.filefactory.com/file/af2330b/
> > >
> >
> > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > how the allocator is actually being used for your workloads.
> >
> > The OLTP results had the following things to say about the page allocator.
>
> In case we might mislead you guys, I want to clarify that here OLTP is
> sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> memory.
>

Ah good. I'm testing with sysbench+postgres and I've seen similar
regressions on some machines so I have something to investigate.

> Ma Chinang, another Intel guy, does work on the famous OLTP running.
>

Good to know. It's too early to test remotely near there but when this
is ready for merging a run on that setup would be really nice time was
available.

> > <SNIP>
> > Question 1: Would it be possible to increase the sample rate and track cache
> > misses as well please?
>
> I will try to capture cache miss with oprofile.
>

Great, thanks. I did a cache miss capture for one of the machines and
noted cache misses increased but it'd still good to know.

> > Another interesting fact is that we are spending about 15% of the overall
> > time is spent in tg_shares_up() for both kernels but the vanilla kernel
> > recorded 977348 samples and the patched kernel recorded 514576 samples. We
> > are spending less time in the kernel and it's not obvious why or if that is
> > a good thing or not. You'd think less time in kernel is good but it might
> > mean we are doing less work overall.
> >
> > Total aside from the page allocator, I checked what we were doing
> > in tg_shares_up where the vast amount of time is being spent. This has
> > something to do with CONFIG_FAIR_GROUP_SCHED.
> >
> > Question 2: Scheduler guys, can you think of what it means to be spending
> > less time in tg_shares_up please?
> >
> > I don't know enough of how it works to guess why we are in there. FWIW,
> > we are appear to be spending the most time in the following lines
> >
> > weight = tg->cfs_rq[i]->load.weight;
> > if (!weight)
> > weight = NICE_0_LOAD;
> >
> > tg->cfs_rq[i]->rq_weight = weight;
> > rq_weight += weight;
> > shares += tg->cfs_rq[i]->shares;
> >
> > So.... cfs_rq is SMP aligned, but we iterate though it with for_each_cpu()
> > and we're writing to it. How often is this function run by multiple CPUs? If
> > the answer is "lots", does that not mean we are cache line bouncing in
> > here like mad? Another crazy amount of time is spent accessing tg->se when
> > validating. Basically, any access of the task_group appears to incur huge
> > costs and cache line bounces would be the obvious explanation.
>
> ???FAIR_GROUP_SCHED is a feature to support configurable cpu weight for different users.
> We did find it takes lots of time to check/update the share weight which might create
> lots of cache ping-pang. With sysbench(oltp)+mysql, that becomes more severe because
> mysql runs as user mysql and sysbench runs as another regular user. When starting
> the testing with 1 thread in command line, there are 2 mysql threads and 1 sysbench
> thread are proactive.
>

Very interesting, I don't think this will affect the page allocator but
I'll keep it in mind when worrying about the workload as a whole instead
of just one corner of it.

> >
> >
> > More stupid poking around. We appear to update these share things on each
> > fork().
> >
> > Question 3: Scheduler guys, If the database or clients being used for OLTP is
> > fork-based instead of thread-based, then we are going to be balancing a lot,
> > right? What does that mean, how can it be avoided?
> >
> > Question 4: Lin, this is unrelated to the page allocator but do you know
> > what the performance difference between vanilla-with-group-sched and
> > vanilla-without-group-sched is?
>
> When ???FAIR_GROUP_SCHED appeared in kernel at the first time, we did many such testing.
> There is another thread to discuss it at http://lkml.org/lkml/2008/9/10/214.
>
> set s???ched_shares_ratelimit to a large value could reduce the regression.
>
> Scheduler guys keep improving it.
>

Good to know. I haven't read the thread yet but it's now on my TODO
list.

> > The UDP results are screwy as the profiles are not matching up to the
> > images. For example
> Mostly, it's caused by not cleaning up old oprofile data when starting
> new sampling.
>
> I will retry.
>

Thanks
> >
> > oltp.oprofile.2.6.29-rc6: ffffffff802808a0 11022 0.1727 get_page_from_freelist
> > oltp.oprofile.2.6.29-rc6-mg-v2: ffffffff80280610 7958 0.2403 get_page_from_freelist
> > UDP-U-4K.oprofile.2.6.29-rc6: ffffffff802808a0 29914 1.2866 get_page_from_freelist
> > UDP-U-4K.oprofile.2.6.29-rc6-mg-v2: ffffffff802808a0 28153 1.1708 get_page_from_freelist
> >
> > Look at the addresses. UDP-U-4K.oprofile.2.6.29-rc6-mg-v2 has the address
> > for UDP-U-4K.oprofile.2.6.29-rc6 so I have no idea what I'm looking at here
> > for the patched kernel :(.
> >
> > Question 5: Lin, would it be possible to get whatever script you use for
> > running netperf so I can try reproducing it?

> Below is a simple script. As for formal testing, we add parameter "-i 50,3 -I" 99,5"
> to get a more stable result.
>
> PROG_DIR=/home/ymzhang/test/netperf/src
> taskset -c 0 ${PROG_DIR}/netserver
> sleep 2
> taskset -c 7 ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> killall netserver
>

Thanks, simple is good enough to start with. Just have to get around to
wrapping the automation around it.

> Basically, we start 1 client and bind client/server to different physical cpu.
>
> >
> > Going by the vanilla kernel, a *large* amount of time is spent doing
> > high-order allocations. Over 25% of the cost of buffered_rmqueue() is in
> > the branch dealing with high-order allocations. Does UDP-U-4K mean that 8K
> > pages are required for the packets? That means high-order allocations and
> > high contention on the zone-list. That is bad obviously and has implications
> > for the SLUB-passthru patch because whether 8K allocations are handled by
> > SL*B or the page allocator has a big impact on locking.
> >
> > Next, a little over 50% of the cost get_page_from_freelist() is being spent
> > acquiring the zone spinlock. The implication is that the SL*B allocators
> > passing in order-1 allocations to the page allocator are currently going to
> > hit scalability problems in a big way. The solution may be to extend the
> > per-cpu allocator to handle magazines up to PAGE_ALLOC_COSTLY_ORDER. I'll
> > check it out.
> >
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-05 01:57:19

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > >
> > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > can see how time is being spent and why it might have gotten worse?
> > > >
> > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > patches applied to 2.6.29-rc6.
> > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > line with addr2line.
> > > >
> > > > You can download the oprofile data and vmlinux from below link,
> > > > http://www.filefactory.com/file/af2330b/
> > > >
> > >
> > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > how the allocator is actually being used for your workloads.
> > >
> > > The OLTP results had the following things to say about the page allocator.
> > In case we might mislead you guys, I want to clarify that here OLTP is
> > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > memory.
> >
> > Ma Chinang, another Intel guy, does work on the famous OLTP running.
>
> OK, so my comments WRT cache sensitivity probably don't apply here,
> but probably cache hotness of pages coming out of the allocator
> might still be important for this one.
Yes. We need check it.

>
> How many runs are you doing of these tests?
We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
4*4 tigerton, then get an average value in case there might be a scalability issue.

As for this sysbench oltp testing, we reran it for 7 times on tigerton this week and
found the results have fluctuations. Now we could only say there is a trend that
the result with the pathces is a little worse than the one without the patches.

> Do you have a fairly high
> confidence that the changes are significant?
2% isn't significant on sysbench oltp.

yanmin

2009-03-05 10:34:56

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

* Zhang, Yanmin <[email protected]> wrote:

> On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > >
> > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > can see how time is being spent and why it might have gotten worse?
> > > > >
> > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > patches applied to 2.6.29-rc6.
> > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > line with addr2line.
> > > > >
> > > > > You can download the oprofile data and vmlinux from below link,
> > > > > http://www.filefactory.com/file/af2330b/
> > > > >
> > > >
> > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > how the allocator is actually being used for your workloads.
> > > >
> > > > The OLTP results had the following things to say about the page allocator.
> > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > memory.
> > >
> > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> >
> > OK, so my comments WRT cache sensitivity probably don't apply here,
> > but probably cache hotness of pages coming out of the allocator
> > might still be important for this one.
> Yes. We need check it.
>
> >
> > How many runs are you doing of these tests?
> We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> 4*4 tigerton, then get an average value in case there might be a scalability issue.
>
> As for this sysbench oltp testing, we reran it for 7 times on
> tigerton this week and found the results have fluctuations.
> Now we could only say there is a trend that the result with
> the pathces is a little worse than the one without the
> patches.

Could you try "perfstat -s" perhaps and see whether any other of
the metrics outside of tx/sec has less natural noise?

I think a more invariant number might be the ratio of "LLC
cachemisses" divided by "CPU migrations".

The fluctuation in tx/sec comes from threads bouncing - but you
can normalize that away by using the cachemisses/migrations
ration.

Perhaps. It's definitely a difficult thing to measure.

Ingo

2009-03-06 08:39:20

by Lin Ming

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Thu, 2009-03-05 at 18:34 +0800, Ingo Molnar wrote:
> * Zhang, Yanmin <[email protected]> wrote:
>
> > On Wed, 2009-03-04 at 10:07 +0100, Nick Piggin wrote:
> > > On Wed, Mar 04, 2009 at 10:05:07AM +0800, Zhang, Yanmin wrote:
> > > > On Mon, 2009-03-02 at 11:21 +0000, Mel Gorman wrote:
> > > > > (Added Ingo as a second scheduler guy as there are queries on tg_shares_up)
> > > > >
> > > > > On Fri, Feb 27, 2009 at 04:44:43PM +0800, Lin Ming wrote:
> > > > > > On Thu, 2009-02-26 at 19:22 +0800, Mel Gorman wrote:
> > > > > > > In that case, Lin, could I also get the profiles for UDP-U-4K please so I
> > > > > > > can see how time is being spent and why it might have gotten worse?
> > > > > >
> > > > > > I have done the profiling (oltp and UDP-U-4K) with and without your v2
> > > > > > patches applied to 2.6.29-rc6.
> > > > > > I also enabled CONFIG_DEBUG_INFO so you can translate address to source
> > > > > > line with addr2line.
> > > > > >
> > > > > > You can download the oprofile data and vmlinux from below link,
> > > > > > http://www.filefactory.com/file/af2330b/
> > > > > >
> > > > >
> > > > > Perfect, thanks a lot for profiling this. It is a big help in figuring out
> > > > > how the allocator is actually being used for your workloads.
> > > > >
> > > > > The OLTP results had the following things to say about the page allocator.
> > > > In case we might mislead you guys, I want to clarify that here OLTP is
> > > > sysbench (oltp)+mysql, not the famous OLTP which needs lots of disks and big
> > > > memory.
> > > >
> > > > Ma Chinang, another Intel guy, does work on the famous OLTP running.
> > >
> > > OK, so my comments WRT cache sensitivity probably don't apply here,
> > > but probably cache hotness of pages coming out of the allocator
> > > might still be important for this one.
> > Yes. We need check it.
> >
> > >
> > > How many runs are you doing of these tests?
> > We start sysbench with different thread number, for example, 8 12 16 32 64 128 for
> > 4*4 tigerton, then get an average value in case there might be a scalability issue.
> >
> > As for this sysbench oltp testing, we reran it for 7 times on
> > tigerton this week and found the results have fluctuations.
> > Now we could only say there is a trend that the result with
> > the pathces is a little worse than the one without the
> > patches.
>
> Could you try "perfstat -s" perhaps and see whether any other of
> the metrics outside of tx/sec has less natural noise?

Thanks, I have used "perfstat -s" to collect cache misses data.

2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core

I collected 5 times netperf UDP-U-4k data with and without mg-v2 patches
applied to tip/perfcounters/core on a 4p quad-core tigerton machine, as
below
"value" means UDP-U-4k test result.

2.6.29-rc7-tip
---------------
value cache misses CPU migrations cachemisses/migrations
5329.71 391094656 1710 228710
5641.59 239552767 2138 112045
5580.87 132474745 2172 60992
5547.19 86911457 2099 41406
5626.38 196751217 2050 95976

2.6.29-rc7-tip-mg2
-------------------
value cache misses CPU migrations cachemisses/migrations
4749.80 649929463 1132 574142
4327.06 484100170 1252 386661
4649.51 374201508 1489 251310
5655.82 405511551 1848 219432
5571.58 90222256 2159 41788

Lin Ming

>
> I think a more invariant number might be the ratio of "LLC
> cachemisses" divided by "CPU migrations".
>
> The fluctuation in tx/sec comes from threads bouncing - but you
> can normalize that away by using the cachemisses/migrations
> ration.
>
> Perhaps. It's definitely a difficult thing to measure.
>
> Ingo

2009-03-06 09:39:56

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

* Lin Ming <[email protected]> wrote:

> Thanks, I have used "perfstat -s" to collect cache misses
> data.
>
> 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
>
> I collected 5 times netperf UDP-U-4k data with and without
> mg-v2 patches applied to tip/perfcounters/core on a 4p
> quad-core tigerton machine, as below "value" means UDP-U-4k
> test result.
>
> 2.6.29-rc7-tip
> ---------------
> value cache misses CPU migrations cachemisses/migrations
> 5329.71 391094656 1710 228710
> 5641.59 239552767 2138 112045
> 5580.87 132474745 2172 60992
> 5547.19 86911457 2099 41406
> 5626.38 196751217 2050 95976
>
> 2.6.29-rc7-tip-mg2
> -------------------
> value cache misses CPU migrations cachemisses/migrations
> 4749.80 649929463 1132 574142
> 4327.06 484100170 1252 386661
> 4649.51 374201508 1489 251310
> 5655.82 405511551 1848 219432
> 5571.58 90222256 2159 41788
>
> Lin Ming

Hm, these numbers look really interesting and give us insight
into this workload. The workload is fluctuating but by measuring
3 metrics at once instead of just one we see the following
patterns:

- Less CPU migrations means more cache misses and less
performance.

The lowest-score runs had the lowest CPU migrations count,
coupled with a high amount of cachemisses.

This _probably_ means that in this workload migrations are
desired: the sooner two related tasks migrate to the same CPU
the better. If they stay separate (migration count is low) then
they interact with each other from different CPUs, creating a
lot of cachemisses and reducing performance.

You can reduce the migration barrier of the system by enabling
CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:

echo 0 > /proc/sys/kernel/sched_migration_cost

This will hurt other workloads - but if this improves the
numbers then it proves that what this particular workload wants
is easy migrations.

Now the question is, why does the mg2 patchset reduce the number
of migrations? It might not be an inherent property of the mg2
patches: maybe just unlucky timings push the workload across
sched_migration_cost.

Setting sched_migration_cost to either zero or to a very high
value and repeating the test will eliminate this source of noise
and will tell us about other properties of the mg2 patchset.

There might be other effects i'm missing. For example what kind
of UDP transport is used - localhost networking? That means that
sender and receiver really wants to be coupled strongly and what
controls this workload is whether such a 'pair' of tasks can
properly migrate to the same CPU.

Ingo

2009-03-06 13:03:32

[permalink] [raw]

Subject: Re: [RFC PATCH 00/19] Cleanup and optimise the page allocator V2

On Fri, Mar 06, 2009 at 10:39:18AM +0100, Ingo Molnar wrote:
>
> * Lin Ming <[email protected]> wrote:
>
> > Thanks, I have used "perfstat -s" to collect cache misses
> > data.
> >
> > 2.6.29-rc7-tip: tip/perfcounters/core (b5e8acf)
> > 2.6.29-rc7-tip-mg2: v2 patches applied to tip/perfcounters/core
> >
> > I collected 5 times netperf UDP-U-4k data with and without
> > mg-v2 patches applied to tip/perfcounters/core on a 4p
> > quad-core tigerton machine, as below "value" means UDP-U-4k
> > test result.
> >
> > 2.6.29-rc7-tip
> > ---------------
> > value cache misses CPU migrations cachemisses/migrations
> > 5329.71 391094656 1710 228710
> > 5641.59 239552767 2138 112045
> > 5580.87 132474745 2172 60992
> > 5547.19 86911457 2099 41406
> > 5626.38 196751217 2050 95976
> >
> > 2.6.29-rc7-tip-mg2
> > -------------------
> > value cache misses CPU migrations cachemisses/migrations
> > 4749.80 649929463 1132 574142
> > 4327.06 484100170 1252 386661
> > 4649.51 374201508 1489 251310
> > 5655.82 405511551 1848 219432
> > 5571.58 90222256 2159 41788
> >
> > Lin Ming
>
> Hm, these numbers look really interesting and give us insight
> into this workload. The workload is fluctuating but by measuring
> 3 metrics at once instead of just one we see the following
> patterns:
>
> - Less CPU migrations means more cache misses and less
> performance.
>

I also happen to know that V2 was cache unfriendly in a number of
respects. I've been trying to address it in V3 but still the netperf
performance in general is being very tricky even though profiles tell me
the page allocator is lighter and incurring fewer cache misses.

(aside, thanks for saying how you were running netperf. It allowed me to
take shortcuts writing the automation as I knew what parameters to use)

Here is the results from one x86-64 machine running an unreleased version
of the patchset

Netperf UDP_STREAM Comparison
----------------------------
clean opt-palloc diff
UDP_STREAM-64 68.63 73.15 6.18%
UDP_STREAM-128 149.77 144.33 -3.77%
UDP_STREAM-256 264.06 280.18 5.75%
UDP_STREAM-1024 1037.81 1058.61 1.96%
UDP_STREAM-2048 1790.33 1906.53 6.09%
UDP_STREAM-3312 2671.34 2744.38 2.66%
UDP_STREAM-4096 2722.92 2910.65 6.45%
UDP_STREAM-8192 4280.14 4314.00 0.78%
UDP_STREAM-16384 5384.13 5606.83 3.97%
Netperf TCP_STREAM Comparison
----------------------------
clean opt-palloc diff
TCP_STREAM-64 180.09 204.59 11.98%
TCP_STREAM-128 297.45 812.22 63.38%
TCP_STREAM-256 1315.20 1432.74 8.20%
TCP_STREAM-1024 2544.73 3043.22 16.38%
TCP_STREAM-2048 4157.76 4351.28 4.45%
TCP_STREAM-3312 4254.53 4790.56 11.19%
TCP_STREAM-4096 4773.22 4932.61 3.23%
TCP_STREAM-8192 4937.03 5453.58 9.47%
TCP_STREAM-16384 6003.46 6183.74 2.92%

WOooo, more or less awesome. Then here are the results of a second x86-64
machine

Netperf UDP_STREAM Comparison
----------------------------
clean opt-palloc diff
UDP_STREAM-64 106.50 106.98 0.45%
UDP_STREAM-128 216.39 212.48 -1.84%
UDP_STREAM-256 425.29 419.12 -1.47%
UDP_STREAM-1024 1433.21 1449.20 1.10%
UDP_STREAM-2048 2569.67 2503.73 -2.63%
UDP_STREAM-3312 3685.30 3603.15 -2.28%
UDP_STREAM-4096 4019.05 4252.53 5.49%
UDP_STREAM-8192 6278.44 6315.58 0.59%
UDP_STREAM-16384 7389.78 7162.91 -3.17%
Netperf TCP_STREAM Comparison
----------------------------
clean opt-palloc diff
TCP_STREAM-64 694.90 674.47 -3.03%
TCP_STREAM-128 1160.13 1159.26 -0.08%
TCP_STREAM-256 2016.35 2018.03 0.08%
TCP_STREAM-1024 4619.41 4562.86 -1.24%
TCP_STREAM-2048 5001.08 5096.51 1.87%
TCP_STREAM-3312 5235.22 5276.18 0.78%
TCP_STREAM-4096 5832.15 5844.42 0.21%
TCP_STREAM-8192 6247.71 6287.93 0.64%
TCP_STREAM-16384 7987.68 7896.17 -1.16%

Much less awesome and the cause of much frowny face and contemplation as to
whether I'd be much better off hitting the bar for a tasty beverage or 10.

I'm trying to pin down why there are such large differences between machines
but it's something with the machine themselves as the results between runs
is fairly consistent. Annoyingly, the second machine showed good results
for kernbench (allocator heavy), sysbench (not allocator heavy), was more
or less the same for hackbench but regressed tbench and netperf even though
the page allocator overhead was less. I'm doing something screwy with cache
but don't know what it is yet.

netperf is being run on different CPUs and is possibly maximising the amount
of cache bounces incurred by the page allocator as it splits and merges
buddies. I'm experimenting with the idea of delaying bulk PCP frees but it's
also possible the network layer is having trouble with cache line bounces when
the workload is run over localhost and my modifications are changing timings.

> The lowest-score runs had the lowest CPU migrations count,
> coupled with a high amount of cachemisses.
>
> This _probably_ means that in this workload migrations are
> desired: the sooner two related tasks migrate to the same CPU
> the better. If they stay separate (migration count is low) then
> they interact with each other from different CPUs, creating a
> lot of cachemisses and reducing performance.
>
> You can reduce the migration barrier of the system by enabling
> CONFIG_SCHED_DEBUG=y and setting sched_migration_cost to zero:
>
> echo 0 > /proc/sys/kernel/sched_migration_cost
>
> This will hurt other workloads - but if this improves the
> numbers then it proves that what this particular workload wants
> is easy migrations.
>
> Now the question is, why does the mg2 patchset reduce the number
> of migrations? It might not be an inherent property of the mg2
> patches: maybe just unlucky timings push the workload across
> sched_migration_cost.
>
> Setting sched_migration_cost to either zero or to a very high
> value and repeating the test will eliminate this source of noise
> and will tell us about other properties of the mg2 patchset.
>
> There might be other effects i'm missing. For example what kind
> of UDP transport is used - localhost networking? That means that
> sender and receiver really wants to be coupled strongly and what
> controls this workload is whether such a 'pair' of tasks can
> properly migrate to the same CPU.
>
> Ingo
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-09 01:50:50