LinuxLists.cc - Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

2014-01-08 13:53:05

[permalink] [raw]

Subject: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

Adding LKML to the list as this -stable snifftest has identified an
upstream regression.

On Wed, Jan 08, 2014 at 10:43:40AM +0000, Mel Gorman wrote:
> On Tue, Jan 07, 2014 at 08:30:12PM +0000, Mel Gorman wrote:
> > On Tue, Jan 07, 2014 at 10:54:40AM -0800, Greg KH wrote:
> > > On Tue, Jan 07, 2014 at 06:17:15AM -0800, Greg KH wrote:
> > > > On Tue, Jan 07, 2014 at 02:00:35PM +0000, Mel Gorman wrote:
> > > > > A number of NUMA balancing patches were tagged for -stable but I got a
> > > > > number of rejected mails from either Greg or his robot minion. The list
> > > > > of relevant patches is
> > > > >
> > > > > FAILED: patch "[PATCH] mm: numa: serialise parallel get_user_page against THP"
> > > > > FAILED: patch "[PATCH] mm: numa: call MMU notifiers on THP migration"
> > > > > MERGED: Patch "mm: clear pmd_numa before invalidating"
> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PMD during PTE update scan"
> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PTE for pte_numa update"
> > > > > MERGED: Patch "mm: numa: ensure anon_vma is locked to prevent parallel THP splits"
> > > > > MERGED: Patch "mm: numa: avoid unnecessary work on the failure path"
> > > > > MERGED: Patch "sched: numa: skip inaccessible VMAs"
> > > > > FAILED: patch "[PATCH] mm: numa: clear numa hinting information on mprotect"
> > > > > FAILED: patch "[PATCH] mm: numa: avoid unnecessary disruption of NUMA hinting during"
> > > > > Patch "mm: fix TLB flush race between migration, and change_protection_range"
> > > > > Patch "mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates"
> > > > > FAILED: patch "[PATCH] mm: numa: defer TLB flush for THP migration as long as"
> > > > >
> > > > > Fixing the rejects one at a time may cause other conflicts due to ordering
> > > > > issues. Instead, this patch series against 3.12.6 is the full list of
> > > > > backported patches in the expected order. Greg, unfortunately this means
> > > > > you may have to drop some patches already in your stable tree and reapply
> > > > > but on the plus side they should be then in the correct order for bisection
> > > > > purposes and you'll know I've tested this combination of patches.
> > > >
> > > > Many thanks for these, I'll go queue them up in a bit and drop the
> > > > others to ensure I got all of this correct.
> > >
> > > Ok, I've now queued all of these up, in this order, so we should be
> > > good.
> > >
> > > I'll do a -rc2 in a bit as it needs some testing.
> > >
> >
> > Thanks a million. I should be cc'd on some of those so I'll pick up the
> > final result and run it through the same tests just to be sure.
> >
>
> Ok, tests completed and look more or less as expected. This is not to
> say the performance results are *good* as such. Workloads that normally
> demonstrate automatic numa balancing suffered because of other patches that
> were merged (primarily fair zone allocation policy) that had interesting
> side-effects. However, it now does not crash under heavy stress and I
> prefer working a little slowly than crashing fast. NAS at least looks
> better.
>
> Other workloads like kernel builds, page fault microbench looked good as
> expected from the fair zone allocation policy fixes.
>
> Big downside is that ebizzy performance is *destroyed* in that RC2 patch
> somewhere
>
> ebizzy
> 3.12.6 3.12.6 3.12.7-rc2
> vanilla backport-v1r2 stablerc2
> Mean 1 3278.67 ( 0.00%) 3180.67 ( -2.99%) 3212.00 ( -2.03%)
> Mean 2 2322.67 ( 0.00%) 2294.67 ( -1.21%) 1839.00 (-20.82%)
> Mean 3 2257.00 ( 0.00%) 2218.67 ( -1.70%) 1664.00 (-26.27%)
> Mean 4 2268.00 ( 0.00%) 2224.67 ( -1.91%) 1629.67 (-28.15%)
> Mean 5 2247.67 ( 0.00%) 2255.67 ( 0.36%) 1582.33 (-29.60%)
> Mean 6 2263.33 ( 0.00%) 2251.33 ( -0.53%) 1547.67 (-31.62%)
> Mean 7 2273.67 ( 0.00%) 2222.67 ( -2.24%) 1545.67 (-32.02%)
> Mean 8 2254.67 ( 0.00%) 2232.33 ( -0.99%) 1535.33 (-31.90%)
> Mean 12 2237.67 ( 0.00%) 2266.33 ( 1.28%) 1543.33 (-31.03%)
> Mean 16 2201.33 ( 0.00%) 2252.67 ( 2.33%) 1540.33 (-30.03%)
> Mean 20 2205.67 ( 0.00%) 2229.33 ( 1.07%) 1537.33 (-30.30%)
> Mean 24 2162.33 ( 0.00%) 2168.67 ( 0.29%) 1535.33 (-29.00%)
> Mean 28 2139.33 ( 0.00%) 2107.67 ( -1.48%) 1535.00 (-28.25%)
> Mean 32 2084.67 ( 0.00%) 2089.00 ( 0.21%) 1537.33 (-26.26%)
> Mean 36 2002.00 ( 0.00%) 2020.00 ( 0.90%) 1530.33 (-23.56%)
> Mean 40 1972.67 ( 0.00%) 1978.67 ( 0.30%) 1530.33 (-22.42%)
> Mean 44 1951.00 ( 0.00%) 1953.67 ( 0.14%) 1531.00 (-21.53%)
> Mean 48 1931.67 ( 0.00%) 1930.67 ( -0.05%) 1526.67 (-20.97%)
>
> Figures are records/sec, more is better for increasing numbers of threads
> up to 48 which is the number of logical CPUs in the machine. Three kernels
> tested
>
> 3.12.6 is self-explanatory
> backport-v1r2 is the backported series I sent you
> stablerc2 is the rc2 patch I pulled from kernel.org
>
> I'm not that familiar with the stable workflow but stable-queue.git looked
> like it had the correct quilt tree so bisection is in progress. If I had
> to bet money on it, I'd bet it's going to be scheduler or power management
> related mostly because problems in both of those areas have tended to
> screw ebizzy recently.
>

I was not far off. Bisection identified the following commit

3d97ea0816589c818ac62fb401e61c3b6a59f351 is the first bad commit
commit 3d97ea0816589c818ac62fb401e61c3b6a59f351
Author: Len Brown <[email protected]>
Date: Wed Dec 18 16:44:57 2013 -0500

x86 idle: Repair large-server 50-watt idle-power regression

commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream.

Linux 3.10 changed the timing of how thread_info->flags is touched:

x86: Use generic idle loop
(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)

This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.

Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.

Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.

While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.

Link: http://lkml.kernel.org/r/CAJvTdK=%[email protected]
Signed-off-by: Len Brown <[email protected]>
Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
Signed-off-by: H. Peter Anvin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Len, HPA, the x86 idle regression fix fubars ebizzy as a consequence, I
don't know why. I know the workload is not that important (and I expected
ebizzy to be unaffected in this test) but it is probably indicative of
other performance regressions hiding in there. It was caught via -stable
testing by accident but I checked and upstream is also affected. This is
a snippet from the bisection log

Wed 8 Jan 09:53:59 GMT 2014 compass ebizzy v3.12.6 mean-4:2317 good
Wed 8 Jan 10:13:04 GMT 2014 compass ebizzy v3.12.7-rc2 mean-4:1631 bad
Wed 8 Jan 10:27:45 GMT 2014 compass ebizzy a202b4808e500f4fd53b6cec150c8fe214c70183 mean-4:1620 bad
Wed 8 Jan 10:41:36 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2290 good
Wed 8 Jan 10:55:14 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2266 good
Wed 8 Jan 11:09:04 GMT 2014 compass ebizzy c62a6f8a28bf8897ba0903cf332d761c1132e48d mean-4:1624 bad
Wed 8 Jan 11:22:46 GMT 2014 compass ebizzy 346679aad15c3608844f6b433b8d8ba56ad03802 mean-4:2280 good
Wed 8 Jan 11:36:32 GMT 2014 compass ebizzy 36b9512dc19b535d72c1035048a95ec1c765d403 mean-4:1641 bad
Wed 8 Jan 11:50:22 GMT 2014 compass ebizzy 1a82fc9ab8bb6b4a5ee5cd32d570d6ff0b77efb2 mean-4:1627 bad
Wed 8 Jan 12:04:15 GMT 2014 compass ebizzy 3d97ea0816589c818ac62fb401e61c3b6a59f351 mean-4:1619 bad
Wed 8 Jan 13:10:03 GMT 2014 compass ebizzy v3.13-rc7 mean-4:1619 bad
Wed 8 Jan 13:39:19 GMT 2014 compass ebizzy v3.12.7-rc2-revert mean-4:2276 good

mean-4 figures are records/sec as recorded by the bisection test. The
bisection points are based on the -stable quilt tree so the commit ids are
meaningless but you can see good/bad figures are relatively stable leading
me to conclude the bisection is valid.

v3.12.6 was 2317 records/second and considered "good". The 3.12.7-rc2
stable candidate and 3.13-rc7 are both "bad". Reverting the single patch
from v3.12.7-rc2 restores performance.

Greg, this does not affect your -stable release as such because upstream is
also affected. If you release with the patch merged then the upstream fix
(whatever that is) will also need to be included in -stable later. If you
release without the patch then both upstream fixes will be later required
and some Intel machines will continue to consume excessive amounts of
power in the meantime.

--
Mel Gorman
SUSE Labs

2014-01-09 04:16:44

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Wed, Jan 08, 2014 at 01:48:58PM +0000, Mel Gorman wrote:
> Adding LKML to the list as this -stable snifftest has identified an
> upstream regression.
>
> On Wed, Jan 08, 2014 at 10:43:40AM +0000, Mel Gorman wrote:
> > On Tue, Jan 07, 2014 at 08:30:12PM +0000, Mel Gorman wrote:
> > > On Tue, Jan 07, 2014 at 10:54:40AM -0800, Greg KH wrote:
> > > > On Tue, Jan 07, 2014 at 06:17:15AM -0800, Greg KH wrote:
> > > > > On Tue, Jan 07, 2014 at 02:00:35PM +0000, Mel Gorman wrote:
> > > > > > A number of NUMA balancing patches were tagged for -stable but I got a
> > > > > > number of rejected mails from either Greg or his robot minion. The list
> > > > > > of relevant patches is
> > > > > >
> > > > > > FAILED: patch "[PATCH] mm: numa: serialise parallel get_user_page against THP"
> > > > > > FAILED: patch "[PATCH] mm: numa: call MMU notifiers on THP migration"
> > > > > > MERGED: Patch "mm: clear pmd_numa before invalidating"
> > > > > > FAILED: patch "[PATCH] mm: numa: do not clear PMD during PTE update scan"
> > > > > > FAILED: patch "[PATCH] mm: numa: do not clear PTE for pte_numa update"
> > > > > > MERGED: Patch "mm: numa: ensure anon_vma is locked to prevent parallel THP splits"
> > > > > > MERGED: Patch "mm: numa: avoid unnecessary work on the failure path"
> > > > > > MERGED: Patch "sched: numa: skip inaccessible VMAs"
> > > > > > FAILED: patch "[PATCH] mm: numa: clear numa hinting information on mprotect"
> > > > > > FAILED: patch "[PATCH] mm: numa: avoid unnecessary disruption of NUMA hinting during"
> > > > > > Patch "mm: fix TLB flush race between migration, and change_protection_range"
> > > > > > Patch "mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates"
> > > > > > FAILED: patch "[PATCH] mm: numa: defer TLB flush for THP migration as long as"
> > > > > >
> > > > > > Fixing the rejects one at a time may cause other conflicts due to ordering
> > > > > > issues. Instead, this patch series against 3.12.6 is the full list of
> > > > > > backported patches in the expected order. Greg, unfortunately this means
> > > > > > you may have to drop some patches already in your stable tree and reapply
> > > > > > but on the plus side they should be then in the correct order for bisection
> > > > > > purposes and you'll know I've tested this combination of patches.
> > > > >
> > > > > Many thanks for these, I'll go queue them up in a bit and drop the
> > > > > others to ensure I got all of this correct.
> > > >
> > > > Ok, I've now queued all of these up, in this order, so we should be
> > > > good.
> > > >
> > > > I'll do a -rc2 in a bit as it needs some testing.
> > > >
> > >
> > > Thanks a million. I should be cc'd on some of those so I'll pick up the
> > > final result and run it through the same tests just to be sure.
> > >
> >
> > Ok, tests completed and look more or less as expected. This is not to
> > say the performance results are *good* as such. Workloads that normally
> > demonstrate automatic numa balancing suffered because of other patches that
> > were merged (primarily fair zone allocation policy) that had interesting
> > side-effects. However, it now does not crash under heavy stress and I
> > prefer working a little slowly than crashing fast. NAS at least looks
> > better.
> >
> > Other workloads like kernel builds, page fault microbench looked good as
> > expected from the fair zone allocation policy fixes.
> >
> > Big downside is that ebizzy performance is *destroyed* in that RC2 patch
> > somewhere
> >
> > ebizzy
> > 3.12.6 3.12.6 3.12.7-rc2
> > vanilla backport-v1r2 stablerc2
> > Mean 1 3278.67 ( 0.00%) 3180.67 ( -2.99%) 3212.00 ( -2.03%)
> > Mean 2 2322.67 ( 0.00%) 2294.67 ( -1.21%) 1839.00 (-20.82%)
> > Mean 3 2257.00 ( 0.00%) 2218.67 ( -1.70%) 1664.00 (-26.27%)
> > Mean 4 2268.00 ( 0.00%) 2224.67 ( -1.91%) 1629.67 (-28.15%)
> > Mean 5 2247.67 ( 0.00%) 2255.67 ( 0.36%) 1582.33 (-29.60%)
> > Mean 6 2263.33 ( 0.00%) 2251.33 ( -0.53%) 1547.67 (-31.62%)
> > Mean 7 2273.67 ( 0.00%) 2222.67 ( -2.24%) 1545.67 (-32.02%)
> > Mean 8 2254.67 ( 0.00%) 2232.33 ( -0.99%) 1535.33 (-31.90%)
> > Mean 12 2237.67 ( 0.00%) 2266.33 ( 1.28%) 1543.33 (-31.03%)
> > Mean 16 2201.33 ( 0.00%) 2252.67 ( 2.33%) 1540.33 (-30.03%)
> > Mean 20 2205.67 ( 0.00%) 2229.33 ( 1.07%) 1537.33 (-30.30%)
> > Mean 24 2162.33 ( 0.00%) 2168.67 ( 0.29%) 1535.33 (-29.00%)
> > Mean 28 2139.33 ( 0.00%) 2107.67 ( -1.48%) 1535.00 (-28.25%)
> > Mean 32 2084.67 ( 0.00%) 2089.00 ( 0.21%) 1537.33 (-26.26%)
> > Mean 36 2002.00 ( 0.00%) 2020.00 ( 0.90%) 1530.33 (-23.56%)
> > Mean 40 1972.67 ( 0.00%) 1978.67 ( 0.30%) 1530.33 (-22.42%)
> > Mean 44 1951.00 ( 0.00%) 1953.67 ( 0.14%) 1531.00 (-21.53%)
> > Mean 48 1931.67 ( 0.00%) 1930.67 ( -0.05%) 1526.67 (-20.97%)
> >
> > Figures are records/sec, more is better for increasing numbers of threads
> > up to 48 which is the number of logical CPUs in the machine. Three kernels
> > tested
> >
> > 3.12.6 is self-explanatory
> > backport-v1r2 is the backported series I sent you
> > stablerc2 is the rc2 patch I pulled from kernel.org
> >
> > I'm not that familiar with the stable workflow but stable-queue.git looked
> > like it had the correct quilt tree so bisection is in progress. If I had
> > to bet money on it, I'd bet it's going to be scheduler or power management
> > related mostly because problems in both of those areas have tended to
> > screw ebizzy recently.
> >
>
> I was not far off. Bisection identified the following commit
>
> 3d97ea0816589c818ac62fb401e61c3b6a59f351 is the first bad commit
> commit 3d97ea0816589c818ac62fb401e61c3b6a59f351
> Author: Len Brown <[email protected]>
> Date: Wed Dec 18 16:44:57 2013 -0500
>
> x86 idle: Repair large-server 50-watt idle-power regression
>
> commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream.
>
> Linux 3.10 changed the timing of how thread_info->flags is touched:
>
> x86: Use generic idle loop
> (7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
>
> This caused Intel NHM-EX and WSM-EX servers to experience a large number
> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
> from deep C-states to shallow C-states, which caused these platforms
> to experience a significant increase in idle power.
>
> Note that this issue was already present before the commit above,
> however, it wasn't seen often enough to be noticed in power measurements.
>
> Here we extend an errata workaround from the Core2 EX "Dunnington"
> to extend to NHM-EX and WSM-EX, to prevent these immediate
> returns from MWAIT, reducing idle power on these platforms.
>
> While only acpi_idle ran on Dunnington, intel_idle
> may also run on these two newer systems.
> As of today, there are no other models that are known
> to need this tweak.
>
> Link: http://lkml.kernel.org/r/CAJvTdK=%[email protected]
> Signed-off-by: Len Brown <[email protected]>
> Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
> Signed-off-by: H. Peter Anvin <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> Len, HPA, the x86 idle regression fix fubars ebizzy as a consequence, I
> don't know why. I know the workload is not that important (and I expected
> ebizzy to be unaffected in this test) but it is probably indicative of
> other performance regressions hiding in there. It was caught via -stable
> testing by accident but I checked and upstream is also affected. This is
> a snippet from the bisection log
>
> Wed 8 Jan 09:53:59 GMT 2014 compass ebizzy v3.12.6 mean-4:2317 good
> Wed 8 Jan 10:13:04 GMT 2014 compass ebizzy v3.12.7-rc2 mean-4:1631 bad
> Wed 8 Jan 10:27:45 GMT 2014 compass ebizzy a202b4808e500f4fd53b6cec150c8fe214c70183 mean-4:1620 bad
> Wed 8 Jan 10:41:36 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2290 good
> Wed 8 Jan 10:55:14 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2266 good
> Wed 8 Jan 11:09:04 GMT 2014 compass ebizzy c62a6f8a28bf8897ba0903cf332d761c1132e48d mean-4:1624 bad
> Wed 8 Jan 11:22:46 GMT 2014 compass ebizzy 346679aad15c3608844f6b433b8d8ba56ad03802 mean-4:2280 good
> Wed 8 Jan 11:36:32 GMT 2014 compass ebizzy 36b9512dc19b535d72c1035048a95ec1c765d403 mean-4:1641 bad
> Wed 8 Jan 11:50:22 GMT 2014 compass ebizzy 1a82fc9ab8bb6b4a5ee5cd32d570d6ff0b77efb2 mean-4:1627 bad
> Wed 8 Jan 12:04:15 GMT 2014 compass ebizzy 3d97ea0816589c818ac62fb401e61c3b6a59f351 mean-4:1619 bad
> Wed 8 Jan 13:10:03 GMT 2014 compass ebizzy v3.13-rc7 mean-4:1619 bad
> Wed 8 Jan 13:39:19 GMT 2014 compass ebizzy v3.12.7-rc2-revert mean-4:2276 good
>
> mean-4 figures are records/sec as recorded by the bisection test. The
> bisection points are based on the -stable quilt tree so the commit ids are
> meaningless but you can see good/bad figures are relatively stable leading
> me to conclude the bisection is valid.
>
> v3.12.6 was 2317 records/second and considered "good". The 3.12.7-rc2
> stable candidate and 3.13-rc7 are both "bad". Reverting the single patch
> from v3.12.7-rc2 restores performance.
>
> Greg, this does not affect your -stable release as such because upstream is
> also affected. If you release with the patch merged then the upstream fix
> (whatever that is) will also need to be included in -stable later. If you
> release without the patch then both upstream fixes will be later required
> and some Intel machines will continue to consume excessive amounts of
> power in the meantime.

Thanks, I'll just leave -stable as-is, and pick up the fix from upstream
when it hits there.

greg k-h

2014-01-09 20:07:07

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

Hi Mel,
Thanks for the bisect.
What is the cpuid of the machine that sees the regression?

thanks,
-Len

On Wed, Jan 8, 2014 at 8:48 AM, Mel Gorman <[email protected]> wrote:
> Adding LKML to the list as this -stable snifftest has identified an
> upstream regression.
>
> On Wed, Jan 08, 2014 at 10:43:40AM +0000, Mel Gorman wrote:
>> On Tue, Jan 07, 2014 at 08:30:12PM +0000, Mel Gorman wrote:
>> > On Tue, Jan 07, 2014 at 10:54:40AM -0800, Greg KH wrote:
>> > > On Tue, Jan 07, 2014 at 06:17:15AM -0800, Greg KH wrote:
>> > > > On Tue, Jan 07, 2014 at 02:00:35PM +0000, Mel Gorman wrote:
>> > > > > A number of NUMA balancing patches were tagged for -stable but I got a
>> > > > > number of rejected mails from either Greg or his robot minion. The list
>> > > > > of relevant patches is
>> > > > >
>> > > > > FAILED: patch "[PATCH] mm: numa: serialise parallel get_user_page against THP"
>> > > > > FAILED: patch "[PATCH] mm: numa: call MMU notifiers on THP migration"
>> > > > > MERGED: Patch "mm: clear pmd_numa before invalidating"
>> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PMD during PTE update scan"
>> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PTE for pte_numa update"
>> > > > > MERGED: Patch "mm: numa: ensure anon_vma is locked to prevent parallel THP splits"
>> > > > > MERGED: Patch "mm: numa: avoid unnecessary work on the failure path"
>> > > > > MERGED: Patch "sched: numa: skip inaccessible VMAs"
>> > > > > FAILED: patch "[PATCH] mm: numa: clear numa hinting information on mprotect"
>> > > > > FAILED: patch "[PATCH] mm: numa: avoid unnecessary disruption of NUMA hinting during"
>> > > > > Patch "mm: fix TLB flush race between migration, and change_protection_range"
>> > > > > Patch "mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates"
>> > > > > FAILED: patch "[PATCH] mm: numa: defer TLB flush for THP migration as long as"
>> > > > >
>> > > > > Fixing the rejects one at a time may cause other conflicts due to ordering
>> > > > > issues. Instead, this patch series against 3.12.6 is the full list of
>> > > > > backported patches in the expected order. Greg, unfortunately this means
>> > > > > you may have to drop some patches already in your stable tree and reapply
>> > > > > but on the plus side they should be then in the correct order for bisection
>> > > > > purposes and you'll know I've tested this combination of patches.
>> > > >
>> > > > Many thanks for these, I'll go queue them up in a bit and drop the
>> > > > others to ensure I got all of this correct.
>> > >
>> > > Ok, I've now queued all of these up, in this order, so we should be
>> > > good.
>> > >
>> > > I'll do a -rc2 in a bit as it needs some testing.
>> > >
>> >
>> > Thanks a million. I should be cc'd on some of those so I'll pick up the
>> > final result and run it through the same tests just to be sure.
>> >
>>
>> Ok, tests completed and look more or less as expected. This is not to
>> say the performance results are *good* as such. Workloads that normally
>> demonstrate automatic numa balancing suffered because of other patches that
>> were merged (primarily fair zone allocation policy) that had interesting
>> side-effects. However, it now does not crash under heavy stress and I
>> prefer working a little slowly than crashing fast. NAS at least looks
>> better.
>>
>> Other workloads like kernel builds, page fault microbench looked good as
>> expected from the fair zone allocation policy fixes.
>>
>> Big downside is that ebizzy performance is *destroyed* in that RC2 patch
>> somewhere
>>
>> ebizzy
>> 3.12.6 3.12.6 3.12.7-rc2
>> vanilla backport-v1r2 stablerc2
>> Mean 1 3278.67 ( 0.00%) 3180.67 ( -2.99%) 3212.00 ( -2.03%)
>> Mean 2 2322.67 ( 0.00%) 2294.67 ( -1.21%) 1839.00 (-20.82%)
>> Mean 3 2257.00 ( 0.00%) 2218.67 ( -1.70%) 1664.00 (-26.27%)
>> Mean 4 2268.00 ( 0.00%) 2224.67 ( -1.91%) 1629.67 (-28.15%)
>> Mean 5 2247.67 ( 0.00%) 2255.67 ( 0.36%) 1582.33 (-29.60%)
>> Mean 6 2263.33 ( 0.00%) 2251.33 ( -0.53%) 1547.67 (-31.62%)
>> Mean 7 2273.67 ( 0.00%) 2222.67 ( -2.24%) 1545.67 (-32.02%)
>> Mean 8 2254.67 ( 0.00%) 2232.33 ( -0.99%) 1535.33 (-31.90%)
>> Mean 12 2237.67 ( 0.00%) 2266.33 ( 1.28%) 1543.33 (-31.03%)
>> Mean 16 2201.33 ( 0.00%) 2252.67 ( 2.33%) 1540.33 (-30.03%)
>> Mean 20 2205.67 ( 0.00%) 2229.33 ( 1.07%) 1537.33 (-30.30%)
>> Mean 24 2162.33 ( 0.00%) 2168.67 ( 0.29%) 1535.33 (-29.00%)
>> Mean 28 2139.33 ( 0.00%) 2107.67 ( -1.48%) 1535.00 (-28.25%)
>> Mean 32 2084.67 ( 0.00%) 2089.00 ( 0.21%) 1537.33 (-26.26%)
>> Mean 36 2002.00 ( 0.00%) 2020.00 ( 0.90%) 1530.33 (-23.56%)
>> Mean 40 1972.67 ( 0.00%) 1978.67 ( 0.30%) 1530.33 (-22.42%)
>> Mean 44 1951.00 ( 0.00%) 1953.67 ( 0.14%) 1531.00 (-21.53%)
>> Mean 48 1931.67 ( 0.00%) 1930.67 ( -0.05%) 1526.67 (-20.97%)
>>
>> Figures are records/sec, more is better for increasing numbers of threads
>> up to 48 which is the number of logical CPUs in the machine. Three kernels
>> tested
>>
>> 3.12.6 is self-explanatory
>> backport-v1r2 is the backported series I sent you
>> stablerc2 is the rc2 patch I pulled from kernel.org
>>
>> I'm not that familiar with the stable workflow but stable-queue.git looked
>> like it had the correct quilt tree so bisection is in progress. If I had
>> to bet money on it, I'd bet it's going to be scheduler or power management
>> related mostly because problems in both of those areas have tended to
>> screw ebizzy recently.
>>
>
> I was not far off. Bisection identified the following commit
>
> 3d97ea0816589c818ac62fb401e61c3b6a59f351 is the first bad commit
> commit 3d97ea0816589c818ac62fb401e61c3b6a59f351
> Author: Len Brown <[email protected]>
> Date: Wed Dec 18 16:44:57 2013 -0500
>
> x86 idle: Repair large-server 50-watt idle-power regression
>
> commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream.
>
> Linux 3.10 changed the timing of how thread_info->flags is touched:
>
> x86: Use generic idle loop
> (7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
>
> This caused Intel NHM-EX and WSM-EX servers to experience a large number
> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
> from deep C-states to shallow C-states, which caused these platforms
> to experience a significant increase in idle power.
>
> Note that this issue was already present before the commit above,
> however, it wasn't seen often enough to be noticed in power measurements.
>
> Here we extend an errata workaround from the Core2 EX "Dunnington"
> to extend to NHM-EX and WSM-EX, to prevent these immediate
> returns from MWAIT, reducing idle power on these platforms.
>
> While only acpi_idle ran on Dunnington, intel_idle
> may also run on these two newer systems.
> As of today, there are no other models that are known
> to need this tweak.
>
> Link: http://lkml.kernel.org/r/CAJvTdK=%[email protected]
> Signed-off-by: Len Brown <[email protected]>
> Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
> Signed-off-by: H. Peter Anvin <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> Len, HPA, the x86 idle regression fix fubars ebizzy as a consequence, I
> don't know why. I know the workload is not that important (and I expected
> ebizzy to be unaffected in this test) but it is probably indicative of
> other performance regressions hiding in there. It was caught via -stable
> testing by accident but I checked and upstream is also affected. This is
> a snippet from the bisection log
>
> Wed 8 Jan 09:53:59 GMT 2014 compass ebizzy v3.12.6 mean-4:2317 good
> Wed 8 Jan 10:13:04 GMT 2014 compass ebizzy v3.12.7-rc2 mean-4:1631 bad
> Wed 8 Jan 10:27:45 GMT 2014 compass ebizzy a202b4808e500f4fd53b6cec150c8fe214c70183 mean-4:1620 bad
> Wed 8 Jan 10:41:36 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2290 good
> Wed 8 Jan 10:55:14 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2266 good
> Wed 8 Jan 11:09:04 GMT 2014 compass ebizzy c62a6f8a28bf8897ba0903cf332d761c1132e48d mean-4:1624 bad
> Wed 8 Jan 11:22:46 GMT 2014 compass ebizzy 346679aad15c3608844f6b433b8d8ba56ad03802 mean-4:2280 good
> Wed 8 Jan 11:36:32 GMT 2014 compass ebizzy 36b9512dc19b535d72c1035048a95ec1c765d403 mean-4:1641 bad
> Wed 8 Jan 11:50:22 GMT 2014 compass ebizzy 1a82fc9ab8bb6b4a5ee5cd32d570d6ff0b77efb2 mean-4:1627 bad
> Wed 8 Jan 12:04:15 GMT 2014 compass ebizzy 3d97ea0816589c818ac62fb401e61c3b6a59f351 mean-4:1619 bad
> Wed 8 Jan 13:10:03 GMT 2014 compass ebizzy v3.13-rc7 mean-4:1619 bad
> Wed 8 Jan 13:39:19 GMT 2014 compass ebizzy v3.12.7-rc2-revert mean-4:2276 good
>
> mean-4 figures are records/sec as recorded by the bisection test. The
> bisection points are based on the -stable quilt tree so the commit ids are
> meaningless but you can see good/bad figures are relatively stable leading
> me to conclude the bisection is valid.
>
> v3.12.6 was 2317 records/second and considered "good". The 3.12.7-rc2
> stable candidate and 3.13-rc7 are both "bad". Reverting the single patch
> from v3.12.7-rc2 restores performance.
>
> Greg, this does not affect your -stable release as such because upstream is
> also affected. If you release with the patch merged then the upstream fix
> (whatever that is) will also need to be included in -stable later. If you
> release without the patch then both upstream fixes will be later required
> and some Intel machines will continue to consume excessive amounts of
> power in the meantime.
>
> --
> Mel Gorman
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Len Brown, Intel Open Source Technology Center

2014-01-10 10:14:44

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Thu, Jan 09, 2014 at 03:07:00PM -0500, Len Brown wrote:
> Hi Mel,
> Thanks for the bisect.
> What is the cpuid of the machine that sees the regression?
>

cpuid information for CPU 0. Machine is 4 socket, 48 threads in total.

CPU 0:
vendor_id = "GenuineIntel"
version information (1/eax):
processor type = primary processor (0)
family = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
model = 0xf (15)
stepping id = 0x2 (2)
extended family = 0x0 (0)
extended model = 0x2 (2)
(simple synth) = Intel Xeon E7-8800 / Xeon E7-4800 / Xeon E7-2800 (Westmere-EX A2), 32nm
miscellaneous (1/ebx):
process local APIC physical ID = 0x0 (0)
cpu count = 0x40 (64)
CLFLUSH line size = 0x8 (8)
brand index = 0x0 (0)
brand id = 0x00 (0): unknown
feature information (1/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSENTER and SYSEXIT = true
memory type range registers = true
PTE global bit = true
machine check architecture = true
conditional move/compare instruction = true
page attribute table = true
page size extension = true
processor serial number = false
CLFLUSH instruction = true
debug store = true
thermal monitor and clock ctrl = true
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = true
SSE2 extensions = true
self snoop = true
hyper-threading / multi-core supported = true
therm. monitor = true
IA64 = false
pending break event = true
feature information (1/ecx):
PNI/SSE3: Prescott New Instructions = true
PCLMULDQ instruction = true
64-bit debug store = true
MONITOR/MWAIT = true
CPL-qualified debug store = true
VMX: virtual machine extensions = true
SMX: safer mode extensions = true
Enhanced Intel SpeedStep Technology = true
thermal monitor 2 = true
SSSE3 extensions = true
context ID: adaptive or shared L1 data = false
FMA instruction = false
CMPXCHG16B instruction = true
xTPR disable = true
perfmon and debug = true
process context identifiers = true
direct cache access = true
SSE4.1 extensions = true
SSE4.2 extensions = true
extended xAPIC support = true
MOVBE instruction = false
POPCNT instruction = true
time stamp counter deadline = false
AES instruction = true
XSAVE/XSTOR states = false
OS-enabled XSAVE/XSTOR = false
AVX: advanced vector extensions = false
F16C half-precision convert instruction = false
RDRAND instruction = false
hypervisor guest status = false
cache and TLB information (2):
0x5a: data TLB: 2M/4M pages, 4-way, 32 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x55: instruction TLB: 2M/4M pages, fully, 7 entries
0xeb: L3 cache: 18M, 24-way, 64 byte lines
0xb2: instruction TLB: 4K, 4-way, 64 entries
0xf0: 64 byte prefetching
0x2c: L1 data cache: 32K, 8-way, 64 byte lines
0x21: L2 cache: 256K MLC, 8-way, 64 byte lines
0xca: L2 TLB: 4K, 4-way, 512 entries
0x09: L1 instruction cache: 32K, 4-way, 64-byte lines
processor serial number: 0002-06F2-0000-0000-0000-0000
deterministic cache parameters (4):
--- cache 0 ---
cache type = data cache (1)
cache level = 0x1 (1)
self-initializing cache level = true
fully associative cache = false
extra threads sharing this cache = 0x1 (1)
extra processor cores on this die = 0x1f (31)
system coherency line size = 0x3f (63)
physical line partitions = 0x0 (0)
ways of associativity = 0x7 (7)
WBINVD/INVD behavior on lower caches = false
inclusive to lower caches = false
complex cache indexing = false
number of sets - 1 (s) = 63
--- cache 1 ---
cache type = instruction cache (2)
cache level = 0x1 (1)
self-initializing cache level = true
fully associative cache = false
extra threads sharing this cache = 0x1 (1)
extra processor cores on this die = 0x1f (31)
system coherency line size = 0x3f (63)
physical line partitions = 0x0 (0)
ways of associativity = 0x3 (3)
WBINVD/INVD behavior on lower caches = false
inclusive to lower caches = false
complex cache indexing = false
number of sets - 1 (s) = 127
--- cache 2 ---
cache type = unified cache (3)
cache level = 0x2 (2)
self-initializing cache level = true
fully associative cache = false
extra threads sharing this cache = 0x1 (1)
extra processor cores on this die = 0x1f (31)
system coherency line size = 0x3f (63)
physical line partitions = 0x0 (0)
ways of associativity = 0x7 (7)
WBINVD/INVD behavior on lower caches = false
inclusive to lower caches = false
complex cache indexing = false
number of sets - 1 (s) = 511
--- cache 3 ---
cache type = unified cache (3)
cache level = 0x3 (3)
self-initializing cache level = true
fully associative cache = false
extra threads sharing this cache = 0x3f (63)
extra processor cores on this die = 0x1f (31)
system coherency line size = 0x3f (63)
physical line partitions = 0x0 (0)
ways of associativity = 0x17 (23)
WBINVD/INVD behavior on lower caches = false
inclusive to lower caches = true
complex cache indexing = true
number of sets - 1 (s) = 12287
MONITOR/MWAIT (5):
smallest monitor-line size (bytes) = 0x40 (64)
largest monitor-line size (bytes) = 0x40 (64)
enum of Monitor-MWAIT exts supported = true
supports intrs as break-event for MWAIT = true
number of C0 sub C-states using MWAIT = 0x0 (0)
number of C1 sub C-states using MWAIT = 0x2 (2)
number of C2 sub C-states using MWAIT = 0x1 (1)
number of C3/C6 sub C-states using MWAIT = 0x1 (1)
number of C4/C7 sub C-states using MWAIT = 0x0 (0)
Thermal and Power Management Features (6):
digital thermometer = true
Intel Turbo Boost Technology = false
ARAT always running APIC timer = true
PLN power limit notification = false
ECMD extended clock modulation duty = false
PTM package thermal management = false
digital thermometer thresholds = 0x1 (1)
ACNT/MCNT supported performance measure = true
ACNT2 available = false
performance-energy bias capability = false
extended feature flags (7):
FSGSBASE instructions = false
BMI instruction = false
SMEP support = false
enhanced REP MOVSB/STOSB = false
INVPCID instruction = false
Direct Cache Access Parameters (9):
PLATFORM_DCA_CAP MSR bits = 0
Architecture Performance Monitoring Features (0xa/eax):
version ID = 0x3 (3)
number of counters per logical processor = 0x4 (4)
bit width of counter = 0x30 (48)
length of EBX bit vector = 0x7 (7)
Architecture Performance Monitoring Features (0xa/ebx):
core cycle event not available = false
instruction retired event not available = false
reference cycles event not available = true
last-level cache ref event not available = false
last-level cache miss event not avail = false
branch inst retired event not available = false
branch mispred retired event not avail = false
Architecture Performance Monitoring Features (0xa/edx):
number of fixed counters = 0x3 (3)
bit width of fixed counters = 0x30 (48)
x2APIC features / processor topology (0xb):
--- level 0 (thread) ---
bits to shift APIC ID to get next = 0x1 (1)
logical processors at this level = 0x2 (2)
level number = 0x0 (0)
level type = thread (1)
extended APIC ID = 0
--- level 1 (core) ---
bits to shift APIC ID to get next = 0x6 (6)
logical processors at this level = 0xc (12)
level number = 0x1 (1)
level type = core (2)
extended APIC ID = 0
extended feature flags (0x80000001/edx):
SYSCALL and SYSRET instructions = true
execution disable = true
1-GB large page support = true
RDTSCP = true
64-bit extensions technology available = true
Intel feature flags (0x80000001/ecx):
LAHF/SAHF supported in 64-bit mode = true
brand = " Intel(R) Xeon(R) CPU E7- 4807 @ 1.87GHz"
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
instruction # entries = 0x0 (0)
instruction associativity = 0x0 (0)
data # entries = 0x0 (0)
data associativity = 0x0 (0)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
instruction # entries = 0x0 (0)
instruction associativity = 0x0 (0)
data # entries = 0x0 (0)
data associativity = 0x0 (0)
L1 data cache information (0x80000005/ecx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = 0x0 (0)
size (Kb) = 0x0 (0)
L1 instruction cache information (0x80000005/edx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = 0x0 (0)
size (Kb) = 0x0 (0)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 unified cache information (0x80000006/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x0 (0)
associativity = 8-way (6)
size (Kb) = 0x100 (256)
L3 cache information (0x80000006/edx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = L2 off (0)
size (in 512Kb units) = 0x0 (0)
Advanced Power Management Features (0x80000007/edx):
temperature sensing diode = false
frequency ID (FID) control = false
voltage ID (VID) control = false
thermal trip (TTP) = false
thermal monitor (TM) = false
software thermal control (STC) = false
100 MHz multiplier control = false
hardware P-State control = false
TscInvariant = true
Physical Address and Linear Address Size (0x80000008/eax):
maximum physical address bits = 0x2c (44)
maximum linear (virtual) address bits = 0x30 (48)
maximum guest physical address bits = 0x0 (0)
Logical CPU cores (0x80000008/ecx):
number of CPU cores - 1 = 0x0 (0)
ApicIdCoreIdSize = 0x0 (0)
(multi-processing synth): multi-core (c=6), hyper-threaded (t=2)
(multi-processing method): Intel leaf 0xb
(APIC widths synth): CORE_width=6 SMT_width=1
(APIC synth): PKG_ID=0 CORE_ID=0 SMT_ID=0
(synth) = Intel Xeon E7-8800 / Xeon E7-4800 / Xeon E7-2800 (Westmere-EX A2), 32nm

--
Mel Gorman
SUSE Labs

2014-01-10 10:26:23

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Fri, Jan 10, 2014 at 01:04:55AM -0500, Len Brown wrote:
> Hi Mel,
>
> I downloaded ebizzy and ran on an 80-thread WSM-EX.

Default parameters? If so, the default is 2xNR_CPUs. My initial tests only
ran up to NR_CPUs but I was seeing regressions throughout so I doubt it
levelled out for higher numbers of clients.

I used mmtests to run ebizzy based on the
configs/config-global-dhp__pagealloc-performance config file with the
following relevant lines changed just for the bisection itself

export MMTESTS="ebizzy"
export EBIZZY_MAX_THREADS=5
export EBIZZY_DURATION=20
export EBIZZY_ITERATIONS=3

Even though the test ran up to 5 threads, I only was using the result
for 4 threads for the bisection.

> But I got quite different number than you, so I'm wondering if there is
> something
> special I need to get the same results you see. I generally see scores
> around 6900 - 7000.
> my reference kernel is built on top of
> b0031f227e47919797dc0e1c1990f3ef151ff0cc
> which is upstream on 12/17, which is when i wrote that patch -- if it
> matters.
>
> But worse, I don't see any difference in ebizzy performance with/without
> the CLFLUSH patch.
>
> Please let me know what I can do to reproduce the results you see.
>

You could try running within mmtests and see what falls out? I don't think
I am doing anything weird in there but it wouldn't be the first time there
was a mistake in testing methodology that led to inconsistent results
between testers.

git clone https://github.com/gormanm/mmtests
cd mmtests
vi configs/config-global-dhp__pagealloc-performance
# edit file to set the lines above to match my bisection
./run-mmtests.sh --no-monitor --config configs/config-global-dhp__pagealloc-performance baseline
# boot new kernel
./run-mmtests.sh --no-monitor --config configs/config-global-dhp__pagealloc-performance patched
cd work/log
../../compare-kernels.sh

Of course, we could also be differing on kernel config in some relevant
way or it might be some other unfortunate timing issue.

> Also, can you try this attached incremental patch to see if it helps?

I'll fire it up after pushing send on this mail.

Thanks

--
Mel Gorman
SUSE Labs

2014-01-10 14:38:50

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Fri, Jan 10, 2014 at 10:26:17AM +0000, Mel Gorman wrote:
> > Also, can you try this attached incremental patch to see if it helps?
>
> I'll fire it up after pushing send on this mail.
>

Relevant parts of the mmtests config file used

export MMTESTS="ebizzy"
export EBIZZY_MAX_THREADS=$((NUM_CPU*2))
export EBIZZY_DURATION=10
export EBIZZY_ITERATIONS=10

Three kernels tested -- vanilla kernel, fixidle is your patch and revert is a
revert of 40e2d7f9b5dae048789c64672bf3027fbb663ffa

ebizzy
3.13.0-rc7 3.13.0-rc7 3.13.0-rc7
vanilla fixidle-v1r1 revert-v1r1
Mean 1 3153.70 ( 0.00%) 3271.40 ( 3.73%) 3170.40 ( 0.53%)
Mean 2 1725.90 ( 0.00%) 1714.90 ( -0.64%) 2366.90 ( 37.14%)
Mean 3 1659.70 ( 0.00%) 1654.30 ( -0.33%) 2301.10 ( 38.65%)
Mean 4 1636.10 ( 0.00%) 1638.20 ( 0.13%) 2287.30 ( 39.80%)
Mean 5 1586.90 ( 0.00%) 1598.80 ( 0.75%) 2312.10 ( 45.70%)
Mean 6 1544.00 ( 0.00%) 1555.00 ( 0.71%) 2264.00 ( 46.63%)
Mean 7 1543.10 ( 0.00%) 1543.70 ( 0.04%) 2296.50 ( 48.82%)
Mean 8 1541.70 ( 0.00%) 1548.40 ( 0.43%) 2284.70 ( 48.19%)
Mean 12 1542.50 ( 0.00%) 1550.80 ( 0.54%) 2268.20 ( 47.05%)
Mean 16 1543.00 ( 0.00%) 1546.10 ( 0.20%) 2261.70 ( 46.58%)
Mean 20 1541.60 ( 0.00%) 1551.20 ( 0.62%) 2262.60 ( 46.77%)
Mean 24 1548.00 ( 0.00%) 1546.20 ( -0.12%) 2240.30 ( 44.72%)
Mean 28 1537.00 ( 0.00%) 1544.20 ( 0.47%) 2172.40 ( 41.34%)
Mean 32 1542.70 ( 0.00%) 1552.70 ( 0.65%) 2118.80 ( 37.34%)
Mean 36 1538.70 ( 0.00%) 1548.80 ( 0.66%) 2074.40 ( 34.82%)
Mean 40 1536.40 ( 0.00%) 1539.90 ( 0.23%) 2041.20 ( 32.86%)
Mean 44 1535.00 ( 0.00%) 1542.90 ( 0.51%) 2011.60 ( 31.05%)
Mean 48 1534.90 ( 0.00%) 1544.00 ( 0.59%) 2002.60 ( 30.47%)
Mean 52 1530.40 ( 0.00%) 1531.90 ( 0.10%) 1994.80 ( 30.35%)
Mean 56 1531.50 ( 0.00%) 1527.10 ( -0.29%) 1980.90 ( 29.34%)
Mean 60 1528.90 ( 0.00%) 1527.40 ( -0.10%) 1995.60 ( 30.53%)
Mean 64 1527.10 ( 0.00%) 1526.50 ( -0.04%) 1985.50 ( 30.02%)
Mean 68 1527.80 ( 0.00%) 1522.50 ( -0.35%) 1983.70 ( 29.84%)
Mean 72 1524.50 ( 0.00%) 1523.50 ( -0.07%) 1976.70 ( 29.66%)
Mean 76 1520.80 ( 0.00%) 1525.20 ( 0.29%) 1964.10 ( 29.15%)
Mean 80 1522.30 ( 0.00%) 1519.30 ( -0.20%) 1966.20 ( 29.16%)
Mean 84 1522.60 ( 0.00%) 1520.00 ( -0.17%) 1948.30 ( 27.96%)
Mean 88 1521.40 ( 0.00%) 1521.40 ( 0.00%) 1949.00 ( 28.11%)
Mean 92 1515.80 ( 0.00%) 1517.10 ( 0.09%) 1938.00 ( 27.85%)
Mean 96 1516.00 ( 0.00%) 1517.40 ( 0.09%) 1930.50 ( 27.34%)

The latest patch makes little difference. Reverting makes a massive
difference. I didn't include the standard deviations but they are very
small and the performance gain from the revert is far outside the noise.

ebizzy Thread spread
3.13.0-rc7 3.13.0-rc7 3.13.0-rc7
vanilla fixidle-v1r1 revert-v1r1
Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Mean 2 39.80 ( 0.00%) 203.40 (-411.06%) 0.10 ( 99.75%)
Mean 3 90.90 ( 0.00%) 70.10 ( 22.88%) 0.30 ( 99.67%)
Mean 4 83.00 ( 0.00%) 79.30 ( 4.46%) 0.30 ( 99.64%)
Mean 5 16.20 ( 0.00%) 37.90 (-133.95%) 0.40 ( 97.53%)
Mean 6 54.40 ( 0.00%) 45.40 ( 16.54%) 0.50 ( 99.08%)
Mean 7 45.90 ( 0.00%) 37.90 ( 17.43%) 0.30 ( 99.35%)
Mean 8 36.30 ( 0.00%) 43.90 (-20.94%) 0.50 ( 98.62%)
Mean 12 31.80 ( 0.00%) 29.80 ( 6.29%) 0.60 ( 98.11%)
Mean 16 26.20 ( 0.00%) 26.70 ( -1.91%) 0.70 ( 97.33%)
Mean 20 20.70 ( 0.00%) 19.00 ( 8.21%) 1.00 ( 95.17%)
Mean 24 20.10 ( 0.00%) 18.00 ( 10.45%) 1.40 ( 93.03%)
Mean 28 17.50 ( 0.00%) 15.80 ( 9.71%) 3.50 ( 80.00%)
Mean 32 15.50 ( 0.00%) 16.20 ( -4.52%) 4.20 ( 72.90%)
Mean 36 14.60 ( 0.00%) 14.60 ( 0.00%) 3.80 ( 73.97%)
Mean 40 13.40 ( 0.00%) 12.50 ( 6.72%) 3.70 ( 72.39%)
Mean 44 12.20 ( 0.00%) 13.50 (-10.66%) 3.20 ( 73.77%)
Mean 48 11.80 ( 0.00%) 13.00 (-10.17%) 2.70 ( 77.12%)
Mean 52 11.10 ( 0.00%) 11.20 ( -0.90%) 2.60 ( 76.58%)
Mean 56 10.00 ( 0.00%) 10.50 ( -5.00%) 2.10 ( 79.00%)
Mean 60 10.00 ( 0.00%) 10.00 ( 0.00%) 2.30 ( 77.00%)
Mean 64 9.30 ( 0.00%) 9.30 ( 0.00%) 2.60 ( 72.04%)
Mean 68 9.80 ( 0.00%) 9.70 ( 1.02%) 2.00 ( 79.59%)
Mean 72 9.80 ( 0.00%) 9.00 ( 8.16%) 2.00 ( 79.59%)
Mean 76 8.80 ( 0.00%) 9.60 ( -9.09%) 2.00 ( 77.27%)
Mean 80 8.20 ( 0.00%) 8.40 ( -2.44%) 2.00 ( 75.61%)
Mean 84 8.30 ( 0.00%) 8.00 ( 3.61%) 2.10 ( 74.70%)
Mean 88 8.20 ( 0.00%) 7.90 ( 3.66%) 2.00 ( 75.61%)
Mean 92 8.40 ( 0.00%) 7.50 ( 10.71%) 1.90 ( 77.38%)
Mean 96 8.10 ( 0.00%) 7.60 ( 6.17%) 2.20 ( 72.84%)

This shows the difference in performance between threads. It's
interesting to note that reverting the patch gives almost equal
performance to each thread

It's worth noting that automatic NUMA balancing is enabled and active
during these tests which would be one large potential difference between
our configs. I do not think it would be enough to explain the large
performance differences though.

--
Mel Gorman
SUSE Labs

2014-01-13 19:24:18

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Wed, Jan 08, 2014 at 01:48:58PM +0000, Mel Gorman wrote:
> Adding LKML to the list as this -stable snifftest has identified an
> upstream regression.
>

This is a false alarm.

The test machine in question was originally installed based on a beta
version of openSUSE 13.1. It included a package by default that set default
malloc parameters that I was not aware. Normally the package is there to
catch bugs during beta testing and removed before a GA release but it's
left in place if a user does a distribution update.

With the debugging RPM installed, the free paths contended on a global
mutex in glibc. Ebizzy had been classified as a CPU intensive and memory
free intensive benchmark (not that common) but turbostat showed that the
CPUs were over 95% of the time in C6 and mpstat verified that the CPUs
were mostly idle. It did not take long to see that everything was blocked
waiting on a futex and to identify where it was in glibc. It's only a
factor when malloc debugging is enabled so normally people would not see it.

The "regression" is because CPUs are reaching C6 as they should and there
is a delay when exiting it. This is behaving as designed and fixing this
would involve doing something stupid. Once the problem RPM was removed
ebizzy performed as expected. 3.13-rc7, the revert and forcing max_cstate=1
all have similar performance.

Sorry about the noise.

--
Mel Gorman
SUSE Labs

2014-01-13 21:11:32

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Mon, Jan 13, 2014 at 07:24:06PM +0000, Mel Gorman wrote:
> On Wed, Jan 08, 2014 at 01:48:58PM +0000, Mel Gorman wrote:
> > Adding LKML to the list as this -stable snifftest has identified an
> > upstream regression.
> >
>
> This is a false alarm.

Thanks for tracking this down and letting us know.

greg k-h

2014-01-14 07:31:24

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

> This is a false alarm.

Thanks for the follow-up, Mel.

Agreed, it makes no sense for ebizzy measure 'throughput', when a
library debug bottleneck
prevents it from scaling past 3% CPU utilization.

Still, the broken configuration did find a difference due to the
addition of CLFLUSH on this box.
It makes me wonder if we will find issues on workloads that may depend
on the latency
of idle entry/exit, or perhaps sensitivity to the state of the cache
line containing thread_info->flags.

If somebody runs into such a workload, please try changing this 1 line
of intel_idle.c to limit
the CLFLUSH to C-states deeper than C1E, and let me know what you see.

- if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+ if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
clflush((void *)&current_thread_info()->flags);

thanks,
Len Brown, Intel Open Source Technology Center

2014-01-14 08:01:35

by Mike Galbraith

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Tue, 2014-01-14 at 02:31 -0500, Len Brown wrote:
> > This is a false alarm.
>
> Thanks for the follow-up, Mel.
>
> Agreed, it makes no sense for ebizzy measure 'throughput', when a
> library debug bottleneck
> prevents it from scaling past 3% CPU utilization.
>
> Still, the broken configuration did find a difference due to the
> addition of CLFLUSH on this box.
> It makes me wonder if we will find issues on workloads that may depend
> on the latency
> of idle entry/exit, or perhaps sensitivity to the state of the cache
> line containing thread_info->flags.
>
> If somebody runs into such a workload, please try changing this 1 line
> of intel_idle.c to limit
> the CLFLUSH to C-states deeper than C1E, and let me know what you see.
>
> - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> + if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> clflush((void *)&current_thread_info()->flags);

Hm, seems any high frequency switcher scheduling cross-core (pipe-test,
or maybe a tbench pair) should show the cost to an affected box.

-Mike

2014-01-14 08:24:34

by Mike Galbraith

[permalink] [raw]

Subject: Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)

On Tue, 2014-01-14 at 09:01 +0100, Mike Galbraith wrote:
> On Tue, 2014-01-14 at 02:31 -0500, Len Brown wrote:
> > > This is a false alarm.
> >
> > Thanks for the follow-up, Mel.
> >
> > Agreed, it makes no sense for ebizzy measure 'throughput', when a
> > library debug bottleneck
> > prevents it from scaling past 3% CPU utilization.
> >
> > Still, the broken configuration did find a difference due to the
> > addition of CLFLUSH on this box.
> > It makes me wonder if we will find issues on workloads that may depend
> > on the latency
> > of idle entry/exit, or perhaps sensitivity to the state of the cache
> > line containing thread_info->flags.
> >
> > If somebody runs into such a workload, please try changing this 1 line
> > of intel_idle.c to limit
> > the CLFLUSH to C-states deeper than C1E, and let me know what you see.
> >
> > - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> > + if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> > clflush((void *)&current_thread_info()->flags);
>
> Hm, seems any high frequency switcher scheduling cross-core (pipe-test,
> or maybe a tbench pair) should show the cost to an affected box.

Oh yeah.. :) unless of course it's a Q6600 (poke poke).