Hello,
We believe we've identified a particular commit to the cpuidle code that
seems to be impacting performance of variety of workloads. The simplest way to
reproduce is using netperf TCP_RR test, so we're using that, on a pair of
Sandy Bridge based servers. We also have data from a large database setup
where performance is also measurably/positively impacted, though that test
data isn't easily share-able.
Included below are test results from 3 test kernels:
kernel reverts
-----------------------------------------------------------
1) vanilla upstream (no reverts)
2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
e11538d1f03914eb92af5a1a378375c05ae8520c
In summary, netperf TCP_RR numbers improve by approximately 4% after
reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When
69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
seems to get above 40%. Taking that patch out gets C0 near 100% quite
often, and performance increases.
The below data are histograms representing the %c0 residency @ 1-second
sample rates (using turbostat), while under netperf test.
- If you look at the first 4 histograms, you can see %c0 residency almost
entirely in the 30,40% bin.
- The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
shows %c0 in the 80,90,100% bins.
Below each kernel name are netperf TCP_RR trans/s numbers for the
particular kernel that can be disclosed publicly, comparing the 3 test
kernels. We ran a 4th test with the vanilla kernel where we've also set
/dev/cpu_dma_latency=0 to show overall impact boosting single-threaded
TCP_RR performance over 11% above baseline.
3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
TCP_RR trans/s 54323.78
-----------------------------------------------------------
3.10-rc2 vanilla RX (no reverts)
TCP_RR trans/s 48192.47
Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 59]:
***********************************************************
40.0000 - 50.0000 [ 1]: *
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
Sender %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 11]: ***********
40.0000 - 50.0000 [ 49]:
*************************************************
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
-----------------------------------------------------------
3.10-rc2 perfteam2 RX (reverts commit
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 49698.69
Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 1]: *
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 59]:
***********************************************************
40.0000 - 50.0000 [ 0]:
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
Sender %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 2]: **
40.0000 - 50.0000 [ 58]:
**********************************************************
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 0]:
-----------------------------------------------------------
3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 47766.95
Receiver %c0
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 1]: *
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 27]: ***************************
40.0000 - 50.0000 [ 2]: **
50.0000 - 60.0000 [ 0]:
60.0000 - 70.0000 [ 2]: **
70.0000 - 80.0000 [ 0]:
80.0000 - 90.0000 [ 0]:
90.0000 - 100.0000 [ 28]: ****************************
Sender:
0.0000 - 10.0000 [ 1]: *
10.0000 - 20.0000 [ 0]:
20.0000 - 30.0000 [ 0]:
30.0000 - 40.0000 [ 11]: ***********
40.0000 - 50.0000 [ 0]:
50.0000 - 60.0000 [ 1]: *
60.0000 - 70.0000 [ 0]:
70.0000 - 80.0000 [ 3]: ***
80.0000 - 90.0000 [ 7]: *******
90.0000 - 100.0000 [ 38]: **************************************
These results demonstrate gaining back the tendency of the CPU to stay in
more responsive, performant C-states (and thus yield measurably better
performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.
While taking into account the changing landscape with regards to CPU
governors, and both P- and C-states, we think that a single-thread should
still be able to achieve maximum performance. With the current upstream
code base, workloads with a low number of "hot" threads are not able to
achieve maximum performance "out of the box".
Also recently, Intel's LAD has posted upstream performance results that
include an interesting column with their table of results. See upstream
commit 0a4db187a999, column #3 within the "Performance numbers" table. It
seems known, even within Intel, that the deeper C-states incur a cost too
high to bear, as they've explicitly tested restricting the CPU to higher
c-states of C0,1.
-- Jeremy Eder
On 07/26/2013 01:33 PM, Jeremy Eder wrote:
> Hello,
>
> We believe we've identified a particular commit to the cpuidle code that
> seems to be impacting performance of variety of workloads. The simplest way to
> reproduce is using netperf TCP_RR test, so we're using that, on a pair of
> Sandy Bridge based servers. We also have data from a large database setup
> where performance is also measurably/positively impacted, though that test
> data isn't easily share-able.
>
> Included below are test results from 3 test kernels:
>
> kernel reverts
> -----------------------------------------------------------
> 1) vanilla upstream (no reverts)
>
> 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
>
> 3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
> e11538d1f03914eb92af5a1a378375c05ae8520c
>
> In summary, netperf TCP_RR numbers improve by approximately 4% after
> reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When
> 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
> seems to get above 40%. Taking that patch out gets C0 near 100% quite
> often, and performance increases.
Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?
--
All rights reversed
On 7/26/2013 11:13 AM, Rik van Riel wrote:
>
> Could you try running the tests with just the repeat mode
> stuff from commit 69a37bea excluded, but leaving the common
> infrastructure and commit e11538?
>
personally I think we should go the other way around.
revert the set entirely first, and now, and get our performance back
to what it should be
and then see what we can add back without causing the regressions.
this may take longer, or be done in steps, and that's ok.
the end point may well be the same... but we can then evaluate in the right
direction.
On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> On 7/26/2013 11:13 AM, Rik van Riel wrote:
>
>>
>> Could you try running the tests with just the repeat mode
>> stuff from commit 69a37bea excluded, but leaving the common
>> infrastructure and commit e11538?
>>
>
> personally I think we should go the other way around.
> revert the set entirely first, and now, and get our performance back
> to what it should be
>
> and then see what we can add back without causing the regressions.
> this may take longer, or be done in steps, and that's ok.
>
> the end point may well be the same... but we can then evaluate in the right
> direction.
Works for me. I have no objection to reverting both patches,
if the people planning to fix the code prefer that :)
--
All rights reversed
On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
> On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> > On 7/26/2013 11:13 AM, Rik van Riel wrote:
> >
> >>
> >> Could you try running the tests with just the repeat mode
> >> stuff from commit 69a37bea excluded, but leaving the common
> >> infrastructure and commit e11538?
> >>
> >
> > personally I think we should go the other way around.
> > revert the set entirely first, and now, and get our performance back
> > to what it should be
> >
> > and then see what we can add back without causing the regressions.
> > this may take longer, or be done in steps, and that's ok.
> >
> > the end point may well be the same... but we can then evaluate in the right
> > direction.
>
> Works for me. I have no objection to reverting both patches,
> if the people planning to fix the code prefer that :)
OK, I'll queue up the reverts as fixes for 3.11-rc4.
Thanks,
Rafael
On Friday, July 26, 2013 11:48:36 PM Rafael J. Wysocki wrote:
> On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
> > On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> > > On 7/26/2013 11:13 AM, Rik van Riel wrote:
> > >
> > >>
> > >> Could you try running the tests with just the repeat mode
> > >> stuff from commit 69a37bea excluded, but leaving the common
> > >> infrastructure and commit e11538?
> > >>
> > >
> > > personally I think we should go the other way around.
> > > revert the set entirely first, and now, and get our performance back
> > > to what it should be
> > >
> > > and then see what we can add back without causing the regressions.
> > > this may take longer, or be done in steps, and that's ok.
> > >
> > > the end point may well be the same... but we can then evaluate in the right
> > > direction.
> >
> > Works for me. I have no objection to reverting both patches,
> > if the people planning to fix the code prefer that :)
>
> OK, I'll queue up the reverts as fixes for 3.11-rc4.
So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
can access at
http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
However, they are not simple reverts as we've had some non-trivial changes on
top of those commits already, so I'd appreciate it a lot if somebody could
double check if I didn't break anything in them.
They are based on top of my master branch for now, but I'll rebase them on
3.11-rc3 when it's out.
Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
>> OK, I'll queue up the reverts as fixes for 3.11-rc4.
>
> So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
> can access at
>
> http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
>
> However, they are not simple reverts as we've had some non-trivial changes on
> top of those commits already, so I'd appreciate it a lot if somebody could
> double check if I didn't break anything in them.
I've verified that the reverts improve netperf TCP_RR performance.
Here I've got two Xeon's, slightly different, so I run in both directions.
Also I run two ways -- Out-of-the-box, plus with cpufreq
set to max frequency. The reverts improve all 4 cases:
JKT → IVT IVT → JKT
3.11.0-rc2 baseline w/o revert Out of Box
20420 19963
20658 19915
20298 20320
3.11.0-rc2 baseline w/o revert max freq
59658 51427
59663 51503
59416 51343
3.11.0-rc2-00002-g74bce39 Linux-pm “fixes-next” branch
3.11.0-rc2-00002-g74bce39 Out of box
23227 22056
23306 22125
23387 40226 40k result saw some 2.2 ghz
vs 1.2 ghz in other runs
21991
3.11.0-rc2-00002-g74bce39 Max-freq
67240 57645
64880 56764
65924 57435
Tested-by: Len Brown <[email protected]>
thanks,
Len Brown, Intel Open Source Technology Center
On 07/26/2013 07:33 PM, Jeremy Eder wrote:
> Hello,
>
> We believe we've identified a particular commit to the cpuidle code that
> seems to be impacting performance of variety of workloads. The simplest way to
> reproduce is using netperf TCP_RR test, so we're using that, on a pair of
> Sandy Bridge based servers. We also have data from a large database setup
> where performance is also measurably/positively impacted, though that test
> data isn't easily share-able.
>
> Included below are test results from 3 test kernels:
Is the system tickless or with a periodic tick ?
> kernel reverts
> -----------------------------------------------------------
> 1) vanilla upstream (no reverts)
>
> 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
>
> 3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
> e11538d1f03914eb92af5a1a378375c05ae8520c
>
> In summary, netperf TCP_RR numbers improve by approximately 4% after
> reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When
> 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
> seems to get above 40%. Taking that patch out gets C0 near 100% quite
> often, and performance increases.
>
> The below data are histograms representing the %c0 residency @ 1-second
> sample rates (using turbostat), while under netperf test.
>
> - If you look at the first 4 histograms, you can see %c0 residency almost
> entirely in the 30,40% bin.
> - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
> shows %c0 in the 80,90,100% bins.
>
> Below each kernel name are netperf TCP_RR trans/s numbers for the
> particular kernel that can be disclosed publicly, comparing the 3 test
> kernels. We ran a 4th test with the vanilla kernel where we've also set
> /dev/cpu_dma_latency=0 to show overall impact boosting single-threaded
> TCP_RR performance over 11% above baseline.
>
> 3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
> TCP_RR trans/s 54323.78
>
> -----------------------------------------------------------
> 3.10-rc2 vanilla RX (no reverts)
> TCP_RR trans/s 48192.47
>
> Receiver %c0
> 0.0000 - 10.0000 [ 1]: *
> 10.0000 - 20.0000 [ 0]:
> 20.0000 - 30.0000 [ 0]:
> 30.0000 - 40.0000 [ 59]:
> ***********************************************************
> 40.0000 - 50.0000 [ 1]: *
> 50.0000 - 60.0000 [ 0]:
> 60.0000 - 70.0000 [ 0]:
> 70.0000 - 80.0000 [ 0]:
> 80.0000 - 90.0000 [ 0]:
> 90.0000 - 100.0000 [ 0]:
>
> Sender %c0
> 0.0000 - 10.0000 [ 1]: *
> 10.0000 - 20.0000 [ 0]:
> 20.0000 - 30.0000 [ 0]:
> 30.0000 - 40.0000 [ 11]: ***********
> 40.0000 - 50.0000 [ 49]:
> *************************************************
> 50.0000 - 60.0000 [ 0]:
> 60.0000 - 70.0000 [ 0]:
> 70.0000 - 80.0000 [ 0]:
> 80.0000 - 90.0000 [ 0]:
> 90.0000 - 100.0000 [ 0]:
>
> -----------------------------------------------------------
> 3.10-rc2 perfteam2 RX (reverts commit
> e11538d1f03914eb92af5a1a378375c05ae8520c)
> TCP_RR trans/s 49698.69
>
> Receiver %c0
> 0.0000 - 10.0000 [ 1]: *
> 10.0000 - 20.0000 [ 1]: *
> 20.0000 - 30.0000 [ 0]:
> 30.0000 - 40.0000 [ 59]:
> ***********************************************************
> 40.0000 - 50.0000 [ 0]:
> 50.0000 - 60.0000 [ 0]:
> 60.0000 - 70.0000 [ 0]:
> 70.0000 - 80.0000 [ 0]:
> 80.0000 - 90.0000 [ 0]:
> 90.0000 - 100.0000 [ 0]:
>
> Sender %c0
> 0.0000 - 10.0000 [ 1]: *
> 10.0000 - 20.0000 [ 0]:
> 20.0000 - 30.0000 [ 0]:
> 30.0000 - 40.0000 [ 2]: **
> 40.0000 - 50.0000 [ 58]:
> **********************************************************
> 50.0000 - 60.0000 [ 0]:
> 60.0000 - 70.0000 [ 0]:
> 70.0000 - 80.0000 [ 0]:
> 80.0000 - 90.0000 [ 0]:
> 90.0000 - 100.0000 [ 0]:
>
> -----------------------------------------------------------
> 3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and
> e11538d1f03914eb92af5a1a378375c05ae8520c)
> TCP_RR trans/s 47766.95
>
> Receiver %c0
> 0.0000 - 10.0000 [ 1]: *
> 10.0000 - 20.0000 [ 1]: *
> 20.0000 - 30.0000 [ 0]:
> 30.0000 - 40.0000 [ 27]: ***************************
> 40.0000 - 50.0000 [ 2]: **
> 50.0000 - 60.0000 [ 0]:
> 60.0000 - 70.0000 [ 2]: **
> 70.0000 - 80.0000 [ 0]:
> 80.0000 - 90.0000 [ 0]:
> 90.0000 - 100.0000 [ 28]: ****************************
>
> Sender:
> 0.0000 - 10.0000 [ 1]: *
> 10.0000 - 20.0000 [ 0]:
> 20.0000 - 30.0000 [ 0]:
> 30.0000 - 40.0000 [ 11]: ***********
> 40.0000 - 50.0000 [ 0]:
> 50.0000 - 60.0000 [ 1]: *
> 60.0000 - 70.0000 [ 0]:
> 70.0000 - 80.0000 [ 3]: ***
> 80.0000 - 90.0000 [ 7]: *******
> 90.0000 - 100.0000 [ 38]: **************************************
>
> These results demonstrate gaining back the tendency of the CPU to stay in
> more responsive, performant C-states (and thus yield measurably better
> performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.
>
> While taking into account the changing landscape with regards to CPU
> governors, and both P- and C-states, we think that a single-thread should
> still be able to achieve maximum performance. With the current upstream
> code base, workloads with a low number of "hot" threads are not able to
> achieve maximum performance "out of the box".
>
> Also recently, Intel's LAD has posted upstream performance results that
> include an interesting column with their table of results. See upstream
> commit 0a4db187a999, column #3 within the "Performance numbers" table. It
> seems known, even within Intel, that the deeper C-states incur a cost too
> high to bear, as they've explicitly tested restricting the CPU to higher
> c-states of C0,1.
>
> -- Jeremy Eder
>
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 07/26/2013 08:29 PM, Rik van Riel wrote:
> On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
>> On 7/26/2013 11:13 AM, Rik van Riel wrote:
>>
>>>
>>> Could you try running the tests with just the repeat mode
>>> stuff from commit 69a37bea excluded, but leaving the common
>>> infrastructure and commit e11538?
>>>
>>
>> personally I think we should go the other way around.
>> revert the set entirely first, and now, and get our performance back
>> to what it should be
>>
>> and then see what we can add back without causing the regressions.
>> this may take longer, or be done in steps, and that's ok.
>>
>> the end point may well be the same... but we can then evaluate in the
>> right
>> direction.
>
> Works for me. I have no objection to reverting both patches,
> if the people planning to fix the code prefer that :)
I am favor of reverting these patches also because they don't solve the
root problem of reducing the prediction failure.
The menu governor tries to deduce the next wakeup but based on events
per cpu. That means if a task with a specific behavior is migrated
across cpus, the statistics will be wrong. IMHO, the menu governor has
too much heuristic in there and may be we should consider the scheduler
to pass more informations to the governor and remove some of these
heuristics.
Beside that, on ARM the cpus could be coupled and the timer to detect
the prediction will wake up the entire cluster, making the power saving
less efficient and impacting the statistics of the other cpu.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Saturday, July 27, 2013 02:22:53 AM Len Brown wrote:
> >> OK, I'll queue up the reverts as fixes for 3.11-rc4.
> >
> > So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
> > can access at
> >
> > http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
> >
> > However, they are not simple reverts as we've had some non-trivial changes on
> > top of those commits already, so I'd appreciate it a lot if somebody could
> > double check if I didn't break anything in them.
>
> I've verified that the reverts improve netperf TCP_RR performance.
>
> Here I've got two Xeon's, slightly different, so I run in both directions.
> Also I run two ways -- Out-of-the-box, plus with cpufreq
> set to max frequency. The reverts improve all 4 cases:
>
> JKT → IVT IVT → JKT
>
> 3.11.0-rc2 baseline w/o revert Out of Box
>
> 20420 19963
> 20658 19915
> 20298 20320
>
> 3.11.0-rc2 baseline w/o revert max freq
>
> 59658 51427
> 59663 51503
> 59416 51343
>
> 3.11.0-rc2-00002-g74bce39 Linux-pm “fixes-next” branch
>
> 3.11.0-rc2-00002-g74bce39 Out of box
> 23227 22056
> 23306 22125
> 23387 40226 40k result saw some 2.2 ghz
> vs 1.2 ghz in other runs
> 21991
>
> 3.11.0-rc2-00002-g74bce39 Max-freq
> 67240 57645
> 64880 56764
> 65924 57435
>
> Tested-by: Len Brown <[email protected]>
Many thanks Len!
Rafael
>
> Beside that, on ARM the cpus could be coupled and the timer to detect
> the prediction will wake up the entire cluster, making the power saving
> less efficient and impacting the statistics of the other cpu.
an ARM specific governor would be quite appropriate in that case.
>
> The menu governor tries to deduce the next wakeup but based on events
> per cpu. That means if a task with a specific behavior is migrated
> across cpus, the statistics will be wrong.
btw this is largely a misunderstanding;
tasks are not the issue; tasks use timers and those are perfectly predictable.
It's interrupts that are not and the heuristics are for that.
Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
then the menu governor logic is not good for you; rather than looking at the next
timer on the current cpu you need to look at the earliest timer on the set of bundled
cpus as the upper bound of the next wake event.
And maybe even more special casing is needed... but I doubt it.
On Mon, Jul 29, 2013 at 02:12:58PM +0100, Arjan van de Ven wrote:
> >
> > The menu governor tries to deduce the next wakeup but based on events
> > per cpu. That means if a task with a specific behavior is migrated
> > across cpus, the statistics will be wrong.
>
>
> btw this is largely a misunderstanding;
> tasks are not the issue; tasks use timers and those are perfectly predictable.
> It's interrupts that are not and the heuristics are for that.
>
> Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
> then the menu governor logic is not good for you; rather than looking at the next
> timer on the current cpu you need to look at the earliest timer on the set of bundled
> cpus as the upper bound of the next wake event.
Yes, that's true and we have to look into this properly, but certainly
a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
or I am missing something ?
Even if the wake-up interrupts just power up one of the CPUs in a package
and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
be turned on. What I am asking is: this bundled next event is a concept
that should apply to x86 CPUs too, or it is entirely managed in FW/HW
and the kernel just should not care ?
I still do not understand how this "bundled" next event is managed on
x86 with the menu governor, or better why it is not managed at all, given
the importance of package C-states.
> And maybe even more special casing is needed... but I doubt it.
I lost you here, can you elaborate pls ?
Thanks a lot,
Lorenzo
On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:
>>
>>
>> btw this is largely a misunderstanding;
>> tasks are not the issue; tasks use timers and those are perfectly predictable.
>> It's interrupts that are not and the heuristics are for that.
>>
>> Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
>> then the menu governor logic is not good for you; rather than looking at the next
>> timer on the current cpu you need to look at the earliest timer on the set of bundled
>> cpus as the upper bound of the next wake event.
>
> Yes, that's true and we have to look into this properly, but certainly
> a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
> or I am missing something ?
a CPU core isn't in a package C state, the system is.
(in a core C state the whole core is already powered down completely; a package C state
just also turns off the memory controller/etc)
package C states are global on x86 (not just per package); there's nothing one
can do there in terms of grouping/etc.
> Even if the wake-up interrupts just power up one of the CPUs in a package
> and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
> be turned on. What I am asking is: this bundled next event is a concept
> that should apply to x86 CPUs too, or it is entirely managed in FW/HW
> and the kernel just should not care ?
on Intel x86 cpus, there's not really bundled concept. or rather, there is only 1 bundle
(which amounts to the same thing).
Yes in a multi-package setup there are some cache power effects... but there's
not a lot one can do there.
The other cores don't wake up, so they still make their own correct decisions.
> I still do not understand how this "bundled" next event is managed on
> x86 with the menu governor, or better why it is not managed at all, given
> the importance of package C-states.
package C states on x86 are basically OS invisible. The OS manages core level C states,
the hardware manages the rest.
The bundle part hurts you on a "one wakes all" system,
not because of package level power effects, but because others wake up prematurely
(compared to what they expected) which causes them to think future wakups will also
be earlier. All because they get the "what is the next known event" wrong,
and start correcting for known events instead of only for 'unpredictable' interrupts.
Things will go very wonky if you do that for sure.
(I've seen various simulation data on that, and the menu governor indeed acts quite poorly
for that)
>> And maybe even more special casing is needed... but I doubt it.
>
> I lost you here, can you elaborate pls ?
well.. just looking at the earliest timer might not be enough; that timer might be on a different
core that's still active, and may change after the current cpu has gone into an idle state.
Fun.
Coupled C states on this level are a PAIN in many ways, and tend to totally suck for power
due to this and the general "too much is active" reasons.
On Mon, Jul 29, 2013 at 03:29:20PM +0100, Arjan van de Ven wrote:
> On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:
> >>
> >>
> >> btw this is largely a misunderstanding;
> >> tasks are not the issue; tasks use timers and those are perfectly predictable.
> >> It's interrupts that are not and the heuristics are for that.
> >>
> >> Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
> >> then the menu governor logic is not good for you; rather than looking at the next
> >> timer on the current cpu you need to look at the earliest timer on the set of bundled
> >> cpus as the upper bound of the next wake event.
> >
> > Yes, that's true and we have to look into this properly, but certainly
> > a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
> > or I am missing something ?
>
> a CPU core isn't in a package C state, the system is.
> (in a core C state the whole core is already powered down completely; a package C state
> just also turns off the memory controller/etc)
>
> package C states are global on x86 (not just per package); there's nothing one
> can do there in terms of grouping/etc.
So you are saying that package states are system states on x86 right ?
Now things are a bit clearer, I was a bit baffled at first when you
mentioned that package C-states allow PM to turn off DRAM controller,
the concept of package C-state is a bit misleading and does not resemble
much to what cluster states are now for ARM, that's why I asked in the
first place, thank you.
> > Even if the wake-up interrupts just power up one of the CPUs in a package
> > and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
> > be turned on. What I am asking is: this bundled next event is a concept
> > that should apply to x86 CPUs too, or it is entirely managed in FW/HW
> > and the kernel just should not care ?
>
> on Intel x86 cpus, there's not really bundled concept. or rather, there is only 1 bundle
> (which amounts to the same thing).
> Yes in a multi-package setup there are some cache power effects... but there's
> not a lot one can do there.
On ARM there is, some basic optimizations like avoid cleaning caches if
an IRQ is pending or a next event is due shortly, for instance. The
difference is that on ARM cache management in done in SW and under
kernel or FW control (which in a way is closer to what x86 does, even
though I think on x86 most of power control is offloaded to HW).
Given what you said above, I understand that even on a multi-package system,
package C-states are global, not per package. That's a pivotal point.
> The other cores don't wake up, so they still make their own correct decisions.
>
> > I still do not understand how this "bundled" next event is managed on
> > x86 with the menu governor, or better why it is not managed at all, given
> > the importance of package C-states.
>
> package C states on x86 are basically OS invisible. The OS manages core level C states,
> the hardware manages the rest.
> The bundle part hurts you on a "one wakes all" system,
> not because of package level power effects, but because others wake up prematurely
> (compared to what they expected) which causes them to think future wakups will also
> be earlier. All because they get the "what is the next known event" wrong,
> and start correcting for known events instead of only for 'unpredictable' interrupts.
Well, reality is a bit more complex and probably less drastic, cores that are
woken up spuriosly can be put back to sleep without going through the governor
again, but your point is taken.
> Things will go very wonky if you do that for sure.
> (I've seen various simulation data on that, and the menu governor indeed acts quite poorly
> for that)
That's something we have been benchmarking, yes.
> >> And maybe even more special casing is needed... but I doubt it.
> >
> > I lost you here, can you elaborate pls ?
>
> well.. just looking at the earliest timer might not be enough; that timer might be on a different
> core that's still active, and may change after the current cpu has gone into an idle state.
Yes, I have worked on this, and certainly next events must be tied to
the state a core is in, the bare next event does not help.
> Fun.
Big :D
> Coupled C states on this level are a PAIN in many ways, and tend to totally suck for power
> due to this and the general "too much is active" reasons.
I think the trend is moving towards core gating, which resembles a lot to what
x86 world does today. Still, the interaction between menu governor and
cluster states has to be characterized and that's we are doing at the
moment.
Thank you very much,
Lorenzo
On 7/29/2013 9:01 AM, Lorenzo Pieralisi wrote:
>
>> Coupled C states on this level are a PAIN in many ways, and tend to totally suck for power
>> due to this and the general "too much is active" reasons.
>
> I think the trend is moving towards core gating, which resembles a lot to what
> x86 world does today. Still, the interaction between menu governor and
> cluster states has to be characterized and that's we are doing at the
> moment.
I suspect that we (Linux) need a different governor than menu for the coupled case.
Fundamentally you're going to need different algorithms that are much closer tied
to the behavior of the specific hardware; the current menu governor uses an abstract
and simple hardware model (fully independent).
For specific hardware that deviates from the model, something different is clearly needed.
It's an open question for me if just the predictor is the part that
you split out, or if the whole governor needs to be split out.
(but frankly, if you take the predictor out of the menu governor, not a whole lot is left)
Hi Jeremy,
I try reproduce your result and then fix the issue, but I do not reproduce it
yet.
I run at netperf-2.6.0 at one machine as server: netserver, other
machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
used in both client and server. I do not reproduce the performance drop
issue. I also notice the result is not stable, sometime it is high,
sometime is low. In sumarry, it is hard to make a definite result.
Can you try tell me how to reproduce the issue? how do you get the C0
data?
What's your config for kernel? Do you enable CONFIG_NO_HZ_FULL=y or
only CONFIG_NO_HZ=y?
Thanks
-Youquan
On 130729 23:57:31, Youquan Song wrote:
> Hi Jeremy,
>
> I try reproduce your result and then fix the issue, but I do not reproduce it
> yet.
>
> I run at netperf-2.6.0 at one machine as server: netserver, other
> machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> used in both client and server. I do not reproduce the performance drop
> issue. I also notice the result is not stable, sometime it is high,
> sometime is low. In sumarry, it is hard to make a definite result.
>
> Can you try tell me how to reproduce the issue? how do you get the C0
> data?
>
> What's your config for kernel? Do you enable CONFIG_NO_HZ_FULL=y or
> only CONFIG_NO_HZ=y?
>
>
> Thanks
> -Youquan
Hi,
To answer both your and Daniel's question, those results used only
CONFIG_NO_HZ=y.
These network latency benchmarks are fickle creatures, and need careful
tuning to become reproducible. Plus there are BIOS implications and tuning
varies by vendor.
Anyway for the most part it's probably not stable because in order to get
any sort
of reproducibility between runs you need to do at least these steps:
- ensure as little is running in userspace as possible
- determine PCI affinity for the NIC
- on both machines, isolate the socket connected to the NIC from userspace
tasks
- Turn off irqbalance and bind all IRQs for that NIC to a single core on
the same socket as the NIC
- run netperf with -TX,Y where X,Y are core numbers that you wish
netperf/netserver to run on, respectively.
For example, if your NIC is attached to socket 0 and socket 0 cores are
enumerated 0-7, then:
- set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
NIC.
- nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2
That should get you most of the way there. The -s 2 connects and waits 2
seconds, I found this to help with the first few second's worth of data.
Or
you could just toss the first 2 seconds worth, it seems to take that long
to stabilize. What I mean is, if you're not using -D1,1 option to netperf,
you might not have seen that netperf tests seem to take a few seconds to
stabilize even
when properly tuned.
I got the C0 data by running turbostat in parallel with each benchmark run,
then grabbing the C-state data for the cores relevant to the test. In my
case that was cores 4 and 6, where core 4 was where I put netperf/netserver
and core 6 was where I put the NIC IRQs. Then I parsed that output into a
format that this could interpret:
https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py
I'm building a kernel from Rafael's tree and will try to confirm what Len
already sent. Thanks everyone for looking into it.
On 130729 12:59:47, Jeremy Eder wrote:
> On 130729 23:57:31, Youquan Song wrote:
> > Hi Jeremy,
> >
> > I try reproduce your result and then fix the issue, but I do not reproduce it
> > yet.
> >
> > I run at netperf-2.6.0 at one machine as server: netserver, other
> > machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> > used in both client and server. I do not reproduce the performance drop
> > issue. I also notice the result is not stable, sometime it is high,
> > sometime is low. In sumarry, it is hard to make a definite result.
> >
> > Can you try tell me how to reproduce the issue? how do you get the C0
> > data?
> >
> > What's your config for kernel? Do you enable CONFIG_NO_HZ_FULL=y or
> > only CONFIG_NO_HZ=y?
> >
> >
> > Thanks
> > -Youquan
>
> Hi,
>
> To answer both your and Daniel's question, those results used only
> CONFIG_NO_HZ=y.
>
> These network latency benchmarks are fickle creatures, and need careful
> tuning to become reproducible. Plus there are BIOS implications and tuning
> varies by vendor.
>
> Anyway for the most part it's probably not stable because in order to get
> any sort
> of reproducibility between runs you need to do at least these steps:
>
> - ensure as little is running in userspace as possible
> - determine PCI affinity for the NIC
> - on both machines, isolate the socket connected to the NIC from userspace
> tasks
> - Turn off irqbalance and bind all IRQs for that NIC to a single core on
> the same socket as the NIC
> - run netperf with -TX,Y where X,Y are core numbers that you wish
> netperf/netserver to run on, respectively.
>
> For example, if your NIC is attached to socket 0 and socket 0 cores are
> enumerated 0-7, then:
>
> - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
> NIC.
> - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2
>
> That should get you most of the way there. The -s 2 connects and waits 2
> seconds, I found this to help with the first few second's worth of data.
> Or
> you could just toss the first 2 seconds worth, it seems to take that long
> to stabilize. What I mean is, if you're not using -D1,1 option to netperf,
> you might not have seen that netperf tests seem to take a few seconds to
> stabilize even
> when properly tuned.
>
> I got the C0 data by running turbostat in parallel with each benchmark run,
> then grabbing the C-state data for the cores relevant to the test. In my
> case that was cores 4 and 6, where core 4 was where I put netperf/netserver
> and core 6 was where I put the NIC IRQs. Then I parsed that output into a
> format that this could interpret:
>
> https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py
>
> I'm building a kernel from Rafael's tree and will try to confirm what Len
> already sent. Thanks everyone for looking into it.
Hi, sorry for the delay. In addition to the results I initially posted,
the below results confirm my initial data, plus what Len sent:
3.11-rc2 w/reverts
TCP_RR trans/s 54454.13
3.11-rc2 w/reverts + c0 lock
TCP_RR trans/s 55088.11