Hello,
I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
idle state before stopping the tick") causes severe performance drop
for systems using pcc-cpufreq driver. Depending on the number of CPUs
the system might be almost unusable. The OS jitter for 4.17.y and
4.18.-rcx kernels is off the charts, you can even spot it with top
command (issued when the system is supposedly idle), e.g.
top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
When I apply below patch (trying to revert essential parts of commit
554c8aa8ecad) behaviour seems back to normal.
I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
to cpufreq drivers and you better not use it. But I wonder whether
commit 554c8aa8ecad ("sched: idle: Select idle state before stopping
the tick") introduced bad behaviour for other cases as well.
I'll send some performance results to illustrate the issue asap. I've
also tried to modify pcc-cpufreq to reduce the amount of frequency
changes triggered by this driver but this does not help for kernels
where commit 554c8aa8ecad is applied.
Andreas
---
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1a3e9bddd17b..377a62ec475c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -185,18 +185,13 @@ static void cpuidle_idle_call(void)
} else {
bool stop_tick = true;
+ tick_nohz_idle_stop_tick();
+ rcu_idle_enter();
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
next_state = cpuidle_select(drv, dev, &stop_tick);
- if (stop_tick)
- tick_nohz_idle_stop_tick();
- else
- tick_nohz_idle_retain_tick();
-
- rcu_idle_enter();
-
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Hi,
Thanks for your report!
On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <[email protected]> wrote:
> Hello,
>
> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> idle state before stopping the tick") causes severe performance drop
> for systems using pcc-cpufreq driver. Depending on the number of CPUs
> the system might be almost unusable. The OS jitter for 4.17.y and
> 4.18.-rcx kernels is off the charts, you can even spot it with top
> command (issued when the system is supposedly idle), e.g.
>
> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
>
> When I apply below patch (trying to revert essential parts of commit
> 554c8aa8ecad) behaviour seems back to normal.
Well, that basically defeats the purpose of the change in commit
554c8aa8ecad, so it's not what I'd like to do to fix this problem.
Also it would be good to understand what actually happens.
> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
> to cpufreq drivers and you better not use it.
That's exactly right.
> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
> stopping the tick") introduced bad behaviour for other cases as well.
It has been tested quite extensively in that respect, although
admittedly not with the pcc-cpufreq driver.
Nothing bad related to it has been has been reported so far, FWIW.
> I'll send some performance results to illustrate the issue asap. I've
> also tried to modify pcc-cpufreq to reduce the amount of frequency
> changes triggered by this driver but this does not help for kernels
> where commit 554c8aa8ecad is applied.
Can you replace pcc-cpufreq with a different cpufreq driver on the
affected systems? If so, do performance numbers look bad after that
too?
Thanks,
Rafael
On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <[email protected]> wrote:
> Hi,
>
> Thanks for your report!
>
> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <[email protected]> wrote:
>> Hello,
>>
>> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> idle state before stopping the tick") causes severe performance drop
>> for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> the system might be almost unusable. The OS jitter for 4.17.y and
>> 4.18.-rcx kernels is off the charts, you can even spot it with top
>> command (issued when the system is supposedly idle), e.g.
>>
>> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
>> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
>> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
>> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
>> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
>> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
>> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
>> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
>> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
>> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
>> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
>> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
>> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
>> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
>> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
>> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
>> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
>> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
>> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
>> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
>> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
>> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
>> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
>> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
>> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
>> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
>> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
>> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
>> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
>> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
>> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
>> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
>>
>> When I apply below patch (trying to revert essential parts of commit
>> 554c8aa8ecad) behaviour seems back to normal.
>
> Well, that basically defeats the purpose of the change in commit
> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>
> Also it would be good to understand what actually happens.
>
>> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> to cpufreq drivers and you better not use it.
>
> That's exactly right.
>
>> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> stopping the tick") introduced bad behaviour for other cases as well.
>
> It has been tested quite extensively in that respect, although
> admittedly not with the pcc-cpufreq driver.
>
> Nothing bad related to it has been has been reported so far, FWIW.
>
>> I'll send some performance results to illustrate the issue asap. I've
>> also tried to modify pcc-cpufreq to reduce the amount of frequency
>> changes triggered by this driver but this does not help for kernels
>> where commit 554c8aa8ecad is applied.
>
> Can you replace pcc-cpufreq with a different cpufreq driver on the
> affected systems? If so, do performance numbers look bad after that
> too?
Also, what cpufreq governor do you use with pcc-cpufreq? Does
changing it to something like "performance" improve things?
On 17/07/2018 09:33, Rafael J. Wysocki wrote:
> Hi,
>
> Thanks for your report!
>
> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <[email protected]> wrote:
>> Hello,
>>
>> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> idle state before stopping the tick") causes severe performance drop
>> for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> the system might be almost unusable. The OS jitter for 4.17.y and
>> 4.18.-rcx kernels is off the charts, you can even spot it with top
>> command (issued when the system is supposedly idle), e.g.
>>
>> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
>> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
>> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
>> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
>> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
>> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
>> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
>> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
>> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
>> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
>> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
>> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
>> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
>> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
>> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
>> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
>> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
>> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
>> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
>> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
>> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
>> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
>> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
>> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
>> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
>> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
>> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
>> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
>> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
>> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
>> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
>> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
>>
>> When I apply below patch (trying to revert essential parts of commit
>> 554c8aa8ecad) behaviour seems back to normal.
>
> Well, that basically defeats the purpose of the change in commit
> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>
> Also it would be good to understand what actually happens.
>
>> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> to cpufreq drivers and you better not use it.
>
> That's exactly right.
>
>> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> stopping the tick") introduced bad behaviour for other cases as well.
>
> It has been tested quite extensively in that respect, although
> admittedly not with the pcc-cpufreq driver.
>
> Nothing bad related to it has been has been reported so far, FWIW.
I confirm we benchmarked on ARM64 and we did not notice a big
performance drop expect a few percent due to more cluster idle states
which is logical and compensated by a big energy improvement.
>> I'll send some performance results to illustrate the issue asap. I've
>> also tried to modify pcc-cpufreq to reduce the amount of frequency
>> changes triggered by this driver but this does not help for kernels
>> where commit 554c8aa8ecad is applied.
>
> Can you replace pcc-cpufreq with a different cpufreq driver on the
> affected systems? If so, do performance numbers look bad after that
> too?
>
> Thanks,
> Rafael
>
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Tue, Jul 17, 2018 at 08:50:48AM +0200, Andreas Herrmann wrote:
> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> idle state before stopping the tick") causes severe performance drop
> for systems using pcc-cpufreq driver. Depending on the number of CPUs
> the system might be almost unusable.
Have you looked at that driver? It features a global lock which is taken
for all cpufreq operations. Once you get past a handful of cpus that
will constrict your system exactly as you say.
What kind of machine are you running this on, and can you configure the
thing to use anything else? If so, do so and never look back.
On Tue, Jul 17, 2018 at 09:33:53AM +0200, Rafael J. Wysocki wrote:
> Hi,
>
> Thanks for your report!
>
> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <[email protected]> wrote:
> > Hello,
> >
> > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> > idle state before stopping the tick") causes severe performance drop
> > for systems using pcc-cpufreq driver. Depending on the number of CPUs
> > the system might be almost unusable. The OS jitter for 4.17.y and
> > 4.18.-rcx kernels is off the charts, you can even spot it with top
> > command (issued when the system is supposedly idle), e.g.
> >
> > top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
> > Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
> > %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
> > KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
> > KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
> > 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
> > 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
> > 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
> > 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
> > 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
> > 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
> > 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
> > 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
> > 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
> > 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
> > 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
> > 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
> > 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
> > 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
> > 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
> > 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
> > 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
> > 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
> > 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
> > 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
> > 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
> > 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
> > 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
> > 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
> > 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
> > 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
> > 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
> >
> > When I apply below patch (trying to revert essential parts of commit
> > 554c8aa8ecad) behaviour seems back to normal.
>
> Well, that basically defeats the purpose of the change in commit
> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>
> Also it would be good to understand what actually happens.
>
> > I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
> > to cpufreq drivers and you better not use it.
>
> That's exactly right.
>
> > But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
> > stopping the tick") introduced bad behaviour for other cases as well.
>
> It has been tested quite extensively in that respect, although
> admittedly not with the pcc-cpufreq driver.
>
> Nothing bad related to it has been has been reported so far, FWIW.
>
> > I'll send some performance results to illustrate the issue asap. I've
> > also tried to modify pcc-cpufreq to reduce the amount of frequency
> > changes triggered by this driver but this does not help for kernels
> > where commit 554c8aa8ecad is applied.
>
> Can you replace pcc-cpufreq with a different cpufreq driver on the
> affected systems? If so, do performance numbers look bad after that
> too?
I have no performance numbers yet for other cpufreq drivers on this
system (checking this commit). But I'll look it at next.
Andreas
On Tue, Jul 17, 2018 at 10:03:41AM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <[email protected]> wrote:
> > Hi,
> >
> > Thanks for your report!
> >
> > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <[email protected]> wrote:
> >> Hello,
> >>
> >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> >> idle state before stopping the tick") causes severe performance drop
> >> for systems using pcc-cpufreq driver. Depending on the number of CPUs
> >> the system might be almost unusable. The OS jitter for 4.17.y and
> >> 4.18.-rcx kernels is off the charts, you can even spot it with top
> >> command (issued when the system is supposedly idle), e.g.
> >>
> >> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
> >> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
> >> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
> >> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
> >> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
> >>
> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
> >> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
> >> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
> >> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
> >> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
> >> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
> >> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
> >> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
> >> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
> >> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
> >> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
> >> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
> >> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
> >> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
> >> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
> >> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
> >> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
> >> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
> >> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
> >> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
> >> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
> >> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
> >> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
> >> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
> >> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
> >> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
> >> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
> >> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
> >>
> >> When I apply below patch (trying to revert essential parts of commit
> >> 554c8aa8ecad) behaviour seems back to normal.
> >
> > Well, that basically defeats the purpose of the change in commit
> > 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
> >
> > Also it would be good to understand what actually happens.
> >
> >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
> >> to cpufreq drivers and you better not use it.
> >
> > That's exactly right.
> >
> >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
> >> stopping the tick") introduced bad behaviour for other cases as well.
> >
> > It has been tested quite extensively in that respect, although
> > admittedly not with the pcc-cpufreq driver.
> >
> > Nothing bad related to it has been has been reported so far, FWIW.
> >
> >> I'll send some performance results to illustrate the issue asap. I've
> >> also tried to modify pcc-cpufreq to reduce the amount of frequency
> >> changes triggered by this driver but this does not help for kernels
> >> where commit 554c8aa8ecad is applied.
> >
> > Can you replace pcc-cpufreq with a different cpufreq driver on the
> > affected systems? If so, do performance numbers look bad after that
> > too?
>
> Also, what cpufreq governor do you use with pcc-cpufreq?
Ondemand governor. Which triggers a lot of PCC related platform calls.
And as Peter noticed already the driver has a severe bottleneck (lock
protecting shared memory used for all CPUs to pass data to/from
platform for PCC calls).
> Does changing it to something like "performance" improve things?
With performance governor above mentioned bottleneck is no issue.
On balance before this commit users could use pcc-cpufreq but had
already suboptimal performance (compared to say intel_pstate driver
which can be used changing BIOS options). Starting with this commit
systems using pcc-cpufreq are unusable with high number of CPUs (top
output above is for system with 120 CPUs).
So should the driver be removed (sooner or later), or this behaviour
be documented somewhere, or just leave it as is.
Andreas
On Tue, Jul 17, 2018 at 10:36 AM, Andreas Herrmann <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 09:33:53AM +0200, Rafael J. Wysocki wrote:
>> Hi,
>>
>> Thanks for your report!
>>
>> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <[email protected]> wrote:
>> > Hello,
>> >
>> > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> > idle state before stopping the tick") causes severe performance drop
>> > for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> > the system might be almost unusable. The OS jitter for 4.17.y and
>> > 4.18.-rcx kernels is off the charts, you can even spot it with top
>> > command (issued when the system is supposedly idle), e.g.
>> >
>> > top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
>> > Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
>> > %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
>> > KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
>> > KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
>> >
>> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> > 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
>> > 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
>> > 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
>> > 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
>> > 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
>> > 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
>> > 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
>> > 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
>> > 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
>> > 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
>> > 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
>> > 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
>> > 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
>> > 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
>> > 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
>> > 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
>> > 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
>> > 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
>> > 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
>> > 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
>> > 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
>> > 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
>> > 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
>> > 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
>> > 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
>> > 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
>> > 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
>> > 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
>> >
>> > When I apply below patch (trying to revert essential parts of commit
>> > 554c8aa8ecad) behaviour seems back to normal.
>>
>> Well, that basically defeats the purpose of the change in commit
>> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>>
>> Also it would be good to understand what actually happens.
>>
>> > I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> > to cpufreq drivers and you better not use it.
>>
>> That's exactly right.
>>
>> > But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> > stopping the tick") introduced bad behaviour for other cases as well.
>>
>> It has been tested quite extensively in that respect, although
>> admittedly not with the pcc-cpufreq driver.
>>
>> Nothing bad related to it has been has been reported so far, FWIW.
>>
>> > I'll send some performance results to illustrate the issue asap. I've
>> > also tried to modify pcc-cpufreq to reduce the amount of frequency
>> > changes triggered by this driver but this does not help for kernels
>> > where commit 554c8aa8ecad is applied.
>>
>> Can you replace pcc-cpufreq with a different cpufreq driver on the
>> affected systems? If so, do performance numbers look bad after that
>> too?
>
> I have no performance numbers yet for other cpufreq drivers on this
> system (checking this commit). But I'll look it at next.
Thanks!
Generally speaking, pcc-cpufreq is fundamentally not scalable, so the
additional concurrency brought in by the commit in question may have
exposed that weakness if that driver is run on a system with multiple
logical CPUs.
On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 10:03:41AM +0200, Rafael J. Wysocki wrote:
>> On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <[email protected]> wrote:
>> > Hi,
>> >
>> > Thanks for your report!
>> >
>> > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <[email protected]> wrote:
>> >> Hello,
>> >>
>> >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> >> idle state before stopping the tick") causes severe performance drop
>> >> for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> >> the system might be almost unusable. The OS jitter for 4.17.y and
>> >> 4.18.-rcx kernels is off the charts, you can even spot it with top
>> >> command (issued when the system is supposedly idle), e.g.
>> >>
>> >> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38
>> >> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie
>> >> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st
>> >> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers
>> >> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem
>> >>
>> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> >> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet
>> >> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0
>> >> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0
>> >> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0
>> >> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0
>> >> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0
>> >> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0
>> >> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1
>> >> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0
>> >> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1
>> >> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1
>> >> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1
>> >> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1
>> >> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1
>> >> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1
>> >> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1
>> >> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1
>> >> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1
>> >> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1
>> >> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1
>> >> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18
>> >> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2
>> >> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0
>> >> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top
>> >> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1
>> >> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1
>> >> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1
>> >> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1
>> >>
>> >> When I apply below patch (trying to revert essential parts of commit
>> >> 554c8aa8ecad) behaviour seems back to normal.
>> >
>> > Well, that basically defeats the purpose of the change in commit
>> > 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>> >
>> > Also it would be good to understand what actually happens.
>> >
>> >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> >> to cpufreq drivers and you better not use it.
>> >
>> > That's exactly right.
>> >
>> >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> >> stopping the tick") introduced bad behaviour for other cases as well.
>> >
>> > It has been tested quite extensively in that respect, although
>> > admittedly not with the pcc-cpufreq driver.
>> >
>> > Nothing bad related to it has been has been reported so far, FWIW.
>> >
>> >> I'll send some performance results to illustrate the issue asap. I've
>> >> also tried to modify pcc-cpufreq to reduce the amount of frequency
>> >> changes triggered by this driver but this does not help for kernels
>> >> where commit 554c8aa8ecad is applied.
>> >
>> > Can you replace pcc-cpufreq with a different cpufreq driver on the
>> > affected systems? If so, do performance numbers look bad after that
>> > too?
>>
>> Also, what cpufreq governor do you use with pcc-cpufreq?
>
> Ondemand governor. Which triggers a lot of PCC related platform calls.
> And as Peter noticed already the driver has a severe bottleneck (lock
> protecting shared memory used for all CPUs to pass data to/from
> platform for PCC calls).
>
>> Does changing it to something like "performance" improve things?
>
> With performance governor above mentioned bottleneck is no issue.
OK
> On balance before this commit users could use pcc-cpufreq but had
> already suboptimal performance (compared to say intel_pstate driver
> which can be used changing BIOS options). Starting with this commit
> systems using pcc-cpufreq are unusable with high number of CPUs (top
> output above is for system with 120 CPUs).
I see. :-)
> So should the driver be removed (sooner or later), or this behaviour
> be documented somewhere, or just leave it as is.
At least it should be documented in the driver that it is not scalable
and not for use on many-CPU systems (with "many" meaning anything
greater than 4 probably).
Or we could just make the driver not load if the number of CPUs in the
system is greater than 4 or similar.
On Tue, Jul 17, 2018 at 10:15:07AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 08:50:48AM +0200, Andreas Herrmann wrote:
> > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> > idle state before stopping the tick") causes severe performance drop
> > for systems using pcc-cpufreq driver. Depending on the number of CPUs
> > the system might be almost unusable.
>
> Have you looked at that driver? It features a global lock which is taken
> for all cpufreq operations. Once you get past a handful of cpus that
> will constrict your system exactly as you say.
Yes, I figured that already and you better use another cpufreq driver
on those systems. (But that such systems with this driver are almost
unusable is a new feature ;-)
> What kind of machine are you running this on, and can you configure the
> thing to use anything else?
Its an older system.
HP ProLiant DL580 Gen8, CPUs are Intel(R) Xeon(R) CPU E7-4890
4 nodes, With HT enabled 120 logical CPUs.
Yes you can configure it using something else.
For the record. On this and similar systems pcc-cpufreq driver is used
when BIOS setting for power management option "Power Regulator" is set
to
"Dynamic Power Savings Mode"
AFAIK this is the default setting for this option. When changing this
setting to "OS Control Mode" intel_pstate driver should be loaded.
> If so, do so and never look back.
Ok, I am fine with this.
Thanks,
Andreas
On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
[cut]
>
> On balance before this commit users could use pcc-cpufreq but had
> already suboptimal performance (compared to say intel_pstate driver
> which can be used changing BIOS options).
BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
It should be initialized before pcc-cpufreq (according to their
respective initcall levels), so in theory intel_pstate should be used
by default on the affected systems anyway.
What BIOS settings need to be changed for that?
On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
>
> [cut]
>
> >
> > On balance before this commit users could use pcc-cpufreq but had
> > already suboptimal performance (compared to say intel_pstate driver
> > which can be used changing BIOS options).
>
> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
I think this is because of (in intel_pstate_init()):
/*
* The Intel pstate driver will be ignored if the platform
* firmware has its own power management modes.
*/
if (intel_pstate_platform_pwr_mgmt_exists())
return -ENODEV;
> It should be initialized before pcc-cpufreq (according to their
> respective initcall levels), so in theory intel_pstate should be used
> by default on the affected systems anyway.
> What BIOS settings need to be changed for that?
Already answered in other mail.
Andreas
On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
>> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
>>
>> [cut]
>>
>> >
>> > On balance before this commit users could use pcc-cpufreq but had
>> > already suboptimal performance (compared to say intel_pstate driver
>> > which can be used changing BIOS options).
>>
>> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
>
> I think this is because of (in intel_pstate_init()):
>
> /*
> * The Intel pstate driver will be ignored if the platform
> * firmware has its own power management modes.
> */
> if (intel_pstate_platform_pwr_mgmt_exists())
> return -ENODEV;
>
OK, because of the "Proliant" entry, right?
So it looks like we have an issue there. We find the entry and we
look for _PSS. It is not there, so we assume that the firmware is
expected to control performance, which is not the case. It looks like
we also should look for the presence of the PCC interface in there.
I can provide a patch for that, will you be able to test it?
>> It should be initialized before pcc-cpufreq (according to their
>> respective initcall levels), so in theory intel_pstate should be used
>> by default on the affected systems anyway.
>
>> What BIOS settings need to be changed for that?
>
> Already answered in other mail.
Indeed.
On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <[email protected]> wrote:
> > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
> >>
> >> [cut]
> >>
> >> >
> >> > On balance before this commit users could use pcc-cpufreq but had
> >> > already suboptimal performance (compared to say intel_pstate driver
> >> > which can be used changing BIOS options).
> >>
> >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> >
> > I think this is because of (in intel_pstate_init()):
> >
> > /*
> > * The Intel pstate driver will be ignored if the platform
> > * firmware has its own power management modes.
> > */
> > if (intel_pstate_platform_pwr_mgmt_exists())
> > return -ENODEV;
> >
>
> OK, because of the "Proliant" entry, right?
>
> So it looks like we have an issue there. We find the entry and we
> look for _PSS. It is not there, so we assume that the firmware is
> expected to control performance, which is not the case. It looks like
> we also should look for the presence of the PCC interface in there.
>
> I can provide a patch for that, will you be able to test it?
Yes, I can test it.
> >> It should be initialized before pcc-cpufreq (according to their
> >> respective initcall levels), so in theory intel_pstate should be used
> >> by default on the affected systems anyway.
> >
> >> What BIOS settings need to be changed for that?
> >
> > Already answered in other mail.
>
> Indeed.
Andreas
On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote:
> On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <[email protected]> wrote:
> > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
> > >>
> > >> [cut]
> > >>
> > >> >
> > >> > On balance before this commit users could use pcc-cpufreq but had
> > >> > already suboptimal performance (compared to say intel_pstate driver
> > >> > which can be used changing BIOS options).
> > >>
> > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> > >
> > > I think this is because of (in intel_pstate_init()):
> > >
> > > /*
> > > * The Intel pstate driver will be ignored if the platform
> > > * firmware has its own power management modes.
> > > */
> > > if (intel_pstate_platform_pwr_mgmt_exists())
> > > return -ENODEV;
> > >
> >
> > OK, because of the "Proliant" entry, right?
> >
> > So it looks like we have an issue there. We find the entry and we
> > look for _PSS. It is not there, so we assume that the firmware is
> > expected to control performance, which is not the case.
FYI, there is another BIOS setting on those systems. It's called
"Collaborative Power Control" (AFAIK enabled by default).
Only if this is disabled, firmware is (alone) in control of
performance. (And of course in this case neither pcc-cpufreq nor
intel_pstate will be loaded).
> > It looks like we also should look for the presence of the PCC
> > interface in there.
> >
> > I can provide a patch for that, will you be able to test it?
>
> Yes, I can test it.
>
> > >> It should be initialized before pcc-cpufreq (according to their
> > >> respective initcall levels), so in theory intel_pstate should be used
> > >> by default on the affected systems anyway.
> > >
> > >> What BIOS settings need to be changed for that?
> > >
> > > Already answered in other mail.
> >
> > Indeed.
>
>
> Andreas
>
On Tuesday, July 17, 2018 11:36:20 AM CEST Andreas Herrmann wrote:
> On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote:
> > On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <[email protected]> wrote:
> > > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> > > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
> > > >>
> > > >> [cut]
> > > >>
> > > >> >
> > > >> > On balance before this commit users could use pcc-cpufreq but had
> > > >> > already suboptimal performance (compared to say intel_pstate driver
> > > >> > which can be used changing BIOS options).
> > > >>
> > > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> > > >
> > > > I think this is because of (in intel_pstate_init()):
> > > >
> > > > /*
> > > > * The Intel pstate driver will be ignored if the platform
> > > > * firmware has its own power management modes.
> > > > */
> > > > if (intel_pstate_platform_pwr_mgmt_exists())
> > > > return -ENODEV;
> > > >
> > >
> > > OK, because of the "Proliant" entry, right?
> > >
> > > So it looks like we have an issue there. We find the entry and we
> > > look for _PSS. It is not there, so we assume that the firmware is
> > > expected to control performance, which is not the case.
>
> FYI, there is another BIOS setting on those systems. It's called
> "Collaborative Power Control" (AFAIK enabled by default).
>
> Only if this is disabled, firmware is (alone) in control of
> performance. (And of course in this case neither pcc-cpufreq nor
> intel_pstate will be loaded).
OK, the patch is below.
First, I hope that if "Collaborative Power Control" is disabled, it will
simply hide the PCCH object and so intel_pstate will still not load then.
The main question basically is what the OS is expected to do if "Dynamic
Power Savings Mode" is set. If we are *expected* to use the PCC interface
then, intel_pstate may not work in that case, but I suspect that the PCC
interface allows extra energy to be saved over what is possible without it.
---
drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
return true;
}
+static bool __init intel_pstate_no_acpi_pcch(void)
+{
+ acpi_status status;
+ acpi_handle handle;
+
+ status = acpi_get_handle(NULL, "\\_SB", &handle);
+ if (ACPI_FAILURE(status))
+ return true;
+
+ return !acpi_has_method(handle, "PCCH");
+}
+
static bool __init intel_pstate_has_acpi_ppc(void)
{
int i;
@@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
switch (plat_info[idx].data) {
case PSS:
- return intel_pstate_no_acpi_pss();
+ if (!intel_pstate_no_acpi_pss())
+ return false;
+
+ return intel_pstate_no_acpi_pcch();
case PPC:
return intel_pstate_has_acpi_ppc() && !force_load;
}
On Tue, Jul 17, 2018 at 11:36:20AM +0200, Andreas Herrmann wrote:
> On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote:
> > On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <[email protected]> wrote:
> > > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> > > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
> > > >>
> > > >> [cut]
> > > >>
> > > >> >
> > > >> > On balance before this commit users could use pcc-cpufreq but had
> > > >> > already suboptimal performance (compared to say intel_pstate driver
> > > >> > which can be used changing BIOS options).
> > > >>
> > > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> > > >
> > > > I think this is because of (in intel_pstate_init()):
> > > >
> > > > /*
> > > > * The Intel pstate driver will be ignored if the platform
> > > > * firmware has its own power management modes.
> > > > */
> > > > if (intel_pstate_platform_pwr_mgmt_exists())
> > > > return -ENODEV;
> > > >
> > >
> > > OK, because of the "Proliant" entry, right?
> > >
> > > So it looks like we have an issue there. We find the entry and we
> > > look for _PSS. It is not there, so we assume that the firmware is
> > > expected to control performance, which is not the case.
> FYI, there is another BIOS setting on those systems. It's called
> "Collaborative Power Control" (AFAIK enabled by default).
>
> Only if this is disabled, firmware is (alone) in control of
> performance. (And of course in this case neither pcc-cpufreq nor
> intel_pstate will be loaded).
To clarify:
(i) When "Dynamic Power Savings Mode" is enabled _PSS objects are
missing (in fact they seem to be renamed to "XPSS").
pcc-cpufreq driver evaluates PCCH header and loads.
Same when "Collaborative Power Control" is disabled. No _PSS but
XPSS objects. pcc-cpufreq driver fails to evaluate PCCH header,
no cpufreq driver is loaded.
(ii) There are _PSS objects when "OS Control Mode" is selected.
(But I think when "Collaborative Power Control" is disabled there
are no _PSS objects either and the "OS Control Mode" selection
does not matter.)
> > > It looks like we also should look for the presence of the PCC
> > > interface in there.
> > >
> > > I can provide a patch for that, will you be able to test it?
> >
> > Yes, I can test it.
> >
> > > >> It should be initialized before pcc-cpufreq (according to their
> > > >> respective initcall levels), so in theory intel_pstate should be used
> > > >> by default on the affected systems anyway.
> > > >
> > > >> What BIOS settings need to be changed for that?
> > > >
> > > > Already answered in other mail.
> > >
> > > Indeed.
> >
> >
> > Andreas
> >
On Tue, Jul 17, 2018 at 12:09:21PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, July 17, 2018 11:36:20 AM CEST Andreas Herrmann wrote:
> > On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote:
> > > On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> > > > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <[email protected]> wrote:
> > > > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> > > > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
> > > > >>
> > > > >> [cut]
> > > > >>
> > > > >> >
> > > > >> > On balance before this commit users could use pcc-cpufreq but had
> > > > >> > already suboptimal performance (compared to say intel_pstate driver
> > > > >> > which can be used changing BIOS options).
> > > > >>
> > > > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> > > > >
> > > > > I think this is because of (in intel_pstate_init()):
> > > > >
> > > > > /*
> > > > > * The Intel pstate driver will be ignored if the platform
> > > > > * firmware has its own power management modes.
> > > > > */
> > > > > if (intel_pstate_platform_pwr_mgmt_exists())
> > > > > return -ENODEV;
> > > > >
> > > >
> > > > OK, because of the "Proliant" entry, right?
> > > >
> > > > So it looks like we have an issue there. We find the entry and we
> > > > look for _PSS. It is not there, so we assume that the firmware is
> > > > expected to control performance, which is not the case.
> >
> > FYI, there is another BIOS setting on those systems. It's called
> > "Collaborative Power Control" (AFAIK enabled by default).
> >
> > Only if this is disabled, firmware is (alone) in control of
> > performance. (And of course in this case neither pcc-cpufreq nor
> > intel_pstate will be loaded).
>
> OK, the patch is below.
>
> First, I hope that if "Collaborative Power Control" is disabled, it will
> simply hide the PCCH object and so intel_pstate will still not load then.
PCCH is hidden in that case.
> The main question basically is what the OS is expected to do if
> "Dynamic Power Savings Mode" is set. If we are *expected* to use
> the PCC interface then, intel_pstate may not work in that case, but
> I suspect that the PCC interface allows extra energy to be saved
> over what is possible without it.
I'll test it and see what happens.
Andreas
> ---
> drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
> return true;
> }
>
> +static bool __init intel_pstate_no_acpi_pcch(void)
> +{
> + acpi_status status;
> + acpi_handle handle;
> +
> + status = acpi_get_handle(NULL, "\\_SB", &handle);
> + if (ACPI_FAILURE(status))
> + return true;
> +
> + return !acpi_has_method(handle, "PCCH");
> +}
> +
> static bool __init intel_pstate_has_acpi_ppc(void)
> {
> int i;
> @@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
>
> switch (plat_info[idx].data) {
> case PSS:
> - return intel_pstate_no_acpi_pss();
> + if (!intel_pstate_no_acpi_pss())
> + return false;
> +
> + return intel_pstate_no_acpi_pcch();
> case PPC:
> return intel_pstate_has_acpi_ppc() && !force_load;
> }
>
>
On Tue, Jul 17, 2018 at 12:21 PM, Andreas Herrmann <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 12:09:21PM +0200, Rafael J. Wysocki wrote:
>> On Tuesday, July 17, 2018 11:36:20 AM CEST Andreas Herrmann wrote:
>> > On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote:
>> > > On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
>> > > > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <[email protected]> wrote:
>> > > > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
>> > > > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <[email protected]> wrote:
>> > > > >>
>> > > > >> [cut]
>> > > > >>
>> > > > >> >
>> > > > >> > On balance before this commit users could use pcc-cpufreq but had
>> > > > >> > already suboptimal performance (compared to say intel_pstate driver
>> > > > >> > which can be used changing BIOS options).
>> > > > >>
>> > > > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
>> > > > >
>> > > > > I think this is because of (in intel_pstate_init()):
>> > > > >
>> > > > > /*
>> > > > > * The Intel pstate driver will be ignored if the platform
>> > > > > * firmware has its own power management modes.
>> > > > > */
>> > > > > if (intel_pstate_platform_pwr_mgmt_exists())
>> > > > > return -ENODEV;
>> > > > >
>> > > >
>> > > > OK, because of the "Proliant" entry, right?
>> > > >
>> > > > So it looks like we have an issue there. We find the entry and we
>> > > > look for _PSS. It is not there, so we assume that the firmware is
>> > > > expected to control performance, which is not the case.
>> >
>> > FYI, there is another BIOS setting on those systems. It's called
>> > "Collaborative Power Control" (AFAIK enabled by default).
>> >
>> > Only if this is disabled, firmware is (alone) in control of
>> > performance. (And of course in this case neither pcc-cpufreq nor
>> > intel_pstate will be loaded).
>>
>> OK, the patch is below.
>>
>> First, I hope that if "Collaborative Power Control" is disabled, it will
>> simply hide the PCCH object and so intel_pstate will still not load then.
>
> PCCH is hidden in that case.
OK
>> The main question basically is what the OS is expected to do if
>> "Dynamic Power Savings Mode" is set. If we are *expected* to use
>> the PCC interface then, intel_pstate may not work in that case, but
>> I suspect that the PCC interface allows extra energy to be saved
>> over what is possible without it.
>
> I'll test it and see what happens.
Thanks!
From: Rafael J. Wysocki <[email protected]>
The firmware interface used by the pcc-cpufreq driver is
fundamentally not scalable and using it for dynamic CPU performance
scaling on systems with many CPUs leads to degraded performance.
For this reason, disable dynamic CPU performance scaling on systems
with pcc-cpufreq where the number of CPUs present at the driver init
time is greater than 4. Also make the driver print corresponding
complaints to the kernel log.
Reported-by: Andreas Herrmann <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/pcc-cpufreq.c | 8 ++++++++
1 file changed, 8 insertions(+)
Index: linux-pm/drivers/cpufreq/pcc-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/pcc-cpufreq.c
+++ linux-pm/drivers/cpufreq/pcc-cpufreq.c
@@ -589,6 +589,14 @@ static int __init pcc_cpufreq_init(void)
return ret;
}
+ if (num_present_cpus() > 4) {
+ pcc_cpufreq_driver.flags |= CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING;
+ pr_err("%s: Unsuitable system, dynamic performance scaling disabled\n",
+ __func__);
+ pr_err("%s: Change BIOS settings and complain to the hardware vendor\n",
+ __func__);
+ }
+
ret = cpufreq_register_driver(&pcc_cpufreq_driver);
return ret;
On Tue, Jul 17, 2018 at 12:21:36PM +0200, Andreas Herrmann wrote:
> On Tue, Jul 17, 2018 at 12:09:21PM +0200, Rafael J. Wysocki wrote:
---8<---
> > OK, the patch is below.
> >
> > First, I hope that if "Collaborative Power Control" is disabled, it will
> > simply hide the PCCH object and so intel_pstate will still not load then.
>
> PCCH is hidden in that case.
>
> > The main question basically is what the OS is expected to do if
> > "Dynamic Power Savings Mode" is set. If we are *expected* to use
> > the PCC interface then, intel_pstate may not work in that case, but
> > I suspect that the PCC interface allows extra energy to be saved
> > over what is possible without it.
>
> I'll test it and see what happens.
I've tested it on top of v4.18-rc5-36-g30b06abfb92b. intel_pstate now
loads instead of pcc-cpufreq and system looks stable.
When disabling "Collaborative Power Control" no cpufreq driver is loaded
(as expected).
Performance (with kernbench) is as expected (always better than any
brew of pcc-cpufreq + misc modifications to this driver + partial
rollback of commit 554c8aa8ecad).
If you like you can add either Tested-by or
Reviewed-by: Andreas Herrmann <[email protected]>
I think this patch should be tagged for 4.17-stable.
Thanks,
Andreas
> > ---
> > drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
> > 1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> > @@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
> > return true;
> > }
> >
> > +static bool __init intel_pstate_no_acpi_pcch(void)
> > +{
> > + acpi_status status;
> > + acpi_handle handle;
> > +
> > + status = acpi_get_handle(NULL, "\\_SB", &handle);
> > + if (ACPI_FAILURE(status))
> > + return true;
> > +
> > + return !acpi_has_method(handle, "PCCH");
> > +}
> > +
> > static bool __init intel_pstate_has_acpi_ppc(void)
> > {
> > int i;
> > @@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
> >
> > switch (plat_info[idx].data) {
> > case PSS:
> > - return intel_pstate_no_acpi_pss();
> > + if (!intel_pstate_no_acpi_pss())
> > + return false;
> > +
> > + return intel_pstate_no_acpi_pcch();
> > case PPC:
> > return intel_pstate_has_acpi_ppc() && !force_load;
> > }
> >
> >
On Tue, Jul 17, 2018 at 4:03 PM, Andreas Herrmann <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 12:21:36PM +0200, Andreas Herrmann wrote:
>> On Tue, Jul 17, 2018 at 12:09:21PM +0200, Rafael J. Wysocki wrote:
>
> ---8<---
>
>> > OK, the patch is below.
>> >
>> > First, I hope that if "Collaborative Power Control" is disabled, it will
>> > simply hide the PCCH object and so intel_pstate will still not load then.
>>
>> PCCH is hidden in that case.
>>
>> > The main question basically is what the OS is expected to do if
>> > "Dynamic Power Savings Mode" is set. If we are *expected* to use
>> > the PCC interface then, intel_pstate may not work in that case, but
>> > I suspect that the PCC interface allows extra energy to be saved
>> > over what is possible without it.
>>
>> I'll test it and see what happens.
>
> I've tested it on top of v4.18-rc5-36-g30b06abfb92b. intel_pstate now
> loads instead of pcc-cpufreq and system looks stable.
>
> When disabling "Collaborative Power Control" no cpufreq driver is loaded
> (as expected).
>
> Performance (with kernbench) is as expected (always better than any
> brew of pcc-cpufreq + misc modifications to this driver + partial
> rollback of commit 554c8aa8ecad).
>
> If you like you can add either Tested-by or
> Reviewed-by: Andreas Herrmann <[email protected]>
>
> I think this patch should be tagged for 4.17-stable.
OK, thank you for testing!
From: Rafael J. Wysocki <[email protected]>
Currently, intel_pstate doesn't load if _PSS is not present on
HP Proliant systems, because it expects the firmware to take over
CPU performance scaling in that case. However, if ACPI PCCH is
present, the firmware expects the kernel to use it for CPU
performance scaling and the pcc-cpufreq driver is loaded for that.
Unfortunately, the firmware interface used by that driver is not
scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
on systems with more than just a few CPUs. In fact, it is better to
avoid using it at all.
For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
is not present and load if it is there.
Fixes: fbbcdc0744da (intel_pstate: skip the driver if ACPI has power mgmt option)
Reported-by: Andreas Herrmann <[email protected]>
Tested-by: Andreas Herrmann <[email protected]>
Reviewed-by: Andreas Herrmann <[email protected]>
Cc: 4.17+ <[email protected]> # 4.17+
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
return true;
}
+static bool __init intel_pstate_no_acpi_pcch(void)
+{
+ acpi_status status;
+ acpi_handle handle;
+
+ status = acpi_get_handle(NULL, "\\_SB", &handle);
+ if (ACPI_FAILURE(status))
+ return true;
+
+ return !acpi_has_method(handle, "PCCH");
+}
+
static bool __init intel_pstate_has_acpi_ppc(void)
{
int i;
@@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
switch (plat_info[idx].data) {
case PSS:
- return intel_pstate_no_acpi_pss();
+ if (!intel_pstate_no_acpi_pss())
+ return false;
+
+ return intel_pstate_no_acpi_pcch();
case PPC:
return intel_pstate_has_acpi_ppc() && !force_load;
}
From: Rafael J. Wysocki <[email protected]>
The firmware interface used by the pcc-cpufreq driver is
fundamentally not scalable and using it for dynamic CPU performance
scaling on systems with many CPUs leads to degraded performance.
For this reason, disable dynamic CPU performance scaling on systems
with pcc-cpufreq where the number of CPUs present at the driver init
time is greater than 4. Also make the driver print corresponding
complaints to the kernel log.
Reported-by: Andreas Herrmann <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
-> v2: Rework the messages printed in the problematic case.
---
drivers/cpufreq/pcc-cpufreq.c | 9 +++++++++
1 file changed, 9 insertions(+)
Index: linux-pm/drivers/cpufreq/pcc-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/pcc-cpufreq.c
+++ linux-pm/drivers/cpufreq/pcc-cpufreq.c
@@ -589,6 +589,15 @@ static int __init pcc_cpufreq_init(void)
return ret;
}
+ if (num_present_cpus() > 4) {
+ pcc_cpufreq_driver.flags |= CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING;
+ pr_err("%s: Too many CPUs, dynamic performance scaling disabled\n",
+ __func__);
+ pr_err("%s: Try to enable a different scaling driver through BIOS settings\n",
+ __func__);
+ pr_err("%s: and complain to the system vendor\n", __func__);
+ }
+
ret = cpufreq_register_driver(&pcc_cpufreq_driver);
return ret;
On Tue, 2018-07-17 at 18:13 +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Currently, intel_pstate doesn't load if _PSS is not present on
> HP Proliant systems, because it expects the firmware to take over
> CPU performance scaling in that case. However, if ACPI PCCH is
> present, the firmware expects the kernel to use it for CPU
> performance scaling and the pcc-cpufreq driver is loaded for that.
>
> Unfortunately, the firmware interface used by that driver is not
> scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
> on systems with more than just a few CPUs. In fact, it is better to
> avoid using it at all.
>
> For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
> is not present and load if it is there.
>
> Fixes: fbbcdc0744da (intel_pstate: skip the driver if ACPI has power
> mgmt option)
> Reported-by: Andreas Herrmann <[email protected]>
> Tested-by: Andreas Herrmann <[email protected]>
> Reviewed-by: Andreas Herrmann <[email protected]>
> Cc: 4.17+ <[email protected]> # 4.17+
> Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Srinivas Pandruvada <[email protected]>
But do we need a change as done by the following commit in in pcc-
cpufreq.c?
"
commit 8a61e12e84597b5f8155ac91b44dea866ccfaac2
Author: Yinghai Lu <[email protected]>
Date: Fri Sep 20 10:43:56 2013 -0700
acpi-cpufreq: skip loading acpi_cpufreq after intel_pstate
"
Thanks,
Srinivas
> ---
> drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
> return true;
> }
>
> +static bool __init intel_pstate_no_acpi_pcch(void)
> +{
> + acpi_status status;
> + acpi_handle handle;
> +
> + status = acpi_get_handle(NULL, "\\_SB", &handle);
> + if (ACPI_FAILURE(status))
> + return true;
> +
> + return !acpi_has_method(handle, "PCCH");
> +}
> +
> static bool __init intel_pstate_has_acpi_ppc(void)
> {
> int i;
> @@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
>
> switch (plat_info[idx].data) {
> case PSS:
> - return intel_pstate_no_acpi_pss();
> + if (!intel_pstate_no_acpi_pss())
> + return false;
> +
> + return intel_pstate_no_acpi_pcch();
> case PPC:
> return intel_pstate_has_acpi_ppc() && !force_load;
> }
>
On Tue, Jul 17, 2018 at 7:23 PM, Srinivas Pandruvada
<[email protected]> wrote:
> On Tue, 2018-07-17 at 18:13 +0200, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> Currently, intel_pstate doesn't load if _PSS is not present on
>> HP Proliant systems, because it expects the firmware to take over
>> CPU performance scaling in that case. However, if ACPI PCCH is
>> present, the firmware expects the kernel to use it for CPU
>> performance scaling and the pcc-cpufreq driver is loaded for that.
>>
>> Unfortunately, the firmware interface used by that driver is not
>> scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
>> on systems with more than just a few CPUs. In fact, it is better to
>> avoid using it at all.
>>
>> For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
>> is not present and load if it is there.
>>
>> Fixes: fbbcdc0744da (intel_pstate: skip the driver if ACPI has power
>> mgmt option)
>> Reported-by: Andreas Herrmann <[email protected]>
>> Tested-by: Andreas Herrmann <[email protected]>
>> Reviewed-by: Andreas Herrmann <[email protected]>
>> Cc: 4.17+ <[email protected]> # 4.17+
>> Signed-off-by: Rafael J. Wysocki <[email protected]>
> Acked-by: Srinivas Pandruvada <[email protected]>
>
> But do we need a change as done by the following commit in in pcc-
> cpufreq.c?
>
> "
> commit 8a61e12e84597b5f8155ac91b44dea866ccfaac2
> Author: Yinghai Lu <[email protected]>
> Date: Fri Sep 20 10:43:56 2013 -0700
>
> acpi-cpufreq: skip loading acpi_cpufreq after intel_pstate
>
> "
Quite likely yes, good point!
>> ---
>> drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> Index: linux-pm/drivers/cpufreq/intel_pstate.c
>> ===================================================================
>> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>> +++ linux-pm/drivers/cpufreq/intel_pstate.c
>> @@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
>> return true;
>> }
>>
>> +static bool __init intel_pstate_no_acpi_pcch(void)
>> +{
>> + acpi_status status;
>> + acpi_handle handle;
>> +
>> + status = acpi_get_handle(NULL, "\\_SB", &handle);
>> + if (ACPI_FAILURE(status))
>> + return true;
>> +
>> + return !acpi_has_method(handle, "PCCH");
>> +}
>> +
>> static bool __init intel_pstate_has_acpi_ppc(void)
>> {
>> int i;
>> @@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
>>
>> switch (plat_info[idx].data) {
>> case PSS:
>> - return intel_pstate_no_acpi_pss();
>> + if (!intel_pstate_no_acpi_pss())
>> + return false;
>> +
>> + return intel_pstate_no_acpi_pcch();
>> case PPC:
>> return intel_pstate_has_acpi_ppc() && !force_load;
>> }
>>
From: Rafael J. Wysocki <[email protected]>
Currently, intel_pstate doesn't load if _PSS is not present on
HP Proliant systems, because it expects the firmware to take over
CPU performance scaling in that case. However, if ACPI PCCH is
present, the firmware expects the kernel to use it for CPU
performance scaling and the pcc-cpufreq driver is loaded for that.
Unfortunately, the firmware interface used by that driver is not
scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
on systems with more than just a few CPUs. In fact, it is better to
avoid using it at all.
For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
is not present and register if it is there. Also prevent the
pcc-cpufreq driver from trying to initialize if intel_pstate has
been registered already.
Fixes: fbbcdc0744da (intel_pstate: skip the driver if ACPI has power mgmt option)
Reported-by: Andreas Herrmann <[email protected]>
Reviewed-by: Andreas Herrmann <[email protected]>
Acked-by: Srinivas Pandruvada <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
This is a replacement for https://patchwork.kernel.org/patch/10530091/
In addition to the intel_pstate changes in the above, it also
prevents pcc-cpufreq from trying to initialize (which will fail
ultimately, but may confuse the firmware in the meantime).
Andreas, please test this one and let me know if it still works for you.
Thanks,
Rafael
---
drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
drivers/cpufreq/pcc-cpufreq.c | 4 ++++
2 files changed, 20 insertions(+), 1 deletion(-)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
return true;
}
+static bool __init intel_pstate_no_acpi_pcch(void)
+{
+ acpi_status status;
+ acpi_handle handle;
+
+ status = acpi_get_handle(NULL, "\\_SB", &handle);
+ if (ACPI_FAILURE(status))
+ return true;
+
+ return !acpi_has_method(handle, "PCCH");
+}
+
static bool __init intel_pstate_has_acpi_ppc(void)
{
int i;
@@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
switch (plat_info[idx].data) {
case PSS:
- return intel_pstate_no_acpi_pss();
+ if (!intel_pstate_no_acpi_pss())
+ return false;
+
+ return intel_pstate_no_acpi_pcch();
case PPC:
return intel_pstate_has_acpi_ppc() && !force_load;
}
Index: linux-pm/drivers/cpufreq/pcc-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/pcc-cpufreq.c
+++ linux-pm/drivers/cpufreq/pcc-cpufreq.c
@@ -580,6 +580,10 @@ static int __init pcc_cpufreq_init(void)
{
int ret;
+ /* Skip initialization if another cpufreq driver is there. */
+ if (cpufreq_get_current_driver())
+ return 0;
+
if (acpi_disabled)
return 0;
On Tue, Jul 17, 2018 at 06:14:58PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The firmware interface used by the pcc-cpufreq driver is
> fundamentally not scalable and using it for dynamic CPU performance
> scaling on systems with many CPUs leads to degraded performance.
>
> For this reason, disable dynamic CPU performance scaling on systems
> with pcc-cpufreq where the number of CPUs present at the driver init
> time is greater than 4. Also make the driver print corresponding
> complaints to the kernel log.
>
> Reported-by: Andreas Herrmann <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> -> v2: Rework the messages printed in the problematic case.
I've tested this patch. Effect is as expected: driver loads but use of
ondemand governor is not allowed. Sample output:
[ 40.757519] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
[ 40.831705] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled
[ 40.898353] pcc_cpufreq_init: Try to enable a different scaling driver through BIOS settings
[ 40.972327] pcc_cpufreq_init: and complain to the system vendor
[ 41.025620] cpufreq: Can't use ondemand governor as dynamic switching is disallowed. Fallback to performance governor
...
[ 41.187928] cpufreq: Can't use ondemand governor as dynamic switching is disallowed. Fallback to performance governor
Last message is shown for each online CPU in the system (ie. 120x).
Looks good to me.
Andreas
> ---
> drivers/cpufreq/pcc-cpufreq.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> Index: linux-pm/drivers/cpufreq/pcc-cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/pcc-cpufreq.c
> +++ linux-pm/drivers/cpufreq/pcc-cpufreq.c
> @@ -589,6 +589,15 @@ static int __init pcc_cpufreq_init(void)
> return ret;
> }
>
> + if (num_present_cpus() > 4) {
> + pcc_cpufreq_driver.flags |= CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING;
> + pr_err("%s: Too many CPUs, dynamic performance scaling disabled\n",
> + __func__);
> + pr_err("%s: Try to enable a different scaling driver through BIOS settings\n",
> + __func__);
> + pr_err("%s: and complain to the system vendor\n", __func__);
> + }
> +
> ret = cpufreq_register_driver(&pcc_cpufreq_driver);
>
> return ret;
>
>
On Tue, Jul 17, 2018 at 10:13 PM, Andreas Herrmann <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 06:14:58PM +0200, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> The firmware interface used by the pcc-cpufreq driver is
>> fundamentally not scalable and using it for dynamic CPU performance
>> scaling on systems with many CPUs leads to degraded performance.
>>
>> For this reason, disable dynamic CPU performance scaling on systems
>> with pcc-cpufreq where the number of CPUs present at the driver init
>> time is greater than 4. Also make the driver print corresponding
>> complaints to the kernel log.
>>
>> Reported-by: Andreas Herrmann <[email protected]>
>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>> ---
>>
>> -> v2: Rework the messages printed in the problematic case.
>
> I've tested this patch. Effect is as expected: driver loads but use of
> ondemand governor is not allowed. Sample output:
>
> [ 40.757519] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
> [ 40.831705] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled
> [ 40.898353] pcc_cpufreq_init: Try to enable a different scaling driver through BIOS settings
> [ 40.972327] pcc_cpufreq_init: and complain to the system vendor
> [ 41.025620] cpufreq: Can't use ondemand governor as dynamic switching is disallowed. Fallback to performance governor
> ...
> [ 41.187928] cpufreq: Can't use ondemand governor as dynamic switching is disallowed. Fallback to performance governor
>
> Last message is shown for each online CPU in the system (ie. 120x).
>
> Looks good to me.
Thanks a lot!
Please also try https://patchwork.kernel.org/patch/10530321/
Cheers,
Rafael
On Tue, Jul 17, 2018 at 10:13:23PM +0200, Andreas Herrmann wrote:
> On Tue, Jul 17, 2018 at 06:14:58PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > The firmware interface used by the pcc-cpufreq driver is
> > fundamentally not scalable and using it for dynamic CPU performance
> > scaling on systems with many CPUs leads to degraded performance.
> >
> > For this reason, disable dynamic CPU performance scaling on systems
> > with pcc-cpufreq where the number of CPUs present at the driver init
> > time is greater than 4. Also make the driver print corresponding
> > complaints to the kernel log.
> >
> > Reported-by: Andreas Herrmann <[email protected]>
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> >
> > -> v2: Rework the messages printed in the problematic case.
>
> I've tested this patch. Effect is as expected: driver loads but use of
> ondemand governor is not allowed. Sample output:
>
> [ 40.757519] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
> [ 40.831705] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled
> [ 40.898353] pcc_cpufreq_init: Try to enable a different scaling driver through BIOS settings
BTW, Andreas, is that BIOS option available through the normal BIOS
settings, or it is in the "secret" BIOS menu that HP has? If it is in
the "secret" one (^A IIRC) then we might want to explicitly mention
that.
On Wed, Jul 18, 2018 at 10:23:52AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 10:13:23PM +0200, Andreas Herrmann wrote:
> > On Tue, Jul 17, 2018 at 06:14:58PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > >
> > > The firmware interface used by the pcc-cpufreq driver is
> > > fundamentally not scalable and using it for dynamic CPU performance
> > > scaling on systems with many CPUs leads to degraded performance.
> > >
> > > For this reason, disable dynamic CPU performance scaling on systems
> > > with pcc-cpufreq where the number of CPUs present at the driver init
> > > time is greater than 4. Also make the driver print corresponding
> > > complaints to the kernel log.
> > >
> > > Reported-by: Andreas Herrmann <[email protected]>
> > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > ---
> > >
> > > -> v2: Rework the messages printed in the problematic case.
> >
> > I've tested this patch. Effect is as expected: driver loads but use of
> > ondemand governor is not allowed. Sample output:
> >
> > [ 40.757519] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
> > [ 40.831705] pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled
> > [ 40.898353] pcc_cpufreq_init: Try to enable a different scaling driver through BIOS settings
>
> BTW, Andreas, is that BIOS option available through the normal BIOS
> settings, or it is in the "secret" BIOS menu that HP has? If it is in
> the "secret" one (^A IIRC) then we might want to explicitly mention
> that.
The relevant options are not in a secret section (so far I was not
even aware of a secret section). But on those systems some stuff is
available via iLO interface (browser) and some only via "BIOS/Platform
Configuration (RBSU)". The iLO interface has a couple of options that
are only visible with an advanced ILO license.
The options affecting Linux cpufreq subsystem are
(a) available only via "BIOS/Platform Configuration (RBSU)"
[Power Management]->
[Advanced Power Options]->
[Collaborative Power Control] == "enabled" | "disabled"
If set to "disabled": No cpufreq driver will load and platform is
solely responsible for CPU frequency adaptions (whatever that means).
(b) available via "BIOS/Platform Configuration (RBSU)" and iLO interface
[Power Management]->
[Power Regulator] == "Dynamic Power Savings Mode" | "OS Control Mode" |
"Static Low Power Mode" | "Static High Performance Mode"
"Dynamic Power Savings Mode" allows pcc-cpufreq to load and "OS
Control Mode" allows intel-pstate to be loaded. We now change it such
that also with "Dynamic Power Savings Mode" intel-pstate is loaded
(if available; if not, pcc-cpufreq will still be loaded but it now
emits a warning and disallows use of ondemand governor if too many
CPUs are in use).
AFAIK the other (static) modes have no real meaning when
"Collaborative Power Control" is enabled because as soon as a cpufreq
driver is loaded frequency will be adapted by OS. I can't remember
which one (pcc-cpufreq or intel-pstate) would be loaded, I've just
tried it once or twice.
Note that the above is valid for a DL580 Gen8 with Intel CPUs. It
might be slightly different on Gen9/Gen10 and/or other models. Ie. for
models based on AMD CPUs we just have added a new warning in
pcc-cpufreq and disabled use of ondemand governor when the system has
more than 4 CPUs.
Andreas
On Tue, Jul 17, 2018 at 08:06:54PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Currently, intel_pstate doesn't load if _PSS is not present on
> HP Proliant systems, because it expects the firmware to take over
> CPU performance scaling in that case. However, if ACPI PCCH is
> present, the firmware expects the kernel to use it for CPU
> performance scaling and the pcc-cpufreq driver is loaded for that.
>
> Unfortunately, the firmware interface used by that driver is not
> scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
> on systems with more than just a few CPUs. In fact, it is better to
> avoid using it at all.
>
> For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
> is not present and register if it is there. Also prevent the
> pcc-cpufreq driver from trying to initialize if intel_pstate has
> been registered already.
>
> Fixes: fbbcdc0744da (intel_pstate: skip the driver if ACPI has power mgmt option)
> Reported-by: Andreas Herrmann <[email protected]>
> Reviewed-by: Andreas Herrmann <[email protected]>
> Acked-by: Srinivas Pandruvada <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> This is a replacement for https://patchwork.kernel.org/patch/10530091/
>
> In addition to the intel_pstate changes in the above, it also
> prevents pcc-cpufreq from trying to initialize (which will fail
> ultimately, but may confuse the firmware in the meantime).
>
> Andreas, please test this one and let me know if it still works for you.
Done that (with system in "Dynamic Power Savings Mode"). It still
works and system was stable.
FYI, as a sniff test I've run a kernbench test. I now repeat the test
with system switched to "OS control mode". Just in case -- if there
was interference with platform code while system was in "Dynamic Power
Savings Mode" (significantly affecting performance) I might spot it
this way.
Andreas
> Thanks,
> Rafael
>
> ---
> drivers/cpufreq/intel_pstate.c | 17 ++++++++++++++++-
> drivers/cpufreq/pcc-cpufreq.c | 4 ++++
> 2 files changed, 20 insertions(+), 1 deletion(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -2391,6 +2391,18 @@ static bool __init intel_pstate_no_acpi_
> return true;
> }
>
> +static bool __init intel_pstate_no_acpi_pcch(void)
> +{
> + acpi_status status;
> + acpi_handle handle;
> +
> + status = acpi_get_handle(NULL, "\\_SB", &handle);
> + if (ACPI_FAILURE(status))
> + return true;
> +
> + return !acpi_has_method(handle, "PCCH");
> +}
> +
> static bool __init intel_pstate_has_acpi_ppc(void)
> {
> int i;
> @@ -2450,7 +2462,10 @@ static bool __init intel_pstate_platform
>
> switch (plat_info[idx].data) {
> case PSS:
> - return intel_pstate_no_acpi_pss();
> + if (!intel_pstate_no_acpi_pss())
> + return false;
> +
> + return intel_pstate_no_acpi_pcch();
> case PPC:
> return intel_pstate_has_acpi_ppc() && !force_load;
> }
> Index: linux-pm/drivers/cpufreq/pcc-cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/pcc-cpufreq.c
> +++ linux-pm/drivers/cpufreq/pcc-cpufreq.c
> @@ -580,6 +580,10 @@ static int __init pcc_cpufreq_init(void)
> {
> int ret;
>
> + /* Skip initialization if another cpufreq driver is there. */
> + if (cpufreq_get_current_driver())
> + return 0;
> +
> if (acpi_disabled)
> return 0;
>
>
>
On Wed, Jul 18, 2018 at 12:43 PM, Andreas Herrmann <[email protected]> wrote:
> On Tue, Jul 17, 2018 at 08:06:54PM +0200, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> Currently, intel_pstate doesn't load if _PSS is not present on
>> HP Proliant systems, because it expects the firmware to take over
>> CPU performance scaling in that case. However, if ACPI PCCH is
>> present, the firmware expects the kernel to use it for CPU
>> performance scaling and the pcc-cpufreq driver is loaded for that.
>>
>> Unfortunately, the firmware interface used by that driver is not
>> scalable for fundamental reasons, so pcc-cpufreq is way suboptimal
>> on systems with more than just a few CPUs. In fact, it is better to
>> avoid using it at all.
>>
>> For this reason, modify intel_pstate to look for ACPI PCCH if _PSS
>> is not present and register if it is there. Also prevent the
>> pcc-cpufreq driver from trying to initialize if intel_pstate has
>> been registered already.
>>
>> Fixes: fbbcdc0744da (intel_pstate: skip the driver if ACPI has power mgmt option)
>> Reported-by: Andreas Herrmann <[email protected]>
>> Reviewed-by: Andreas Herrmann <[email protected]>
>> Acked-by: Srinivas Pandruvada <[email protected]>
>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>> ---
>>
>> This is a replacement for https://patchwork.kernel.org/patch/10530091/
>>
>> In addition to the intel_pstate changes in the above, it also
>> prevents pcc-cpufreq from trying to initialize (which will fail
>> ultimately, but may confuse the firmware in the meantime).
>>
>> Andreas, please test this one and let me know if it still works for you.
>
> Done that (with system in "Dynamic Power Savings Mode"). It still
> works and system was stable.
Thanks!
> FYI, as a sniff test I've run a kernbench test. I now repeat the test
> with system switched to "OS control mode". Just in case -- if there
> was interference with platform code while system was in "Dynamic Power
> Savings Mode" (significantly affecting performance) I might spot it
> this way.
I'd rather not expect performance to be affected by this, but power
very well may be affected.
Anyway, good idea!
Cheers,
Rafael
I think I still owe some performance numbers to show what is wrong
with systems using pcc-cpufreq with Linux after commit 554c8aa8ecad.
Following are results for kernbench tests (from MMTests test suite).
That's just a kernel compile with different number of compile jobs.
As result the time is measured, 5 runs are done for each configuration and
average values calculated.
I've restricted maximum number of jobs to 30. That means that tests
were done for 2, 4, 8, 16, and 30 compile jobs. I had bound all tests
to node 0. (I've used something like "numactl -N 0 ./run-mmtests.sh
--run-monitor <test_name>" to start those tests.)
Tests were done with kernel 4.18.0-rc3 on an HP DL580 Gen8 with Intel
Xeon CPU E7-4890 with latest BIOS installed. System had 4 nodes, 15
CPUs per node (30 logical CPUs with HT enabled). pcc-cpufreq was
active and ondemand governor in use.
I've tested with different number of online CPUs which better
illustrates how idle online CPUs interfere with compile load on node 0
(due to the jitter caused by pcc-cpufreq and its locking).
Average mean for user/system/elapsed time and standard deviation for each
subtest (=number of compile jobs) are as follows:
(Nodes) N0 N01 N0 N01 N0123
(CPUs) 15CPUs 30CPUs 30CPUs 60CPUs 120CPUs
Amean user-2 640.82 (0.00%) 675.90 (-5.47%) 789.03 (-23.13%) 1448.58 (-126.05%) 3575.79 (-458.01%)
Amean user-4 652.18 (0.00%) 689.12 (-5.67%) 868.19 (-33.12%) 1846.66 (-183.15%) 5437.37 (-733.73%)
Amean user-8 695.00 (0.00%) 732.22 (-5.35%) 1138.30 (-63.78%) 2598.74 (-273.92%) 7413.43 (-966.67%)
Amean user-16 653.94 (0.00%) 772.48 (-18.13%) 1734.80 (-165.29%) 2699.65 (-312.83%) 9224.47 (-1310.61%)
Amean user-30 634.91 (0.00%) 701.11 (-10.43%) 1197.37 (-88.59%) 1360.02 (-114.21%) 3732.34 (-487.85%)
Amean syst-2 235.45 (0.00%) 235.68 (-0.10%) 321.99 (-36.76%) 574.44 (-143.98%) 869.35 (-269.23%)
Amean syst-4 239.34 (0.00%) 243.09 (-1.57%) 345.07 (-44.18%) 621.00 (-159.47%) 1145.13 (-378.46%)
Amean syst-8 246.51 (0.00%) 254.83 (-3.37%) 387.49 (-57.19%) 786.63 (-219.10%) 1406.17 (-470.42%)
Amean syst-16 110.85 (0.00%) 122.21 (-10.25%) 408.25 (-268.31%) 644.41 (-481.36%) 1513.04 (-1264.99%)
Amean syst-30 82.74 (0.00%) 94.07 (-13.69%) 155.38 (-87.80%) 207.03 (-150.22%) 547.73 (-562.01%)
Amean elsp-2 625.33 (0.00%) 724.51 (-15.86%) 792.47 (-26.73%) 1537.44 (-145.86%) 3510.22 (-461.34%)
Amean elsp-4 482.02 (0.00%) 568.26 (-17.89%) 670.26 (-39.05%) 1257.34 (-160.85%) 3120.89 (-547.46%)
Amean elsp-8 267.75 (0.00%) 337.88 (-26.19%) 430.56 (-60.80%) 978.47 (-265.44%) 2321.91 (-767.18%)
Amean elsp-16 63.55 (0.00%) 71.79 (-12.97%) 224.83 (-253.79%) 403.94 (-535.65%) 1121.04 (-1664.09%)
Amean elsp-30 56.76 (0.00%) 62.82 (-10.69%) 66.50 (-17.16%) 124.20 (-118.84%) 303.47 (-434.70%)
Stddev user-2 1.36 (0.00%) 1.94 (-42.57%) 16.17 (-1090.46%) 119.09 (-8669.75%) 382.74 (-28085.60%)
Stddev user-4 2.81 (0.00%) 5.08 (-80.78%) 4.88 (-73.66%) 252.56 (-8881.80%) 1133.02 (-40193.16%)
Stddev user-8 2.30 (0.00%) 15.58 (-578.28%) 30.60 (-1232.63%) 279.35 (-12064.01%) 1050.00 (-45621.61%)
Stddev user-16 6.76 (0.00%) 25.52 (-277.80%) 78.44 (-1060.97%) 118.29 (-1650.94%) 724.11 (-10617.95%)
Stddev user-30 0.51 (0.00%) 1.80 (-249.13%) 12.63 (-2354.11%) 25.82 (-4915.43%) 1098.82 (-213365.28%)
Stddev syst-2 1.52 (0.00%) 2.76 (-81.04%) 3.98 (-161.58%) 36.35 (-2287.16%) 59.09 (-3781.09%)
Stddev syst-4 2.39 (0.00%) 1.55 (35.25%) 3.24 ( -35.92%) 51.51 (-2057.65%) 175.75 (-7262.43%)
Stddev syst-8 1.08 (0.00%) 3.70 (-241.40%) 6.83 (-531.33%) 65.80 (-5977.97%) 151.17 (-13864.10%)
Stddev syst-16 3.78 (0.00%) 5.58 (-47.53%) 4.63 ( -22.44%) 47.90 (-1167.18%) 99.94 (-2543.88%)
Stddev syst-30 0.31 (0.00%) 0.38 (-22.41%) 3.01 (-862.79%) 27.45 (-8688.85%) 137.94 (-44072.77%)
Stddev elsp-2 55.14 (0.00%) 55.04 (0.18%) 95.33 ( -72.90%) 103.91 (-88.45%) 302.31 (-448.29%)
Stddev elsp-4 60.90 (0.00%) 84.42 (-38.62%) 18.92 ( 68.94%) 197.60 (-224.46%) 323.53 (-431.24%)
Stddev elsp-8 16.77 (0.00%) 30.77 (-83.47%) 49.57 (-195.57%) 79.02 (-371.16%) 261.85 (-1461.28%)
Stddev elsp-16 1.99 (0.00%) 2.88 (-44.60%) 28.11 (-1311.79%) 101.81 (-5012.88%) 62.29 (-3028.36%)
Stddev elsp-30 0.65 (0.00%) 1.04 (-59.06%) 1.64 (-151.81%) 41.84 (-6308.81%) 75.37 (-11445.61%)
Overall test time for each mmtests invocation was as follows (this is
also given for number-of-cpu configs for which I did not provide
details above).
N0 N01 N0 N012 N0123 N01 N0123 N0123 N012 N0123 N0123
15CPUs 30CPUs 30CPUs 45CPUs 60CPUs 60CPUs 75CPUs 90CPUs 90CPUs 105CPUs 120CPUs
User 17196.67 18714.36 30105.65 19239.27 19505.35 53089.39 22690.33 26731.06 38131.74 47627.61 153424.99
System 4807.98 4970.89 8533.95 5136.97 5184.24 16351.67 6135.29 7152.66 10920.76 12362.39 32129.74
Elapsed 7796.46 9166.55 11518.51 9274.77 9030.39 25465.38 9361.60 10677.63 15633.49 18900.46 60908.28
The results given for 120 online CPUs on nodes 0-3 indicate what I
meant with the "system being almost unusable". When trying to gather
results with kernel 4.17.5 and 120 CPUs, one iteration of kernbench (1
kernel compile) with 2 jobs even took about 6 hours. Maybe it was an
extreme outlier but I dismissed to further use that kernel (w/o
modifications) for further tests.
Andreas
On Wed, Jul 18, 2018 at 05:25:56PM +0200, Andreas Herrmann wrote:
...
> Average mean for user/system/elapsed time and standard deviation for each
Oops, of course I meant arithmetic mean.
...
For the sake of completeness following are given the remaining sets of
kernbench results related to this thread.
Setup for kernbench test is as described in previous mails but now all
120 logical CPUs were online in all tests. Test runs were still pinned
to node 0.
Common legend for below tables is:
OSCM: "OS Control Mode"
DPSM: "Dynamic Power Savings Mode"
idle_rb: partial rollback of 554c8aa8ecad ("sched: idle: Select idle
state before stopping the tick") as described in initial mail
of this thread
(A) intel_pstate (in powersave mode) performance wrt effect of commit
554c8aa8ecad and wrt to potential interference from platform code
Kernel v4.18-rc5-36-g30b06abfb92b + patch for intel_pstate to load it
instead of pcc-cpufreq when system is in DPSM.
Detailed results for each number of compile jobs:
(OSCM is baseline, values in parenthesis show comparison to baseline)
OSCM OSCM DPSM DPSM
idle_rb idle_rb
Amean user-2 600.58 596.38 ( 0.70%) 685.94 ( -14.21%) 688.78 ( -14.69%)
Amean user-4 583.90 586.34 ( -0.42%) 626.37 ( -7.27%) 622.17 ( -6.55%)
Amean user-8 584.78 581.52 ( 0.56%) 600.89 ( -2.75%) 595.53 ( -1.84%)
Amean user-16 705.07 688.62 ( 2.33%) 705.16 ( -0.01%) 682.44 ( 3.21%)
Amean user-30 1017.25 1022.39 ( -0.51%) 1025.23 ( -0.78%) 1022.61 ( -0.53%)
Amean syst-2 172.17 174.08 ( -1.11%) 184.73 ( -7.30%) 186.13 ( -8.11%)
Amean syst-4 183.88 180.44 ( 1.87%) 191.70 ( -4.25%) 192.24 ( -4.54%)
Amean syst-8 193.40 193.81 ( -0.21%) 198.01 ( -2.38%) 193.96 ( -0.29%)
Amean syst-16 183.97 180.40 ( 1.94%) 184.00 ( -0.01%) 182.10 ( 1.02%)
Amean syst-30 122.36 122.08 ( 0.23%) 122.53 ( -0.14%) 122.17 ( 0.15%)
Amean elsp-2 610.90 634.64 ( -3.89%) 667.67 ( -9.29%) 661.81 ( -8.33%)
Amean elsp-4 413.54 488.02 ( -18.01%) 433.79 ( -4.90%) 407.30 ( 1.51%)
Amean elsp-8 261.85 218.25 ( 16.65%) 246.62 ( 5.82%) 219.55 ( 16.15%)
Amean elsp-16 89.27 99.36 ( -11.30%) 92.74 ( -3.89%) 102.74 ( -15.09%)
Amean elsp-30 47.07 47.04 ( 0.08%) 48.82 ( -3.72%) 48.28 ( -2.57%)
Stddev user-2 6.06 7.53 ( -24.21%) 31.88 (-425.98%) 25.79 (-325.57%)
Stddev user-4 7.05 14.48 (-105.40%) 11.82 ( -67.63%) 12.14 ( -72.22%)
Stddev user-8 5.69 1.18 ( 79.28%) 18.75 (-229.45%) 7.03 ( -23.51%)
Stddev user-16 6.41 15.74 (-145.55%) 12.87 (-100.75%) 10.59 ( -65.19%)
Stddev user-30 2.62 2.80 ( -6.56%) 2.92 ( -11.31%) 2.45 ( 6.52%)
Stddev syst-2 3.48 2.81 ( 19.28%) 2.27 ( 34.73%) 1.47 ( 57.83%)
Stddev syst-4 4.04 4.69 ( -16.03%) 2.16 ( 46.42%) 0.84 ( 79.32%)
Stddev syst-8 3.96 1.42 ( 64.11%) 2.34 ( 40.98%) 1.93 ( 51.24%)
Stddev syst-16 2.01 2.33 ( -15.76%) 1.33 ( 33.89%) 1.94 ( 3.74%)
Stddev syst-30 0.76 0.38 ( 50.10%) 0.91 ( -19.48%) 0.17 ( 77.86%)
Stddev elsp-2 44.55 58.37 ( -31.01%) 110.11 (-147.15%) 82.81 ( -85.88%)
Stddev elsp-4 62.39 109.75 ( -75.90%) 48.32 ( 22.56%) 47.10 ( 24.52%)
Stddev elsp-8 59.01 25.95 ( 56.02%) 71.44 ( -21.07%) 37.83 ( 35.89%)
Stddev elsp-16 10.47 23.88 (-128.08%) 11.98 ( -14.41%) 15.42 ( -47.32%)
Stddev elsp-30 0.26 0.64 (-142.06%) 0.39 ( -46.53%) 0.44 ( -66.71%)
Overall test time:
OSCM OSCM DPSM DPSM
idle_rb idle_rb
User 18681.59 18599.99 19450.38 19289.33
System 4487.76 4458.55 4620.80 4595.13
Elapsed 7407.07 7725.86 7765.91 7502.72
Overall test run-time is comparable. Commit 554c8aa8ecad does not
seem to have a significant impact on performance (I don't have
numbers for power consumption). Comparing OSCM vs. DPSM: it seems
that its better to switch system into OSCM.
(B) performance of intel_pstate (in powersave mode and system in DPSM)
vs. pcc-cpufreq (with ondemand governor)
Results for pcc-cpufreq were obtained with v4.17.5+misc modifications.
intel_pstate results were obtained with v4.18-rc5-36-g30b06abfb92b +
patch for intel_pstate to load it instead of pcc-cpufreq when system
is in DPSM.
So strictly speaking this is no correct comparison but at least it
gives an idea where the limits are with pcc-cpufreq and why its
better to just switch to intel_pstate.
pcc-cpufreq driver modifications were
freqtable: pcc-cpufreq modified to use fixed table of 4 frequencies
deadband: pcc-cpufreq modified to re-introduce so called deadband
effect which keeps CPU at minimum frequency if target
frequency would be in the calculated deadband
intel_pstate pcc-cpufreq pcc-cpufreq pcc-cpufreq
DPSM idle_rb idle_rb+freqtable idle_rb+deadband
Amean user-2 685.94 834.15 ( -21.61%) 648.68 ( 5.43%) 636.63 ( 7.19%)
Amean user-4 626.37 902.09 ( -44.02%) 657.43 ( -4.96%) 615.49 ( 1.74%)
Amean user-8 600.89 1078.37 ( -79.46%) 723.05 ( -20.33%) 646.23 ( -7.55%)
Amean user-16 705.16 1640.89 (-132.70%) 1096.61 ( -55.51%) 904.17 ( -28.22%)
Amean user-30 1025.23 1463.90 ( -42.79%) 1156.17 ( -12.77%) 1151.40 ( -12.31%)
Amean syst-2 184.73 232.17 ( -25.68%) 178.24 ( 3.51%) 172.09 ( 6.84%)
Amean syst-4 191.70 257.22 ( -34.18%) 194.16 ( -1.29%) 188.10 ( 1.88%)
Amean syst-8 198.01 313.67 ( -58.41%) 228.34 ( -15.31%) 206.99 ( -4.53%)
Amean syst-16 184.00 393.92 (-114.09%) 279.89 ( -52.12%) 241.83 ( -31.43%)
Amean syst-30 122.53 185.98 ( -51.79%) 143.28 ( -16.94%) 140.45 ( -14.62%)
Amean elsp-2 667.67 769.28 ( -15.22%) 635.68 ( 4.79%) 651.51 ( 2.42%)
Amean elsp-4 433.79 614.27 ( -41.60%) 440.45 ( -1.53%) 392.80 ( 9.45%)
Amean elsp-8 246.62 397.54 ( -61.19%) 252.27 ( -2.29%) 239.21 ( 3.01%)
Amean elsp-16 92.74 207.43 (-123.68%) 138.00 ( -48.81%) 119.98 ( -29.37%)
Amean elsp-30 48.82 72.66 ( -48.83%) 55.95 ( -14.60%) 54.32 ( -11.27%)
Stddev user-2 31.88 15.22 ( 52.26%) 7.77 ( 75.63%) 6.63 ( 79.21%)
Stddev user-4 11.82 32.20 (-172.49%) 3.37 ( 71.44%) 6.44 ( 45.49%)
Stddev user-8 18.75 33.99 ( -81.29%) 6.96 ( 62.86%) 5.82 ( 68.97%)
Stddev user-16 12.87 70.72 (-449.46%) 31.19 (-142.30%) 28.88 (-124.40%)
Stddev user-30 2.92 26.08 (-792.64%) 6.16 (-110.99%) 10.90 (-273.16%)
Stddev syst-2 2.27 4.44 ( -95.54%) 4.15 ( -82.48%) 2.09 ( 8.11%)
Stddev syst-4 2.16 8.46 (-290.74%) 3.71 ( -71.58%) 2.45 ( -12.99%)
Stddev syst-8 2.34 10.73 (-359.70%) 3.98 ( -70.62%) 4.39 ( -87.80%)
Stddev syst-16 1.33 11.44 (-759.46%) 2.14 ( -60.49%) 2.93 (-120.24%)
Stddev syst-30 0.91 4.88 (-436.79%) 1.37 ( -50.11%) 2.36 (-159.71%)
Stddev elsp-2 110.11 85.53 ( 22.32%) 87.11 ( 20.89%) 37.33 ( 66.10%)
Stddev elsp-4 48.32 130.17 (-169.39%) 59.81 ( -23.79%) 26.15 ( 45.88%)
Stddev elsp-8 71.44 86.47 ( -21.03%) 12.87 ( 81.98%) 43.88 ( 38.58%)
Stddev elsp-16 11.98 13.63 ( -13.82%) 8.94 ( 25.35%) 5.97 ( 50.15%)
Stddev elsp-30 0.39 2.64 (-582.23%) 0.62 ( -58.97%) 0.95 (-144.47%)
intel_pstate pcc-cpufreq pcc-cpufreq pcc-cpufreq
DPSM idle_rb idle_rb+ idle_rb+
freqtable deadband
User 19450.38 31273.96 22689.14 21050.35
System 4620.80 7327.67 5364.63 4984.36
Elapsed 7765.91 10997.49 7935.53 7593.74
Again I have no numbers for power consumption.
Note that I've stopped an attempt to collect results for pcc-cpufreq
with unmodififed v4.17.5 (ie. w/o idle_rb) after the first iteration
(compiling kernel with 2 jobs) took several hours.
Andreas