On recent kernels the tick remains active on idle adaptive-tick CPUs when a small value is written to /dev/cpu_dma_latency to restrict the highest C-state. Before the idle loop redesign in 4.17 idle CPUs had the tick disabled even when C-state were restricted. Is this change intentional or a regression?
I use an x86_64 system built with CONFIG_NO_HZ_FULL that I recently upgraded to the 4.19 series from the 4.14 series. I noticed that adaptive-tick CPUs (nohz_full=1-7) still fire timer interrupts about 1000 times/s (CONFIG_HZ_1000=y) even when they are mostly idle. Some debugging showed that this only happens when a program is writing to /dev/cpu_dma_latency to restrict C-states. The old 4.14 kernel only have around 10 timer interrupts per second on idle adaptive-tick CPU even when C-states are restricted that way.
I would expect an adaptive-tick CPU to turn off the tick when it has 0 or 1 processes to run and enable the tick for >2 processes. Kernels after 4.17 instead have the tick on when 0 or >2 processes are running and the tick off in the 1 process case. Since the tick is off when a single process is running that workload isn't directly harmed by the change but if the CPU use hyperthreading the constant wakeups on an idle HT sibling will reduce performance on the other sibling.
They way I look for timer interrupts is by comparing the LOC line in /proc/interrupts or using the hrtimer_expire_entry tracepoint when function=tick_sched_timer. Both methods seem to give the same results.
I can reproduce the problem using an i7-4790K CPU with /sys/devices/system/cpu/cpuidle/current_driver:intel_idle. I can also reproduce the problem on an old core2duo laptop with current_driver:acpi_idle but I can't reproduce the problem in a virtual machine where current_driver:none. I also can't reproduce the problem if C-states are restricted using the intel_idle.max_cstate=1 kernel argument instead of /dev/cpu_dma_latency.
The commit that introduced the change is 554c8aa8ec "sched: idle: Select idle state before stopping the tick" in v4.17 and the problem exists at least up to kernel 5.1 using the menu cpuidle governor.
On Fri, Jul 5, 2019 at 7:22 PM Thomas Lindroth
<[email protected]> wrote:
>
> On recent kernels the tick remains active on idle adaptive-tick CPUs when a small value is written to /dev/cpu_dma_latency to restrict the highest C-state. Before the idle loop redesign in 4.17 idle CPUs had the tick disabled even when C-state were restricted. Is this change intentional or a regression?
It was intentional, but this kind of is a gray area.
At least for the menu governor you may argue that the decision on
whether or not to stop the tick should be based on the predicted idle
duration.
> I use an x86_64 system built with CONFIG_NO_HZ_FULL that I recently upgraded to the 4.19 series from the 4.14 series. I noticed that adaptive-tick CPUs (nohz_full=1-7) still fire timer interrupts about 1000 times/s (CONFIG_HZ_1000=y) even when they are mostly idle. Some debugging showed that this only happens when a program is writing to /dev/cpu_dma_latency to restrict C-states. The old 4.14 kernel only have around 10 timer interrupts per second on idle adaptive-tick CPU even when C-states are restricted that way.
>
> I would expect an adaptive-tick CPU to turn off the tick when it has 0 or 1 processes to run and enable the tick for >2 processes. Kernels after 4.17 instead have the tick on when 0 or >2 processes are running and the tick off in the 1 process case. Since the tick is off when a single process is running that workload isn't directly harmed by the change but if the CPU use hyperthreading the constant wakeups on an idle HT sibling will reduce performance on the other sibling.
>
> They way I look for timer interrupts is by comparing the LOC line in /proc/interrupts or using the hrtimer_expire_entry tracepoint when function=tick_sched_timer. Both methods seem to give the same results.
>
> I can reproduce the problem using an i7-4790K CPU with /sys/devices/system/cpu/cpuidle/current_driver:intel_idle. I can also reproduce the problem on an old core2duo laptop with current_driver:acpi_idle but I can't reproduce the problem in a virtual machine where current_driver:none. I also can't reproduce the problem if C-states are restricted using the intel_idle.max_cstate=1 kernel argument instead of /dev/cpu_dma_latency.
>
> The commit that introduced the change is 554c8aa8ec "sched: idle: Select idle state before stopping the tick" in v4.17 and the problem exists at least up to kernel 5.1 using the menu cpuidle governor.
Restoring the previous behavior in this case should be relatively
straightforward. I'll send you a patch to do that later.
On Saturday, July 6, 2019 10:17:15 AM CEST Rafael J. Wysocki wrote:
> On Fri, Jul 5, 2019 at 7:22 PM Thomas Lindroth
> <[email protected]> wrote:
> >
> > On recent kernels the tick remains active on idle adaptive-tick CPUs when a small
> > value is written to /dev/cpu_dma_latency to restrict the highest C-state. Before the
> > idle loop redesign in 4.17 idle CPUs had the tick disabled even when C-state were
> > restricted. Is this change intentional or a regression?
>
> It was intentional, but this kind of is a gray area.
>
> At least for the menu governor you may argue that the decision on
> whether or not to stop the tick should be based on the predicted idle
> duration.
But also see below.
> > I use an x86_64 system built with CONFIG_NO_HZ_FULL that I recently upgraded to the 4.19 series from the 4.14 series.
> > I noticed that adaptive-tick CPUs (nohz_full=1-7) still fire timer interrupts about 1000 times/s (CONFIG_HZ_1000=y) even
> > when they are mostly idle. Some debugging showed that this only happens when a program is writing to
> > /dev/cpu_dma_latency to restrict C-states. The old 4.14 kernel only have around 10 timer interrupts per second on idle
> > adaptive-tick CPU even when C-states are restricted that way.
> >
> > I would expect an adaptive-tick CPU to turn off the tick when it has 0 or 1 processes to run and enable the tick for >2
> > processes. Kernels after 4.17 instead have the tick on when 0 or >2 processes are running and the tick off in the 1 process
> > case. Since the tick is off when a single process is running that workload isn't directly harmed by the change but if the CPU
> > use hyperthreading the constant wakeups on an idle HT sibling will reduce performance on the other sibling.
So it looks like the idle loop needs a special case for adaptive-tick CPUs.
> >
> > They way I look for timer interrupts is by comparing the LOC line in /proc/interrupts or using the hrtimer_expire_entry
> > tracepoint when function=tick_sched_timer. Both methods seem to give the same results.
> >
> > I can reproduce the problem using an i7-4790K CPU with /sys/devices/system/cpu/cpuidle/current_driver:intel_idle. I can
> > also reproduce the problem on an old core2duo laptop with current_driver:acpi_idle but I can't reproduce the problem
> > in a virtual machine where current_driver:none. I also can't reproduce the problem if C-states are restricted using the
> > intel_idle.max_cstate=1 kernel argument instead of /dev/cpu_dma_latency.
> >
> > The commit that introduced the change is 554c8aa8ec "sched: idle: Select idle state before stopping the tick" in v4.17
> > and the problem exists at least up to kernel 5.1 using the menu cpuidle governor.
>
> Restoring the previous behavior in this case should be relatively
> straightforward. I'll send you a patch to do that later.
The patch is below, but note that it adds the tick stopping overhead to the idle loop
for CPUs that are not adaptive-tick and when PM QoS latency constraints are used
which is not desirable in general.
Please test it, but as I said above, the real solution appears to be to treat adaptive-tick
CPUs in a special way in the idle loop.
---
drivers/cpuidle/governors/menu.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -302,9 +302,10 @@ static int menu_select(struct cpuidle_dr
!drv->states[0].disabled && !dev->states_usage[0].disable)) {
/*
* In this case state[0] will be used no matter what, so return
- * it right away and keep the tick running.
+ * it right away and keep the tick running if state[0] is a
+ * polling one.
*/
- *stop_tick = false;
+ *stop_tick = !!(drv->states[0].flags & CPUIDLE_FLAG_POLLING);
return 0;
}
@@ -395,16 +396,9 @@ static int menu_select(struct cpuidle_dr
return idx;
}
- if (s->exit_latency > latency_req) {
- /*
- * If we break out of the loop for latency reasons, use
- * the target residency of the selected state as the
- * expected idle duration so that the tick is retained
- * as long as that target residency is low enough.
- */
- predicted_us = drv->states[idx].target_residency;
+ if (s->exit_latency > latency_req)
break;
- }
+
idx = i;
}
On 7/6/19 1:06 PM, Rafael J. Wysocki wrote:
> The patch is below, but note that it adds the tick stopping overhead to the idle loop
> for CPUs that are not adaptive-tick and when PM QoS latency constraints are used
> which is not desirable in general.
>
> Please test it, but as I said above, the real solution appears to be to treat adaptive-tick
> CPUs in a special way in the idle loop.
>
> ---
> drivers/cpuidle/governors/menu.c | 16 +++++-----------
> 1 file changed, 5 insertions(+), 11 deletions(-)
>
> Index: linux-pm/drivers/cpuidle/governors/menu.c
> ===================================================================
> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> +++ linux-pm/drivers/cpuidle/governors/menu.c
> @@ -302,9 +302,10 @@ static int menu_select(struct cpuidle_dr
> !drv->states[0].disabled && !dev->states_usage[0].disable)) {
> /*
> * In this case state[0] will be used no matter what, so return
> - * it right away and keep the tick running.
> + * it right away and keep the tick running if state[0] is a
> + * polling one.
> */
> - *stop_tick = false;
> + *stop_tick = !!(drv->states[0].flags & CPUIDLE_FLAG_POLLING);
> return 0;
> }
>
> @@ -395,16 +396,9 @@ static int menu_select(struct cpuidle_dr
>
> return idx;
> }
> - if (s->exit_latency > latency_req) {
> - /*
> - * If we break out of the loop for latency reasons, use
> - * the target residency of the selected state as the
> - * expected idle duration so that the tick is retained
> - * as long as that target residency is low enough.
> - */
> - predicted_us = drv->states[idx].target_residency;
> + if (s->exit_latency > latency_req)
> break;
> - }
> +
> idx = i;
> }
I tested the patch and it appears to work. Idle CPUs now have ticks disabled even
when /dev/cpu_dma_latency is used.
I also want to thank you for your work on the idle loop redesign. Overall it works
much better than before. I used to have a problem where idle CPUs would end up
doing C0 polling for a long time resulting in a big performance drop on the HT
sibling. When benchmarking I always had to offline siblings to get consistent
results. That problem was fixed in the redesign.
On Saturday, July 6, 2019 3:02:11 PM CEST Thomas Lindroth wrote:
> On 7/6/19 1:06 PM, Rafael J. Wysocki wrote:
> > The patch is below, but note that it adds the tick stopping overhead to the idle loop
> > for CPUs that are not adaptive-tick and when PM QoS latency constraints are used
> > which is not desirable in general.
> >
> > Please test it, but as I said above, the real solution appears to be to treat adaptive-tick
> > CPUs in a special way in the idle loop.
> >
> > ---
> > drivers/cpuidle/governors/menu.c | 16 +++++-----------
> > 1 file changed, 5 insertions(+), 11 deletions(-)
> >
> > Index: linux-pm/drivers/cpuidle/governors/menu.c
> > ===================================================================
> > --- linux-pm.orig/drivers/cpuidle/governors/menu.c
> > +++ linux-pm/drivers/cpuidle/governors/menu.c
> > @@ -302,9 +302,10 @@ static int menu_select(struct cpuidle_dr
> > !drv->states[0].disabled && !dev->states_usage[0].disable)) {
> > /*
> > * In this case state[0] will be used no matter what, so return
> > - * it right away and keep the tick running.
> > + * it right away and keep the tick running if state[0] is a
> > + * polling one.
> > */
> > - *stop_tick = false;
> > + *stop_tick = !!(drv->states[0].flags & CPUIDLE_FLAG_POLLING);
> > return 0;
> > }
> >
> > @@ -395,16 +396,9 @@ static int menu_select(struct cpuidle_dr
> >
> > return idx;
> > }
> > - if (s->exit_latency > latency_req) {
> > - /*
> > - * If we break out of the loop for latency reasons, use
> > - * the target residency of the selected state as the
> > - * expected idle duration so that the tick is retained
> > - * as long as that target residency is low enough.
> > - */
> > - predicted_us = drv->states[idx].target_residency;
> > + if (s->exit_latency > latency_req)
> > break;
> > - }
> > +
> > idx = i;
> > }
>
> I tested the patch and it appears to work. Idle CPUs now have ticks disabled even
> when /dev/cpu_dma_latency is used.
OK, thanks, but as I said previously, you'd see the problem again with the PM QoS
latency constraint set to 0, which is somewhat inconsistent. Also, this fix is
specific to the menu governor and the behavior should not depend on the
governor here IMO, so I have another patch to try (appended).
Please test it (instead of the previous one) and report back.
> I also want to thank you for your work on the idle loop redesign. Overall it works
> much better than before. I used to have a problem where idle CPUs would end up
> doing C0 polling for a long time resulting in a big performance drop on the HT
> sibling. When benchmarking I always had to offline siblings to get consistent
> results. That problem was fixed in the redesign.
>
Thank you, much appreciated.
---
kernel/sched/idle.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -191,7 +191,8 @@ static void cpuidle_idle_call(void)
*/
next_state = cpuidle_select(drv, dev, &stop_tick);
- if (stop_tick || tick_nohz_tick_stopped())
+ if (stop_tick || tick_nohz_tick_stopped() ||
+ !housekeeping_cpu(dev->cpu, HK_FLAG_TICK))
tick_nohz_idle_stop_tick();
else
tick_nohz_idle_retain_tick();
On 7/10/19 12:22 PM, Rafael J. Wysocki wrote:
> On Saturday, July 6, 2019 3:02:11 PM CEST Thomas Lindroth wrote:
>> On 7/6/19 1:06 PM, Rafael J. Wysocki wrote:
>>> The patch is below, but note that it adds the tick stopping overhead to the idle loop
>>> for CPUs that are not adaptive-tick and when PM QoS latency constraints are used
>>> which is not desirable in general.
>>>
>>> Please test it, but as I said above, the real solution appears to be to treat adaptive-tick
>>> CPUs in a special way in the idle loop.
>>>
>>> ---
>>> drivers/cpuidle/governors/menu.c | 16 +++++-----------
>>> 1 file changed, 5 insertions(+), 11 deletions(-)
>>>
>>> Index: linux-pm/drivers/cpuidle/governors/menu.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/cpuidle/governors/menu.c
>>> +++ linux-pm/drivers/cpuidle/governors/menu.c
>>> @@ -302,9 +302,10 @@ static int menu_select(struct cpuidle_dr
>>> !drv->states[0].disabled && !dev->states_usage[0].disable)) {
>>> /*
>>> * In this case state[0] will be used no matter what, so return
>>> - * it right away and keep the tick running.
>>> + * it right away and keep the tick running if state[0] is a
>>> + * polling one.
>>> */
>>> - *stop_tick = false;
>>> + *stop_tick = !!(drv->states[0].flags & CPUIDLE_FLAG_POLLING);
>>> return 0;
>>> }
>>>
>>> @@ -395,16 +396,9 @@ static int menu_select(struct cpuidle_dr
>>>
>>> return idx;
>>> }
>>> - if (s->exit_latency > latency_req) {
>>> - /*
>>> - * If we break out of the loop for latency reasons, use
>>> - * the target residency of the selected state as the
>>> - * expected idle duration so that the tick is retained
>>> - * as long as that target residency is low enough.
>>> - */
>>> - predicted_us = drv->states[idx].target_residency;
>>> + if (s->exit_latency > latency_req)
>>> break;
>>> - }
>>> +
>>> idx = i;
>>> }
>>
>> I tested the patch and it appears to work. Idle CPUs now have ticks disabled even
>> when /dev/cpu_dma_latency is used.
>
> OK, thanks, but as I said previously, you'd see the problem again with the PM QoS
> latency constraint set to 0, which is somewhat inconsistent. Also, this fix is
> specific to the menu governor and the behavior should not depend on the
> governor here IMO, so I have another patch to try (appended).
>
> Please test it (instead of the previous one) and report back.
>
> ---
> kernel/sched/idle.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -191,7 +191,8 @@ static void cpuidle_idle_call(void)
> */
> next_state = cpuidle_select(drv, dev, &stop_tick);
>
> - if (stop_tick || tick_nohz_tick_stopped())
> + if (stop_tick || tick_nohz_tick_stopped() ||
> + !housekeeping_cpu(dev->cpu, HK_FLAG_TICK))
> tick_nohz_idle_stop_tick();
> else
> tick_nohz_idle_retain_tick();
>
I tested this patch and it seems to work fine using the menu governor
and PM QoS latency constraints matching each C-state including 0.
I didn't test the TEO governor.