Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964843AbbBISOq (ORCPT ); Mon, 9 Feb 2015 13:14:46 -0500 Received: from mail-pa0-f54.google.com ([209.85.220.54]:44420 "EHLO mail-pa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933233AbbBISOo (ORCPT ); Mon, 9 Feb 2015 13:14:44 -0500 Date: Mon, 9 Feb 2015 10:14:36 -0800 From: Steven Noonan To: Preeti U Murthy Cc: Peter Zijlstra , Thomas Gleixner , jacob.jun.pan@intel.com, Arjan van de Ven , Linux Kernel mailing List , =?utf-8?B?RnLDqWTDqXJpYw==?= Weisbecker , frederic@kernel.org, Daniel Lezcano , Amit Kucheria , Eduardo Valentin , Viresh Kumar , rui.zhang@intel.com Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism Message-ID: <20150209181436.GA29721@rincewind.us-west-2.compute.internal> References: <20150209044852.6231.66456.stgit@preeti.in.ibm.com> <54D89F46.4030803@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 40410 Lines: 803 On Mon, Feb 9, 2015 at 9:56 AM, Steven Noonan wrote: > On Mon, Feb 9, 2015 at 3:51 AM, Preeti U Murthy > wrote: >> Hi Steven, >> >> On 02/09/2015 01:02 PM, Steven Noonan wrote: >>> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy >>> wrote: >>>> The powerclamp driver injects idle periods to stay within the thermal constraints. >>>> The driver does a fake idle by spawning per-cpu threads that call the mwait >>>> instruction. This behavior of fake idle can confuse the other kernel subsystems. >>>> For instance it calls into the nohz tick handlers, which are meant to be called >>>> only by the idle thread. It sets the state of the tick in each cpu to idle and >>>> stops the tick, when there are tasks on the runqueue. As a result the callers of >>>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the >>>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome >>>> may be inconsistency in the scheduler/nohz states which can lead to serious >>>> consequences. One of them was reported on this thread: >>>> https://lkml.org/lkml/2014/12/11/365. >>>> >>>> Thomas posted out a patch to disable the powerclamp driver from calling into the >>>> tick nohz code which has taken care of the above regression for the moment. However >>>> powerclamp driver as a result, will not be able to inject idle periods due to the >>>> presence of periodic ticks. With the current design of fake idle, we cannot move >>>> towards a better solution. >>>> https://lkml.org/lkml/2014/12/18/169 >>>> >>>> This patch aims at removing the concept of fake idle and instead makes the cpus >>>> truly idle by throttling the runqueues during the idle injection periods. The situation >>>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths. >>>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and >>>> 'period' to achieve the same result. When the powerclamping is begun or when the >>>> clamping controls have been modified, the bandwidth for the root task group is set. >>>> The 'quota' will be the amount of time that the system needs to be busy and 'period' >>>> will be the sum of this busy duration and the idle duration as calculated by the driver. >>>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since >>>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will >>>> automatically fall in place. None of the other cpus need be bothered about this. This >>>> simplifies the design of the driver. >>>> >>>> Of course this is only if the idle injection metrics can be conveniently transformed >>>> into bandwidth control metrics. There are a couple of other primary concerns around if >>>> doing the below two in this patch is valid. >>>> a. This patch exports the functions to set the quota and period of task groups. >>>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth. >>>> >>>> Signed-off-by: Preeti U Murthy >>> >>> This doesn't compile. >> >> Thanks for reporting this! I realized that I had not compiled in the powerclamp driver >> as a module while compile testing it. I was focusing on the issues with the design and >> failed to cross verify this. Apologies for the inconvenience. >> >> Find the diff compile tested below. >> >> I also realized that clamp_cpus() that sets the bandwidth cannot be called from >> multiple places. Currently I am calling it from end_powerclamp(), when the user changes the >> idle clamping duration and from a queued timer. This will require synchronization between >> callers which is not really called for. The queued wakeup_timer alone can re-evaluate the >> clamping metrics after every throttle-unthrottle period and this should suffice as far >> as I can see. Thoughts ? > > Hmm, I've had two system lockups so far while running a kernel with > intel_powerclamp loaded. Both times it slowly ground to a halt and > processes piled up... I let the hung system sit for a while and it eventually partially recovered and came back with this in dmesg (9:24 is around system boot time, 9:51 is around when it went out to lunch, and 10:01 is when it came back): Feb 09 09:24:16 osprey kernel: wlp2s0: authenticate with e0:3f:49:28:3b:8c Feb 09 09:24:16 osprey kernel: wlp2s0: send auth to e0:3f:49:28:3b:8c (try 1/3) Feb 09 09:24:16 osprey kernel: wlp2s0: authenticated Feb 09 09:24:16 osprey kernel: wlp2s0: associate with e0:3f:49:28:3b:8c (try 1/3) Feb 09 09:24:16 osprey kernel: wlp2s0: RX AssocResp from e0:3f:49:28:3b:8c (capab=0x1011 status=0 aid=4) Feb 09 09:24:16 osprey kernel: wlp2s0: associated Feb 09 09:50:19 osprey kernel: intel_powerclamp: Start idle injection to reduce power Feb 09 09:50:23 osprey kernel: intel_powerclamp: Stop forced idle injection Feb 09 09:50:27 osprey kernel: intel_powerclamp: Start idle injection to reduce power Feb 09 09:50:31 osprey kernel: intel_powerclamp: Stop forced idle injection Feb 09 09:50:31 osprey kernel: intel_powerclamp: Start idle injection to reduce power Feb 09 09:50:57 osprey kernel: Watchdog[1271]: segfault at 0 ip 00007f580734ad7b sp 00007f57f04175a0 error 6 in chromium[7f580310e000+66c6000] Feb 09 09:51:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU Feb 09 09:51:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/8abc, nonlazy_posted: 0, L. Feb 09 09:51:46 osprey kernel: (t=60000 jiffies g=40078 c=40077 q=69418) Feb 09 09:51:46 osprey kernel: Task dump for CPU 0: Feb 09 09:51:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008 Feb 09 09:51:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000 Feb 09 09:51:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300 Feb 09 09:51:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300 Feb 09 09:51:46 osprey kernel: Call Trace: Feb 09 09:51:46 osprey kernel: [] sched_show_task+0xb6/0x120 Feb 09 09:51:46 osprey kernel: [] dump_cpu_task+0x3d/0x50 Feb 09 09:51:46 osprey kernel: [] rcu_dump_cpu_stacks+0x91/0xd0 Feb 09 09:51:46 osprey kernel: [] rcu_check_callbacks+0x4d2/0x7e0 Feb 09 09:51:46 osprey kernel: [] ? __this_cpu_preempt_check+0x13/0x20 Feb 09 09:51:46 osprey kernel: [] ? hrtimer_run_queues+0x36/0x110 Feb 09 09:51:46 osprey kernel: [] update_process_times+0x4b/0x80 Feb 09 09:51:46 osprey kernel: [] tick_sched_handle.isra.13+0x25/0x60 Feb 09 09:51:46 osprey kernel: [] tick_sched_timer+0x45/0x80 Feb 09 09:51:46 osprey kernel: [] __run_hrtimer+0x77/0x250 Feb 09 09:51:46 osprey kernel: [] ? tick_sched_handle.isra.13+0x60/0x60 Feb 09 09:51:46 osprey kernel: [] hrtimer_interrupt+0x107/0x250 Feb 09 09:51:46 osprey kernel: [] local_apic_timer_interrupt+0x3b/0x70 Feb 09 09:51:46 osprey kernel: [] smp_apic_timer_interrupt+0x43/0x60 Feb 09 09:51:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 09:51:46 osprey kernel: [] ? 0xffffffffa0b0728b Feb 09 09:51:46 osprey kernel: [] ? 0xffffffffa0b0718d Feb 09 09:51:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 09:51:46 osprey kernel: [] 0xffffffffa0b075e5 Feb 09 09:51:46 osprey kernel: [] call_timer_fn+0x3a/0x160 Feb 09 09:51:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 09:51:46 osprey kernel: [] run_timer_softirq+0x264/0x310 Feb 09 09:51:46 osprey kernel: [] __do_softirq+0xf6/0x2e0 Feb 09 09:51:46 osprey kernel: [] irq_exit+0x8e/0xb0 Feb 09 09:51:46 osprey kernel: [] smp_apic_timer_interrupt+0x48/0x60 Feb 09 09:51:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 09:51:46 osprey kernel: [] ? cpuidle_enter_state+0x65/0x1a0 Feb 09 09:51:46 osprey kernel: [] ? cpuidle_enter_state+0x51/0x1a0 Feb 09 09:51:46 osprey kernel: [] cpuidle_enter+0x17/0x20 Feb 09 09:51:46 osprey kernel: [] cpu_startup_entry+0x3f4/0x460 Feb 09 09:51:46 osprey kernel: [] rest_init+0x85/0x90 Feb 09 09:51:46 osprey kernel: [] start_kernel+0x445/0x452 Feb 09 09:51:46 osprey kernel: [] ? early_idt_handlers+0x120/0x120 Feb 09 09:51:46 osprey kernel: [] x86_64_start_reservations+0x2a/0x2c Feb 09 09:51:46 osprey kernel: [] x86_64_start_kernel+0x143/0x152 Feb 09 09:54:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU Feb 09 09:54:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/49df, nonlazy_posted: 0, L. Feb 09 09:54:46 osprey kernel: (t=240003 jiffies g=40078 c=40077 q=72706) Feb 09 09:54:46 osprey kernel: Task dump for CPU 0: Feb 09 09:54:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008 Feb 09 09:54:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000 Feb 09 09:54:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300 Feb 09 09:54:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300 Feb 09 09:54:46 osprey kernel: Call Trace: Feb 09 09:54:46 osprey kernel: [] sched_show_task+0xb6/0x120 Feb 09 09:54:46 osprey kernel: [] dump_cpu_task+0x3d/0x50 Feb 09 09:54:46 osprey kernel: [] rcu_dump_cpu_stacks+0x91/0xd0 Feb 09 09:54:46 osprey kernel: [] rcu_check_callbacks+0x4d2/0x7e0 Feb 09 09:54:46 osprey kernel: [] ? __this_cpu_preempt_check+0x13/0x20 Feb 09 09:54:46 osprey kernel: [] ? hrtimer_run_queues+0x36/0x110 Feb 09 09:54:46 osprey kernel: [] update_process_times+0x4b/0x80 Feb 09 09:54:46 osprey kernel: [] tick_sched_handle.isra.13+0x25/0x60 Feb 09 09:54:46 osprey kernel: [] tick_sched_timer+0x45/0x80 Feb 09 09:54:46 osprey kernel: [] __run_hrtimer+0x77/0x250 Feb 09 09:54:46 osprey kernel: [] ? tick_sched_handle.isra.13+0x60/0x60 Feb 09 09:54:46 osprey kernel: [] hrtimer_interrupt+0x107/0x250 Feb 09 09:54:46 osprey kernel: [] local_apic_timer_interrupt+0x3b/0x70 Feb 09 09:54:46 osprey kernel: [] smp_apic_timer_interrupt+0x43/0x60 Feb 09 09:54:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 09:54:46 osprey kernel: [] ? 0xffffffffa0b0726d Feb 09 09:54:46 osprey kernel: [] ? 0xffffffffa0b0718d Feb 09 09:54:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 09:54:46 osprey kernel: [] 0xffffffffa0b075e5 Feb 09 09:54:46 osprey kernel: [] call_timer_fn+0x3a/0x160 Feb 09 09:54:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 09:54:46 osprey kernel: [] run_timer_softirq+0x264/0x310 Feb 09 09:54:46 osprey kernel: [] __do_softirq+0xf6/0x2e0 Feb 09 09:54:46 osprey kernel: [] irq_exit+0x8e/0xb0 Feb 09 09:54:46 osprey kernel: [] smp_apic_timer_interrupt+0x48/0x60 Feb 09 09:54:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 09:54:46 osprey kernel: [] ? cpuidle_enter_state+0x65/0x1a0 Feb 09 09:54:46 osprey kernel: [] ? cpuidle_enter_state+0x51/0x1a0 Feb 09 09:54:46 osprey kernel: [] cpuidle_enter+0x17/0x20 Feb 09 09:54:46 osprey kernel: [] cpu_startup_entry+0x3f4/0x460 Feb 09 09:54:46 osprey kernel: [] rest_init+0x85/0x90 Feb 09 09:54:46 osprey kernel: [] start_kernel+0x445/0x452 Feb 09 09:54:46 osprey kernel: [] ? early_idt_handlers+0x120/0x120 Feb 09 09:54:46 osprey kernel: [] x86_64_start_reservations+0x2a/0x2c Feb 09 09:54:46 osprey kernel: [] x86_64_start_kernel+0x143/0x152 Feb 09 09:57:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU Feb 09 09:57:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/0902, nonlazy_posted: 0, L. Feb 09 09:57:46 osprey kernel: (t=420006 jiffies g=40078 c=40077 q=76104) Feb 09 09:57:46 osprey kernel: Task dump for CPU 0: Feb 09 09:57:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008 Feb 09 09:57:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000 Feb 09 09:57:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300 Feb 09 09:57:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300 Feb 09 09:57:46 osprey kernel: Call Trace: Feb 09 09:57:46 osprey kernel: [] sched_show_task+0xb6/0x120 Feb 09 09:57:46 osprey kernel: [] dump_cpu_task+0x3d/0x50 Feb 09 09:57:46 osprey kernel: [] rcu_dump_cpu_stacks+0x91/0xd0 Feb 09 09:57:46 osprey kernel: [] rcu_check_callbacks+0x4d2/0x7e0 Feb 09 09:57:46 osprey kernel: [] ? __this_cpu_preempt_check+0x13/0x20 Feb 09 09:57:46 osprey kernel: [] ? hrtimer_run_queues+0x36/0x110 Feb 09 09:57:46 osprey kernel: [] update_process_times+0x4b/0x80 Feb 09 09:57:46 osprey kernel: [] tick_sched_handle.isra.13+0x25/0x60 Feb 09 09:57:46 osprey kernel: [] tick_sched_timer+0x45/0x80 Feb 09 09:57:46 osprey kernel: [] __run_hrtimer+0x77/0x250 Feb 09 09:57:46 osprey kernel: [] ? tick_sched_handle.isra.13+0x60/0x60 Feb 09 09:57:46 osprey kernel: [] hrtimer_interrupt+0x107/0x250 Feb 09 09:57:46 osprey kernel: [] local_apic_timer_interrupt+0x3b/0x70 Feb 09 09:57:46 osprey kernel: [] smp_apic_timer_interrupt+0x43/0x60 Feb 09 09:57:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 09:57:46 osprey kernel: [] ? 0xffffffffa0b0726d Feb 09 09:57:46 osprey kernel: [] ? 0xffffffffa0b0718d Feb 09 09:57:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 09:57:46 osprey kernel: [] 0xffffffffa0b075e5 Feb 09 09:57:46 osprey kernel: [] call_timer_fn+0x3a/0x160 Feb 09 09:57:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 09:57:46 osprey kernel: [] run_timer_softirq+0x264/0x310 Feb 09 09:57:46 osprey kernel: [] __do_softirq+0xf6/0x2e0 Feb 09 09:57:46 osprey kernel: [] irq_exit+0x8e/0xb0 Feb 09 09:57:46 osprey kernel: [] smp_apic_timer_interrupt+0x48/0x60 Feb 09 09:57:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 09:57:46 osprey kernel: [] ? cpuidle_enter_state+0x65/0x1a0 Feb 09 09:57:46 osprey kernel: [] ? cpuidle_enter_state+0x51/0x1a0 Feb 09 09:57:46 osprey kernel: [] cpuidle_enter+0x17/0x20 Feb 09 09:57:46 osprey kernel: [] cpu_startup_entry+0x3f4/0x460 Feb 09 09:57:46 osprey kernel: [] rest_init+0x85/0x90 Feb 09 09:57:46 osprey kernel: [] start_kernel+0x445/0x452 Feb 09 09:57:46 osprey kernel: [] ? early_idt_handlers+0x120/0x120 Feb 09 09:57:46 osprey kernel: [] x86_64_start_reservations+0x2a/0x2c Feb 09 09:57:46 osprey kernel: [] x86_64_start_kernel+0x143/0x152 Feb 09 10:00:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU Feb 09 10:00:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/c825, nonlazy_posted: 0, L. Feb 09 10:00:46 osprey kernel: (t=600009 jiffies g=40078 c=40077 q=79433) Feb 09 10:00:46 osprey kernel: Task dump for CPU 0: Feb 09 10:00:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008 Feb 09 10:00:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000 Feb 09 10:00:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300 Feb 09 10:00:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300 Feb 09 10:00:46 osprey kernel: Call Trace: Feb 09 10:00:46 osprey kernel: [] sched_show_task+0xb6/0x120 Feb 09 10:00:46 osprey kernel: [] dump_cpu_task+0x3d/0x50 Feb 09 10:00:46 osprey kernel: [] rcu_dump_cpu_stacks+0x91/0xd0 Feb 09 10:00:46 osprey kernel: [] rcu_check_callbacks+0x4d2/0x7e0 Feb 09 10:00:46 osprey kernel: [] ? __this_cpu_preempt_check+0x13/0x20 Feb 09 10:00:46 osprey kernel: [] ? hrtimer_run_queues+0x36/0x110 Feb 09 10:00:46 osprey kernel: [] update_process_times+0x4b/0x80 Feb 09 10:00:46 osprey kernel: [] tick_sched_handle.isra.13+0x25/0x60 Feb 09 10:00:46 osprey kernel: [] tick_sched_timer+0x45/0x80 Feb 09 10:00:46 osprey kernel: [] __run_hrtimer+0x77/0x250 Feb 09 10:00:46 osprey kernel: [] ? tick_sched_handle.isra.13+0x60/0x60 Feb 09 10:00:46 osprey kernel: [] hrtimer_interrupt+0x107/0x250 Feb 09 10:00:46 osprey kernel: [] local_apic_timer_interrupt+0x3b/0x70 Feb 09 10:00:46 osprey kernel: [] smp_apic_timer_interrupt+0x43/0x60 Feb 09 10:00:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 10:00:46 osprey kernel: [] ? 0xffffffffa0b0726d Feb 09 10:00:46 osprey kernel: [] ? 0xffffffffa0b0718d Feb 09 10:00:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 10:00:46 osprey kernel: [] 0xffffffffa0b075e5 Feb 09 10:00:46 osprey kernel: [] call_timer_fn+0x3a/0x160 Feb 09 10:00:46 osprey kernel: [] ? 0xffffffffa0b075d0 Feb 09 10:00:46 osprey kernel: [] run_timer_softirq+0x264/0x310 Feb 09 10:00:46 osprey kernel: [] __do_softirq+0xf6/0x2e0 Feb 09 10:00:46 osprey kernel: [] irq_exit+0x8e/0xb0 Feb 09 10:00:46 osprey kernel: [] smp_apic_timer_interrupt+0x48/0x60 Feb 09 10:00:46 osprey kernel: [] apic_timer_interrupt+0x6d/0x80 Feb 09 10:00:46 osprey kernel: [] ? cpuidle_enter_state+0x65/0x1a0 Feb 09 10:00:46 osprey kernel: [] ? cpuidle_enter_state+0x51/0x1a0 Feb 09 10:00:46 osprey kernel: [] cpuidle_enter+0x17/0x20 Feb 09 10:00:46 osprey kernel: [] cpu_startup_entry+0x3f4/0x460 Feb 09 10:00:46 osprey kernel: [] rest_init+0x85/0x90 Feb 09 10:00:46 osprey kernel: [] start_kernel+0x445/0x452 Feb 09 10:00:46 osprey kernel: [] ? early_idt_handlers+0x120/0x120 Feb 09 10:00:46 osprey kernel: [] x86_64_start_reservations+0x2a/0x2c Feb 09 10:00:46 osprey kernel: [] x86_64_start_kernel+0x143/0x152 I couldn't do much at this point. While I had some interactivity, processes were failing to spawn, so i had to just 'echo s > /proc/sysrq-trigger; echo b > /proc/sysrq-trigger'. >> ---------------------------------------------------------------------------------- >> >> V2 of intel_powerclamp driver >> >> From: Preeti U Murthy >> >> >> --- >> drivers/thermal/Kconfig | 1 >> drivers/thermal/intel_powerclamp.c | 301 ++++++++++-------------------------- >> include/linux/sched.h | 9 + >> kernel/sched/core.c | 6 - >> kernel/sched/sched.h | 5 - >> 5 files changed, 95 insertions(+), 227 deletions(-) >> >> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig >> index af40db0..4b7cd02 100644 >> --- a/drivers/thermal/Kconfig >> +++ b/drivers/thermal/Kconfig >> @@ -233,6 +233,7 @@ config INTEL_POWERCLAMP >> depends on THERMAL >> depends on X86 >> depends on CPU_SUP_INTEL >> + select CFS_BANDWIDTH >> help >> Enable this to enable Intel PowerClamp idle injection driver. This >> enforce idle time which results in more package C-state residency. The >> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c >> index 6ceebd6..4bd07bb 100644 >> --- a/drivers/thermal/intel_powerclamp.c >> +++ b/drivers/thermal/intel_powerclamp.c >> @@ -51,6 +51,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -78,20 +79,18 @@ static struct dentry *debug_dir; >> static unsigned int set_target_ratio; >> static unsigned int current_ratio; >> static bool should_skip; >> +static unsigned int count = 0; >> static bool reduce_irq; >> static atomic_t idle_wakeup_counter; >> -static unsigned int control_cpu; /* The cpu assigned to collect stat and update >> - * control parameters. default to BSP but BSP >> - * can be offlined. >> - */ >> static bool clamping; >> >> +/* Timer that evaluates bandwidth reset if clamping knobs have changed */ >> +static void clamp_timer_fn(unsigned long foo); >> +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0); >> + >> +static void clamp_cpus(void); >> >> -static struct task_struct * __percpu *powerclamp_thread; >> static struct thermal_cooling_device *cooling_dev; >> -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu >> - * clamping thread >> - */ >> >> static unsigned int duration; >> static unsigned int pkg_cstate_ratio_cur; >> @@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp) >> smp_mb(); >> >> exit_win: >> - >> + clamp_cpus(); >> return ret; >> } >> >> @@ -256,10 +255,6 @@ static u64 pkg_state_counter(void) >> return count; >> } >> >> -static void noop_timer(unsigned long foo) >> -{ >> - /* empty... just the fact that we get the interrupt wakes us up */ >> -} >> >> static unsigned int get_compensation(int ratio) >> { >> @@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio, >> return set_target_ratio + guard <= current_ratio; >> } >> >> -static int clamp_thread(void *arg) >> +static void clamp_cpus(void) >> { >> - int cpunr = (unsigned long)arg; >> - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0); >> - static const struct sched_param param = { >> - .sched_priority = MAX_USER_RT_PRIO/2, >> - }; >> - unsigned int count = 0; >> unsigned int target_ratio; >> + u64 quota, period; >> + int sleeptime; >> + unsigned long target_jiffies; >> + unsigned int guard; >> + unsigned int compensation; >> + int interval; /* jiffies to sleep for each attempt */ >> + unsigned int duration_jiffies; >> + unsigned int window_size_now; >> + >> +again: >> + if (clamping == false) >> + goto out; >> + >> + quota = RUNTIME_INF; >> + period = NSEC_PER_SEC; >> + compensation = 0; >> + duration_jiffies = msecs_to_jiffies(duration); >> + /* >> + * make sure user selected ratio does not take effect until >> + * the next round. adjust target_ratio if user has changed >> + * target such that we can converge quickly. >> + */ >> + target_ratio = set_target_ratio; >> + guard = 1 + target_ratio/20; >> + window_size_now = window_size; >> >> - set_bit(cpunr, cpu_clamping_mask); >> - set_freezable(); >> - init_timer_on_stack(&wakeup_timer); >> - sched_setscheduler(current, SCHED_FIFO, ¶m); >> - >> - while (true == clamping && !kthread_should_stop() && >> - cpu_online(cpunr)) { >> - int sleeptime; >> - unsigned long target_jiffies; >> - unsigned int guard; >> - unsigned int compensation = 0; >> - int interval; /* jiffies to sleep for each attempt */ >> - unsigned int duration_jiffies = msecs_to_jiffies(duration); >> - unsigned int window_size_now; >> - >> - try_to_freeze(); >> - /* >> - * make sure user selected ratio does not take effect until >> - * the next round. adjust target_ratio if user has changed >> - * target such that we can converge quickly. >> - */ >> - target_ratio = set_target_ratio; >> - guard = 1 + target_ratio/20; >> - window_size_now = window_size; >> - count++; >> - >> - /* >> - * systems may have different ability to enter package level >> - * c-states, thus we need to compensate the injected idle ratio >> - * to achieve the actual target reported by the HW. >> - */ >> - compensation = get_compensation(target_ratio); >> - interval = duration_jiffies*100/(target_ratio+compensation); >> - >> - /* align idle time */ >> - target_jiffies = roundup(jiffies, interval); >> - sleeptime = target_jiffies - jiffies; >> - if (sleeptime <= 0) >> - sleeptime = 1; >> - schedule_timeout_interruptible(sleeptime); >> - /* >> - * only elected controlling cpu can collect stats and update >> - * control parameters. >> - */ >> - if (cpunr == control_cpu && !(count%window_size_now)) { >> - should_skip = >> - powerclamp_adjust_controls(target_ratio, >> - guard, window_size_now); >> - smp_mb(); >> - } >> - >> - if (should_skip) >> - continue; >> - >> - target_jiffies = jiffies + duration_jiffies; >> - mod_timer(&wakeup_timer, target_jiffies); >> - if (unlikely(local_softirq_pending())) >> - continue; >> - /* >> - * stop tick sched during idle time, interrupts are still >> - * allowed. thus jiffies are updated properly. >> - */ >> - preempt_disable(); >> - /* mwait until target jiffies is reached */ >> - while (time_before(jiffies, target_jiffies)) { >> - unsigned long ecx = 1; >> - unsigned long eax = target_mwait; >> - >> - /* >> - * REVISIT: may call enter_idle() to notify drivers who >> - * can save power during cpu idle. same for exit_idle() >> - */ >> - local_touch_nmi(); >> - stop_critical_timings(); >> - mwait_idle_with_hints(eax, ecx); >> - start_critical_timings(); >> - atomic_inc(&idle_wakeup_counter); >> - } >> - preempt_enable(); >> + /* >> + * systems may have different ability to enter package level >> + * c-states, thus we need to compensate the injected idle ratio >> + * to achieve the actual target reported by the HW. >> + */ >> + compensation = get_compensation(target_ratio); >> + interval = duration_jiffies*100/(target_ratio+compensation); >> + >> + /* align idle time */ >> + target_jiffies = roundup(jiffies, interval); >> + sleeptime = target_jiffies - jiffies; >> + if (sleeptime <= 0) >> + sleeptime = 1; >> + >> + if (!(count%window_size_now)) { >> + should_skip = >> + powerclamp_adjust_controls(target_ratio, >> + guard, window_size_now); >> + smp_mb(); >> } >> - del_timer_sync(&wakeup_timer); >> - clear_bit(cpunr, cpu_clamping_mask); >> >> - return 0; >> + if (should_skip) >> + goto again; >> + >> + target_jiffies = jiffies + sleeptime + duration_jiffies; >> + mod_timer(&wakeup_timer, target_jiffies); >> + if (unlikely(local_softirq_pending())) >> + goto again; >> + >> + quota = jiffies_to_usecs(sleeptime); >> + period = jiffies_to_usecs(sleeptime + duration_jiffies); >> + >> +out: >> + tg_set_cfs_quota(&root_task_group, quota); >> + tg_set_cfs_period(&root_task_group, period); >> +} >> + >> +static void clamp_timer_fn(unsigned long foo) >> +{ >> + /* Evaluate to see if clamping controls need to be adjusted */ >> + count++; >> + clamp_cpus(); >> } >> >> /* >> @@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy) >> >> static int start_power_clamp(void) >> { >> - unsigned long cpu; >> - struct task_struct *thread; >> + clamping = true; >> >> /* check if pkg cstate counter is completely 0, abort in this case */ >> if (!has_pkg_state_counter()) { >> @@ -511,108 +482,21 @@ static int start_power_clamp(void) >> } >> >> set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1); >> - /* prevent cpu hotplug */ >> - get_online_cpus(); >> - >> - /* prefer BSP */ >> - control_cpu = 0; >> - if (!cpu_online(control_cpu)) >> - control_cpu = smp_processor_id(); >> >> - clamping = true; >> schedule_delayed_work(&poll_pkg_cstate_work, 0); >> - >> - /* start one thread per online cpu */ >> - for_each_online_cpu(cpu) { >> - struct task_struct **p = >> - per_cpu_ptr(powerclamp_thread, cpu); >> - >> - thread = kthread_create_on_node(clamp_thread, >> - (void *) cpu, >> - cpu_to_node(cpu), >> - "kidle_inject/%ld", cpu); >> - /* bind to cpu here */ >> - if (likely(!IS_ERR(thread))) { >> - kthread_bind(thread, cpu); >> - wake_up_process(thread); >> - *p = thread; >> - } >> - >> - } >> - put_online_cpus(); >> + clamp_cpus(); >> >> return 0; >> } >> >> static void end_power_clamp(void) >> { >> - int i; >> - struct task_struct *thread; >> - >> clamping = false; >> - /* >> - * make clamping visible to other cpus and give per cpu clamping threads >> - * sometime to exit, or gets killed later. >> - */ >> - smp_mb(); >> - msleep(20); >> - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) { >> - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) { >> - pr_debug("clamping thread for cpu %d alive, kill\n", i); >> - thread = *per_cpu_ptr(powerclamp_thread, i); >> - kthread_stop(thread); >> - } >> - } >> -} >> >> -static int powerclamp_cpu_callback(struct notifier_block *nfb, >> - unsigned long action, void *hcpu) >> -{ >> - unsigned long cpu = (unsigned long)hcpu; >> - struct task_struct *thread; >> - struct task_struct **percpu_thread = >> - per_cpu_ptr(powerclamp_thread, cpu); >> - >> - if (false == clamping) >> - goto exit_ok; >> - >> - switch (action) { >> - case CPU_ONLINE: >> - thread = kthread_create_on_node(clamp_thread, >> - (void *) cpu, >> - cpu_to_node(cpu), >> - "kidle_inject/%lu", cpu); >> - if (likely(!IS_ERR(thread))) { >> - kthread_bind(thread, cpu); >> - wake_up_process(thread); >> - *percpu_thread = thread; >> - } >> - /* prefer BSP as controlling CPU */ >> - if (cpu == 0) { >> - control_cpu = 0; >> - smp_mb(); >> - } >> - break; >> - case CPU_DEAD: >> - if (test_bit(cpu, cpu_clamping_mask)) { >> - pr_err("cpu %lu dead but powerclamping thread is not\n", >> - cpu); >> - kthread_stop(*percpu_thread); >> - } >> - if (cpu == control_cpu) { >> - control_cpu = smp_processor_id(); >> - smp_mb(); >> - } >> - } >> - >> -exit_ok: >> - return NOTIFY_OK; >> + clamp_cpus(); >> + del_timer_sync(&wakeup_timer); >> } >> >> -static struct notifier_block powerclamp_cpu_notifier = { >> - .notifier_call = powerclamp_cpu_callback, >> -}; >> - >> static int powerclamp_get_max_state(struct thermal_cooling_device *cdev, >> unsigned long *state) >> { >> @@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev, >> } >> >> exit_set: >> + clamp_cpus(); >> return ret; >> } >> >> @@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused) >> { >> int i = 0; >> >> - seq_printf(m, "controlling cpu: %d\n", control_cpu); >> seq_printf(m, "pct confidence steady dynamic (compensation)\n"); >> for (i = 0; i < MAX_TARGET_RATIO; i++) { >> seq_printf(m, "%d\t%lu\t%lu\t%lu\n", >> @@ -762,33 +646,20 @@ file_error: >> static int powerclamp_init(void) >> { >> int retval; >> - int bitmap_size; >> - >> - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long); >> - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL); >> - if (!cpu_clamping_mask) >> - return -ENOMEM; >> >> /* probe cpu features and ids here */ >> retval = powerclamp_probe(); >> if (retval) >> - goto exit_free; >> + goto exit; >> >> /* set default limit, maybe adjusted during runtime based on feedback */ >> window_size = 2; >> - register_hotcpu_notifier(&powerclamp_cpu_notifier); >> - >> - powerclamp_thread = alloc_percpu(struct task_struct *); >> - if (!powerclamp_thread) { >> - retval = -ENOMEM; >> - goto exit_unregister; >> - } >> >> cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL, >> &powerclamp_cooling_ops); >> if (IS_ERR(cooling_dev)) { >> retval = -ENODEV; >> - goto exit_free_thread; >> + goto exit; >> } >> >> if (!duration) >> @@ -798,23 +669,15 @@ static int powerclamp_init(void) >> >> return 0; >> >> -exit_free_thread: >> - free_percpu(powerclamp_thread); >> -exit_unregister: >> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); >> -exit_free: >> - kfree(cpu_clamping_mask); >> +exit: >> return retval; >> } >> module_init(powerclamp_init); >> >> static void powerclamp_exit(void) >> { >> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); >> end_power_clamp(); >> - free_percpu(powerclamp_thread); >> thermal_cooling_device_unregister(cooling_dev); >> - kfree(cpu_clamping_mask); >> >> cancel_delayed_work_sync(&poll_pkg_cstate_work); >> debugfs_remove_recursive(debug_dir); >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 8db31ef..2493942 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); >> load += n*(FIXED_1-exp); \ >> load >>= FSHIFT; >> >> +/* >> + * single value that denotes runtime == period, ie unlimited time. >> + */ >> +#define RUNTIME_INF ((u64)~0ULL) >> + >> extern unsigned long total_forks; >> extern int nr_threads; >> DECLARE_PER_CPU(unsigned long, process_counts); >> @@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask); >> >> #ifdef CONFIG_CGROUP_SCHED >> extern struct task_group root_task_group; >> +#ifdef CONFIG_CFS_BANDWIDTH >> +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us); >> +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us); >> +#endif /* CONFIG_CFS_BANDWIDTH */ >> #endif /* CONFIG_CGROUP_SCHED */ >> >> extern int task_can_switch_user(struct user_struct *up, >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index e628cb1..7471b06 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr) >> * Every task in system belongs to this group at bootup. >> */ >> struct task_group root_task_group; >> +EXPORT_SYMBOL_GPL(root_task_group); >> LIST_HEAD(task_groups); >> #endif >> >> @@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) >> int i, ret = 0, runtime_enabled, runtime_was_enabled; >> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; >> >> - if (tg == &root_task_group) >> - return -EINVAL; >> - >> /* >> * Ensure we have at some amount of bandwidth every period. This is >> * to prevent reaching a state of large arrears when throttled via >> @@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) >> >> return tg_set_cfs_bandwidth(tg, period, quota); >> } >> +EXPORT_SYMBOL_GPL(tg_set_cfs_quota); >> >> long tg_get_cfs_quota(struct task_group *tg) >> { >> @@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us) >> >> return tg_set_cfs_bandwidth(tg, period, quota); >> } >> +EXPORT_SYMBOL_GPL(tg_set_cfs_period); >> >> long tg_get_cfs_period(struct task_group *tg) >> { >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 9a2a45c..20493e4 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq); >> * These are the 'tuning knobs' of the scheduler: >> */ >> >> -/* >> - * single value that denotes runtime == period, ie unlimited time. >> - */ >> -#define RUNTIME_INF ((u64)~0ULL) >> - >> static inline int fair_policy(int policy) >> { >> return policy == SCHED_NORMAL || policy == SCHED_BATCH; >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/