Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751633AbdHQH3j (ORCPT ); Thu, 17 Aug 2017 03:29:39 -0400 Received: from mail-oi0-f68.google.com ([209.85.218.68]:38013 "EHLO mail-oi0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750741AbdHQH3f (ORCPT ); Thu, 17 Aug 2017 03:29:35 -0400 Subject: Re: [PATCH 1/2] x86/idle: add halt poll for halt idle To: "Michael S. Tsirkin" References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <20170816070305-mutt-send-email-mst@kernel.org> Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, pbonzini@redhat.com, x86@kernel.org, corbet@lwn.net, tony.luck@intel.com, bp@alien8.de, peterz@infradead.org, mchehab@kernel.org, akpm@linux-foundation.org, krzk@kernel.org, jpoimboe@redhat.com, luto@kernel.org, borntraeger@de.ibm.com, thgarnie@google.com, rgerst@gmail.com, minipli@googlemail.com, douly.fnst@cn.fujitsu.com, nicstange@gmail.com, fweisbec@gmail.com, dvlasenk@redhat.com, bristot@redhat.com, yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, yu.c.chen@intel.com, aaron.lu@intel.com, rostedt@goodmis.org, me@kylehuey.com, len.brown@intel.com, prarit@redhat.com, hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, adobriyan@gmail.com, fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com, subashab@codeaurora.org, arnd@arndb.de, matt@codeblueprint.co.uk, mgorman@techsingularity.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm@vger.kernel.org From: Yang Zhang Message-ID: <4f244651-49df-26ee-961a-8fb782a920b1@gmail.com> Date: Thu, 17 Aug 2017 15:29:16 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170816070305-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5807 Lines: 177 On 2017/8/16 12:04, Michael S. Tsirkin wrote: > On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote: >> From: Yang Zhang >> >> This patch introduce a new mechanism to poll for a while before >> entering idle state. >> >> David has a topic in KVM forum to describe the problem on current KVM VM >> when running some message passing workload in KVM forum. Also, there >> are some work to improve the performance in KVM, like halt polling in KVM. >> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle >> which introduce lot of latency. >> >> Halt polling in KVM provide the capbility to not schedule out VCPU when >> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls >> for a while if there is no work inside VCPU to elimiate heavy vmexit during >> in/out idle. The potential impact is it will cost more CPU cycle since we >> are doing polling and may impact other task which waiting on the same >> physical CPU in host. > > I wonder whether you considered doing this in an idle driver. > I have a prototype patch combining this with mwait within guest - > I can post it if you are interested. I am not sure mwait can solve this problem. But yes, if you have any prototype patch, i can do a test. Also, i am working on next version with better approach. I will post it ASAP. > > >> Here is the data i get when running benchmark contextswitch >> (https://github.com/tsuna/contextswitch) >> >> before patch: >> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw) >> >> after patch: >> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw) >> >> Signed-off-by: Yang Zhang >> --- >> Documentation/sysctl/kernel.txt | 10 ++++++++++ >> arch/x86/kernel/process.c | 21 +++++++++++++++++++++ >> include/linux/kernel.h | 3 +++ >> kernel/sched/idle.c | 3 +++ >> kernel/sysctl.c | 9 +++++++++ >> 5 files changed, 46 insertions(+) >> >> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt >> index bac23c1..4e71bfe 100644 >> --- a/Documentation/sysctl/kernel.txt >> +++ b/Documentation/sysctl/kernel.txt >> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel: >> - perf_event_max_stack >> - perf_event_max_contexts_per_stack >> - pid_max >> +- poll_threshold_ns [ X86 only ] >> - powersave-nap [ PPC only ] >> - printk >> - printk_delay >> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one. >> >> ============================================================== >> >> +poll_threshold_ns: (X86 only) >> + >> +This parameter used to control the max wait time to poll before going >> +into real idle state. By default, the values is 0 means don't poll. >> +It is recommended to change the value to non-zero if running latency-bound >> +workloads in VM. >> + >> +============================================================== >> + >> powersave-nap: (PPC only) >> >> If set, Linux-PPC will use the 'nap' mode of powersaving, >> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c >> index 0bb8842..6361783 100644 >> --- a/arch/x86/kernel/process.c >> +++ b/arch/x86/kernel/process.c >> @@ -39,6 +39,10 @@ >> #include >> #include >> >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +unsigned long poll_threshold_ns; >> +#endif >> + >> /* >> * per-CPU TSS segments. Threads are completely 'soft' on Linux, >> * no more per-task TSS's. The TSS size is kept cacheline-aligned >> @@ -313,6 +317,23 @@ static inline void play_dead(void) >> } >> #endif >> >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +void arch_cpu_idle_poll(void) >> +{ >> + ktime_t start, cur, stop; >> + >> + if (poll_threshold_ns) { >> + start = cur = ktime_get(); >> + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); >> + do { >> + if (need_resched()) >> + break; >> + cur = ktime_get(); >> + } while (ktime_before(cur, stop)); >> + } >> +} >> +#endif >> + >> void arch_cpu_idle_enter(void) >> { >> tsc_verify_tsc_adjust(false); >> diff --git a/include/linux/kernel.h b/include/linux/kernel.h >> index 13bc08a..04cf774 100644 >> --- a/include/linux/kernel.h >> +++ b/include/linux/kernel.h >> @@ -460,6 +460,9 @@ extern __scanf(2, 0) >> extern int sysctl_panic_on_stackoverflow; >> >> extern bool crash_kexec_post_notifiers; >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +extern unsigned long poll_threshold_ns; >> +#endif >> >> /* >> * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c >> index 2a25a9e..e789f99 100644 >> --- a/kernel/sched/idle.c >> +++ b/kernel/sched/idle.c >> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) >> } >> >> /* Weak implementations for optional arch specific functions */ >> +void __weak arch_cpu_idle_poll(void) { } >> void __weak arch_cpu_idle_prepare(void) { } >> void __weak arch_cpu_idle_enter(void) { } >> void __weak arch_cpu_idle_exit(void) { } >> @@ -219,6 +220,8 @@ static void do_idle(void) >> */ >> >> __current_set_polling(); >> + arch_cpu_idle_poll(); >> + >> tick_nohz_idle_enter(); >> >> while (!need_resched()) { >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >> index 4dfba1a..9174d57 100644 >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, >> .extra2 = &one, >> }, >> #endif >> +#ifdef CONFIG_HYPERVISOR_GUEST >> + { >> + .procname = "halt_poll_threshold", >> + .data = &poll_threshold_ns, >> + .maxlen = sizeof(unsigned long), >> + .mode = 0644, >> + .proc_handler = proc_dointvec, >> + }, >> +#endif >> { } >> }; >> >> -- >> 1.8.3.1 -- Yang Alibaba Cloud Computing