Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751576AbdHPEFJ (ORCPT ); Wed, 16 Aug 2017 00:05:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43360 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751212AbdHPEFG (ORCPT ); Wed, 16 Aug 2017 00:05:06 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 0C051C047B8F Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx07.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=mst@redhat.com Date: Wed, 16 Aug 2017 07:04:53 +0300 From: "Michael S. Tsirkin" To: root Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, pbonzini@redhat.com, x86@kernel.org, corbet@lwn.net, tony.luck@intel.com, bp@alien8.de, peterz@infradead.org, mchehab@kernel.org, akpm@linux-foundation.org, krzk@kernel.org, jpoimboe@redhat.com, luto@kernel.org, borntraeger@de.ibm.com, thgarnie@google.com, rgerst@gmail.com, minipli@googlemail.com, douly.fnst@cn.fujitsu.com, nicstange@gmail.com, fweisbec@gmail.com, dvlasenk@redhat.com, bristot@redhat.com, yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, yu.c.chen@intel.com, aaron.lu@intel.com, rostedt@goodmis.org, me@kylehuey.com, len.brown@intel.com, prarit@redhat.com, hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, adobriyan@gmail.com, fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com, subashab@codeaurora.org, arnd@arndb.de, matt@codeblueprint.co.uk, mgorman@techsingularity.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm@vger.kernel.org Subject: Re: [PATCH 1/2] x86/idle: add halt poll for halt idle Message-ID: <20170816070305-mutt-send-email-mst@kernel.org> References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Wed, 16 Aug 2017 04:05:06 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5401 Lines: 166 On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote: > From: Yang Zhang > > This patch introduce a new mechanism to poll for a while before > entering idle state. > > David has a topic in KVM forum to describe the problem on current KVM VM > when running some message passing workload in KVM forum. Also, there > are some work to improve the performance in KVM, like halt polling in KVM. > But we still has 4 MSR wirtes and HLT vmexit when going into halt idle > which introduce lot of latency. > > Halt polling in KVM provide the capbility to not schedule out VCPU when > it is the only task in this pCPU. Unlike it, this patch will let VCPU polls > for a while if there is no work inside VCPU to elimiate heavy vmexit during > in/out idle. The potential impact is it will cost more CPU cycle since we > are doing polling and may impact other task which waiting on the same > physical CPU in host. I wonder whether you considered doing this in an idle driver. I have a prototype patch combining this with mwait within guest - I can post it if you are interested. > Here is the data i get when running benchmark contextswitch > (https://github.com/tsuna/contextswitch) > > before patch: > 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw) > > after patch: > 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw) > > Signed-off-by: Yang Zhang > --- > Documentation/sysctl/kernel.txt | 10 ++++++++++ > arch/x86/kernel/process.c | 21 +++++++++++++++++++++ > include/linux/kernel.h | 3 +++ > kernel/sched/idle.c | 3 +++ > kernel/sysctl.c | 9 +++++++++ > 5 files changed, 46 insertions(+) > > diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt > index bac23c1..4e71bfe 100644 > --- a/Documentation/sysctl/kernel.txt > +++ b/Documentation/sysctl/kernel.txt > @@ -63,6 +63,7 @@ show up in /proc/sys/kernel: > - perf_event_max_stack > - perf_event_max_contexts_per_stack > - pid_max > +- poll_threshold_ns [ X86 only ] > - powersave-nap [ PPC only ] > - printk > - printk_delay > @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one. > > ============================================================== > > +poll_threshold_ns: (X86 only) > + > +This parameter used to control the max wait time to poll before going > +into real idle state. By default, the values is 0 means don't poll. > +It is recommended to change the value to non-zero if running latency-bound > +workloads in VM. > + > +============================================================== > + > powersave-nap: (PPC only) > > If set, Linux-PPC will use the 'nap' mode of powersaving, > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0bb8842..6361783 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -39,6 +39,10 @@ > #include > #include > > +#ifdef CONFIG_HYPERVISOR_GUEST > +unsigned long poll_threshold_ns; > +#endif > + > /* > * per-CPU TSS segments. Threads are completely 'soft' on Linux, > * no more per-task TSS's. The TSS size is kept cacheline-aligned > @@ -313,6 +317,23 @@ static inline void play_dead(void) > } > #endif > > +#ifdef CONFIG_HYPERVISOR_GUEST > +void arch_cpu_idle_poll(void) > +{ > + ktime_t start, cur, stop; > + > + if (poll_threshold_ns) { > + start = cur = ktime_get(); > + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); > + do { > + if (need_resched()) > + break; > + cur = ktime_get(); > + } while (ktime_before(cur, stop)); > + } > +} > +#endif > + > void arch_cpu_idle_enter(void) > { > tsc_verify_tsc_adjust(false); > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > index 13bc08a..04cf774 100644 > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -460,6 +460,9 @@ extern __scanf(2, 0) > extern int sysctl_panic_on_stackoverflow; > > extern bool crash_kexec_post_notifiers; > +#ifdef CONFIG_HYPERVISOR_GUEST > +extern unsigned long poll_threshold_ns; > +#endif > > /* > * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index 2a25a9e..e789f99 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) > } > > /* Weak implementations for optional arch specific functions */ > +void __weak arch_cpu_idle_poll(void) { } > void __weak arch_cpu_idle_prepare(void) { } > void __weak arch_cpu_idle_enter(void) { } > void __weak arch_cpu_idle_exit(void) { } > @@ -219,6 +220,8 @@ static void do_idle(void) > */ > > __current_set_polling(); > + arch_cpu_idle_poll(); > + > tick_nohz_idle_enter(); > > while (!need_resched()) { > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 4dfba1a..9174d57 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, > .extra2 = &one, > }, > #endif > +#ifdef CONFIG_HYPERVISOR_GUEST > + { > + .procname = "halt_poll_threshold", > + .data = &poll_threshold_ns, > + .maxlen = sizeof(unsigned long), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > +#endif > { } > }; > > -- > 1.8.3.1