Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751386AbdH2RUw (ORCPT ); Tue, 29 Aug 2017 13:20:52 -0400 Received: from mx2.suse.de ([195.135.220.15]:56386 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751233AbdH2RUt (ORCPT ); Tue, 29 Aug 2017 13:20:49 -0400 Date: Tue, 29 Aug 2017 19:20:41 +0200 From: "Luis R. Rodriguez" To: Yang Zhang Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, wanpeng.li@hotmail.com, mst@redhat.com, pbonzini@redhat.com, tglx@linutronix.de, rkrcmar@redhat.com, dmatlack@google.com, agraf@suse.de, peterz@infradead.org, linux-doc@vger.kernel.org, Quan Xu , Jonathan Corbet , Juergen Gross , Jeremy Fitzhardinge , Chris Wright , Alok Kataria , Rusty Russell , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org, "Luis R. Rodriguez" , Kees Cook , Mauro Carvalho Chehab , Krzysztof Kozlowski , Josh Poimboeuf , Andrew Morton , Petr Mladek , Jessica Yu , Larry Finger , zijun_hu , Baoquan He , Johannes Berg , Ian Abbott , virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org Subject: Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll Message-ID: <20170829172041.GH27873@wotan.suse.de> References: <1504007201-12904-1-git-send-email-yang.zhang.wz@gmail.com> <1504007201-12904-6-git-send-email-yang.zhang.wz@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1504007201-12904-6-git-send-email-yang.zhang.wz@gmail.com> User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4998 Lines: 147 On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote: > To reduce the cost of poll, we introduce three sysctl to control the > poll time. This commit does not describe in any way the fact that these knobs are all for and only for PARAVIRT. > diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt > index bac23c1..67447b8 100644 > --- a/Documentation/sysctl/kernel.txt > +++ b/Documentation/sysctl/kernel.txt > @@ -63,6 +63,9 @@ show up in /proc/sys/kernel: > - perf_event_max_stack > - perf_event_max_contexts_per_stack > - pid_max > +- poll_grow [ X86 only ] > +- poll_shrink [ X86 only ] > +- poll_threshold_ns [ X86 only ] How about paravirt_ prefix? > - powersave-nap [ PPC only ] > - printk > - printk_delay > @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one. > > ============================================================== > > +poll_grow: (X86 only) > + > +This parameter is multiplied in the grow_poll_ns() to increase the poll time. > +By default, the values is 2. > + > +============================================================== > +poll_shrink: (X86 only) > + > +This parameter is divided in the shrink_poll_ns() to reduce the poll time. > +By default, the values is 2. These don't say anything about this being related to paravirt. > + > +============================================================== > +poll_threshold_ns: (X86 only) > + > +This parameter controls the max poll time before entering real idle path. > +This parameter is expected to take effect only when running inside a VM. > +It would make no sense to turn on it in bare mental. turn it on? Or modify, or use it? > +By default, the values is 0 means don't poll. It is recommended to change > +the value to non-zero if running latency-bound workloads inside VM. > + > +============================================================== > + > powersave-nap: (PPC only) > > If set, Linux-PPC will use the 'nap' mode of powersaving, > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index a11b2c2..0b92f8f 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = { > .update = paravirt_nop, > }; > > +unsigned long poll_threshold_ns; > +unsigned int poll_shrink = 2; > +unsigned int poll_grow = 2; > + > __visible struct pv_irq_ops pv_irq_ops = { > .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), > .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > index bd6d96c..6cb2820 100644 > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -462,6 +462,12 @@ extern __scanf(2, 0) > > extern bool crash_kexec_post_notifiers; > > +#ifdef CONFIG_PARAVIRT > +extern unsigned long poll_threshold_ns; > +extern unsigned int poll_shrink; > +extern unsigned int poll_grow; > +#endif What are these if we are on bare metal but support paravirt? The docs are not clear in any way about it. > + > /* > * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It > * holds a CPU number which is executing panic() currently. A value of > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 6648fbb..9b86a8f 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, > .extra2 = &one, > }, > #endif > +#ifdef CONFIG_PARAVIRT > + { > + .procname = "halt_poll_threshold", > + .data = &poll_threshold_ns, > + .maxlen = sizeof(unsigned long), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, proc_doulongvec_minmax() may be more appropriate here? What is the range? Also what user did you see used proc_dointvec() but had unsigned long? If the test below reveal this is invalid we would need to go fix them as well. > + { > + .procname = "halt_poll_grow", > + .data = &poll_grow, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > + { > + .procname = "halt_poll_shrink", > + .data = &poll_shrink, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec, We have now a much more appropriate proc_douintvec() proc_douintvec_minmax(). It seems you want to support only unsigned int for two of these so that would be more appropriate. To test things out you can replicate your values using or extending the test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of values and then fuzz testing them with arbitrary values to ensure you get only as input what you really think these values should get. Once done please extend the script: tools/testing/selftests/sysctl/sysctl.sh to cover the different tests you have run, we want tests to be generic. Luis