Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753073AbbGMT60 (ORCPT ); Mon, 13 Jul 2015 15:58:26 -0400 Received: from mail-am1on0088.outbound.protection.outlook.com ([157.56.112.88]:63552 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752978AbbGMT6V (ORCPT ); Mon, 13 Jul 2015 15:58:21 -0400 Authentication-Results: spf=fail (sender IP is 12.216.194.146) smtp.mailfrom=ezchip.com; ezchip.com; dkim=none (message not signed) header.d=none; From: Chris Metcalf To: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , Rik van Riel , Tejun Heo , Frederic Weisbecker , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , , , CC: Chris Metcalf Subject: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode Date: Mon, 13 Jul 2015 15:57:57 -0400 Message-ID: <1436817481-8732-2-git-send-email-cmetcalf@ezchip.com> X-Mailer: git-send-email 2.1.2 In-Reply-To: <1436817481-8732-1-git-send-email-cmetcalf@ezchip.com> References: <1436817481-8732-1-git-send-email-cmetcalf@ezchip.com> X-EOPAttributedMessage: 0 X-Microsoft-Exchange-Diagnostics: 1;DB3FFO11FD028;1:jpJUyT0Qazki8jnVf/8Doel9SQwLBXJKlAvpZUV8n0nAHOyawwKJ3xc/va/KTz/1m3BcDOnkvI2KfTHlgsW5Rb8cQCufVGF11QNd2nFMuW/C3mYwnvvnZ6Fy4ulyj2rQFY8dCQgH7b5hW2w8CTV8yJmxI86SzE4PgsM8jbgcaH/tqlPX+gXl6shlv9Vqk7y5DPapARw1uVpf/uAxyP5u7jveQ3Dj+cm7UaA5ACy6llLB5aJpUuUVN4OxyBrkGdCjSPO+ou2FTE32yRery9WXLAYVzdp4hklh2Q7WDn34TK6scfHAwx/PVGgHCtAsthw6ZG5s0S1xD1/v7OOk3vVghu/wl6KMRrpt36EI70N4YuS7PrIdj+16Q8dFFhBMsmo1ctJZXonvtuCTtAfnPBDsRQ== X-Forefront-Antispam-Report: CIP:12.216.194.146;CTRY:US;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(10009020)(6009001)(2980300002)(339900001)(189002)(199003)(92566002)(46102003)(87936001)(62966003)(2201001)(77156002)(19580405001)(19580395003)(76176999)(5001770100001)(575784001)(50986999)(6806004)(47776003)(85426001)(33646002)(104016003)(50466002)(86362001)(2950100001)(105606002)(48376002)(229853001)(42186005)(189998001)(5001960100002)(106466001)(36756003)(50226001)(107886002)(5003940100001)(921003)(4001430100001)(1121003);DIR:OUT;SFP:1101;SCL:1;SRVR:HE1PR02MB0780;H:ld-1.internal.tilera.com;FPR:;SPF:Fail;MLV:sfv;MX:1;A:1;LANG:en; MIME-Version: 1.0 Content-Type: text/plain X-Microsoft-Exchange-Diagnostics: 1;HE1PR02MB0780;2:Hb+pTECafbsf3ppOltsuqlDFxGUo3mp9BzgPByy68sX/ADLpCj6Ln+jRDskFRGsy;3:Vt4sfKUTfnUaciNLppbq4IvYbunbI8Oz8zxWGLojXfQwW6OySlKoO+kudSe1FiD7eUFsVfVhZkAgN3SY6Y1EdhzF/b9etb07k1LyACFzYTDi5SWGXcpfJWrFOvaVTzDKBVeWKqXTZT+Jw/OMmGDT15kWApY3Ug/kDCVkkNJ/iMfdytPEQitXjBFYorwxPkXrt3f259Pg6Snt1RFr/1viNOLSccRhYKZTc3oJdZaa312u9FwdxGMDQ9cTyxKtFSEv;25:sPiYT+4z7NmiB5dwqFk0QwRkf+gvrxe3oo+Sk23kPHW3XS5lqtaHajhOp5eUQwOP2w6xzzLi8qjuJ7I7Ibls6xlzFDl0BJRS8Ql6Wv+TKP36j7mMLp8FVgRKSMeoktMRi9yLTlz7/zRO4Ptfm65LrA2xRAPw9BQWM/2efDJcSELxPXAPAcH1etAqxdja/DiBx6w9ooLXHfWgj6XLvZV6OvGcaahT6rQrFrKHESW8VyaDBIwBge0KOm3Nhr6NzErceIxeXWeWmwDHDsavudCSHQ==;20:Ei3iEyfqkLs1fB8RRX84pbdPJyMFMdEANy3dPXro7QarlEboKnXlh+SA3cgR6kpZKckEQxpSgntjoLIBQ07VvvcuILda5OFXfubEvLAZAPMHEhRA7wmgz3Q7UY/vr4vPokhXo4XYpDzbc1ZPLFqBB0MO8fUquhaLu740wGTxcLo= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:HE1PR02MB0780; HE1PR02MB0780: X-MS-Exchange-Organization-RulesExecuted X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(5005006)(3002001);SRVR:HE1PR02MB0780;BCL:0;PCL:0;RULEID:;SRVR:HE1PR02MB0780; X-Microsoft-Exchange-Diagnostics: 1;HE1PR02MB0780;4:3ALrVhbonkXwLmlbv8LOJP6QqnBt5IOfZF+9ihNC7gZxlEHv37/ICnh7SQcqkk7JZ7WwMo75cDgfVmYhhSuekCuwN/8W8vT3Ql+X+v8DgcSD/C0QWyGinMgHaH1NL9ivdnVR2TE7LVTYrLTBenf6ksisy3HEVOd1x3YXkl/ghIxlf1m55s3vLkfNjV2JHIVdsR/0lrq44mkk4lzHwNHPzroCIMzPMONBO4ZqInKjGV3W3UMANxfGlLMvZYiPDb/iMbYLE/DvYNPAUbiLrZiy/+goGIz3Brvs6i/cmAX731k= X-Forefront-PRVS: 0636271852 X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;HE1PR02MB0780;23:E08vSQh3M53sevMtlwff0kwUm/r6q/0dKrtbxwgl1?= =?us-ascii?Q?fVj/5okIMzPe4rWYdvmzsthIr/0vSVYIUkH3W09noNfmzADEGIHxxWB86pwG?= =?us-ascii?Q?iUGPnzALunr5Q9oFIRAIRrl6leU8e3/OV0PFciCdeL+Tb+lpAHKCIz7mw9tV?= =?us-ascii?Q?0eqDI3elEk5gkLI3WABdWaHRyLLpoMfV2bnWIQ9FPB+gKeAHP3ZMSQV5j+0N?= =?us-ascii?Q?EKpk7qYzAWQv9vU4mivjOOsdtkxy1SDKR8FYxhW0r9xW9HPCS0vp6x5CYHFn?= =?us-ascii?Q?px7IWL2EaVNWbtOO03sC2DQ2jTK42ku6WVprW/3i3SfFu0d43u6vJQNOVVkC?= =?us-ascii?Q?HvvhM4aDs+KQe+q28pIiDqEPVnITMLOr/Rero2n8lpGUP3my/afZ5fl0brqU?= =?us-ascii?Q?1pie6f3ovdJNPhs0vXvDTB3KfUN66rIXVLTHzVGCYGcsJDn/6uGChuK+fY1R?= =?us-ascii?Q?w83LZnX+llYQHCM0d2TIwECRicvTSaJACO8yNcEWgjuMFJ71NWifntT9XzAB?= =?us-ascii?Q?BD0maow2YGR5bORlQ0TrDZ/j2fNlME7XFgjsDAkEe48cCEIgCWZbsVRa0RE4?= =?us-ascii?Q?tWJafrIxQXtehqjkKH5t32TOKDPE7RlaWn/Iux4I5498Do0wzAsGGzN8YVmv?= =?us-ascii?Q?7lgmcLD0GzRioekzxjqqvvoP2vC5d3x0uTpng01Q5LV/8WFWzikp+5WheFRc?= =?us-ascii?Q?2gSpzj+KV9msMCGd/V84ATBSbaeHLfXOe+SjeAFFhIbYVtGp/uDWrZK7G0MS?= =?us-ascii?Q?EWJbRmNZrzNJoesKB3e37AFHsncitEu4i8c7zMOYTQiCh2963kCt3CseGv6B?= =?us-ascii?Q?wB86Aw+cYhNvbirg+cZEBFMIaEzvQtlyeXu5FQiYt/NeiKKel/8BUA8KW0gm?= =?us-ascii?Q?wQOnwfTTAP+yq7NkWeFW564lZtmoM8pGQXsRcKOgpcJshpdOHV7ihk+xmYA8?= =?us-ascii?Q?PMwGm13PACJUKT64PmvnGYpo1B8mT9A1JZQX4Fybq8rrbwyHnU5xb16Dy2tr?= =?us-ascii?Q?THxOc4ck4njlcIwWDsS1OVAkFLps7kJVkOOUjYnVPTJ5w=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;HE1PR02MB0780;5:SE7gKeb5IC7WB6DT8dUga0tQXFjMNuC592sQOs3MGFfXq6a97U85EH5PoNNRUBLKkO/3kI8smZyLePJQ7QXBZI/oVx8FCokoG259XmXfKxLGUz9QKaBMTav05HOohjdczxPA9F+mucj1lEfz89nx6Q==;24:kqhc5hYqQZKpCHAJ2BaQWp516aqSHUIUm1nAQ6fR9PtadgwUW6jV3cz30NoNUDCwG8w4URvRaUuxzXDvrsm4jLcyN2m7Qsz+uOCQ//KzjKE=;20:+Y+N7axQm6RcBCSMsql8HFim64WjBBiqr64zAAmZRZcY/UTAQdY1J13aqtGlnpdmjSBDH3l+2soZIEnpeYj9eQ== SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: ezchip.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 13 Jul 2015 19:58:14.0083 (UTC) X-MS-Exchange-CrossTenant-Id: 0fc16e0a-3cd3-4092-8b2f-0a42cff122c3 X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=0fc16e0a-3cd3-4092-8b2f-0a42cff122c3;Ip=[12.216.194.146];Helo=[ld-1.internal.tilera.com] X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR02MB0780 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8516 Lines: 251 The existing nohz_full mode makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a stronger commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the stronger semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The "cpu_isolated" state is indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new tick_nohz_cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only two actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf --- arch/tile/kernel/process.c | 9 ++++++++ include/linux/sched.h | 3 +++ include/linux/tick.h | 10 ++++++++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 +++ kernel/sys.c | 8 +++++++ kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 95 insertions(+) diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..3625e839ad62 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_NO_HZ_FULL +void tick_nohz_cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/sched.h b/include/linux/sched.h index ae21f1591615..f350b0c20bbc 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1778,6 +1778,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_NO_HZ_FULL + unsigned int cpu_isolated_flags; +#endif }; /* Future-safe accessor for struct task_struct's cpus_allowed. */ diff --git a/include/linux/tick.h b/include/linux/tick.h index 3741ba1a652c..cb5569181359 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -10,6 +10,7 @@ #include #include #include +#include #ifdef CONFIG_GENERIC_CLOCKEVENTS extern void __init tick_init(void); @@ -144,11 +145,18 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) cpumask_or(mask, mask, tick_nohz_full_mask); } +static inline bool tick_nohz_is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); +extern void tick_nohz_cpu_isolated_enter(void); #else static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } @@ -158,6 +166,8 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { } static inline void tick_nohz_full_kick(void) { } static inline void tick_nohz_full_kick_all(void) { } static inline void __tick_nohz_task_switch(struct task_struct *tsk) { } +static inline bool tick_nohz_is_cpu_isolated(void) { return false; } +static inline void tick_nohz_cpu_isolated_enter(void) { } #endif static inline bool is_housekeeping_cpu(int cpu) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..f9de3ee12723 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (tick_nohz_is_cpu_isolated()) + tick_nohz_cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..36eb9a839f1f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_NO_HZ_FULL + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c792429e98c6..4cf093c012d1 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -24,6 +24,7 @@ #include #include #include +#include #include @@ -389,6 +390,62 @@ void __init tick_nohz_init(void) pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n", cpumask_pr_args(tick_nohz_full_mask)); } + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak tick_nohz_cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In "cpu_isolated" mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two "cpu_isolated" processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void tick_nohz_cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + tick_nohz_cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} + #endif /* -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/