Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753489AbbG1TuL (ORCPT ); Tue, 28 Jul 2015 15:50:11 -0400 Received: from mail-am1on0100.outbound.protection.outlook.com ([157.56.112.100]:39456 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752798AbbG1TuG (ORCPT ); Tue, 28 Jul 2015 15:50:06 -0400 Authentication-Results: spf=fail (sender IP is 12.216.194.146) smtp.mailfrom=ezchip.com; ezchip.com; dkim=none (message not signed) header.d=none; From: Chris Metcalf To: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , "Rik van Riel" , Tejun Heo , Frederic Weisbecker , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , Catalin Marinas , Will Deacon , , , CC: Chris Metcalf Subject: [PATCH v5 2/6] cpu_isolated: add initial support Date: Tue, 28 Jul 2015 15:49:36 -0400 Message-ID: <1438112980-9981-3-git-send-email-cmetcalf@ezchip.com> X-Mailer: git-send-email 2.1.2 In-Reply-To: <1438112980-9981-1-git-send-email-cmetcalf@ezchip.com> References: <1438112980-9981-1-git-send-email-cmetcalf@ezchip.com> X-EOPAttributedMessage: 0 X-Microsoft-Exchange-Diagnostics: 1;DB3FFO11FD049;1:u5v8gXESehUyOtkR5sCYY2+MELnyOyi0F/ZUeMVDocwcD7bixgw96fr20pGTjPyTfw1cA5T8oZcNQp4Eg/Iut8K0vD+Uf4ljxRoGWhi20/KyLj05efTT/9NUOmL443nvlKrYR6hyDVD/0dkG/FFxZEMhizzMaOx53zTVCU17pIRWIRu/jXxcsLEw2o0HngcdXqztALzvBlSl4zjvxsA59j86MAT0S4CzTnEXlh5LCng1YvBt0IattdBXbL1yVyhtuXMjlfhJ9tYeLH9h/ar58TwPp8dpCMmsPuL6sagG+1g= X-Forefront-Antispam-Report: CIP:12.216.194.146;CTRY:US;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(10009020)(6009001)(2980300002)(339900001)(189002)(199003)(104016003)(48376002)(2950100001)(87936001)(42186005)(19580395003)(19580405001)(92566002)(86362001)(575784001)(6806004)(85426001)(76176999)(2201001)(33646002)(62966003)(46102003)(5001770100001)(50466002)(77156002)(5001960100002)(189998001)(105606002)(107886002)(50226001)(106466001)(229853001)(50986999)(36756003)(47776003)(921003)(4001430100001)(1121003);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR02MB0783;H:ld-1.internal.tilera.com;FPR:;SPF:Fail;MLV:sfv;MX:1;A:1;LANG:en; MIME-Version: 1.0 Content-Type: text/plain X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;2:czEaZz+EEVModcEo94MUfsRfMUzZwmEnnlAasYsVwaAOHkeJLOLySkG012/faiIdlRyvWMm15jEWoWJ/1Mb1mCCws/0J7O8hfnvfDprSjrEiJz9vD/ieJnWvanDthkQbk8DtoL8XrkpTS7RamcON4JPuUagDYpGFTcDK+1m0vR4=;3:DCNYvKnXLbmW9pA5wyzaty3gV6YPTZnJFjVuc4GJDY03ZnHW0OmGJyPtDQAsa6Qsp+FYvvu1YW5UWDeK/dDGDoFZcSazeuW8rL5jCvUmzmCoA+fkhh1beUkVkVJq3+oWRzPYhQc4AaGqTlMLOC5qhd5SusTixdmzykVZevYFZAFDmJ/Ne/qgQLL9pl0ux4pJUtebQtojVlSDwfDsNMbhfZn04bQVfm4++7/5vnfoRdM+itxpbAf/Tl3VBqGF5fYf;25:VPTj2N1o1RQwOhkPwGPWm0Gngf29LjTT7J4wfKVniJRx4kQn0LnTCqkPHgzVto06BQidGPnnoSzqIOFUv9AC92RvMFlVxNwvspg9YhJagxKvAUByNlAW7zKcf9Qv7BtHAkiyEeGvdgdOApRi0i1VPVGh6OINtNHoTeoEWgZbCI4RhPir3GW0iVbq+czda03GTg+qa2jAijrFBpnDYq5t0bGlY661kFoTZrSi3Tm7Ih3aP3oOWuu2c+oFFUSqU818ZLYjvotx/hRRLtd1Vlq2jA==;20:10uZ2REGMpTxu1riCPwjg1Z9djQa4M/mdiRl3vJ0z+Kv14nrgYsrtHIfoJX9JdDIL5WNcOvt7fiEPuGsCP2eWLIyzJ7i7adAjVY1FnHOyiqvbOdPWEEu/s2AZDz6GT6xJRXpTCH4K+R6XA6ufwDkQ+N0TkUGODFv41fweIO4Uyk= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR02MB0783; VI1PR02MB0783: X-MS-Exchange-Organization-RulesExecuted X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(5005006)(3002001);SRVR:VI1PR02MB0783;BCL:0;PCL:0;RULEID:;SRVR:VI1PR02MB0783; X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;4:C1aw+irs5sxxoBIrV/TjpctO7/bsHM2rWAx3bh5TLRtd/YOfyQqJSk2lFscj7T4sIBGVeqbCRAwpznEwonchrLk+bcLH8SS+7ZaI3aKyyeL8N/I74VcPtW4wDu+/ZM7A5yN4+MQIckbrK8wA+3IXLgD1YOfqLazx+eFGXMptqDHF9P1AWkuA06qS95DFsPG9j825JyRJn0hT4QKlfQTYvfcJJgbHYMh7mlidTZNtXIARKFFVyohAqFalOeGUOMK1UknXK/MCYXI9p+L3SnN0jtNgrLDKESkGz7egCbFY6vs= X-Forefront-PRVS: 06515DA04B X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;VI1PR02MB0783;23:TFOK6dav+sO/hb4azNREX/IlIPOP0zqOm74cCmcXg?= =?us-ascii?Q?CVCHdGd1zkbsyV8zKxxxmjGGxqeVO0fiExyL2lRau7ntuArvWauw+vfmTlfV?= =?us-ascii?Q?4kwA/ba4rT6NjgSHqYBYvdiHy0oZfs2kTEc3bt1U92pO96vkgC5xtMY7l9VG?= =?us-ascii?Q?dfNwCb1+TBpYT/wfHHcctGnGDwd3RzybK0thnTy0f10sIZzB9RcqiSt7GzAc?= =?us-ascii?Q?9Izr68llRrhAnezCpRtQ0ammhmF2h5qIqsMuPNC7tmkU+Gg0538jb+5wn7de?= =?us-ascii?Q?hTIo3dGthuUYRRKz+FARlBWww3esKT4bpcj9qZo5dA9bEsprhLN8aj7n03oI?= =?us-ascii?Q?C7lNnV0UWZ9tGIhQnm+dfIAN3iHkJ5P9nC+Rx/FtAyF9TsH/BD0cQqlRWnZC?= =?us-ascii?Q?TAY9tfDPeC/jcqK9hlkVwswCJ1Em0kBJFIr2bOwAPLTqeCiS+acHKak7f+wq?= =?us-ascii?Q?A9zHhMAjDMmkPuR9uEACsXU1S3ApLKLJ98hcR8omOA/5hHtixmjCR0k6SOJv?= =?us-ascii?Q?FeHRqLRGBoCBaStpwmENtHUUuym5futzfIEK9CzaMovmVTo3SHr8a9Syum85?= =?us-ascii?Q?G5GkKOLqzRCDrSHGyUyc/byzR3TjIZtgkv14s6PVDYNOHgWky19wXb7id9hs?= =?us-ascii?Q?hDLFyC1cNVEloRwvhRDAQB3nBzFmNPc/Z58l7Q1KrN6UHYvx6YID9CTMmTY/?= =?us-ascii?Q?hsAXBH4kBmUOq4HhBl9ReMKU/QzB9kHIVP9oR8yYNvS021vLUvY54qbEWvMe?= =?us-ascii?Q?rOE/leVsTMNqVPoIqWciH8CNrcsVliRrlseNdOEbNVxKtKG/fALo1ws9/ioY?= =?us-ascii?Q?a+JUpRpN0S2CyGV5NGiScgtA3OT05IDrw2bwMfddnSGQF8edFivfxDu18bz2?= =?us-ascii?Q?mcsj/7CpDy2m03+y++yfGkLtSl9JPHOSqZgl53hqbRKAbTJ4tgUbWLZhd7ow?= =?us-ascii?Q?fSQ9T10xp43LX+6NdRxek0ybEMIZhE5ugB1wQ3WfWdZp/0KJ4IW1GFVIE8VM?= =?us-ascii?Q?7d3cTibw/YfiEEM8Kbpe1Ee?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR02MB0783;5:4kan2y4foRgA7IrzdiPnKo+e+rLIkE4TK8n6nHB7SVuSufSROych+c4BJeakQNl1B6EaV86dOol/UGpZoU3j+Ss9TJC2Zu2abclCHf3fdmcixGFnC+CHa4n5TwMPrqwLyLmsTEVNvYPQpethc2yLSQ==;24:6DwS6lqweG/HRzw7GrbBedVdrOL6W0F3DbSRUU3garonZ6u5yrF5K6NMHUJWDEqmoIVUz8pe4t1h2+GurJ4HOKZ4fdAKtD1DvsUoSAdHlF4=;20:x+SZK/h8rscPURS5iYDkS0QVxMvw14KDHA7y1+Z2Os3Q0ZCqFAxJejtZUilaM1IZJtEnGjGHctsxW+RDYmnYeg== SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: ezchip.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Jul 2015 19:49:59.9356 (UTC) X-MS-Exchange-CrossTenant-Id: 0fc16e0a-3cd3-4092-8b2f-0a42cff122c3 X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=0fc16e0a-3cd3-4092-8b2f-0a42cff122c3;Ip=[12.216.194.146];Helo=[ld-1.internal.tilera.com] X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR02MB0783 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9825 Lines: 291 The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new CPU_ISOLATED Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "cpu_isolated" state is then indicated by setting a new task struct field, cpu_isolated_flags, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new cpu_isolated_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf --- arch/tile/kernel/process.c | 9 ++++++ include/linux/cpu_isolated.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ kernel/context_tracking.c | 3 ++ kernel/sys.c | 8 +++++ kernel/time/Kconfig | 20 +++++++++++++ kernel/time/Makefile | 1 + kernel/time/cpu_isolated.c | 71 ++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 144 insertions(+) create mode 100644 include/linux/cpu_isolated.h create mode 100644 kernel/time/cpu_isolated.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..7db6f8386417 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_CPU_ISOLATED +void cpu_isolated_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/cpu_isolated.h b/include/linux/cpu_isolated.h new file mode 100644 index 000000000000..a3d17360f7ae --- /dev/null +++ b/include/linux/cpu_isolated.h @@ -0,0 +1,24 @@ +/* + * CPU isolation related global functions + */ +#ifndef _LINUX_CPU_ISOLATED_H +#define _LINUX_CPU_ISOLATED_H + +#include +#include + +#ifdef CONFIG_CPU_ISOLATED +static inline bool is_cpu_isolated(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->cpu_isolated_flags & PR_CPU_ISOLATED_ENABLE); +} + +extern void cpu_isolated_enter(void); +extern void cpu_isolated_wait(void); +#else +static inline bool is_cpu_isolated(void) { return false; } +static inline void cpu_isolated_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..0bb248385d88 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_CPU_ISOLATED + unsigned int cpu_isolated_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..edb40b6b84db 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query cpu_isolated mode for NO_HZ_FULL kernels. */ +#define PR_SET_CPU_ISOLATED 47 +#define PR_GET_CPU_ISOLATED 48 +# define PR_CPU_ISOLATED_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..36b6509c3e2a 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (is_cpu_isolated()) + cpu_isolated_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c68417ff4800 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_CPU_ISOLATED + case PR_SET_CPU_ISOLATED: + me->cpu_isolated_flags = arg2; + break; + case PR_GET_CPU_ISOLATED: + error = me->cpu_isolated_flags; + break; +#endif default: error = -EINVAL; break; diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 579ce1b929af..141969149994 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -195,5 +195,25 @@ config HIGH_RES_TIMERS hardware is not capable then this option only increases the size of the kernel image. +config CPU_ISOLATED + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_CPU_ISOLATED) to "isolate" + themselves from the kernel. On return to userspace, + cpu-isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + endmenu endif diff --git a/kernel/time/Makefile b/kernel/time/Makefile index 49eca0beed32..984081cce974 100644 --- a/kernel/time/Makefile +++ b/kernel/time/Makefile @@ -12,3 +12,4 @@ obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o tick-sched.o obj-$(CONFIG_TIMER_STATS) += timer_stats.o obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o obj-$(CONFIG_TEST_UDELAY) += test_udelay.o +obj-$(CONFIG_CPU_ISOLATED) += cpu_isolated.o diff --git a/kernel/time/cpu_isolated.c b/kernel/time/cpu_isolated.c new file mode 100644 index 000000000000..e27259f30caf --- /dev/null +++ b/kernel/time/cpu_isolated.c @@ -0,0 +1,71 @@ +/* + * linux/kernel/time/cpu_isolated.c + * + * Implementation for cpu isolation. + * + * Distributed under GPLv2. + */ + +#include +#include +#include +#include +#include "tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + */ +void __weak cpu_isolated_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In cpu_isolated mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two cpu_isolated processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void cpu_isolated_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + cpu_isolated_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/