Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756097AbbHYT4d (ORCPT ); Tue, 25 Aug 2015 15:56:33 -0400 Received: from mail-db3on0094.outbound.protection.outlook.com ([157.55.234.94]:17498 "EHLO emea01-db3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755756AbbHYT40 (ORCPT ); Tue, 25 Aug 2015 15:56:26 -0400 Authentication-Results: spf=fail (sender IP is 12.216.194.146) smtp.mailfrom=ezchip.com; ezchip.com; dkim=none (message not signed) header.d=none; From: Chris Metcalf To: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , "Rik van Riel" , Tejun Heo , Frederic Weisbecker , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , Catalin Marinas , Will Deacon , , , CC: Chris Metcalf Subject: [PATCH v6 2/6] task_isolation: add initial support Date: Tue, 25 Aug 2015 15:55:51 -0400 Message-ID: <1440532555-15492-3-git-send-email-cmetcalf@ezchip.com> X-Mailer: git-send-email 2.1.2 In-Reply-To: <1440532555-15492-1-git-send-email-cmetcalf@ezchip.com> References: <1440532555-15492-1-git-send-email-cmetcalf@ezchip.com> X-EOPAttributedMessage: 0 X-Microsoft-Exchange-Diagnostics: 1;DB3FFO11FD051;1:DvFV+KkFXtPKKKKGeTSNg7Kj4T0P5eT2rG0NYyNQz6rwTRiWuSBbGbMfk4o68tIyW+TZjGiEpHpx22UC3St5B+MTFBV3Q3WAWkw3UPWSDNUh2ocoQ+9s1pSbKPR/GxxXEkdgTe6RAHEhgmyM60jF6h9nwF/10r2qltSi8V2N1cDop9Wp2+CsXoxhWmf0hgLXhkDHrGdei4pBNpg1afS4L4Srq7I3b6XKtROn48a1Psv74+XIYO7Llgzc0HrKyNStrAQ1WqXHbUeOdWSG49m+rYQxrBJ9J8E1xpgFsd6yrbQvzprJUvhw/TyHI1iR5KVfL7l5HmenruImRexOkrke7a3GjzCZlIz1RxEdTkV2zok= X-Forefront-Antispam-Report: CIP:12.216.194.146;CTRY:US;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(10009020)(6009001)(2980300002)(339900001)(3050300001)(199003)(189002)(19580405001)(2201001)(4001540100001)(5001860100001)(92566002)(97736004)(68736005)(5001770100001)(5001960100002)(85426001)(5007970100001)(6806004)(81156007)(5001830100001)(50226001)(47776003)(64706001)(86362001)(107886002)(104016003)(106466001)(229853001)(19580395003)(575784001)(106356001)(76176999)(106476002)(42186005)(33646002)(46102003)(2950100001)(50466002)(50986999)(36756003)(48376002)(105606002)(77156002)(87936001)(62966003)(5003940100001)(189998001)(9376005)(921003)(4001430100001)(1121003);DIR:OUT;SFP:1101;SCL:1;SRVR:HE1PR02MB0779;H:ld-1.internal.tilera.com;FPR:;SPF:Fail;PTR:InfoNoRecords;A:1;MX:1;LANG:en; MIME-Version: 1.0 Content-Type: text/plain X-Microsoft-Exchange-Diagnostics: 1;HE1PR02MB0779;2:9dJjiBOxkhGvCCbhH3kakQiqwAIy6QQR5nFVs2jdnQ88OLM8i57H+qeSvuxtqQGeoJ4hu7t2xXmQonQuU5abbTBaGiVVfmfexCJEWljEriK5AJnLgdi4+RZJFUEOdhkv4OrTM8LiCVCI5WkNI4IIQuiterKUJng1TMynV9kzIgs=;3:1kEOavwelczy1EUWQOmLVThK7aWLPrw4iC2tJLU40o1qSlb3uaj7RPnZkPdE1P0d0s96Zu35OuYU3ld2kuM3PJtHhTkaaBC5ZxJ7npfDV3DL92hz1Xs7FreF+QWK44lG5VNY8S8qDevIBX/f4uf8bbQGALKtmNHLeI/HOArKQDQjG6uZzi5oZxq8Y13zk6mFNCTnc8Hasaq8c83s5vXDIUW4WidWNdxmnEFtfrUb8htFhFBx+5XHR5ACbro62Irz;25:tQLTZmGAX8a3CprOxPg2ItCm1h/gSEYFmUgMa0Qy3aXHxMrokSCNlrFPbXd6DXLDYFb22DwV2IkvmOvVoA6lvPKz1CcCzUaduz1ZPztEUAKSmOQHpzcCBwhAOLJbIC8Fb/4Q7QUBVLvbJK/DmwRmPzVnAlB4fXZ9skpH0qvgkaIAs0iO65gTWWV8xldQbiw8V5ivslEMGBwqsE0ihUq7BDsGSp9ygNJCZTN5AjRnZtaLkjrLOX4acfaWSVwl31m//nRNToBKz18+8s3TDewO0A==;20:4NvSa2zJEhHm7nJkPKA7fdBd4YoqjKStRIb2j33q/N7PR+pJuhOTdHmnJLQNERG4Gw/t5MmIvAy0Gt0vgKFmlQDM1+gUnjhuZA7jNzO4PQbhyp5kKtQp10g6uBZLP6H46jZ41iPfOKOuxoJhDI7pBznWWWIp6J8tSyKMTocwlTs= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:HE1PR02MB0779; X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(5005006)(8121501046)(3002001);SRVR:HE1PR02MB0779;BCL:0;PCL:0;RULEID:;SRVR:HE1PR02MB0779; X-Microsoft-Exchange-Diagnostics: 1;HE1PR02MB0779;4:qbnOMaZhuu/sSe7liaWNoEREeCotMVGe5zRpYSBsxm4s31lSaRksh0StxVZa6hXrbaJ+n9CAKJ8fQwJIM1pkpEsDDaQ4qYOqT3fcAuO1mgmzsqqQhp1jqlnCoVn9BmYzim+ZJWwmmDItFc8JE/wD2jqqfEIVWo+19vub7bOKEzDTYCnxdwoo3eNLRQDknbkLx6q956NbGEaC33GPqOW4sFWviHJO8ORS2HEsot/mpN4NNgbSKaqv91h5tUA1xr8V/ai/QF1s47P5DFAE5vK7z7+K3Q1zQbVbgZyzgVjoHe9FqFlOzU6nBkoTWKu4iGVC4CK866rn08BTIlICuCaElg== X-Forefront-PRVS: 06793E740F X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;HE1PR02MB0779;23:Z84+jghwvPJhzbAovo+hFbyO+o6fRlW6s3RDv6y4S?= =?us-ascii?Q?y07rXVbPoel1ly+LS4dq6aCTt17Sr3pZ2sT3AFbBT7kAwu6gbIcqqvtKT6kf?= =?us-ascii?Q?aKkMLvMBlqX37J1oVWV3xQ7rM/pnL/xnVrgkf2LPtJZVl5LTDZkTsn4YRXkW?= =?us-ascii?Q?buQ7ZpDkx8QBpso6HLSS5gV2BZlRtVDUT5t4KEV6FKvytQhbQHqo2ecyQsvs?= =?us-ascii?Q?j76sMs2R2GQE60VhgC+NZNFUzElfWgaimfe6UYLavfy0f4ZQ8qiEM/B0yJpc?= =?us-ascii?Q?THWAjtQEotff17wj38/tvPlkk1bsYzgh/tL3nCcchnMcqDLrjsnO9ak3icPG?= =?us-ascii?Q?6IVdTte250vV8kFz+lwVVKTLVn5OK0Tl+qU/X+0q2fhWPJ3C3JeiC4SuLeUc?= =?us-ascii?Q?667BJ8VuyUguIx9BDgChk6hoKA+pLrkz+OksA61gdOanPjMbIGKeoUgzx4ET?= =?us-ascii?Q?Ky+vDF3Y1Fc21VFIGtBjJrPMJ1+DXKw28ChaK9ml7x3ojqiGe0I5wkgqeVmT?= =?us-ascii?Q?CMV3i5Py0um2k0C7hjlTfxch1OtiCMvDtSLMU68Zn7LU3Ug4Rqs/9Xg4AVey?= =?us-ascii?Q?8kuwI6ywgC/AHsUysiJcDZAX3mVZCVSe5IoiyI6wfmDP5pHzXO+pr2ReqUkp?= =?us-ascii?Q?cd09Lu5A/IDw7PC8SWeuID3FiX2QgbMkunQAjdI1F9Anyje2ydH84cfPiKl2?= =?us-ascii?Q?eWLvaZbThGhGjuhdG/dkKOrPz+ucbs03OUq/f3dqleCdw4DRxLhUD5qUpljs?= =?us-ascii?Q?mGqM64fR/b46KM/Eya6pOVTV9slP2tlPdy1+TNab+S0vunCHHkeHQpX/gXNH?= =?us-ascii?Q?6ySr6j8tlJaqIxoSwSSamWghrepx5oBkDS0T9jhKqcuhxwa4XwYaA9LTNKVD?= =?us-ascii?Q?U21GFlYHyr2MiGz+yZm+8y08LIJxYPi46reFDjQD5o+qDgZLPLN6asM4PcIp?= =?us-ascii?Q?XJiq/ZSJtBtOAS6ZQEPSakoQoT109cxSIboF/f+gzviY1ufSpHcKwPIolcY1?= =?us-ascii?Q?gGkJzt0Wk+QS87oqLJkd4+aXRnSU+cpCKKqMhomdCgVFeOWC6CcEgId5BkjE?= =?us-ascii?Q?T9OiFG7WvZb5vrntpH/9CwcGGuvstMrHiL6AG0verNQa8Fpvf8c5xcMt/hTP?= =?us-ascii?Q?XF9fUUSvfOnYw7qQT3Nzh1XPYoQicjNhvIHNhCDRP4gsOa8fCzxG1Lu2Tnka?= =?us-ascii?Q?B2H93vG2f7kSRILH4vWILPD2OBHTqC28GBXOYlp9azmW5pn4hslcMOG/GvLJ?= =?us-ascii?Q?gC7n/PHYIX4CZcs3axdq+oU79tmhjSGht51O0r0YgafY0NOPg4/ZR1AnR+uR?= =?us-ascii?Q?PGLnGkEreUHvJeBMC9N1JVhtVyU08DyYJS0TbtYAWp2X5C1gxa/B2uPTvMtV?= =?us-ascii?Q?WstXw=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;HE1PR02MB0779;5:r7kh463gKoFXScvHieUcEIRriPtyS7HMtpWk5+Ps8EaNYF77PjhWGEaDszSPqhn4U3/vk8NvKEYGr9yOjaRBCUBOssMsx4IF+LeEnPucbvJmDvBl/4FM2h+y/rsVYgEyorOMv88ws/xdD/X70SGY0Q==;24:J2OdrLHPBH9XMrnq7yxWIZyRnbzDY6cRR7fbFAYqgzJn9xE1tVq8udHzrQnjKEC43Hh+By8tXMwJJDftdLxFx6cT/0Dt7R1JyTEfHDpvB3w=;20:3lmCnQ2MJVFW/irV/61KzGUI21uyiiED8cKlxENoM1+vvXP1G67b+gKw+/GAZj9Bh0h/PsbxyRzL+FjF0PxkQA== SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: ezchip.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Aug 2015 19:56:21.9018 (UTC) X-MS-Exchange-CrossTenant-Id: 0fc16e0a-3cd3-4092-8b2f-0a42cff122c3 X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=0fc16e0a-3cd3-4092-8b2f-0a42cff122c3;Ip=[12.216.194.146];Helo=[ld-1.internal.tilera.com] X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR02MB0779 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9957 Lines: 299 The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. Subsequent commits will add additional flags and additional semantics. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate nohz_full=CPULIST boot argument. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(). When the _ENABLE bit is set for a task, and it is returning to userspace on a nohz_full core, it calls the new task_isolation_enter() routine to take additional actions to help the task avoid being interrupted in the future. Initially, there are only three actions taken. First, the task calls lru_add_drain() to prevent being interrupted by a subsequent lru_add_drain_all() call on another core. Then, it calls quiet_vmstat() to quieten the vmstat worker to avoid a follow-on interrupt. Finally, the code checks for pending timer interrupts and quiesces until they are no longer pending. As a result, sys calls (and page faults, etc.) can be inordinately slow. However, this quiescing guarantees that no unexpected interrupts will occur, even if the application intentionally calls into the kernel. Signed-off-by: Chris Metcalf --- arch/tile/kernel/process.c | 9 ++++++ include/linux/isolation.h | 24 +++++++++++++++ include/linux/sched.h | 3 ++ include/uapi/linux/prctl.h | 5 ++++ init/Kconfig | 20 +++++++++++++ kernel/Makefile | 1 + kernel/context_tracking.c | 3 ++ kernel/isolation.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 8 +++++ 9 files changed, 148 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644 kernel/isolation.c diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c index e036c0aa9792..1d9bd2320a50 100644 --- a/arch/tile/kernel/process.c +++ b/arch/tile/kernel/process.c @@ -70,6 +70,15 @@ void arch_cpu_idle(void) _cpu_idle(); } +#ifdef CONFIG_TASK_ISOLATION +void task_isolation_wait(void) +{ + set_current_state(TASK_INTERRUPTIBLE); + _cpu_idle(); + set_current_state(TASK_RUNNING); +} +#endif + /* * Release a thread_info structure */ diff --git a/include/linux/isolation.h b/include/linux/isolation.h new file mode 100644 index 000000000000..fd04011b1c1e --- /dev/null +++ b/include/linux/isolation.h @@ -0,0 +1,24 @@ +/* + * Task isolation related global functions + */ +#ifndef _LINUX_ISOLATION_H +#define _LINUX_ISOLATION_H + +#include +#include + +#ifdef CONFIG_TASK_ISOLATION +static inline bool task_isolation_enabled(void) +{ + return tick_nohz_full_cpu(smp_processor_id()) && + (current->task_isolation_flags & PR_TASK_ISOLATION_ENABLE); +} + +extern void task_isolation_enter(void); +extern void task_isolation_wait(void); +#else +static inline bool task_isolation_enabled(void) { return false; } +static inline void task_isolation_enter(void) { } +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04b5ada460b4..2acb618189d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1776,6 +1776,9 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; +#ifdef CONFIG_TASK_ISOLATION + unsigned int task_isolation_flags; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 31891d9535e2..79da784fe17a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -190,4 +190,9 @@ struct prctl_mm_map { # define PR_FP_MODE_FR (1 << 0) /* 64b FP registers */ # define PR_FP_MODE_FRE (1 << 1) /* 32b compatibility */ +/* Enable/disable or query task_isolation mode for NO_HZ_FULL kernels. */ +#define PR_SET_TASK_ISOLATION 47 +#define PR_GET_TASK_ISOLATION 48 +# define PR_TASK_ISOLATION_ENABLE (1 << 0) + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index af09b4fb43d2..82d313cbd70f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -795,6 +795,26 @@ config RCU_EXPEDITE_BOOT endmenu # "RCU Subsystem" +config TASK_ISOLATION + bool "Provide hard CPU isolation from the kernel on demand" + depends on NO_HZ_FULL + help + Allow userspace processes to place themselves on nohz_full + cores and run prctl(PR_SET_TASK_ISOLATION) to "isolate" + themselves from the kernel. On return to userspace, + isolated tasks will first arrange that no future kernel + activity will interrupt the task while the task is running + in userspace. This "hard" isolation from the kernel is + required for userspace tasks that are running hard real-time + tasks in userspace, such as a 10 Gbit network driver in userspace. + + Without this option, but with NO_HZ_FULL enabled, the kernel + will make a best-faith, "soft" effort to shield a single userspace + process from interrupts, but makes no guarantees. + + You should say "N" unless you are intending to run a + high-performance userspace driver or similar task. + config BUILD_BIN2C bool default n diff --git a/kernel/Makefile b/kernel/Makefile index 43c4c920f30a..9ffb5c021767 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o obj-$(CONFIG_TORTURE_TEST) += torture.o +obj-$(CONFIG_TASK_ISOLATION) += isolation.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0a495ab35bc7..c57c99f5c4d7 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -20,6 +20,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -99,6 +100,8 @@ void context_tracking_enter(enum ctx_state state) * on the tick. */ if (state == CONTEXT_USER) { + if (task_isolation_enabled()) + task_isolation_enter(); trace_user_enter(0); vtime_user_enter(current); } diff --git a/kernel/isolation.c b/kernel/isolation.c new file mode 100644 index 000000000000..d4618cd9e23d --- /dev/null +++ b/kernel/isolation.c @@ -0,0 +1,75 @@ +/* + * linux/kernel/isolation.c + * + * Implementation for task isolation. + * + * Distributed under GPLv2. + */ + +#include +#include +#include +#include +#include "time/tick-sched.h" + +/* + * Rather than continuously polling for the next_event in the + * tick_cpu_device, architectures can provide a method to save power + * by sleeping until an interrupt arrives. + * + * Note that it must be guaranteed for a particular architecture + * that if next_event is not KTIME_MAX, then a timer interrupt will + * occur, otherwise the sleep may never awaken. + */ +void __weak task_isolation_wait(void) +{ + cpu_relax(); +} + +/* + * We normally return immediately to userspace. + * + * In task_isolation mode we wait until no more interrupts are + * pending. Otherwise we nap with interrupts enabled and wait for the + * next interrupt to fire, then loop back and retry. + * + * Note that if you schedule two task_isolation processes on the same + * core, neither will ever leave the kernel, and one will have to be + * killed manually. Otherwise in situations where another process is + * in the runqueue on this cpu, this task will just wait for that + * other task to go idle before returning to user space. + */ +void task_isolation_enter(void) +{ + struct clock_event_device *dev = + __this_cpu_read(tick_cpu_device.evtdev); + struct task_struct *task = current; + unsigned long start = jiffies; + bool warned = false; + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + + /* Quieten the vmstat worker so it won't interrupt us. */ + quiet_vmstat(); + + while (READ_ONCE(dev->next_event.tv64) != KTIME_MAX) { + if (!warned && (jiffies - start) >= (5 * HZ)) { + pr_warn("%s/%d: cpu %d: task_isolation task blocked for %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + warned = true; + } + if (should_resched()) + schedule(); + if (test_thread_flag(TIF_SIGPENDING)) + break; + task_isolation_wait(); + } + if (warned) { + pr_warn("%s/%d: cpu %d: task_isolation task unblocked after %ld seconds\n", + task->comm, task->pid, smp_processor_id(), + (jiffies - start) / HZ); + dump_stack(); + } +} diff --git a/kernel/sys.c b/kernel/sys.c index 259fda25eb6b..c7024be2d79b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2267,6 +2267,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; +#ifdef CONFIG_TASK_ISOLATION + case PR_SET_TASK_ISOLATION: + me->task_isolation_flags = arg2; + break; + case PR_GET_TASK_ISOLATION: + error = me->task_isolation_flags; + break; +#endif default: error = -EINVAL; break; -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/