Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757408AbcCCTwd (ORCPT ); Thu, 3 Mar 2016 14:52:33 -0500 Received: from mail-am1on0054.outbound.protection.outlook.com ([157.56.112.54]:28080 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752839AbcCCTwb (ORCPT ); Thu, 3 Mar 2016 14:52:31 -0500 Authentication-Results: infradead.org; dkim=none (message not signed) header.d=none;infradead.org; dmarc=none action=none header.from=mellanox.com; Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality To: Andy Lutomirski References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com> <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> CC: Thomas Gleixner , Christoph Lameter , Andrew Morton , Viresh Kumar , Ingo Molnar , Steven Rostedt , Tejun Heo , Gilad Ben Yossef , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-kernel@vger.kernel.org" , X86 ML , "H. Peter Anvin" , Catalin Marinas , Peter Zijlstra From: Chris Metcalf Message-ID: <56D895EA.1060301@mellanox.com> Date: Thu, 3 Mar 2016 14:52:10 -0500 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: SN1PR16CA0001.namprd16.prod.outlook.com (25.169.34.11) To VI1PR05MB1694.eurprd05.prod.outlook.com (25.165.235.156) X-MS-Office365-Filtering-Correlation-Id: c5481b09-0e98-4940-5400-08d3439d58af X-Microsoft-Exchange-Diagnostics: 1;VI1PR05MB1694;2:92WwNkQ7IumRw9E50oVYjaabgbCEjmY3NRxoo52b93605vAFA9a7G8ppNFRe9bdVlXRAMhSNghZvEcD1EH1HsJZfIc84TDXwMHwjHobccr2iqQshTjhUOUCP/LvIvmrKM4OMiz5izPGpgQYxIQSoI2G697+LQMdVT2fJsbLN4Rj26fbrEOE8vjmmKFsFkMDY;3:cuUJ2U0bm0ofA5imocMpW2fj5+8CgbLD0J6o0NnHZ8e6CcEx8kTix2xaKAti8cj5/1BwbSVqc1rVDYKIFZbJ/sHYbCYlBQrVX4P8yYLcl6opdvkrv56B4i7fcU+TQ3Ys;25:kGDJZjK32i8fxhsQCdkmXNBXYiEW8JCxAZX8VFYyIzffGvwEa3vGm8rYsl0q+QWx/rJTQy/M98W2Md3ImfTxPL4/NkgHGg9ymhyNg9rnQjCu6U51wYtt6rC/YbARi+h8qtTbIcsZu3kMsWngfXZTsna43VMxEZiZGtPnQgG9qeYu4lvGi59wK0gwdOB+jZGB0x0jZDSZro7ZCIordAE5LtL6EsSl8pswtzNIBY2IChZRV1GFin2BJxHEhVtKYgviMFAwclfId/66tuSU9ajUNWi4HMk7WH4yySfvPIm46NVUAonNP+tD8gPtFl69mTqtxj1vDEi7Ks+yyAj+gEs65Q== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR05MB1694; X-MLNXRule-EZCH-Linux: Rule triggered X-Microsoft-Exchange-Diagnostics: 1;VI1PR05MB1694;20:TF5KVJkyuqhi4hhf7t/8EApeBM/rvjnj/elwVw5eNNAgSbXmswONrkYw7iyC4JATUcJMM1UVd2n9kOsS/gAxR14RR8o5tu9JxzC5bE3TFiSUAPDMnNk8DykJSUgcuYn8jQjv4zLypD4UO5fGldmM02xfO3QJrW8ACAYfZW9WiwpQ6r56TYbr4vaqUUtYhN1/XnL/f5TATH+yugmA0PS0vl3C0bq4CJqHkoX4PLLPND8M4ljXt3gROr2sc8fcbQruciVRUmfgr7OFFjmqtEjSIjGmr8bxzLX33MaWdw1lyQx45U0dPumVOi7N6r7an4NsR2wHuPgRxvh+i7lQAmZrDy+UuIwtLyWWJ2vLrQUShb9QE7doS/oS2C3KEchLI8iePx1FiRziJ/I3SrLNfxvhA2iS7wwRyaDaiyafxZVzSkCubOKzo/FFY+M5L6TlvfqAKLfj/3FcNbJpHl0w2NT9yFkSUOy67NXG0PBaLnFuu56yGy/IX0IQ6ZzUMgyVsE2g;4:mJYzYK7893zvinwJQip1hoHPzSsEGjn2i90D9TYCkyBr65gYNHU7nmQMn2dn16PkeImwotTJPxBVJz0qfTg8yhmapA9noTNS86AhVDReosa6rdRI11e0KtgyTGgQGIAjCmdygCA0rrr3bL6Wajzk1cBn1eF4ldHBjzY9nZhBfehGPGGCMRzLrdFa+C6KKuYtZekPO7qc/BlqcFwxarM3lrRBK2aRhNqViWi4xcFdaW9DWUBp3Vpg83jjwc9Z+vTxA5+lq5SY4xGPauXNPd/ap9ZN9GtGhO1ED7/3ucfPru3zU5tetoFMLz7QZ8aOlxvSuywlCERk0XWLaCntW+2AALaGsP+Tsyl6aU/ZdBLm+On18Yq0JhEvaH0NFX3EzsWv X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(8121501046)(3002001)(10201501046);SRVR:VI1PR05MB1694;BCL:0;PCL:0;RULEID:;SRVR:VI1PR05MB1694; X-Forefront-PRVS: 0870212862 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6009001)(6049001)(479174004)(24454002)(377454003)(4326007)(586003)(5008740100001)(19580405001)(77096005)(6116002)(3846002)(110136002)(1096002)(65806001)(50466002)(66066001)(65956001)(2906002)(122386002)(19580395003)(87976001)(2950100001)(23676002)(64126003)(83506001)(40100003)(86362001)(15975445007)(47776003)(50986999)(4001350100001)(76176999)(54356999)(36756003)(59896002)(5004730100002)(87266999)(92566002)(65816999)(230700001)(42186005)(189998001)(33656002)(81166005)(21314002)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:VI1PR05MB1694;H:[10.15.7.41];FPR:;SPF:None;MLV:sfv;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtWSTFQUjA1TUIxNjk0OzIzOjVzajE4TVJsd0hsbWhrVHdTb1J6YW5Gejkx?= =?utf-8?B?WTZVcDlQMUFiUENwL1BEZXBOM3hLb2tyMFo0TEl2UDF5Z1ZBcXAwNk5FMGVO?= =?utf-8?B?ZXQwQmZIcUFNai93U2NZblBOZEwzNVFxMlhHNE5mb04xMzlnMGtEell3TnMw?= =?utf-8?B?dkg3c1EvTzRiK0twbmQ3ODAydnl5anRqTnRndnlVdXdTbENpbk9TTWNsYllz?= =?utf-8?B?UEtvQ1BScGFwZWRkVklCWkdMZWRkelZJTFpMc2NtNktvdkFGbXhrYWtIZDhk?= =?utf-8?B?L01NdzhGMXR1TUl2b005MlZwYVlMOUxNOGpYZ2dBL0F2UGl6disxd1JETUIv?= =?utf-8?B?VUNlemdaNDRJenBVQXE2VlRHNGlPNVhKOUIycXhlUkFsaWpvTkxCNEllMGE1?= =?utf-8?B?RHdXTGxPRytjTnd5UE4wTVBzVUZ2eTRTQjlrZGFEdXh6NlprNFVtYng4RmtH?= =?utf-8?B?MzBxdktrQmNkY0dKRnVscXR6Q2E1cGYvOS80bjJlaGF1N1NkM2xrd0Uvdm9V?= =?utf-8?B?SHFwQzVuQTgyamJZZ0NBeXFCaytIdXE3N0hFYXU2NERHdVk3YWc5eWk0aXVl?= =?utf-8?B?dGVHWnJsUHVxbmtUTnlRZ25kYms4SkVJMlYwTWhGRUtZZVFFQjYvRFJyZzN6?= =?utf-8?B?MzliWUZOQWRaQWJWUVg0akExK2Z4T0VzL3JvNis0V0lsY2pyNFRQSWxrM1FX?= =?utf-8?B?bk9YeTBnbFZsRzJWNG1ac1JmcWp1c0pHQVZrYmpvdUZuTE1tWHRJM2Y5Nkxm?= =?utf-8?B?Tm55ZGxqckg2QkExRUlHZ0svWTBrVFBURFcycExEd3pLS0IzbGREdXE2U0Vy?= =?utf-8?B?dUlKK0ZLOUZZWkczYWJHU3QwK0txNVBQTjlCTTcrd25SSWNSTEhqM0FYMldF?= =?utf-8?B?RGlEOG0zUUt3RXNXWXlEMk9UajI1cHgvbW9zaDdmd0JDOFU0YU9BSXMybTdt?= =?utf-8?B?b01GMW1kRG9oaVo0ZEJLZ05SWWFYQnhRdXhnZ1p2QnNkbjgwVEZ1Kzc0cGZS?= =?utf-8?B?dXJmNUpZM1N2SDU3bjB4TzlnSGZUWjg0VzZPMTdqcURFQXBMbEkzanJaRkVM?= =?utf-8?B?NzdxK3A1WGM3bVBMZjRtN2VwRnE4VzJYZWJZM3ZnTldUODd0cVNUQjBVeTN4?= =?utf-8?B?TEw1SkhtRFZsUDBYSlFiWjFzeUZEbjB4cjZzK3pzNVZCcUIvWWRjVXhML2JY?= =?utf-8?B?a1RiRkM1Ykl6dFJLakdMTVVxYTAvbUY2a1RuN0h1ZnUwdnVjcGtiMHk5R2pM?= =?utf-8?B?V0lsVFpvR2tNeGZtM0twcHBmTUZoM2UrdTJsbURjbFYwb2twMnVUUkVvdThL?= =?utf-8?B?ZTVUNk1jeGRtcGxET0RlNTJPaEF2b2NPUWhZYWdpbnRvM05sNWNBdU5CYmw5?= =?utf-8?B?SGpLSS9EM2cxYlJGWHJSV3lidkl6SWZkV0Z3eC9wak9FSm0xNVB6T3ArM2V3?= =?utf-8?B?OC81R0xQNVJmem5pcEVzRFBtWUFnd1B5ZzJBUGdqZVhVREJRU1dkcUJHMitW?= =?utf-8?B?emo4dEprRG42ZUxBS1Z0cVRXMCtYbEVWY0xvOHNHOEZybGNRTmNQUVJwNkxi?= =?utf-8?B?a3ljY0dSOGQ4NTZCaXdBRUNYZ2svSzRvSEliam9mOUlPV3lDY0lDZkp6dE00?= =?utf-8?B?bXluVXNza3Rjd25iMm9UeExLc2RkeVZpVGtMTisrQzNOU1Q5US9BR3MvTE84?= =?utf-8?Q?KTd39ecGhm+GBCP/Kk=3D?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR05MB1694;5:xNvi4CLv3g9S2x24CAs3Wd0cyTPdYM+d55ZcIo+qM/AgZGQhzIvP3+Gru2+wBdf8Afkee1yreOLi/szcK/BR5/BJA4wHMOk4eebgOB6BluCwR+i997cG/KRJMFVfDW6D1p4oG+SKrRd1yZ463rPD4A==;24:u4ZZ4l61bDD18S2IypkVJv/79VEamh6duEntcVDqo1OJEUEhyrrH/juBw/ADi4oHeUD16QDjZ8F2VWIu9RWd1niraEwR89vQlhTDT6OZmIU= X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Mar 2016 19:52:22.3056 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR05MB1694 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6650 Lines: 155 On 03/02/2016 07:36 PM, Andy Lutomirski wrote: > On Mar 2, 2016 12:10 PM, "Chris Metcalf" wrote: >> In prepare_exit_to_usermode(), call task_isolation_ready() >> when we are checking the thread-info flags, and after we've handled >> the other work, call task_isolation_enter() unconditionally. >> >> In syscall_trace_enter_phase1(), we add the necessary support for >> strict-mode detection of syscalls. >> >> We add strict reporting for the kernel exception types that do >> not result in signals, namely non-signalling page faults and >> non-signalling MPX fixups. >> >> Signed-off-by: Chris Metcalf >> --- >> arch/x86/entry/common.c | 18 ++++++++++++++++-- >> arch/x86/kernel/traps.c | 2 ++ >> arch/x86/mm/fault.c | 2 ++ >> 3 files changed, 20 insertions(+), 2 deletions(-) >> >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c >> index 03663740c866..27c71165416b 100644 >> --- a/arch/x86/entry/common.c >> +++ b/arch/x86/entry/common.c >> @@ -21,6 +21,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch) >> */ >> if (work & _TIF_NOHZ) { >> enter_from_user_mode(); >> + if (task_isolation_check_syscall(regs->orig_ax)) { >> + regs->orig_ax = -1; >> + return 0; >> + } > This needs a comment indicating the intended semantics. > And I've still heard no explanation of why this part can't use seccomp. Here's an excerpt from my earlier reply to you from: https://lkml.kernel.org/r/55AE9EAC.4010202@ezchip.com Admittedly this patch series has been moving very slowly through review, so it's not surprising we have to revisit some things! On 07/21/2015 03:34 PM, Chris Metcalf wrote: > On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >> If a user wants a syscall to kill them, use >> seccomp. The kernel isn't at fault if the user does a syscall when it >> didn't want to enter the kernel. > > Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT > was to what I wanted here. One concern is that there doesn't seem > to be a way to "escape" from seccomp strict mode, i.e. you can't > call seccomp() again to turn it off - which makes sense for seccomp > since it's a security issue, but not so much sense with cpu_isolated. > > So, do you think there's a good role for the seccomp() API to play > in achieving this goal? It's certainly not a question of "the kernel at > fault" but rather "asking the kernel to help catch user mistakes" > (typically third-party libraries in our customers' experience). You > could imagine a SECCOMP_SET_MODE_ISOLATED or something. > > Alternatively, we could stick with the API proposed in my patch > series, or something similar, and just try to piggy-back on the seccomp > internals to make it happen. It would require Kconfig to ensure > that SECCOMP was enabled though, which obviously isn't currently > required to do cpu isolation. On looking at this again just now, one thing that strikes me is that it may not be necessary to forbid the syscall like seccomp does. It may be sufficient just to trigger the task isolation strict signal and then allow the syscall to complete. After all, we don't "fail" any of the other things that upset strict mode, like page faults; we let them complete, but add a signal. So for consistency, I think it may in fact make sense to simply trigger the signal but let the syscall do its thing. After all, perhaps the signal is handled and logged and we don't mind having the application continue; the signal handler can certainly choose to fail hard, or in the usual case of no signal handler, that kills the task just fine too. Allowing the syscall to complete is really kind of incidental. After the changes proposed lower down in this email, this call site will end up looking like: /* Generate a task isolation strict signal if necessary. */ if ((work & _TIF_TASK_ISOLATION) && task_isolation_strict()) task_isolation_syscall(regs->orig_ax); But do let me know if you think more specifically that there is a way seccomp can be helpful. >> work &= ~_TIF_NOHZ; >> } >> #endif >> @@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags) >> if (cached_flags & _TIF_USER_RETURN_NOTIFY) >> fire_user_return_notifiers(); >> >> + task_isolation_enter(); >> + >> /* Disable IRQs and retry */ >> local_irq_disable(); >> >> cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags); >> >> - if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) >> + if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) && >> + task_isolation_ready()) >> break; >> >> } >> } >> >> +#ifdef CONFIG_TASK_ISOLATION >> +# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_NOHZ) >> +#else >> +# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS >> +#endif >> + > TIF_NOHZ is already a hack and IMO this just makes it worse. At the > very least this should have a comment. It really ought to be either a > static_branch or a flag that's actually switched per cpu. > But this is also a confusing change. Don't override the definition > here -- stick it in the header where it belongs. The EXIT_TO_USERMODE_xxx stuff is only defined in common.c, not in a header. The more I look at this, though, the more I think it would be cleanest to add a new flag _TIF_TASK_ISOLATION that is set just for tasks that have enabled task isolation. That can be used unconditionally to check to see if we need to call exit_to_usermode_loop() from prepare_exit_to_usermode(), and also means that non-task-isolation tasks don't need to go into the exit loop unconditionally, which is obviously a performance drag for them. So I will move the EXIT_TO_USERMODE_FLAGS definition to immediately follow the EXIT_TO_USERMODE_LOOP_FLAGS definition and write it this way: /* * Task-isolated tasks have to enter exit_to_usermode_loop unconditionally, * but the state isn't cleared in the loop. */ #define EXIT_TO_USERMODE_FLAGS \ (EXIT_TO_USERMODE_LOOP_FLAGS | _TIF_TASK_ISOLATION) Thanks for the feedback! -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com