Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753779AbcCGUvw (ORCPT ); Mon, 7 Mar 2016 15:51:52 -0500 Received: from mail-db3on0076.outbound.protection.outlook.com ([157.55.234.76]:18699 "EHLO emea01-db3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753207AbcCGUvp (ORCPT ); Mon, 7 Mar 2016 15:51:45 -0500 Authentication-Results: infradead.org; dkim=none (message not signed) header.d=none;infradead.org; dmarc=none action=none header.from=mellanox.com; Subject: Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality To: Andy Lutomirski References: <1456949376-4910-1-git-send-email-cmetcalf@ezchip.com> <1456949376-4910-10-git-send-email-cmetcalf@ezchip.com> <56D895EA.1060301@mellanox.com> CC: Thomas Gleixner , Christoph Lameter , Andrew Morton , Viresh Kumar , Ingo Molnar , Steven Rostedt , Tejun Heo , Gilad Ben Yossef , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-kernel@vger.kernel.org" , X86 ML , "H. Peter Anvin" , Catalin Marinas , Peter Zijlstra From: Chris Metcalf Message-ID: <56DDE9C9.5060900@mellanox.com> Date: Mon, 7 Mar 2016 15:51:21 -0500 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: DM3PR14CA0005.namprd14.prod.outlook.com (25.164.193.143) To HE1PR05MB1689.eurprd05.prod.outlook.com (25.169.119.155) X-MS-Office365-Filtering-Correlation-Id: 51703c44-a87c-43ae-81c5-08d346ca483f X-Microsoft-Exchange-Diagnostics: 1;HE1PR05MB1689;2:jCvLIDzfwr0mjxysbcCdOETl8QoTVFoNjRL5hJgr7Xv0tQy7s6SisNTUwz8mDIxqH1GqsmB/KjOl/QfRtXfwAjvmTnLtbFYhmmq0mvAKrXGfz5BIkvTZ+RRrYuOKtfuiPppVemufVcNDzLI9iMZU48faUykY4f2l4+z9fGrYlHjpqGpPznzKf3jtStG4gtlq;3:8EL/294eVYwLWphZ+9y3YMo4/Ryne3SCzjjwOCQq8kfFUiASx+uiOkoqtCPPCmHhIjJL184klbBJdq5vsDxOmsBh5lS9lz0Pm55t87JmDdZ/8GZWdg9jEOjN/75uEfqU;25:E9U5/wawfpL3ww5HSFFeH50M+aeMCT76RTp9xsZaV/J5ezBPlKOjPBVP0BdYL3pwLcWrXAqcrJV0CxElNW2p95hWomOWnBSJGRpnGWfSnlArn2hupOWkhHN2bBy4V8o0BDa1OO6bzEzxeyvGUg4L40qOgl5jxjfWMx7irg+TId4Q9pbaS/ySWTbXMDIQ08X6IgGptvH38YPGNnTb7UCGfdVY3p7SxXf2+Z9iBSAkwxnxou3NZ2myAY38vA1p1LPY7dman3mxF1kZR7z7QuovR657BM6D3d83hHqxrY5HF4Mxtxzc5BxcEOWXFOQlIzsaTHQysf4p8ML/pUipI6ueRw== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:HE1PR05MB1689; X-MLNXRule-EZCH-Linux: Rule triggered X-Microsoft-Exchange-Diagnostics: 1;HE1PR05MB1689;20:ZNqYRLUWLtdOl+0Hv7QkWm32VBO4NHoABeWc1/I7GLmCarDQ3V8En/tjgBI0Z8kbwcK9sgDzKUZSzcHcvZjREMRfuPcx6nBuiWjxY8vmCqt3Dlli4HbMj3Iwg1ZqwFCkQRmBuaBSmeIrOBJTAlYJIiSYkJFSPyVA4LM+l6JCeZQEFEQYM2ZXAxFj7hTImg4WtNbd2GRHSNr56OMoHRebk41fu6GVbMFO0kzg/ONKhI7tRK3ANfNj5q5G1nJR4tCHnrQx1385rsxxc4sTSYIpL3DNqUYt2B0LrVIPJ+KuHvGfg2MO6phI1ufT6rpQu+yUEJIb723Y0iZZWmsXEyCDSZ7KpMC1SwzMQcVJEhOjQB2uDXBXBhnkbypayYrbY/d3+LMUNb05NJL7UbcBS16ejVVcjgICP4uLpb2ibO79HyMo9D9tCeqCquf0N0mD2BSi5iiVLnhPYME7MMNXXo8HIVWcblUzsJcJWT/1tcCX8jgp0pI2BCEu05JKEV32aWjh;4:e1MGhJIX4Tzj3TeTntYwmO5AZPgqhzw83ifycrEtS8ZdLJ1Dt14F+IfDDqZJyR8YL6EcRhkFWWrAo/xlgrO02P4OgLHyYghXxo0rwjakC5t777GSGYJpAzBsM/VzS3TGzgdQZ8DC6oPncbsRvAOSNxjfkzUN3czV82QUa3i15fZZ86Q2ov8e6dDLs7lIh1OFea2HbFlhaNBvOt8qvJ4dVX39co29zjA3ZKpz13j0w/zLrGWd5nrnFoL91KeaUfAqtWxpUNtZAeiUKtQAXwwhs6gFAbOXyGNEUDUlfytS14pc1YseWbS13Z7TloqK/L/cn+vj73Qc4KRfe79SeK8tTUmNeoJ0Q6RNfsEwj+aSyqTEF7vbRSC1DPGBIa491bHw X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001);SRVR:HE1PR05MB1689;BCL:0;PCL:0;RULEID:;SRVR:HE1PR05MB1689; X-Forefront-PRVS: 087474FBFA X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6009001)(6049001)(479174004)(76104003)(377454003)(24454002)(23676002)(66066001)(65956001)(50466002)(4001350100001)(65816999)(50986999)(76176999)(87266999)(54356999)(83506001)(81166005)(86362001)(5008740100001)(47776003)(93886004)(33656002)(19580395003)(87976001)(230700001)(59896002)(2950100001)(4326007)(6116002)(42186005)(3846002)(2906002)(1096002)(40100003)(36756003)(586003)(77096005)(19580405001)(80316001)(5004730100002)(92566002)(110136002)(15975445007)(122386002)(189998001)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:HE1PR05MB1689;H:[10.15.7.41];FPR:;SPF:None;MLV:sfv;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtIRTFQUjA1TUIxNjg5OzIzOmRuUlUrRHVHa3oyb2FLNmxzS2Nva0ZKVGpo?= =?utf-8?B?bFFYM2g2QyswUlYyUWNZcmRHTnhXL1RMaTIvOGcvRVU4cVhZWTRRa3Q3b1hu?= =?utf-8?B?YTlXWUpyMWEzTXZjeXZTenQwTHp2MFZuaC9lUm1MbDhtMHgxSG5Jd3Y4YjYz?= =?utf-8?B?OWRpa1VUOVJKeGpucGlkTU13SlhwNzk4dVdiQ3NSVHY3NjFhODlmcG8za04v?= =?utf-8?B?VHI4V2Izd2I5ckNDb3hhWm05aFJGTGdIZFJpOEl2OUM1SWxUcGhjbnlwcmpF?= =?utf-8?B?TGFmMjNPTVJ4VjdEU1NjYjBIWDZKak1abm1vRDVXN1AycXRibEhnVGhRUlho?= =?utf-8?B?RTg5WEFpM21pbkV2ZnBYeWlmNWtWbk0rSGtmN3FDWHFHQmY5UDgxZDMwWDhY?= =?utf-8?B?S3lHYW1zVEVoS0pRR0dZQ1ZkemVJcnJNRTA1M25iNjU5ald5aStBQVVxRENy?= =?utf-8?B?YkZVUnVkajV2eERpWW95Yk5NUkFaaUhuTklETkVXeU9ueUZyUVgzd2hxcm5Z?= =?utf-8?B?R2ZIYzc3ZDBHTUxuUWZiUktDT1dCUWVKRnYydXMrNTh2NDgvYm12YzBQRi9v?= =?utf-8?B?dzYvVklhTVRSMG9sRE53VGxCeEIzU01GTzc5M2Fyd2MwakpFUUVRN2xaTk1o?= =?utf-8?B?cHBTbCtjS214a3h4V2pJQlBlQzVnazJISVU1OWY0SmVzSmY1TDNxTG1jOUNJ?= =?utf-8?B?SytXMnA0cUg4N1BYdzQ0Z05lQlRCOFNpeU5yTGUwUTUvODYwWlQxNXkrRlVz?= =?utf-8?B?Tk5ybnF5U3hlWGVKaTBJcTMzRmExRjZQbTlBUzBvdVdiWkhiZ1g4RWxiVUN4?= =?utf-8?B?M2drTkdkTTM1ejVwbGk4dXdSSU00SXJGRk1XK1p4aVhSdkVaVmczUG40c3N1?= =?utf-8?B?aTZ1SzlCalk3eDdRcHlNVzZxUitzKytCcmpaalJpNFF1TEtzWVhvV1FaNlBo?= =?utf-8?B?WUpvUzVjSXF6TGtMUVAxT0ZHdXJkRWh1S1J6a3pNNGZZOW82VGJlZ091RmFP?= =?utf-8?B?QWlFQ2xJeUQ5dGc2RTM5ZXpUalhaenVXcGIzUFdyQmlaVEdkZk5aWGxCc3lr?= =?utf-8?B?a0pLLyt1UjZ2ZTg2WjhUQStXVVkvN0EzWjhBOFpVUHh1M2kxWnNmd1hzK21Y?= =?utf-8?B?azhIemtVZDI4NmpJbHhQZGlLTGU2bTdqTkgwY05ad2svbG1SWnRrQm5oQjRo?= =?utf-8?B?bEpQR3dZU0hUYzhiYXJqZ2V5UTZ5TFlPRXlteG5qNmdjRlZoS2xBL21TZWt0?= =?utf-8?B?U0xLN3NuUlVRazJuRlNUSWl3WkVTd0ltb2ZMeDJjRG9TRGRFcStIQ3dlSUFF?= =?utf-8?B?MG1SMXhMeUU1bkVMTDc0d2kxSFhWY0dyaDJCSmFQcUZvdWwwdVBhaWE1Y0xj?= =?utf-8?B?WTNWdHB4SGh1YjA4YTNKTlhtTmo3WTFmNHFMS1hpVjd3alpSdExEZS94bE50?= =?utf-8?B?ZTRDMWhucHorVFBrOURSaWN4RklPbzVpeThHanhUZ2tqWVpNUHhtV3VVNTl0?= =?utf-8?B?ZEdydWx2ZHJSalBhWnZpRWZxampZQXZleDV4WllIeENNUFVXMXh0MnlNWTdX?= =?utf-8?B?cXBENjliZHBoVm5LZnMrckZySk9FR0pJcWVvV1FkTS9LWHo1R2ZoaEpibjV6?= =?utf-8?B?alBFaCtUOUZydDZIQk83RkEzU0VKUFM1bUdIRjVmNVZvbHZUNmhOZWdRWHV4?= =?utf-8?Q?mzxJgRsSsc6qwLDb5Y=3D?= X-Microsoft-Exchange-Diagnostics: 1;HE1PR05MB1689;5:rNr1413BsNLbFaxEB9DxqRakaRmKhIygAMfs+Td+VAnI31Yle96OkZyqkYX4T+C/KtNAI6xciWfBZHdGSr8du+ujvUKGePKhg3GZoKL6zkDqGbWfiYBztK4W8EPDHs2UCK7MOrVWuws86iOUEKPH5Q==;24:2qEa9ViGCIUH3mTEadCfUk8/iRPiMjUYCPcuUk9vNASHl4G0D6igtPkIKiJ9ORt967Ww1L25ln8lyIplBv8OEu3F6MyckXXZqDTR/Ham15E= X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 07 Mar 2016 20:51:35.8830 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR05MB1689 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4859 Lines: 97 On 03/03/2016 06:46 PM, Andy Lutomirski wrote: > On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf wrote: >> On 03/02/2016 07:36 PM, Andy Lutomirski wrote: >>> On Mar 2, 2016 12:10 PM, "Chris Metcalf" wrote: >>>> In prepare_exit_to_usermode(), call task_isolation_ready() >>>> when we are checking the thread-info flags, and after we've handled >>>> the other work, call task_isolation_enter() unconditionally. >>>> >>>> In syscall_trace_enter_phase1(), we add the necessary support for >>>> strict-mode detection of syscalls. >>>> [...] >>>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct >>>> pt_regs *regs, u32 arch) >>>> */ >>>> if (work & _TIF_NOHZ) { >>>> enter_from_user_mode(); >>>> + if (task_isolation_check_syscall(regs->orig_ax)) { >>>> + regs->orig_ax = -1; >>>> + return 0; >>>> + } >>> This needs a comment indicating the intended semantics. >>> And I've still heard no explanation of why this part can't use seccomp. >> >> Here's an excerpt from my earlier reply to you from: >> >> https://lkml.kernel.org/r/55AE9EAC.4010202@ezchip.com >> >> Admittedly this patch series has been moving very slowly through >> review, so it's not surprising we have to revisit some things! >> >> On 07/21/2015 03:34 PM, Chris Metcalf wrote: >>> On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >>>> If a user wants a syscall to kill them, use >>>> seccomp. The kernel isn't at fault if the user does a syscall when it >>>> didn't want to enter the kernel. >>> >>> Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT >>> was to what I wanted here. One concern is that there doesn't seem >>> to be a way to "escape" from seccomp strict mode, i.e. you can't >>> call seccomp() again to turn it off - which makes sense for seccomp >>> since it's a security issue, but not so much sense with cpu_isolated. >>> >>> So, do you think there's a good role for the seccomp() API to play >>> in achieving this goal? It's certainly not a question of "the kernel at >>> fault" but rather "asking the kernel to help catch user mistakes" >>> (typically third-party libraries in our customers' experience). You >>> could imagine a SECCOMP_SET_MODE_ISOLATED or something. >>> >>> Alternatively, we could stick with the API proposed in my patch >>> series, or something similar, and just try to piggy-back on the seccomp >>> internals to make it happen. It would require Kconfig to ensure >>> that SECCOMP was enabled though, which obviously isn't currently >>> required to do cpu isolation. >> >> On looking at this again just now, one thing that strikes me is that >> it may not be necessary to forbid the syscall like seccomp does. >> It may be sufficient just to trigger the task isolation strict signal >> and then allow the syscall to complete. After all, we don't "fail" >> any of the other things that upset strict mode, like page faults; we >> let them complete, but add a signal. So for consistency, I think it >> may in fact make sense to simply trigger the signal but let the >> syscall do its thing. After all, perhaps the signal is handled >> and logged and we don't mind having the application continue; the >> signal handler can certainly choose to fail hard, or in the usual >> case of no signal handler, that kills the task just fine too. >> Allowing the syscall to complete is really kind of incidental. > No, don't do that. First, if you have a signal pending, a lot of > syscalls will abort with -EINTR. Second, if you fire a signal on > entry via sigreturn, you're not going to like the results. OK, you've convinced me to stick with the previous model of just forbidding the syscall in this case. > Let task isolation users who want to detect when they screw up and do > a syscall do it with seccomp. Can you give me more details on what you're imagining here? Remember that a key use case is that these applications can remove the syscall prohibition voluntarily; it's only there to prevent unintended uses (by third party libraries or just straight-up programming bugs). As far as I can tell, seccomp does not allow you to go from "less permissive" to "more permissive" settings at all, which means that as it exists, it's not a good solution for this use case. Or were you thinking about a new seccomp API that allows this? Or were you thinking that I could just use seccomp internals, i.e. allow the prctl() to set a special SECCOMP_MODE_TASK_ISOLATION and handle it appropriately in seccomp_phase1(), maybe? But, not touch the actual seccomp() API? I'm happy to spec something out, but I'd definitely benefit from some sense from you as to what you think is the better approach. -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com