Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933798AbbGUTLS (ORCPT ); Tue, 21 Jul 2015 15:11:18 -0400 Received: from mail-am1on0061.outbound.protection.outlook.com ([157.56.112.61]:17079 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933356AbbGUTLP (ORCPT ); Tue, 21 Jul 2015 15:11:15 -0400 Authentication-Results: vger.kernel.org; dkim=none (message not signed) header.d=none; Subject: Re: [PATCH v4 1/5] nohz_full: add support for "cpu_isolated" mode To: Andy Lutomirski References: <1436817481-8732-1-git-send-email-cmetcalf@ezchip.com> <1436817481-8732-2-git-send-email-cmetcalf@ezchip.com> <55A4271B.9040506@ezchip.com> CC: Gilad Ben Yossef , Steven Rostedt , Ingo Molnar , Peter Zijlstra , Andrew Morton , Rik van Riel , Tejun Heo , Frederic Weisbecker , Thomas Gleixner , "Paul E. McKenney" , Christoph Lameter , Viresh Kumar , "linux-doc@vger.kernel.org" , Linux API , "linux-kernel@vger.kernel.org" From: Chris Metcalf Message-ID: <55AE993E.6040501@ezchip.com> Date: Tue, 21 Jul 2015 15:10:54 -0400 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: BY2PR09CA0028.namprd09.prod.outlook.com (10.242.234.156) To AM2PR02MB0769.eurprd02.prod.outlook.com (25.163.146.154) X-Microsoft-Exchange-Diagnostics: 1;AM2PR02MB0769;2:I4P1dko4KjCXyiXk1OOHNKGdM9URC0w8PeDx6UM1m7WHjOUqIxd8sryMB/k1dzKS;3:8g28+qhEW/inHQNoJqDgmsEeV94z4h+0y+rNQQTHGpDyDabrTtL9jG8PJHQPgbWlSC9cLW5mi313FzKiv0Q1qTG/aB5ylRBXzExEplSrn/qZCk+VwtWIFH3qT7t5ORGwpdnij3YB9F8XC5k6E/zCUw==;25:nESwifWSUy1/MKU0DdyM78wFjVJz5tP7LSd52rDHz9g6BZxeGZytcg8na9J5fmiX6svVQEtk0hWKH8A+KXoRKqgfwm3Lc0yWyEHGRlo+zYLR6GKCWM7ybsSwCL0SWiihoy6T/fPk4Lk4eZfXtkND/oOhAqVjtmTAv9XPaANKKYzt/YgDFPwglhxqChESzHLJxIc+cFPoxfpevwg3S4+Ysx9HqJG212fyikFpkkt7uejjpQkmWpWr+72GCOB01/Xv8wEK3DUYnT5l2zTJEY+GwA==;20:V968/KwocHy1HGbxd/1fxuAEwVzD0Y8wlsXEa7bvbPMxjcCfGM99UxdYkTb2buPBqNzFw7kAfpP1KOv6AkiQDxKE64T1MSxx3/YRYcitZLobvNHqUbYisSwltCwDmHmYDrXPzOu8ixKZvq3eVknuY1ZLCpkmumWjoIOaQAy7Y0k= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:AM2PR02MB0769; AM2PR02MB0769: X-MS-Exchange-Organization-RulesExecuted X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(5005006)(3002001);SRVR:AM2PR02MB0769;BCL:0;PCL:0;RULEID:;SRVR:AM2PR02MB0769; X-Microsoft-Exchange-Diagnostics: 1;AM2PR02MB0769;4:iz97tXjbjjEnmC7LujbE/Z2i2F//F2G5ERN30ARuwbeiSQvkydzu0f2mUC/roT/yJgH0eTMCgk0MAbmpSmYX+/xhtW9aE+nFZ/gMnaGEpGuihgv/qAwdtFUqys94013FUIMVmbtLMrSUHnL20I2y0ZZf94hTWx/xDcoE2jJaBI95EVC3TX46hE+8pQ2iPobLl4e41w5CYesi8bVrmfN7AeBkm6egGFL5xraGcEr65lWy5NYtJ5EICKZaz0b4ONQM20ia2D3GFjXUS3CaLQLJ3IYKCjDLiJ6/rSzMTywOAiI= X-Forefront-PRVS: 0644578634 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(6009001)(6049001)(377454003)(24454002)(479174004)(5001960100002)(87976001)(50466002)(65816999)(66066001)(189998001)(122386002)(110136002)(33656002)(54356999)(87266999)(47776003)(83506001)(86362001)(40100003)(77096005)(19580395003)(77156002)(19580405001)(5001920100001)(62966003)(93886004)(4001350100001)(64126003)(2950100001)(36756003)(42186005)(80316001)(50986999)(92566002)(15975445007)(76176999)(23676002)(46102003)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:AM2PR02MB0769;H:[10.7.0.41];FPR:;SPF:None;MLV:sfv;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtBTTJQUjAyTUIwNzY5OzIzOloyV1VyTkNvdlJrWnlqbHg3NDUxL2g1UUVV?= =?utf-8?B?V0FHQlQ1bU9hZzkwRGJqYmQyZnJrcUQ0YjNYZDJlcXNvNlUwUkttYnhFcEh2?= =?utf-8?B?Y0hyV2pHRWZINDVUR3pVQlZnaG8zRHUrU3dqcnZCaE5HVTNxQkd3NXoyWWJj?= =?utf-8?B?YXlUR2ZkRzBXemxsOFNteHVSV29tUGc1MWtlays2VGgzQW5XZ2V6cHlVVWh2?= =?utf-8?B?OVVFK3pEaEh6REtmUTFYUHJ3d01yNVRQcEtsa1hxZkR3TG1RNVJIdkVpcmt2?= =?utf-8?B?N2N0SE8zR0pkNmcvUmRMeWg1NjhqWW1LTXd4dk4zYlgvWTh5aVNxa3h2clZD?= =?utf-8?B?Wm0xblBTQUhrbjlEU1FiREtVTWhMN3Vvcm4ybmdUVWNZRlVXUjJOdm9GRVFW?= =?utf-8?B?eUFpSE1BSXlnZG9DOFpvdmg3UnR5SXh4TEhRbTNpdFdYT3NiMmR0V1AvK0tH?= =?utf-8?B?VEZnNk1PMTIwQUFkOGI1Zk95T0FUdDdGeUhwZktGZ0hjUkV1bEhnZ3M3bnNm?= =?utf-8?B?KzRTRnJtbWZRTk8vMzFMRkJjQVE1dHRROTNCTG5QcmVucFJ5STBNVEQ4RVlR?= =?utf-8?B?R1pDY3d3QlFGdzQrMWVUU1dETzBpc3RwNTdDYnJWeDNoZ2prQmlqRkIzaC9w?= =?utf-8?B?dUo3SGczZHZWTEVlaUdjZE1sOVd1Rkh4NTkrSm5DaXRpeW5LWXpra1Fvclhx?= =?utf-8?B?cFhkZkxMdk9PQ25xTXlLWVV3NkFndGsyM0ZPbmJHY2VFOGxjdk15eDJYVC9y?= =?utf-8?B?VXRBYXZRTFpYZlluZDY3WGdhRks4Sjhlcy9rc1dZSENZbmpQRVZSS29hUDBt?= =?utf-8?B?K1N4WTBiU3p3WWNDWG5ua2l1U1dwY0pRcG1CR2NkRjdHOGFxbERhUnRFand4?= =?utf-8?B?azQwLzZQY09GU1NrNkk3bVBVT29WUnIzTWJPdm9wRHdaWEdlRGVzODdOcE9l?= =?utf-8?B?VGhndE94MWd2K1RsNDlldFp5YVJNKzNKUzlPa1V2MG9iMW5EdVFOVW5yR21p?= =?utf-8?B?NWJQR0ZUYzlVOVZ6bUI1L05qU1JBanlWMnNXV0ZrV2t4QmlxMFlrQm0yMG9r?= =?utf-8?B?OFZBVHhzdWc1NFQrK0wybFNmKy9Zd3U5bVpGeDBaM1pJMUZsOGc1cWFFbGow?= =?utf-8?B?MSszbzJCZ051YnBhaEZScEMveGsxM2FHOUFNTmptWXcwd2VnejE0Q1BteHNV?= =?utf-8?B?cVlIanB5RlNvV2JoMnhLUGJZSVVCVW9ZK1A4NHBURXd1YWJNUGltbGRxZnFo?= =?utf-8?B?aDlFaThJSU11TnRYVlEwMHdOT2pJSlZ6eDdIR2V3WmhBUFMrS05UblZFYVZi?= =?utf-8?B?U3JSS3dzalhuYWdDYW51ZXBHeUh6RXBoTTczRFF5TkZiUTNVV0h5UUY0MjJy?= =?utf-8?B?QW1FZDhPRlUzK3ZCK2tSSVBYcjJIMWNHYldnSFFIRXFWRkliZ3VJZDhNS3JU?= =?utf-8?Q?zIEoNzEnUY8AKYrJqWYRnuUI6Eh?= X-Microsoft-Exchange-Diagnostics: 1;AM2PR02MB0769;5:0Ixe+TTPwyM3kHBPN5lxDEeUzi6kATyt9PjVN+XI1IZ7HSRWMh+wtV7QK4e5IaLof3k1uRr0NDyTwd99GUDUMJHnAWQGAoe2OhtqsC0MFs3N0mb7YYqFuUW50WOeR0RQTJWYPFxjU/Gly3WGSg/rrg==;24:rUepQpZcOE4z59qxfDFU/m85UEmxw+RcVPL0vz+a14O8A4inYE0WK2BphIRiJkR1ZQeAtj3kRHrwigA5nO4ULemuS7LK2XYc9qaD60avv10=;20:Gldm13Ye9EjNY+jk4x2yCGvv7IUbw+LPL+6jYym4LPHLF4StBX+cKi8O8vf/+b3xHcEQ3nrnDbHTTLGcAdWJGg== SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: ezchip.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 21 Jul 2015 19:11:07.5414 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM2PR02MB0769 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5580 Lines: 116 Sorry for the delay in responding; some other priorities came up internally. On 07/13/2015 05:45 PM, Andy Lutomirski wrote: > On Mon, Jul 13, 2015 at 2:01 PM, Chris Metcalf wrote: >> On 07/13/2015 04:40 PM, Andy Lutomirski wrote: >>> On Mon, Jul 13, 2015 at 12:57 PM, Chris Metcalf >>> wrote: >>>> The existing nohz_full mode makes tradeoffs to minimize userspace >>>> interruptions while still attempting to avoid overheads in the >>>> kernel entry/exit path, to provide 100% kernel semantics, etc. >>>> >>>> However, some applications require a stronger commitment from the >>>> kernel to avoid interruptions, in particular userspace device >>>> driver style applications, such as high-speed networking code. >>>> >>>> This change introduces a framework to allow applications to elect >>>> to have the stronger semantics as needed, specifying >>>> prctl(PR_SET_CPU_ISOLATED, PR_CPU_ISOLATED_ENABLE) to do so. >>>> Subsequent commits will add additional flags and additional >>>> semantics. >>> I thought the general consensus was that this should be the default >>> behavior and that any associated bugs should be fixed. >> >> I think it comes down to dividing the set of use cases in two: >> >> - "Regular" nohz_full, as used to improve performance and limit >> interruptions, possibly for power benefits, etc. But, stray >> interrupts are not particularly bad, and you don't want to take >> extreme measures to avoid them. >> >> - What I'm calling "cpu_isolated" mode where when you return to >> userspace, you expect that by God, the kernel doesn't interrupt you >> again, and if it does, it's a flat-out bug. >> >> There are a few things that cpu_isolated mode currently does to >> accomplish its goals that are pretty heavy-weight: >> >> Processes are held in kernel space until ticks are quiesced; this is >> not necessarily what every nohz_full task wants. If a task makes a >> kernel call, there may well be arbitrary timer fallout, and having a >> way to select whether or not you are willing to take a timer tick after >> return to userspace is pretty important. > Then shouldn't deferred work be done immediately in nohz_full mode > regardless? What is this delayed work that's being done? I'm thinking of things like needing to wait for an RCU quiesce period to complete. In the current version, there's also the vmstat_update() that may schedule delayed work and interrupt the core again shortly before realizing that there are no more counter updates happening, at which point it quiesces. Currently we handle this in cpu_isolated mode simply by spinning and waiting for the timer interrupts to complete. >> Likewise, there are things that you may want to do on return to >> userspace that are designed to prevent further interruptions in >> cpu_isolated mode, even at a possible future performance cost if and >> when you return to the kernel, such as flushing the per-cpu free page >> list so that you won't be interrupted by an IPI to flush it later. > Why not just kick the per-cpu free page over to whatever cpu is > monitoring your RCU state, etc? That should be very quick. So just for the sake of precision, the thing I'm talking about is the lru_add_drain() call on kernel exit. Are you proposing that we call that for every nohz_full core on kernel exit? I'm not opposed to this, but I don't know if other nohz developers feel like this is the right tradeoff. Similarly, addressing the vmstat_update() issue above, in cpu_isolated mode we might want to have a follow-on patch that forces the vmstat system into quiesced state on return to userspace. We would need to do this unconditionally on all nohz_full cores if we tried to combine the current nohz_full with my proposed cpu_isolated functionality. Again, I'm not necessarily opposed, but I suspect other nohz developers might not want this. (I didn't want to introduce such a patch as part of this series since it pulls in even more interested parties, and it gets harder and harder to get to consensus.) >> If you're arguing that the cpu_isolated semantic is really the only >> one that makes sense for nohz_full, my sense is that it might be >> surprising to many of the folks who do nohz_full work. But, I'm happy >> to be wrong on this point, and maybe all the nohz_full community is >> interested in making the same tradeoffs for nohz_full generally that >> I've proposed in this patch series just for cpu_isolated? > nohz_full is currently dog slow for no particularly good reasons. I > suspect that the interrupts you're seeing are also there for no > particularly good reasons as well. > > Let's fix them instead of adding new ABIs to work around them. Well, in principle if we accepted my proposed patch series and then over time came to decide that it was reasonable for nohz_full to have these complete cpu isolation semantics, the one proposed ABI simply becomes a no-op. So it's not as problematic an ABI as some. My issue is this: I'm totally happy with submitting a revised patch series that does all the stuff for pure nohz_full that I'm currently proposing for cpu_isolated. But, is it what the community wants? Should I propose it and see? Frederic, do you have any insight here? Thanks! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/