Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752124AbcKQUQY (ORCPT ); Thu, 17 Nov 2016 15:16:24 -0500 Received: from mail-db5eur01on0046.outbound.protection.outlook.com ([104.47.2.46]:57620 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751225AbcKQUQV (ORCPT ); Thu, 17 Nov 2016 15:16:21 -0500 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=cmetcalf@mellanox.com; Subject: Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count To: Peter Zijlstra References: <1479324933-8161-1-git-send-email-cmetcalf@mellanox.com> <20161117095343.GF3142@twins.programming.kicks-ass.net> CC: John Stultz , Thomas Gleixner , Salman Qazi , Paul Turner , Tony Lindgren , Steven Miao , lkml From: Chris Metcalf Message-ID: Date: Thu, 17 Nov 2016 15:00:14 -0500 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161117095343.GF3142@twins.programming.kicks-ass.net> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: BLUPR0201CA0013.namprd02.prod.outlook.com (10.163.116.23) To DB6PR0501MB2760.eurprd05.prod.outlook.com (10.172.226.12) X-Microsoft-Exchange-Diagnostics: 1;DB6PR0501MB2760;2:6mtpg+Jv/nYsRE2MqEbGbxZhTWxD/POYo9oNdYBBJOliR6EgHQrH9UFgTvHaUauXGVv/ZjmvKVu8aLg2UQ1i+HNor2SQwrbXABPxRy7f26N/TPINFNIibg0pjVueO/KcnAT5iPSJJX75GXqdHqoSzCxIwUeOVDbXDJCGtDfkhQI=;3:2f1G0dVwhnrQtcdAlowCiBKMyml+xpR4TiizrZOa5cPVQ6xpOcJcS6xPYtjJUwExP3c9sToW7+xkMO+MSAt6NVrU6uALVgQBrFbSXYuvemZyVAuynt10rk6zhTDAYL1IyuDrNM6QlbiiVJCTScOz+cbA9GChrNoJxlI1fceWokk=;25:4rAw1kP9/mlH8QyIWN0a7kZmzJSq7S0dNDVIWp3QX0IPt+bn97aH1Fmm43aQklllOyWWCXYa9JOzl3IiJRbfPT248InpcUJUZhNNqTJdAZoltC089gx+6497z+M88FDRWtq7D+aiPBiWHR15L0cjjnhVdOaD5pPdrTN2sXvdvdQHRXK/8V4dcetwK9/HJ9xS+6CKRVnZ9XB3ST9AqkPM820NFlvJVXgYYRH9cSHGGwhkefMiGNbV+NhHwtaOKGHI9f/x5Yu7VGxYZSvOYpk6/nPC6gwN42A/yKn+W8QI3HzPj/rcpKjiNcIDnaemWFKDcEjuxXS1sTQ6A1AS0yQW2/mfiVH+r3nB8DzXyxkNrmOGaxZETZ3faj/qrV43iC5ikmv/IlfCkYgt1iTBK1t6w9IsOD8MzKdg7YMVHJykdCDUE7Cq3irgtGoRsvNdXk2PxoXzV6mZef2mh7BkXz5lyw== X-MS-Office365-Filtering-Correlation-Id: 980d023a-36dc-4fdc-e07f-08d40f245dc6 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001);SRVR:DB6PR0501MB2760; X-Microsoft-Exchange-Diagnostics: 1;DB6PR0501MB2760;31:l7F8CvivXtULX+rnd6l3v5EiEcyzB6NF8zc6lDr34j61M7lm2JYyNRA3B1d6RC4UgG3R7nLzjRoHO+oPrGhAOFNA1OL3f5jEiMgFBOrJO606zynhPUi7phFha6aKZgwcI9KhsyqKEGp2iYN9YnFMmt2gbJO5fr299gnKgTja47/Q3wwOhlom5rdBb9lrkJWgr0s/o3wIlUXAC2tatczf9B9X5Gxi6C+XTXMPxEkDX2VFC4BUgXo6eWZLT47PvDL1otIbAIxFW53GmXSmdE12rg==;20:UfzxQUo59bYNh+66VWn24SqQbhw20Gm1DU4rh+226syPu+nJJAYaboxi/r246HWTQNIjDdZflMwPDz36PBlQ41CcfoJ0gdqBYUAgNlGFj5Vk22muE/dXtjDivzX12kAbT+SJ6L3PMtNx8YD7XLEMpTAGQAGChq7/5D1KepBUlXlZaWK08f0RBSTJCcmTaduvLYGCS0MiyPUAIHapDHovH7JG6bxrzIqf0ZhIPf3DnD4Q545Q9p06EDIDZBrBISgpD1PxVKRnpHNFFhPAtrelvyY6E5rA3wmQtOLXsb+0eqH5HzNbPntzAMCytzZgfZzLofnTt/jGnUBVg/N78FEPoKyvRgXuCV/dz5EIHP2NSaVvkjSOmPZo1xXeIup1clBBZXcTWX2l4NMP3NxW1Bqu7RvwA1EcPKkvTyCac1jhOGeaa32xZwmazewlpa27/vkNkVCSgMRdvksvYEdoDlg215j2PQwYrx2J7mOoti20rmd1ikgdWgSefc9BC8FpG9HS X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(171992500451332); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040281)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(6055026)(6041223);SRVR:DB6PR0501MB2760;BCL:0;PCL:0;RULEID:;SRVR:DB6PR0501MB2760; X-Microsoft-Exchange-Diagnostics: 1;DB6PR0501MB2760;4:Emn5DX1iv4cneEJqzDBO8UAQ+hf57Vc6bH3Wsz+ALJ0mEVzBq05uJjPszQ6cLapyID4ELuDZBkq9he6XmmPmN53gYl/Lov0DbjmTb9dRT3OQgQPlq8OGNQkEJ0FvzfpxbDBi77odzyWr/a168yvwC/xMavG0yBbgAxqtmcOovrgjYRmXEcspOPzw6B0d9pasvxI4mSTUgM0oFNy/T5sn5oIPkov/XNVPIDP/zWVkL4QZFhesTIZFKGBq39qbaUQsrX41k4aWXoNKz0N23Vh60OPsMAKOx5nKxhmTc2zZy5PeOIPDXngQsmVIrRezu+d7du2eLx6bReg7n+m1IvRprAQQJnDXtGlRRgTEd1g5y2ZaK5PI0jrEMRKPLLlCpmockML7S7SkGiWd+79rTQe871EOx+BDq3Nf3YZfSlK36F8greqP46n85wYdX9kB384YnJsbwPj+e7G2tnfedGOGel5TbjLnKuDaR55BFECUkd4= X-Forefront-PRVS: 01294F875B X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6049001)(6009001)(7916002)(199003)(377454003)(24454002)(189002)(68736007)(54356999)(64126003)(106356001)(65806001)(65956001)(23746002)(7846002)(8676002)(76176999)(81156014)(86362001)(305945005)(77096005)(50986999)(31686004)(2950100002)(36756003)(31696002)(66066001)(42186005)(93886004)(47776003)(81166006)(7736002)(92566002)(229853002)(6916009)(189998001)(33646002)(3846002)(6116002)(105586002)(5660300001)(230700001)(4326007)(6666003)(83506001)(101416001)(50466002)(65826007)(4001350100001)(110136003)(2906002)(97736004)(15760500001)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:DB6PR0501MB2760;H:[10.15.7.187];FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?Windows-1252?Q?1;DB6PR0501MB2760;23:XhU2rLdGjLASe+RjdiFTN4kmIKWzhzVJ8ES?= =?Windows-1252?Q?WbDwwzQCQQQU4g8meSlU7aszEAUHwdhjZrakYsEvEMGxuaj+O78d93Bu?= =?Windows-1252?Q?rd58yERnGZwAb4L6YD1GWIO0eBPm5th+mo1hZNKOSwrD7G/8Fuhu+RTU?= =?Windows-1252?Q?DUGTET76RFO7cE5sD1Xcs9qmZmHG9VR/6p1vuXNWNONdHtqzr3LEz4yZ?= =?Windows-1252?Q?4FxhM/YiOSFaQM7xPLLAG53dzKjb/2T+JVFpyT5Bj8zP59Qtl7TZqE+r?= =?Windows-1252?Q?SGTxcxQoZrRdLjrs1xSa3p0+cIf10l2EisF8m+39go6K8lsi03vOe80F?= =?Windows-1252?Q?irZHTvW5kuSUadUVjaaA4okXXlex8ei79YS1q0uv3nm/FZEsqxE/9Q5q?= =?Windows-1252?Q?4S4sYPbeW6tPB8sBB9jSkBobDoJNufl2Sgqx8fJ5vrGYHC5Q4Z8Foc3x?= =?Windows-1252?Q?HxgjFGPf/Tg1nn4E/uyANDzqx6BNtp8poytBdPoH2rKbswdocqIa9Bbt?= =?Windows-1252?Q?ourjC5mJrLuisE7ibnfABwkzk5vz4/MmZkDx2I648/1r9ml5IGwvnBmA?= =?Windows-1252?Q?XpBrZkgv7rpdSSvL6LIYvbE3ZU0p09js0KzMDNhK+RDI+YDqEoPFdsDa?= =?Windows-1252?Q?+iB7C2ZomvcTCl/pfYS1ah0KkDGCi0ln5M4DZyVOkC+U1GzxRu32Pjjo?= =?Windows-1252?Q?7DukeCptpHbCXx6Es1ZRpUuZ7DgdguvjMC/CCIoCrVfp15c2FiHw8wUS?= =?Windows-1252?Q?252bPaqACdqobc24nx0bHipm+0aztd7vo0xSPUr6koJhplmppt8PqMD5?= =?Windows-1252?Q?4lLxyIoENKtPPc2bAkua1EZ0bsJdDtBr6/IixahIi521ISGn8sPK5BKu?= =?Windows-1252?Q?/DxeFRiyVtrHA4OteaG5flGBYcTJarmIlAC5phuD1hGjz7xo9vuTFI/+?= =?Windows-1252?Q?e4DGQGZ1SyuC+0wfXhUJpA431ppfdutSywbZCt0p7qao+JSzW4u+jSXj?= =?Windows-1252?Q?OCI70UL8Os/Byjs79VQOdEq1MiXElDkbZvg097IvtSn/YBg6t9YL5aKY?= =?Windows-1252?Q?Mz9gzPbQOoUDTsgLXeXdM9ky8VfkLYnEUD8cIUBcKa7SMu9HzuEMIhiM?= =?Windows-1252?Q?KEPqEje7w637psYEhp5+FEQvIABKoawfmaGnoHdMg8D+63mp5YORqvbL?= =?Windows-1252?Q?mJ0XP5vn9DqIhsummc+xSi57kQpKwefZHnMFHcD98WD8xFNNU7HYaATy?= =?Windows-1252?Q?vom/NcTjB5i0ObsPvZnoT9ru6Ex4gpBoEEDUMZSupMe1h/pmZpiBiNnx?= =?Windows-1252?Q?n2owJYMfeVLViNc7w9uGYOAxBfs1NrpSQ9ktgsZVIHeZeUN1r0R+DARX?= =?Windows-1252?Q?jJDHvCpQABVpe6vTrV2Q28FzrYw8dVyF62cgYHdmx95rTJ/Zk81Gz4xw?= =?Windows-1252?Q?BH0MNlvuwMJ6i3JHH9iamS6L43Z+sKNe9yrHdxfBXmQ=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;DB6PR0501MB2760;6:ukG+X9dMl9ilEHZG5GFFnij6uJsfwqkdwIvOLtCJ6HMuCrvgce8pcRnAJCCVX2G0mJ3MvbG2+FogJfG9zaNlcUBRZ9ltFaBqVcW1qMhre917kZnv4G6Cg3lWRtC89msVUKm8/cFh3HMUVL0/wKY9YBeNAivmLz9508tTJKkiY3ZvalFWQnigcju/cRbAf4dkxiYl/Cvk9dgpcZPhA4h7RFgyEws3+nCdp0MakR9nHA7nInBbkU2y5QCyBw1QHJAbH0RJoL3VL1Ckb86NLvg8WIhkcERaU72T+9tStqjisAiDN36bxOsUdUnoitmWTrR4vp4PA6WuKgx4JpzO/c8wqyNTDFicOJ9d3PAZOekQrUA=;5:y5X1/iITDredQjcgZX7xwvQU0OkFedC/zEzXRkvQwNCL0tLE9iGPbdkpvN16ibXizGpaz+FVP28zwOFq0zr+9niAEy4Q3K2w7T6NdeicsdLFMuP4qRlRZU+GBZBxnrPaCZk50qdtHHMF7/bdwhiTOw==;24:25qKiMAEQslasSH0ZGRL0An5B1JQ+lvPTzpaJ6jQAtnlGoeqJkPXl4U6/877i9aC31Lx2PaELpxzH46beEntdSofH+XvqED8JkJhHo24J/o= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;DB6PR0501MB2760;7:bhYU9oTpGkaEeYGvPvDDow5U8LEjB92Xvyf3F3D6rQIpJsEg2ZP6HiI/WBkh178W/FxJs+h6iSeVpNg5iO6x8HXul/4lqIVEQ+q92eg8TqPLsx/34ya0hLeGTCLBP855wQOjhIleRrRGmEDde2VS3l1XidWgpZi8DqSwJAiK1MXYhbtLeU3BEMQHsdGDp9cWJq6qkKx6vjCPrFQo16bSp3OpmILBfYyajLhfMFDSHjatrcgjtqc1yzxuVwQv7kJ1ON9gX3LBMdPeL9NESKN+ZpHR1OF0YLF7fBCNUQ/C2doembRzVujAJn8/C5qrXB30ozqF30uM4Yi0niXuptIflmaTdkyIxAOhMkK31GKEqrs= X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Nov 2016 20:00:22.2197 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0501MB2760 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3343 Lines: 66 On 11/17/2016 4:53 AM, Peter Zijlstra wrote: > On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote: >> PeterZ (cc'ed) then improved it to use __int128 math via >> mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply >> instead of two, but the multiply is handled by an out-of-line call to >> __multi3, and the sched_clock() function ends up about 2.5x slower as >> a result. > Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces > to 2 32x23->64 multiplications, of which one if conditional on there > actually being bits set in the high word of the u64 argument. I didn't notice that. It took me down an interesting rathole. Obviously the branch optimization won't help on cycle counter values, since we blow out of the low 32 bits in the first few seconds of uptime. So the conditional test won't help, but the 32x32 multiply optimizations should. However, I was surprised to discover that the compiler doesn't always catch the 32x32 case. It does for simple cases on gcc 4.4, but if you change the compiler version or the complexity of the code, it can lose sight of the optimization opportunity, and in fact that happens in mul_u64_u32_shr(), and we get 64x64 multiplies. I passed this along to our compiler team as an optimization bug. Given that, it turns out it's always faster to do the unconditional path on tilegx. The basic machine instruction is a 32x32 multiply-accumulate, but unlike most tilegx instructions, it causes a 1-cycle RAW hazard stall if you try to use the result in the next instruction. Similarly, mispredicted branches take a 1-cycle stall. The unconditional code pipelines the multiplies completely and runs in 10 cycles; the conditional code has two RAW hazard stalls and a branch stall, so it takes 12 cycles even when it skips the second multiply. Working around the missed compiler optimization by taking the existing mul_u64_u32_shr() and replacing "*" with calls to __insn_mul_lu_lu() to use the compiler builtin gives a 10-cycle function (assuming we have to do both multiplies). So this is the same performance as the pipelined mult_frac() that does the overlapping 64x64 multiplies. We can do better by providing a completely hand-rolled version of the function, either using "*" if the compiler optimization is fixed, or __insn_mul_lu_lu() if it isn't, that doesn't do a conditional branch: static inline u64 mul_u64_u32_shr(u64 a, u64 mul, unsigned int shift) { return (__insn_mul_lu_lu(a, mul) >> shift) + (__insn_mul_lu_lu(a >> 32, mul) << (32 - shift)); } This compiles down to 5 cycles with no hazard stalls. It's not completely clear where I'd put this to override the version; presumably in ? Of course I'd then also have to make it conditional on __tilegx__, since tilepro has a different set of multiply instructions, as it's an ILP32 ISA. I'm a little dubious that it's worth the investment in build scaffolding to do this to save 5 cycles, so I think for now I will just keep the mult_frac() version, and push it to stable to fix the bug with the cycle counter overflowing. Depending on what/when I hear back from the compiler team, I will think about saving those few extra cycles with a custom mul_u64_u32_shr(). -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com