Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753008AbdFVRCy (ORCPT ); Thu, 22 Jun 2017 13:02:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57510 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751221AbdFVRCw (ORCPT ); Thu, 22 Jun 2017 13:02:52 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 1A8B57C82F Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=pabeni@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 1A8B57C82F Message-ID: <1498150967.2503.4.camel@redhat.com> Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings From: Paolo Abeni To: Ingo Molnar Cc: x86@kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Al Viro , Kees Cook , Hannes Frederic Sowa , linux-kernel@vger.kernel.org, Linus Torvalds , Andy Lutomirski Date: Thu, 22 Jun 2017 19:02:47 +0200 In-Reply-To: <20170622084732.xjnd5gx77ftaozem@gmail.com> References: <63d913f28bc64bd4ea66a39a532f0b59ee015382.1498039056.git.pabeni@redhat.com> <20170622084732.xjnd5gx77ftaozem@gmail.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 22 Jun 2017 17:02:52 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3447 Lines: 122 On Thu, 2017-06-22 at 10:47 +0200, Ingo Molnar wrote: > * Paolo Abeni wrote: > > > The 'rep' prefix suffers for a relevant "setup cost"; as a result > > string copies with unrolled loops are faster than even > > optimized string copy using 'rep' variant, for short string. > > > > This change updates __copy_user_generic() to use the unrolled > > version for small string length. The threshold length for short > > string - 64 - has been selected with empirical measures as the > > larger value that still ensure a measurable gain. > > > > A micro-benchmark of __copy_from_user() with different lengths shows > > the following: > > > > string len vanilla patched delta > > bytes ticks ticks tick(%) > > > > 0 58 26 32(55%) > > 1 49 29 20(40%) > > 2 49 31 18(36%) > > 3 49 32 17(34%) > > 4 50 34 16(32%) > > 5 49 35 14(28%) > > 6 49 36 13(26%) > > 7 49 38 11(22%) > > 8 50 31 19(38%) > > 9 51 33 18(35%) > > 10 52 36 16(30%) > > 11 52 37 15(28%) > > 12 52 38 14(26%) > > 13 52 40 12(23%) > > 14 52 41 11(21%) > > 15 52 42 10(19%) > > 16 51 34 17(33%) > > 17 51 35 16(31%) > > 18 52 37 15(28%) > > 19 51 38 13(25%) > > 20 52 39 13(25%) > > 21 52 40 12(23%) > > 22 51 42 9(17%) > > 23 51 46 5(9%) > > 24 52 35 17(32%) > > 25 52 37 15(28%) > > 26 52 38 14(26%) > > 27 52 39 13(25%) > > 28 52 40 12(23%) > > 29 53 42 11(20%) > > 30 52 43 9(17%) > > 31 52 44 8(15%) > > 32 51 36 15(29%) > > 33 51 38 13(25%) > > 34 51 39 12(23%) > > 35 51 41 10(19%) > > 36 52 41 11(21%) > > 37 52 43 9(17%) > > 38 51 44 7(13%) > > 39 52 46 6(11%) > > 40 51 37 14(27%) > > 41 50 38 12(24%) > > 42 50 39 11(22%) > > 43 50 40 10(20%) > > 44 50 42 8(16%) > > 45 50 43 7(14%) > > 46 50 43 7(14%) > > 47 50 45 5(10%) > > 48 50 37 13(26%) > > 49 49 38 11(22%) > > 50 50 40 10(20%) > > 51 50 42 8(16%) > > 52 50 42 8(16%) > > 53 49 46 3(6%) > > 54 50 46 4(8%) > > 55 49 48 1(2%) > > 56 50 39 11(22%) > > 57 50 40 10(20%) > > 58 49 42 7(14%) > > 59 50 42 8(16%) > > 60 50 46 4(8%) > > 61 50 47 3(6%) > > 62 50 48 2(4%) > > 63 50 48 2(4%) > > 64 51 38 13(25%) > > > > Above 64 bytes the gain fades away. > > > > Very similar values are collectd for __copy_to_user(). > > UDP receive performances under flood with small packets using recvfrom() > > increase by ~5%. > > What CPU model(s) were used for the performance testing and was it performance > tested on several different types of CPUs? > > Please add a comment here: > > + if (len <= 64) > + return copy_user_generic_unrolled(to, from, len); > + > > ... because it's not obvious at all that this is a performance optimization, not a > correctness issue. Also explain that '64' is a number that we got from performance > measurements. > > But in general I like the change - as long as it was measured on reasonably modern > x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs. Thank you for reviewing this. I'll add an hopefully descriptive comment in v2. The above figures are for an Intel Xeon E5-2690 v4. I see similar data points with an i7-6500U CPU, while an i7-4810MQ shows slightly better improvements. I'm in the process of collecting more figures for AMD processors, which I don't have so handy - it may take some time. Thanks, Paolo