DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 1A8B57C82F
Message-ID: <1498150967.2503.4.camel@redhat.com>
Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
From: Paolo Abeni <pabeni@redhat.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: x86@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
        Al Viro <viro@zeniv.linux.org.uk>, Kees Cook <keescook@chromium.org>,
        Hannes Frederic Sowa <hannes@stressinduktion.org>,
        linux-kernel@vger.kernel.org,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>
Date: Thu, 22 Jun 2017 19:02:47 +0200
In-Reply-To: <20170622084732.xjnd5gx77ftaozem@gmail.com>
References: <63d913f28bc64bd4ea66a39a532f0b59ee015382.1498039056.git.pabeni@redhat.com>
         <20170622084732.xjnd5gx77ftaozem@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3447
Lines: 122

On Thu, 2017-06-22 at 10:47 +0200, Ingo Molnar wrote:
> * Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > The 'rep' prefix suffers for a relevant "setup cost"; as a result
> > string copies with unrolled loops are faster than even
> > optimized string copy using 'rep' variant, for short string.
> > 
> > This change updates __copy_user_generic() to use the unrolled
> > version for small string length. The threshold length for short
> > string - 64 - has been selected with empirical measures as the
> > larger value that still ensure a measurable gain.
> > 
> > A micro-benchmark of __copy_from_user() with different lengths shows
> > the following:
> > 
> > string len	vanilla		patched 	delta
> > bytes		ticks		ticks		tick(%)
> > 
> > 0		58		26		32(55%)
> > 1		49		29		20(40%)
> > 2		49		31		18(36%)
> > 3		49		32		17(34%)
> > 4		50		34		16(32%)
> > 5		49		35		14(28%)
> > 6		49		36		13(26%)
> > 7		49		38		11(22%)
> > 8		50		31		19(38%)
> > 9		51		33		18(35%)
> > 10		52		36		16(30%)
> > 11		52		37		15(28%)
> > 12		52		38		14(26%)
> > 13		52		40		12(23%)
> > 14		52		41		11(21%)
> > 15		52		42		10(19%)
> > 16		51		34		17(33%)
> > 17		51		35		16(31%)
> > 18		52		37		15(28%)
> > 19		51		38		13(25%)
> > 20		52		39		13(25%)
> > 21		52		40		12(23%)
> > 22		51		42		9(17%)
> > 23		51		46		5(9%)
> > 24		52		35		17(32%)
> > 25		52		37		15(28%)
> > 26		52		38		14(26%)
> > 27		52		39		13(25%)
> > 28		52		40		12(23%)
> > 29		53		42		11(20%)
> > 30		52		43		9(17%)
> > 31		52		44		8(15%)
> > 32		51		36		15(29%)
> > 33		51		38		13(25%)
> > 34		51		39		12(23%)
> > 35		51		41		10(19%)
> > 36		52		41		11(21%)
> > 37		52		43		9(17%)
> > 38		51		44		7(13%)
> > 39		52		46		6(11%)
> > 40		51		37		14(27%)
> > 41		50		38		12(24%)
> > 42		50		39		11(22%)
> > 43		50		40		10(20%)
> > 44		50		42		8(16%)
> > 45		50		43		7(14%)
> > 46		50		43		7(14%)
> > 47		50		45		5(10%)
> > 48		50		37		13(26%)
> > 49		49		38		11(22%)
> > 50		50		40		10(20%)
> > 51		50		42		8(16%)
> > 52		50		42		8(16%)
> > 53		49		46		3(6%)
> > 54		50		46		4(8%)
> > 55		49		48		1(2%)
> > 56		50		39		11(22%)
> > 57		50		40		10(20%)
> > 58		49		42		7(14%)
> > 59		50		42		8(16%)
> > 60		50		46		4(8%)
> > 61		50		47		3(6%)
> > 62		50		48		2(4%)
> > 63		50		48		2(4%)
> > 64		51		38		13(25%)
> > 
> > Above 64 bytes the gain fades away.
> > 
> > Very similar values are collectd for __copy_to_user().
> > UDP receive performances under flood with small packets using recvfrom()
> > increase by ~5%.
> 
> What CPU model(s) were used for the performance testing and was it performance 
> tested on several different types of CPUs?
> 
> Please add a comment here:
> 
> +       if (len <= 64)
> +               return copy_user_generic_unrolled(to, from, len);
> +
> 
> ... because it's not obvious at all that this is a performance optimization, not a 
> correctness issue. Also explain that '64' is a number that we got from performance 
> measurements.
> 
> But in general I like the change - as long as it was measured on reasonably modern 
> x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thank you for reviewing this.

I'll add an hopefully descriptive comment in v2.

The above figures are for an Intel Xeon E5-2690 v4.

I see similar data points with an i7-6500U CPU, while an i7-4810MQ
shows slightly better improvements. 

I'm in the process of collecting more figures for AMD processors, which
 I don't have so handy - it may take some time.

Thanks,

Paolo