Date: Thu, 22 Jun 2017 10:47:32 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
Cc: x86@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
        Al Viro <viro@zeniv.linux.org.uk>, Kees Cook <keescook@chromium.org>,
        Hannes Frederic Sowa <hannes@stressinduktion.org>,
        linux-kernel@vger.kernel.org,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
Message-ID: <20170622084732.xjnd5gx77ftaozem@gmail.com>
References: <63d913f28bc64bd4ea66a39a532f0b59ee015382.1498039056.git.pabeni@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <63d913f28bc64bd4ea66a39a532f0b59ee015382.1498039056.git.pabeni@redhat.com>
User-Agent: NeoMutt/20170113 (1.7.2)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2825
Lines: 109


* Paolo Abeni <pabeni@redhat.com> wrote:

> The 'rep' prefix suffers for a relevant "setup cost"; as a result
> string copies with unrolled loops are faster than even
> optimized string copy using 'rep' variant, for short string.
> 
> This change updates __copy_user_generic() to use the unrolled
> version for small string length. The threshold length for short
> string - 64 - has been selected with empirical measures as the
> larger value that still ensure a measurable gain.
> 
> A micro-benchmark of __copy_from_user() with different lengths shows
> the following:
> 
> string len	vanilla		patched 	delta
> bytes		ticks		ticks		tick(%)
> 
> 0		58		26		32(55%)
> 1		49		29		20(40%)
> 2		49		31		18(36%)
> 3		49		32		17(34%)
> 4		50		34		16(32%)
> 5		49		35		14(28%)
> 6		49		36		13(26%)
> 7		49		38		11(22%)
> 8		50		31		19(38%)
> 9		51		33		18(35%)
> 10		52		36		16(30%)
> 11		52		37		15(28%)
> 12		52		38		14(26%)
> 13		52		40		12(23%)
> 14		52		41		11(21%)
> 15		52		42		10(19%)
> 16		51		34		17(33%)
> 17		51		35		16(31%)
> 18		52		37		15(28%)
> 19		51		38		13(25%)
> 20		52		39		13(25%)
> 21		52		40		12(23%)
> 22		51		42		9(17%)
> 23		51		46		5(9%)
> 24		52		35		17(32%)
> 25		52		37		15(28%)
> 26		52		38		14(26%)
> 27		52		39		13(25%)
> 28		52		40		12(23%)
> 29		53		42		11(20%)
> 30		52		43		9(17%)
> 31		52		44		8(15%)
> 32		51		36		15(29%)
> 33		51		38		13(25%)
> 34		51		39		12(23%)
> 35		51		41		10(19%)
> 36		52		41		11(21%)
> 37		52		43		9(17%)
> 38		51		44		7(13%)
> 39		52		46		6(11%)
> 40		51		37		14(27%)
> 41		50		38		12(24%)
> 42		50		39		11(22%)
> 43		50		40		10(20%)
> 44		50		42		8(16%)
> 45		50		43		7(14%)
> 46		50		43		7(14%)
> 47		50		45		5(10%)
> 48		50		37		13(26%)
> 49		49		38		11(22%)
> 50		50		40		10(20%)
> 51		50		42		8(16%)
> 52		50		42		8(16%)
> 53		49		46		3(6%)
> 54		50		46		4(8%)
> 55		49		48		1(2%)
> 56		50		39		11(22%)
> 57		50		40		10(20%)
> 58		49		42		7(14%)
> 59		50		42		8(16%)
> 60		50		46		4(8%)
> 61		50		47		3(6%)
> 62		50		48		2(4%)
> 63		50		48		2(4%)
> 64		51		38		13(25%)
> 
> Above 64 bytes the gain fades away.
> 
> Very similar values are collectd for __copy_to_user().
> UDP receive performances under flood with small packets using recvfrom()
> increase by ~5%.

What CPU model(s) were used for the performance testing and was it performance 
tested on several different types of CPUs?

Please add a comment here:

+       if (len <= 64)
+               return copy_user_generic_unrolled(to, from, len);
+

... because it's not obvious at all that this is a performance optimization, not a 
correctness issue. Also explain that '64' is a number that we got from performance 
measurements.

But in general I like the change - as long as it was measured on reasonably modern 
x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thanks,

	Ingo