Hi Boris,
I've noticed that clear_user() is slower than it need be:
dd if=/dev/zero of=/dev/null bs=1M count=1M
1099511627776 bytes (1.1 TB) copied, 45.9641 s, 23.9 GB/s
whereas with the hacked patch below
1099511627776 bytes (1.1 TB) copied, 33.4 s, 32.9 GB/s
That was on some Intel machine: IIRC an AMD went faster.
It's because clear_user() lacks alternatives, and uses a
nowadays suboptimal implementation; whereas clear_page()
and copy_user() do support alternatives.
I came across this because reading a sparse file on tmpfs uses
copy_page_to_iter() from the ZERO_PAGE(), and I wanted to change that
to iov_iter_zero(), which sounds obviously faster - but in fact slower.
I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
tmpfs, are not prime candidates for optimization; and I've no idea how
much clear_user() normally gets used for long clears.
If I were capable of compiler asm, with alternatives, and knew at what
length ERMS becomes advantageous when clearing, I would be sending you
a proper patch. As it is, I'm hoping to tempt someone else to do the
work! Or reject it as too niche to bother with.
Thanks,
Hugh
Hacked patch against 5.16 (5.17-rc is a little different there):
arch/x86/lib/copy_user_64.S | 26 ++++++++++++++++++++++++++
arch/x86/lib/usercopy_64.c | 36 +-----------------------------------
2 files changed, 27 insertions(+), 35 deletions(-)
--- 5.16/arch/x86/lib/copy_user_64.S 2022-01-09 14:55:34.000000000 -0800
+++ erms/arch/x86/lib/copy_user_64.S 2022-01-22 16:57:21.156968363 -0800
@@ -392,3 +392,29 @@ SYM_FUNC_START(__copy_user_nocache)
_ASM_EXTABLE_CPY(41b, .L_fixup_1b_copy)
SYM_FUNC_END(__copy_user_nocache)
EXPORT_SYMBOL(__copy_user_nocache)
+
+/*
+ * Recent CPUs have added enhanced REP MOVSB/STOSB instructions.
+ * It's recommended to use enhanced REP MOVSB/STOSB if it's enabled.
+ * Assume that's best for __clear_user(), until alternatives are provided
+ * (though would be better to avoid REP STOSB for short clears, if no FSRM).
+ *
+ * Input:
+ * rdi destination
+ * rsi count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+SYM_FUNC_START(__clear_user)
+ ASM_STAC
+ movl %esi,%ecx
+ xorq %rax,%rax
+1: rep stosb
+2: movl %ecx,%eax
+ ASM_CLAC
+ ret
+
+ _ASM_EXTABLE_UA(1b, 2b)
+SYM_FUNC_END(__clear_user)
+EXPORT_SYMBOL(__clear_user)
--- 5.16/arch/x86/lib/usercopy_64.c 2020-12-13 14:41:30.000000000 -0800
+++ erms/arch/x86/lib/usercopy_64.c 2022-01-20 18:40:04.125752474 -0800
@@ -13,43 +13,9 @@
/*
* Zero Userspace
*/
-
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
- long __d0;
- might_fault();
- /* no memory constraint because it doesn't change any memory gcc knows
- about */
- stac();
- asm volatile(
- " testq %[size8],%[size8]\n"
- " jz 4f\n"
- " .align 16\n"
- "0: movq $0,(%[dst])\n"
- " addq $8,%[dst]\n"
- " decl %%ecx ; jnz 0b\n"
- "4: movq %[size1],%%rcx\n"
- " testl %%ecx,%%ecx\n"
- " jz 2f\n"
- "1: movb $0,(%[dst])\n"
- " incq %[dst]\n"
- " decl %%ecx ; jnz 1b\n"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: lea 0(%[size1],%[size8],8),%[size8]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE_UA(0b, 3b)
- _ASM_EXTABLE_UA(1b, 2b)
- : [size8] "=&c"(size), [dst] "=&D" (__d0)
- : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
- clac();
- return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
unsigned long clear_user(void __user *to, unsigned long n)
{
+ might_fault();
if (access_ok(to, n))
return __clear_user(to, n);
return n;
Hi Hugh,
On Mon, Feb 07, 2022 at 09:45:36PM -0800, Hugh Dickins wrote:
> I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
> tmpfs, are not prime candidates for optimization; and I've no idea how
> much clear_user() normally gets used for long clears.
Right, we usually don't take such "optimizations" because the folks who
send them always come up with either microbenchmarks or only test on a
single machine.
> If I were capable of compiler asm, with alternatives, and knew at what
> length ERMS becomes advantageous when clearing, I would be sending you
> a proper patch. As it is, I'm hoping to tempt someone else to do the
> work! Or reject it as too niche to bother with.
Yap, looking at arch/x86/lib/clear_page_64.S - that's straight-forward
asm without special-cases noodles like __copy_user_nocache, for example,
so I wouldn't be opposed to someone
- remodelling it so that you can have clear_user* variants there too,
with the length supplied so that you can call a common function with
arbitrary length and clear_page* can call it too. And then call them in
a clear_user() version just like the clear_page() one which selects the
proper target based on CPU feature flags.
- testing this on bunch of modern machines with, say, a kernel build or
some sensible benchmark so that we at least have some coverage
If the numbers are worth it - and judging by your quick testing above
they should be - then I don't mind taking that at all.
If only someone would have the time and diligence to do it properly...
:-)
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov
> Sent: 08 February 2022 12:53
>
> On Mon, Feb 07, 2022 at 09:45:36PM -0800, Hugh Dickins wrote:
> > I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
> > tmpfs, are not prime candidates for optimization; and I've no idea how
> > much clear_user() normally gets used for long clears.
>
> Right, we usually don't take such "optimizations" because the folks who
> send them always come up with either microbenchmarks or only test on a
> single machine.
I think it is reasonable to use microbenchmarks to see how fast
a particular loop runs on a specific cpu - since you can work out
the number of clocks per iteration and directly compare different
versions.
The problem with microbenchmarks is when they get used on unrolled loops.
Run something 10000 times with hot-cache data and unrolled loops
almost always seem best.
Cold cache or the effects of displacing other code from the I-cache
can make loop unrolling a big loss.
Especially on current cpu that execute multiple instructions/clock
and out-of-order execution.
Loops only need unrolling just enough to hit some architectural limit.
For memset() (using normal instructions) that is 1 write per clock.
A classic example has to be the sparc divide remainder function
in the original CY7C600 book (which I happen to still have).
I'm not sure how many instructions it is - but it is printed
on 4 pages. It just has to blow anything else out of the
small I-cache those cpu had.
It might be faster - but only if you are doing a lot of divides.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
From: Hugh Dickins
> Sent: 08 February 2022 05:46
>
> I've noticed that clear_user() is slower than it need be:
>
> dd if=/dev/zero of=/dev/null bs=1M count=1M
> 1099511627776 bytes (1.1 TB) copied, 45.9641 s, 23.9 GB/s
> whereas with the hacked patch below
> 1099511627776 bytes (1.1 TB) copied, 33.4 s, 32.9 GB/s
>
> That was on some Intel machine: IIRC an AMD went faster.
>
> It's because clear_user() lacks alternatives, and uses a
> nowadays suboptimal implementation; whereas clear_page()
> and copy_user() do support alternatives.
>
...
> +SYM_FUNC_START(__clear_user)
> + ASM_STAC
> + movl %esi,%ecx
> + xorq %rax,%rax
> +1: rep stosb
> +2: movl %ecx,%eax
> + ASM_CLAC
> + ret
You only want to even consider than version for long copies
(and possibly only for aligned ones).
The existing code (I've not quoted) does look sub-optimal though.
It should be easy to obtain a write every clock.
But I suspect the loop is too long.
The code gcc generates might even be better!
Note that for copies longer than 8 bytes 'odd' lengths can
be handled by a single misaligned write to the end of the buffer.
No need for a byte copy loop.
I've not experimented with misaligned writes - they might take two clocks.
So it might be worth aligning them - but they may not happen often
enough for it to be an overall gain.
Misaligned reads usually don't make any difference.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, 8 Feb 2022, Borislav Petkov wrote:
> Hi Hugh,
>
> On Mon, Feb 07, 2022 at 09:45:36PM -0800, Hugh Dickins wrote:
> > I realize that dd'ing from /dev/zero to /dev/null, and sparse files on
> > tmpfs, are not prime candidates for optimization; and I've no idea how
> > much clear_user() normally gets used for long clears.
>
> Right, we usually don't take such "optimizations" because the folks who
> send them always come up with either microbenchmarks or only test on a
> single machine.
>
> > If I were capable of compiler asm, with alternatives, and knew at what
> > length ERMS becomes advantageous when clearing, I would be sending you
> > a proper patch. As it is, I'm hoping to tempt someone else to do the
> > work! Or reject it as too niche to bother with.
>
> Yap, looking at arch/x86/lib/clear_page_64.S - that's straight-forward
> asm without special-cases noodles like __copy_user_nocache, for example,
> so I wouldn't be opposed to someone
>
> - remodelling it so that you can have clear_user* variants there too,
> with the length supplied so that you can call a common function with
> arbitrary length and clear_page* can call it too. And then call them in
> a clear_user() version just like the clear_page() one which selects the
> proper target based on CPU feature flags.
>
> - testing this on bunch of modern machines with, say, a kernel build or
> some sensible benchmark so that we at least have some coverage
>
> If the numbers are worth it - and judging by your quick testing above
> they should be - then I don't mind taking that at all.
>
> If only someone would have the time and diligence to do it properly...
Thanks a lot for setting out what's needed. Remodelling so there can be
commonality, that's attractive and intriguing. But I'm disqualified in
several ways (if we tactfully leave out questions of my competence, I'm
still lacking time, and wide machine pool, and expertise with latter):
so I'm responding mainly to make it clear that I shall not be pursuing
this, but hope somebody can eventually.
Thanks,
Hugh
>
> :-)
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
Hey folks,
Just a reminder that Samuel already did a lot of work and benchmarking
and made this a lot faster:
https://lore.kernel.org/lkml/[email protected]/
As far as I can tell, there was a lot of nitpicking, some of which might
have been pointless, and the author understandably didn't followup to
finish it off.
But the code and analysis is there, and maybe it'd be worthwhile for
somebody here to pick it up again?
Jason