2020-06-30 21:02:59

by Oliver Swede

[permalink] [raw]
Subject: [PATCH v4 00/14] arm64: Optimise and update memcpy, user copy and string routines

Hi,

This contains an update to the cortex-strings patchset: the
correctness of the fixup routines are improved, with the aim being to
return the exact number of remaining bytes for all copy sizes.
To ensure they are exact - which the current fixups are not for some
copy sizes and are off by a few byes - is an extension to the
original intention of fixing an issue reported by an LTP run last
year, where the fixup routine in v2 of this patchset (which was
importing the cortex-strings memcpy implementation) would over-report
the number of bytes that successfully copied.
Version 3 addressed this but I later found some issues with the fixup
correctness after further testing, and have partially re-written them
here, and addressed some other behaviours of the copy algorithm.

Comments welcome,

Thanks
Oliver

v1: https://lore.kernel.org/linux-arm-kernel/[email protected]/
v2: https://lore.kernel.org/linux-arm-kernel/[email protected]/
v3: https://lore.kernel.org/linux-arm-kernel/[email protected]/

Changes since v3:
* Improves the accuracy of the fixups in response to issues that
arose during futher testing
* Accounts for faults on store instructions on systems with UAO
enabled
* Expands on comments detailing the implementation

Changes since v2:
* Adds Robin's separate patch that fixes a compilation issue with
KProbes fixup [1]
* Imports the most recent memcpy implementation by updating Sam's
patch (and moves this patch to occur after the cortex-strings
importing so that it's closer to the patches containing its
corresponding fixups)
* Uses the stack to preserve the initial parameters
* Replaces the usercopy fixup routine in v2 with multiple longer
fixups that each make use of the fault address to return the exact
number of bytes that haven't yet copied.

[1] https://lore.kernel.org/linux-arm-kernel/e70f7b9de7e601b9e4a6fedad8eaf64d304b1637.1571326276.git.robin.murphy@arm.com/

Oliver Swede (5):
arm64: Store the arguments to copy_*_user on the stack
arm64: Use additional memcpy macros and fixups
arm64: Add fixup routines for usercopy load exceptions
arm64: Add fixup routines for usercopy store exceptions
arm64: Improve accuracy of fixup for UAO cases

Robin Murphy (2):
arm64: kprobes: Drop open-coded exception fixup
arm64: Tidy up _asm_extable_faultaddr usage

Sam Tebbs (7):
arm64: Allow passing fault address to fixup handlers
arm64: Import latest version of Cortex Strings' memcmp
arm64: Import latest version of Cortex Strings' memmove
arm64: Import latest version of Cortex Strings' strcmp
arm64: Import latest version of Cortex Strings' strlen
arm64: Import latest version of Cortex Strings' strncmp
arm64: Import latest optimization of memcpy

arch/arm64/include/asm/alternative.h | 36 ---
arch/arm64/include/asm/assembler.h | 13 +
arch/arm64/include/asm/extable.h | 10 +-
arch/arm64/kernel/probes/kprobes.c | 7 -
arch/arm64/lib/copy_from_user.S | 272 +++++++++++++++--
arch/arm64/lib/copy_in_user.S | 287 ++++++++++++++++--
arch/arm64/lib/copy_template.S | 377 +++++++++++++----------
arch/arm64/lib/copy_template_user.S | 50 ++++
arch/arm64/lib/copy_to_user.S | 273 +++++++++++++++--
arch/arm64/lib/copy_user_fixup.S | 433 +++++++++++++++++++++++++++
arch/arm64/lib/memcmp.S | 333 ++++++++------------
arch/arm64/lib/memcpy.S | 127 ++++++--
arch/arm64/lib/memmove.S | 232 +++++---------
arch/arm64/lib/strcmp.S | 272 +++++++----------
arch/arm64/lib/strlen.S | 247 ++++++++++-----
arch/arm64/lib/strncmp.S | 363 ++++++++++------------
arch/arm64/mm/extable.c | 13 +-
arch/arm64/mm/fault.c | 2 +-
18 files changed, 2228 insertions(+), 1119 deletions(-)
create mode 100644 arch/arm64/lib/copy_template_user.S
create mode 100644 arch/arm64/lib/copy_user_fixup.S

--
2.17.1


2020-06-30 21:03:27

by Oliver Swede

[permalink] [raw]
Subject: [PATCH v4 01/14] arm64: Allow passing fault address to fixup handlers

From: Sam Tebbs <[email protected]>

Extend fixup_exception() to optionally place the faulting address in a
register when returning to a fixup handler. Since A64 instructions must
be 4-byte-aligned, we can mimic the IA-64 implementation and encode a
flag in the lower bits of the offset field to indicate handlers which
expect an address. This will allow us to use more efficient offset
addressing modes in usercopy routines, rather than updating the base
register on every access just for the sake of inferring where a fault
occurred in order to compute the return value upon failure.

The choice of x15 is somewhat arbitrary, but with the consideration that
as the highest-numbered temporary register with no possible 'special'
role in the ABI, it is most likely not used by hand-written assembly
code, and thus a minimally-invasive option for imported routines.

Signed-off-by: Sam Tebbs <[email protected]>
[ rm: split into separate patch, use UL(), expand commit message ]
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Oliver Swede <[email protected]>
---
arch/arm64/include/asm/assembler.h | 9 +++++++++
arch/arm64/include/asm/extable.h | 10 +++++++++-
arch/arm64/mm/extable.c | 13 +++++++++----
arch/arm64/mm/fault.c | 2 +-
4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 54d181177656..438382a277c8 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -18,6 +18,7 @@
#include <asm/cpufeature.h>
#include <asm/cputype.h>
#include <asm/debug-monitors.h>
+#include <asm/extable.h>
#include <asm/page.h>
#include <asm/pgtable-hwdef.h>
#include <asm/ptrace.h>
@@ -129,6 +130,14 @@ alternative_endif
.popsection
.endm

+/*
+ * Emit an entry into the exception table.
+ * The fixup handler will receive the faulting address in x15
+ */
+ .macro _asm_extable_faultaddr, from, to
+ _asm_extable \from, \to + FIXUP_WITH_ADDR
+ .endm
+
#define USER(l, x...) \
9999: x; \
_asm_extable 9999b, l
diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h
index 56a4f68b262e..4c4955f2bb44 100644
--- a/arch/arm64/include/asm/extable.h
+++ b/arch/arm64/include/asm/extable.h
@@ -2,6 +2,12 @@
#ifndef __ASM_EXTABLE_H
#define __ASM_EXTABLE_H

+#include <linux/const.h>
+
+#define FIXUP_WITH_ADDR UL(1)
+
+#ifndef __ASSEMBLY__
+
/*
* The exception table consists of pairs of relative offsets: the first
* is the relative offset to an instruction that is allowed to fault,
@@ -22,5 +28,7 @@ struct exception_table_entry

#define ARCH_HAS_RELATIVE_EXTABLE

-extern int fixup_exception(struct pt_regs *regs);
+extern int fixup_exception(struct pt_regs *regs, unsigned long addr);
+
+#endif
#endif
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 81e694af5f8c..e6578c2814b5 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -6,13 +6,18 @@
#include <linux/extable.h>
#include <linux/uaccess.h>

-int fixup_exception(struct pt_regs *regs)
+int fixup_exception(struct pt_regs *regs, unsigned long addr)
{
const struct exception_table_entry *fixup;

fixup = search_exception_tables(instruction_pointer(regs));
- if (fixup)
- regs->pc = (unsigned long)&fixup->fixup + fixup->fixup;
-
+ if (fixup) {
+ unsigned long offset = fixup->fixup;
+ if (offset & FIXUP_WITH_ADDR) {
+ regs->regs[15] = addr;
+ offset &= ~FIXUP_WITH_ADDR;
+ }
+ regs->pc = (unsigned long)&fixup->fixup + offset;
+ }
return fixup != NULL;
}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8afb238ff335..f80e299dc91b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -303,7 +303,7 @@ static void __do_kernel_fault(unsigned long addr, unsigned int esr,
* Are we prepared to handle this kernel fault?
* We are almost certainly not prepared to handle instruction faults.
*/
- if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
+ if (!is_el1_instruction_abort(esr) && fixup_exception(regs, addr))
return;

if (WARN_RATELIMIT(is_spurious_el1_translation_fault(addr, esr, regs),
--
2.17.1

2020-06-30 21:03:42

by Oliver Swede

[permalink] [raw]
Subject: [PATCH v4 05/14] arm64: Import latest version of Cortex Strings' strcmp

From: Sam Tebbs <[email protected]>

Import the latest version of Cortex Strings' strcmp function.

The upstream source is src/aarch64/strcmp.S as of commit 90b61261ceb4
in https://git.linaro.org/toolchain/cortex-strings.git.

Signed-off-by: Sam Tebbs <[email protected]>
[ rm: update attribution, expand commit message ]
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Oliver Swede <[email protected]>
---
arch/arm64/lib/strcmp.S | 272 +++++++++++++++++-----------------------
1 file changed, 113 insertions(+), 159 deletions(-)

diff --git a/arch/arm64/lib/strcmp.S b/arch/arm64/lib/strcmp.S
index 4e79566726c8..e00ff46c4ffc 100644
--- a/arch/arm64/lib/strcmp.S
+++ b/arch/arm64/lib/strcmp.S
@@ -1,13 +1,11 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
- * Copyright (C) 2013 ARM Ltd.
- * Copyright (C) 2013 Linaro.
+ * Copyright (c) 2012,2018 Linaro Limited. All rights reserved.
*
- * This code is based on glibc cortex strings work originally authored by Linaro
- * be found @
+ * This code is based on glibc Cortex Strings work originally authored by
+ * Linaro, found at:
*
- * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
- * files/head:/src/aarch64/
+ * https://git.linaro.org/toolchain/cortex-strings.git
*/

#include <linux/linkage.h>
@@ -25,60 +23,106 @@
* or be greater than s2.
*/

+#define L(label) .L ## label
+
#define REP8_01 0x0101010101010101
#define REP8_7f 0x7f7f7f7f7f7f7f7f
#define REP8_80 0x8080808080808080

/* Parameters and result. */
-src1 .req x0
-src2 .req x1
-result .req x0
+#define src1 x0
+#define src2 x1
+#define result x0

/* Internal variables. */
-data1 .req x2
-data1w .req w2
-data2 .req x3
-data2w .req w3
-has_nul .req x4
-diff .req x5
-syndrome .req x6
-tmp1 .req x7
-tmp2 .req x8
-tmp3 .req x9
-zeroones .req x10
-pos .req x11
-
+#define data1 x2
+#define data1w w2
+#define data2 x3
+#define data2w w3
+#define has_nul x4
+#define diff x5
+#define syndrome x6
+#define tmp1 x7
+#define tmp2 x8
+#define tmp3 x9
+#define zeroones x10
+#define pos x11
+
+ /* Start of performance-critical section -- one 64B cache line. */
SYM_FUNC_START_WEAK_PI(strcmp)
eor tmp1, src1, src2
mov zeroones, #REP8_01
tst tmp1, #7
- b.ne .Lmisaligned8
+ b.ne L(misaligned8)
ands tmp1, src1, #7
- b.ne .Lmutual_align
-
- /*
- * NUL detection works on the principle that (X - 1) & (~X) & 0x80
- * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
- * can be done in parallel across the entire word.
- */
-.Lloop_aligned:
+ b.ne L(mutual_align)
+ /* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ can be done in parallel across the entire word. */
+L(loop_aligned):
ldr data1, [src1], #8
ldr data2, [src2], #8
-.Lstart_realigned:
+L(start_realigned):
sub tmp1, data1, zeroones
orr tmp2, data1, #REP8_7f
eor diff, data1, data2 /* Non-zero if differences found. */
bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
orr syndrome, diff, has_nul
- cbz syndrome, .Lloop_aligned
- b .Lcal_cmpresult
-
-.Lmutual_align:
- /*
- * Sources are mutually aligned, but are not currently at an
- * alignment boundary. Round down the addresses and then mask off
- * the bytes that preceed the start point.
- */
+ cbz syndrome, L(loop_aligned)
+ /* End of performance-critical section -- one 64B cache line. */
+
+L(end):
+CPU_LE(rev syndrome, syndrome)
+CPU_LE(rev data1, data1)
+ /* The MS-non-zero bit of the syndrome marks either the first bit
+ that is different, or the top bit of the first zero byte.
+ Shifting left now will bring the critical information into the
+ top bits. */
+CPU_LE(clz pos, syndrome)
+CPU_LE(rev data2, data2)
+CPU_LE(lsl data1, data1, pos)
+CPU_LE(lsl data2, data2, pos)
+ /* But we need to zero-extend (char is unsigned) the value and then
+ perform a signed 32-bit subtraction. */
+CPU_LE(lsr data1, data1, #56)
+CPU_LE(sub result, data1, data2, lsr #56)
+CPU_LE(ret)
+ /* For big-endian we cannot use the trick with the syndrome value
+ as carry-propagation can corrupt the upper bits if the trailing
+ bytes in the string contain 0x01. */
+ /* However, if there is no NUL byte in the dword, we can generate
+ the result directly. We can't just subtract the bytes as the
+ MSB might be significant. */
+CPU_BE(cbnz has_nul, 1f)
+CPU_BE(cmp data1, data2)
+CPU_BE(cset result, ne)
+CPU_BE(cneg result, result, lo)
+CPU_BE(ret)
+1:
+ /* Re-compute the NUL-byte detection, using a byte-reversed value. */
+CPU_BE(rev tmp3, data1)
+CPU_BE(sub tmp1, tmp3, zeroones)
+CPU_BE(orr tmp2, tmp3, #REP8_7f)
+CPU_BE(bic has_nul, tmp1, tmp2)
+CPU_BE(rev has_nul, has_nul)
+CPU_BE(orr syndrome, diff, has_nul)
+CPU_BE(clz pos, syndrome)
+ /* The MS-non-zero bit of the syndrome marks either the first bit
+ that is different, or the top bit of the first zero byte.
+ Shifting left now will bring the critical information into the
+ top bits. */
+CPU_BE(lsl data1, data1, pos)
+CPU_BE(lsl data2, data2, pos)
+ /* But we need to zero-extend (char is unsigned) the value and then
+ perform a signed 32-bit subtraction. */
+CPU_BE(lsr data1, data1, #56)
+CPU_BE(sub result, data1, data2, lsr #56)
+CPU_BE(ret)
+
+L(mutual_align):
+ /* Sources are mutually aligned, but are not currently at an
+ alignment boundary. Round down the addresses and then mask off
+ the bytes that preceed the start point. */
bic src1, src1, #7
bic src2, src2, #7
lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */
@@ -87,137 +131,47 @@ SYM_FUNC_START_WEAK_PI(strcmp)
ldr data2, [src2], #8
mov tmp2, #~0
/* Big-endian. Early bytes are at MSB. */
-CPU_BE( lsl tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */
+CPU_BE(lsl tmp2, tmp2, tmp1) /* Shift (tmp1 & 63). */
/* Little-endian. Early bytes are at LSB. */
-CPU_LE( lsr tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */
-
+CPU_LE(lsr tmp2, tmp2, tmp1) /* Shift (tmp1 & 63). */
orr data1, data1, tmp2
orr data2, data2, tmp2
- b .Lstart_realigned
-
-.Lmisaligned8:
- /*
- * Get the align offset length to compare per byte first.
- * After this process, one string's address will be aligned.
- */
- and tmp1, src1, #7
- neg tmp1, tmp1
- add tmp1, tmp1, #8
- and tmp2, src2, #7
- neg tmp2, tmp2
- add tmp2, tmp2, #8
- subs tmp3, tmp1, tmp2
- csel pos, tmp1, tmp2, hi /*Choose the maximum. */
-.Ltinycmp:
+ b L(start_realigned)
+
+L(misaligned8):
+ /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always
+ checking to make sure that we don't access beyond page boundary in
+ SRC2. */
+ tst src1, #7
+ b.eq L(loop_misaligned)
+L(do_misaligned):
ldrb data1w, [src1], #1
ldrb data2w, [src2], #1
- subs pos, pos, #1
- ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
- ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
- b.eq .Ltinycmp
- cbnz pos, 1f /*find the null or unequal...*/
cmp data1w, #1
- ccmp data1w, data2w, #0, cs
- b.eq .Lstart_align /*the last bytes are equal....*/
-1:
- sub result, data1, data2
- ret
-
-.Lstart_align:
- ands xzr, src1, #7
- b.eq .Lrecal_offset
- /*process more leading bytes to make str1 aligned...*/
- add src1, src1, tmp3
- add src2, src2, tmp3
- /*load 8 bytes from aligned str1 and non-aligned str2..*/
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.ne L(done)
+ tst src1, #7
+ b.ne L(do_misaligned)
+
+L(loop_misaligned):
+ /* Test if we are within the last dword of the end of a 4K page. If
+ yes then jump back to the misaligned loop to copy a byte at a time. */
+ and tmp1, src2, #0xff8
+ eor tmp1, tmp1, #0xff8
+ cbz tmp1, L(do_misaligned)
ldr data1, [src1], #8
ldr data2, [src2], #8

sub tmp1, data1, zeroones
orr tmp2, data1, #REP8_7f
- bic has_nul, tmp1, tmp2
- eor diff, data1, data2 /* Non-zero if differences found. */
- orr syndrome, diff, has_nul
- cbnz syndrome, .Lcal_cmpresult
- /*How far is the current str2 from the alignment boundary...*/
- and tmp3, tmp3, #7
-.Lrecal_offset:
- neg pos, tmp3
-.Lloopcmp_proc:
- /*
- * Divide the eight bytes into two parts. First,backwards the src2
- * to an alignment boundary,load eight bytes from the SRC2 alignment
- * boundary,then compare with the relative bytes from SRC1.
- * If all 8 bytes are equal,then start the second part's comparison.
- * Otherwise finish the comparison.
- * This special handle can garantee all the accesses are in the
- * thread/task space in avoid to overrange access.
- */
- ldr data1, [src1,pos]
- ldr data2, [src2,pos]
- sub tmp1, data1, zeroones
- orr tmp2, data1, #REP8_7f
- bic has_nul, tmp1, tmp2
- eor diff, data1, data2 /* Non-zero if differences found. */
- orr syndrome, diff, has_nul
- cbnz syndrome, .Lcal_cmpresult
-
- /*The second part process*/
- ldr data1, [src1], #8
- ldr data2, [src2], #8
- sub tmp1, data1, zeroones
- orr tmp2, data1, #REP8_7f
- bic has_nul, tmp1, tmp2
- eor diff, data1, data2 /* Non-zero if differences found. */
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
orr syndrome, diff, has_nul
- cbz syndrome, .Lloopcmp_proc
+ cbz syndrome, L(loop_misaligned)
+ b L(end)

-.Lcal_cmpresult:
- /*
- * reversed the byte-order as big-endian,then CLZ can find the most
- * significant zero bits.
- */
-CPU_LE( rev syndrome, syndrome )
-CPU_LE( rev data1, data1 )
-CPU_LE( rev data2, data2 )
-
- /*
- * For big-endian we cannot use the trick with the syndrome value
- * as carry-propagation can corrupt the upper bits if the trailing
- * bytes in the string contain 0x01.
- * However, if there is no NUL byte in the dword, we can generate
- * the result directly. We cannot just subtract the bytes as the
- * MSB might be significant.
- */
-CPU_BE( cbnz has_nul, 1f )
-CPU_BE( cmp data1, data2 )
-CPU_BE( cset result, ne )
-CPU_BE( cneg result, result, lo )
-CPU_BE( ret )
-CPU_BE( 1: )
- /*Re-compute the NUL-byte detection, using a byte-reversed value. */
-CPU_BE( rev tmp3, data1 )
-CPU_BE( sub tmp1, tmp3, zeroones )
-CPU_BE( orr tmp2, tmp3, #REP8_7f )
-CPU_BE( bic has_nul, tmp1, tmp2 )
-CPU_BE( rev has_nul, has_nul )
-CPU_BE( orr syndrome, diff, has_nul )
-
- clz pos, syndrome
- /*
- * The MS-non-zero bit of the syndrome marks either the first bit
- * that is different, or the top bit of the first zero byte.
- * Shifting left now will bring the critical information into the
- * top bits.
- */
- lsl data1, data1, pos
- lsl data2, data2, pos
- /*
- * But we need to zero-extend (char is unsigned) the value and then
- * perform a signed 32-bit subtraction.
- */
- lsr data1, data1, #56
- sub result, data1, data2, lsr #56
+L(done):
+ sub result, data1, data2
ret
SYM_FUNC_END_PI(strcmp)
EXPORT_SYMBOL_NOKASAN(strcmp)
--
2.17.1

2020-06-30 21:03:43

by Oliver Swede

[permalink] [raw]
Subject: [PATCH v4 07/14] arm64: Import latest version of Cortex Strings' strncmp

From: Sam Tebbs <[email protected]>

Import latest version of Cortex Strings' strncmp function.

The upstream source is src/aarch64/strncmp.S as of commit 071fe283b28d
in https://git.linaro.org/toolchain/cortex-strings.git.

Signed-off-by: Sam Tebbs <[email protected]>
[ rm: update attribution, expand commit message ]
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Oliver Swede <[email protected]>
---
arch/arm64/lib/strncmp.S | 363 ++++++++++++++++++---------------------
1 file changed, 163 insertions(+), 200 deletions(-)

diff --git a/arch/arm64/lib/strncmp.S b/arch/arm64/lib/strncmp.S
index 2a7ee949ed47..b954e0fd93be 100644
--- a/arch/arm64/lib/strncmp.S
+++ b/arch/arm64/lib/strncmp.S
@@ -1,13 +1,11 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
- * Copyright (C) 2013 ARM Ltd.
- * Copyright (C) 2013 Linaro.
+ * Copyright (c) 2013,2018 Linaro Limited. All rights reserved.
*
- * This code is based on glibc cortex strings work originally authored by Linaro
- * be found @
+ * This code is based on glibc Cortex Strings work originally authored by
+ * Linaro, found at:
*
- * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
- * files/head:/src/aarch64/
+ * https://git.linaro.org/toolchain/cortex-strings.git
*/

#include <linux/linkage.h>
@@ -30,49 +28,49 @@
#define REP8_80 0x8080808080808080

/* Parameters and result. */
-src1 .req x0
-src2 .req x1
-limit .req x2
-result .req x0
+#define src1 x0
+#define src2 x1
+#define limit x2
+#define result x0

/* Internal variables. */
-data1 .req x3
-data1w .req w3
-data2 .req x4
-data2w .req w4
-has_nul .req x5
-diff .req x6
-syndrome .req x7
-tmp1 .req x8
-tmp2 .req x9
-tmp3 .req x10
-zeroones .req x11
-pos .req x12
-limit_wd .req x13
-mask .req x14
-endloop .req x15
+#define data1 x3
+#define data1w w3
+#define data2 x4
+#define data2w w4
+#define has_nul x5
+#define diff x6
+#define syndrome x7
+#define tmp1 x8
+#define tmp2 x9
+#define tmp3 x10
+#define zeroones x11
+#define pos x12
+#define limit_wd x13
+#define mask x14
+#define endloop x15
+#define count mask

+ .p2align 6
+ .rep 7
+ nop /* Pad so that the loop below fits a cache line. */
+ .endr
SYM_FUNC_START_WEAK_PI(strncmp)
cbz limit, .Lret0
eor tmp1, src1, src2
mov zeroones, #REP8_01
tst tmp1, #7
+ and count, src1, #7
b.ne .Lmisaligned8
- ands tmp1, src1, #7
- b.ne .Lmutual_align
+ cbnz count, .Lmutual_align
/* Calculate the number of full and partial words -1. */
- /*
- * when limit is mulitply of 8, if not sub 1,
- * the judgement of last dword will wrong.
- */
- sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
- lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+ lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */

- /*
- * NUL detection works on the principle that (X - 1) & (~X) & 0x80
- * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
- * can be done in parallel across the entire word.
- */
+ /* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ can be done in parallel across the entire word. */
+ /* Start of performance-critical section -- one 64B cache line. */
.Lloop_aligned:
ldr data1, [src1], #8
ldr data2, [src2], #8
@@ -80,23 +78,24 @@ SYM_FUNC_START_WEAK_PI(strncmp)
subs limit_wd, limit_wd, #1
sub tmp1, data1, zeroones
orr tmp2, data1, #REP8_7f
- eor diff, data1, data2 /* Non-zero if differences found. */
- csinv endloop, diff, xzr, pl /* Last Dword or differences.*/
- bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, pl /* Last Dword or differences. */
+ bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
ccmp endloop, #0, #0, eq
b.eq .Lloop_aligned
+ /* End of performance-critical section -- one 64B cache line. */

- /*Not reached the limit, must have found the end or a diff. */
+ /* Not reached the limit, must have found the end or a diff. */
tbz limit_wd, #63, .Lnot_limit

/* Limit % 8 == 0 => all bytes significant. */
ands limit, limit, #7
b.eq .Lnot_limit

- lsl limit, limit, #3 /* Bits -> bytes. */
+ lsl limit, limit, #3 /* Bits -> bytes. */
mov mask, #~0
-CPU_BE( lsr mask, mask, limit )
-CPU_LE( lsl mask, mask, limit )
+CPU_BE(lsr mask, mask, limit)
+CPU_LE(lsl mask, mask, limit)
bic data1, data1, mask
bic data2, data2, mask

@@ -105,192 +104,156 @@ CPU_LE( lsl mask, mask, limit )

.Lnot_limit:
orr syndrome, diff, has_nul
- b .Lcal_cmpresult
+
+ CPU_LE(rev syndrome, syndrome)
+ CPU_LE(rev data1, data1)
+ /* The MS-non-zero bit of the syndrome marks either the first bit
+ that is different, or the top bit of the first zero byte.
+ Shifting left now will bring the critical information into the
+ top bits. */
+ CPU_LE(clz pos, syndrome)
+ CPU_LE(rev data2, data2)
+ CPU_LE(lsl data1, data1, pos)
+ CPU_LE(lsl data2, data2, pos)
+ /* But we need to zero-extend (char is unsigned) the value and then
+ perform a signed 32-bit subtraction. */
+ CPU_LE(lsr data1, data1, #56)
+ CPU_LE(sub result, data1, data2, lsr #56)
+ CPU_LE(ret)
+ /* For big-endian we cannot use the trick with the syndrome value
+ as carry-propagation can corrupt the upper bits if the trailing
+ bytes in the string contain 0x01. */
+ /* However, if there is no NUL byte in the dword, we can generate
+ the result directly. We can't just subtract the bytes as the
+ MSB might be significant. */
+ CPU_BE(cbnz has_nul, 1f)
+ CPU_BE(cmp data1, data2)
+ CPU_BE(cset result, ne)
+ CPU_BE(cneg result, result, lo)
+ CPU_BE(ret)
+1:
+ /* Re-compute the NUL-byte detection, using a byte-reversed value. */
+ CPU_BE(rev tmp3, data1)
+ CPU_BE(sub tmp1, tmp3, zeroones)
+ CPU_BE(orr tmp2, tmp3, #REP8_7f)
+ CPU_BE(bic has_nul, tmp1, tmp2)
+ CPU_BE(rev has_nul, has_nul)
+ CPU_BE(orr syndrome, diff, has_nul)
+ CPU_BE(clz pos, syndrome)
+ /* The MS-non-zero bit of the syndrome marks either the first bit
+ that is different, or the top bit of the first zero byte.
+ Shifting left now will bring the critical information into the
+ top bits. */
+ CPU_BE(lsl data1, data1, pos)
+ CPU_BE(lsl data2, data2, pos)
+ /* But we need to zero-extend (char is unsigned) the value and then
+ perform a signed 32-bit subtraction. */
+ CPU_BE(lsr data1, data1, #56)
+ CPU_BE(sub result, data1, data2, lsr #56)
+ CPU_BE(ret)

.Lmutual_align:
- /*
- * Sources are mutually aligned, but are not currently at an
- * alignment boundary. Round down the addresses and then mask off
- * the bytes that precede the start point.
- * We also need to adjust the limit calculations, but without
- * overflowing if the limit is near ULONG_MAX.
- */
+ /* Sources are mutually aligned, but are not currently at an
+ alignment boundary. Round down the addresses and then mask off
+ the bytes that precede the start point.
+ We also need to adjust the limit calculations, but without
+ overflowing if the limit is near ULONG_MAX. */
bic src1, src1, #7
bic src2, src2, #7
ldr data1, [src1], #8
- neg tmp3, tmp1, lsl #3 /* 64 - bits(bytes beyond align). */
+ neg tmp3, count, lsl #3 /* 64 - bits(bytes beyond align). */
ldr data2, [src2], #8
mov tmp2, #~0
- sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
/* Big-endian. Early bytes are at MSB. */
-CPU_BE( lsl tmp2, tmp2, tmp3 ) /* Shift (tmp1 & 63). */
+ CPU_BE(lsl tmp2, tmp2, tmp3) /* Shift (count & 63). */
/* Little-endian. Early bytes are at LSB. */
-CPU_LE( lsr tmp2, tmp2, tmp3 ) /* Shift (tmp1 & 63). */
-
+ CPU_LE(lsr tmp2, tmp2, tmp3) /* Shift (count & 63). */
and tmp3, limit_wd, #7
lsr limit_wd, limit_wd, #3
- /* Adjust the limit. Only low 3 bits used, so overflow irrelevant.*/
- add limit, limit, tmp1
- add tmp3, tmp3, tmp1
+ /* Adjust the limit. Only low 3 bits used, so overflow irrelevant. */
+ add limit, limit, count
+ add tmp3, tmp3, count
orr data1, data1, tmp2
orr data2, data2, tmp2
add limit_wd, limit_wd, tmp3, lsr #3
b .Lstart_realigned

-/*when src1 offset is not equal to src2 offset...*/
+ .p2align 6
+ /* Don't bother with dwords for up to 16 bytes. */
.Lmisaligned8:
- cmp limit, #8
- b.lo .Ltiny8proc /*limit < 8... */
- /*
- * Get the align offset length to compare per byte first.
- * After this process, one string's address will be aligned.*/
- and tmp1, src1, #7
- neg tmp1, tmp1
- add tmp1, tmp1, #8
- and tmp2, src2, #7
- neg tmp2, tmp2
- add tmp2, tmp2, #8
- subs tmp3, tmp1, tmp2
- csel pos, tmp1, tmp2, hi /*Choose the maximum. */
- /*
- * Here, limit is not less than 8, so directly run .Ltinycmp
- * without checking the limit.*/
- sub limit, limit, pos
-.Ltinycmp:
+ cmp limit, #16
+ b.hs .Ltry_misaligned_words
+
+.Lbyte_loop:
+ /* Perhaps we can do better than this. */
ldrb data1w, [src1], #1
ldrb data2w, [src2], #1
- subs pos, pos, #1
- ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
- ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
- b.eq .Ltinycmp
- cbnz pos, 1f /*find the null or unequal...*/
- cmp data1w, #1
- ccmp data1w, data2w, #0, cs
- b.eq .Lstart_align /*the last bytes are equal....*/
-1:
+ subs limit, limit, #1
+ ccmp data1w, #1, #0, hi /* NZCV = 0b0000. */
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.eq .Lbyte_loop
+.Ldone:
sub result, data1, data2
ret
-
-.Lstart_align:
+ /* Align the SRC1 to a dword by doing a bytewise compare and then do
+ the dword loop. */
+.Ltry_misaligned_words:
lsr limit_wd, limit, #3
- cbz limit_wd, .Lremain8
- /*process more leading bytes to make str1 aligned...*/
- ands xzr, src1, #7
- b.eq .Lrecal_offset
- add src1, src1, tmp3 /*tmp3 is positive in this branch.*/
- add src2, src2, tmp3
- ldr data1, [src1], #8
- ldr data2, [src2], #8
+ cbz count, .Ldo_misaligned

- sub limit, limit, tmp3
+ neg count, count
+ and count, count, #7
+ sub limit, limit, count
lsr limit_wd, limit, #3
- subs limit_wd, limit_wd, #1

- sub tmp1, data1, zeroones
- orr tmp2, data1, #REP8_7f
- eor diff, data1, data2 /* Non-zero if differences found. */
- csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
- bics has_nul, tmp1, tmp2
- ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
- b.ne .Lunequal_proc
- /*How far is the current str2 from the alignment boundary...*/
- and tmp3, tmp3, #7
-.Lrecal_offset:
- neg pos, tmp3
-.Lloopcmp_proc:
- /*
- * Divide the eight bytes into two parts. First,backwards the src2
- * to an alignment boundary,load eight bytes from the SRC2 alignment
- * boundary,then compare with the relative bytes from SRC1.
- * If all 8 bytes are equal,then start the second part's comparison.
- * Otherwise finish the comparison.
- * This special handle can garantee all the accesses are in the
- * thread/task space in avoid to overrange access.
- */
- ldr data1, [src1,pos]
- ldr data2, [src2,pos]
- sub tmp1, data1, zeroones
- orr tmp2, data1, #REP8_7f
- bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
- eor diff, data1, data2 /* Non-zero if differences found. */
- csinv endloop, diff, xzr, eq
- cbnz endloop, .Lunequal_proc
+.Lpage_end_loop:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ cmp data1w, #1
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.ne .Ldone
+ subs count, count, #1
+ b.hi .Lpage_end_loop
+
+.Ldo_misaligned:
+ /* Prepare ourselves for the next page crossing. Unlike the aligned
+ loop, we fetch 1 less dword because we risk crossing bounds on
+ SRC2. */
+ mov count, #8
+ subs limit_wd, limit_wd, #1
+ b.lo .Ldone_loop
+.Lloop_misaligned:
+ and tmp2, src2, #0xff8
+ eor tmp2, tmp2, #0xff8
+ cbz tmp2, .Lpage_end_loop

- /*The second part process*/
ldr data1, [src1], #8
ldr data2, [src2], #8
- subs limit_wd, limit_wd, #1
sub tmp1, data1, zeroones
orr tmp2, data1, #REP8_7f
- eor diff, data1, data2 /* Non-zero if differences found. */
- csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
- bics has_nul, tmp1, tmp2
- ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
- b.eq .Lloopcmp_proc
-
-.Lunequal_proc:
- orr syndrome, diff, has_nul
- cbz syndrome, .Lremain8
-.Lcal_cmpresult:
- /*
- * reversed the byte-order as big-endian,then CLZ can find the most
- * significant zero bits.
- */
-CPU_LE( rev syndrome, syndrome )
-CPU_LE( rev data1, data1 )
-CPU_LE( rev data2, data2 )
- /*
- * For big-endian we cannot use the trick with the syndrome value
- * as carry-propagation can corrupt the upper bits if the trailing
- * bytes in the string contain 0x01.
- * However, if there is no NUL byte in the dword, we can generate
- * the result directly. We can't just subtract the bytes as the
- * MSB might be significant.
- */
-CPU_BE( cbnz has_nul, 1f )
-CPU_BE( cmp data1, data2 )
-CPU_BE( cset result, ne )
-CPU_BE( cneg result, result, lo )
-CPU_BE( ret )
-CPU_BE( 1: )
- /* Re-compute the NUL-byte detection, using a byte-reversed value.*/
-CPU_BE( rev tmp3, data1 )
-CPU_BE( sub tmp1, tmp3, zeroones )
-CPU_BE( orr tmp2, tmp3, #REP8_7f )
-CPU_BE( bic has_nul, tmp1, tmp2 )
-CPU_BE( rev has_nul, has_nul )
-CPU_BE( orr syndrome, diff, has_nul )
- /*
- * The MS-non-zero bit of the syndrome marks either the first bit
- * that is different, or the top bit of the first zero byte.
- * Shifting left now will bring the critical information into the
- * top bits.
- */
- clz pos, syndrome
- lsl data1, data1, pos
- lsl data2, data2, pos
- /*
- * But we need to zero-extend (char is unsigned) the value and then
- * perform a signed 32-bit subtraction.
- */
- lsr data1, data1, #56
- sub result, data1, data2, lsr #56
- ret
-
-.Lremain8:
- /* Limit % 8 == 0 => all bytes significant. */
- ands limit, limit, #7
- b.eq .Lret0
-.Ltiny8proc:
- ldrb data1w, [src1], #1
- ldrb data2w, [src2], #1
- subs limit, limit, #1
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ ccmp diff, #0, #0, eq
+ b.ne .Lnot_limit
+ subs limit_wd, limit_wd, #1
+ b.pl .Lloop_misaligned

- ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
- ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
- b.eq .Ltiny8proc
- sub result, data1, data2
- ret
+.Ldone_loop:
+ /* We found a difference or a NULL before the limit was reached. */
+ and limit, limit, #7
+ cbz limit, .Lnot_limit
+ /* Read the last word. */
+ sub src1, src1, 8
+ sub src2, src2, 8
+ ldr data1, [src1, limit]
+ ldr data2, [src2, limit]
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ ccmp diff, #0, #0, eq
+ b.ne .Lnot_limit

.Lret0:
mov result, #0
--
2.17.1

2020-06-30 21:04:30

by Oliver Swede

[permalink] [raw]
Subject: [PATCH v4 13/14] arm64: Add fixup routines for usercopy store exceptions

This adds the fixup routines for exceptions that occur on store
operations while copying, by providing the calling code with a more
accurate value for the number of bytes that failed to copy, rather
than returning the full buffer width.

The three routines for store exceptions work together to analyse
the position of the fault relative to the start or the end of the
buffer, and backtrack from the optimized memcpy algorithm to
determine if some number of bytes has already been successfully
copied.

The store operations occur mostly in-order, with the exception of
a few copy size ranges - this is specific to the new copy template,
which uses the latest memcpy implementation.

Signed-off-by: Oliver Swede <[email protected]>
---
arch/arm64/lib/copy_user_fixup.S | 217 ++++++++++++++++++++++++++++++-
1 file changed, 215 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/lib/copy_user_fixup.S b/arch/arm64/lib/copy_user_fixup.S
index ae94b492129d..37ca3d99a02a 100644
--- a/arch/arm64/lib/copy_user_fixup.S
+++ b/arch/arm64/lib/copy_user_fixup.S
@@ -169,12 +169,225 @@ addr .req x15
*/
b L(end_fixup)

+/*
+ * The following three routines are directed to from faults
+ * on store instructions.
+ */
9996:
+ /*
+ * This routine is reached from faults on store instructions
+ * where the target address has been specified relative to the
+ * start of the user space memory buffer, and is also not
+ * guaranteed to be 16-byte aligned.
+ *
+ * For copy sizes <= 128 bytes, the stores occur in-order,
+ * so it has copied up to (addr-dst)&~15.
+ * For copy sizes > 128 bytes, this should only be directed
+ * to from a fault on the first store of the long copy, before
+ * the algorithm aligns dst to 16B, so no bytes have copied at
+ * this point.
+ */
+
+ /* Retrieve original params from stack */
+ ldr dst, [sp], #16 // dst: x3, src: x1
+ ldr count, [sp], #16 // count: x2
+ add srcend, src, count
+ add dstend, dst, count
+
+ /* count <= 3 -> count - (addr-dst) */
+ cmp count, 3
+ sub x0, addr, dst // relative fault offset in buffer
+ sub x0, count, x0 // bytes yet to copy
+ b.le L(end_fixup)
+ /* 3 < count <= 32 -> count - (addr-dst) */
+ cmp count, 32
+ b.le L(none_copied)
+ /* 32 < count < 128 -> count - ((addr-dst)&15) */
+ cmp count, 128
+ sub x0, addr, dst // relative fault offset
+ bic x0, x0, 15 // bytes already copied (steps of 16B stores)
+ sub x0, count, x0 // bytes yet to copy
+ b.le L(end_fixup)
+ /* 128 < count -> count */
+ b L(none_copied)
+
9997:
+ /*
+ * This routine is reached from faults on store instructions
+ * where the target address has been specified relative to
+ * the end of the user space memory buffer and is also not
+ * guaranteed to be 16-byte aligned.
+ *
+ * In this scenario, the copy is close to completion and
+ * has occurred in-order, so it is straightforward to
+ * calculate the remaining bytes.
+ *
+ * The algorithm increments dst by 64B for each loop64()
+ * subroutine, and tmp1 stores its latest value, which
+ * allows for the calculation of a threshold that it has
+ * copied up to.
+ *
+ * To account for faults on data that has already been copied
+ * (which can occur due to the way the algorithm uses
+ * overlapping copies within a buffer), this threshold can be
+ * subtracted from the copy size and the result compared
+ * against the remaining bytes after the fault offset in the
+ * last 64B; the minimum of the two can then be taken for the
+ * return value.
+ */
+
+ /* Save the current (adjusted) dst for later restoration */
+ mov tmp1, dst
+
+ /* Retrieve original params from stack */
+ ldp dst, src, [sp], #16 // dst: x3, src: x1
+ ldr count, [sp], #16 // count: x2
+ add srcend, src, count
+ add dstend, dst, count
+
+ /*
+ * Overlapping buffers:
+ * (src <= dst && dst < srcend) || (dst <= src && src < dstend)
+ */
+ cmp src, dst
+ ccmp dst, srcend, #0, le
+ b.lt L(none_copied)
+ cmp dst, src
+ ccmp src, dstend, #0, le
+ b.lt L(none_copied)
+
+ /*
+ * For copy size >128 bytes, select max of
+ * { tmp1-dst+80, ((addr-dstend+64)&15) }
+ */
+ sub tmp1, tmp1, dst // relative position of new dst
+ add tmp1, tmp1, 80 // copied up to here
+ sub tmp1, count, tmp1 // remaining bytes after non-overlapping section
+ sub x0, dstend, 64
+ sub x0, addr, x0
+ bic x0, x0, 15 // fault offset within dest. buffer
+ add x0, dstend, x0
+ sub x0, x0, 64
+ sub x0, dstend, x0 // remaining bytes in final (overlapping) 64B
+ cmp x0, tmp1
+ csel x0, x0, tmp1, lt
+ cmp count, 128
+ b.gt L(end_fixup)
+
+ cmp count, 2
+ b.le L(none_copied)
+ cmp count, 3 // one byte left
+ mov x0, 1
+ b.eq L(end_fixup)
+ cmp count, 7 // four bytes left
+ sub x0, count, 4
+ b.le L(end_fixup)
+ cmp count, 15 // eight bytes left
+ sub x0, count, 8
+ b.le L(end_fixup)
+ cmp count, 32 // 16 bytes left
+ sub x0, count, 16
+ b.le L(end_fixup)
+ /*
+ * For copy size 33..64 select min of
+ * {(32 - fault_offset), (count-32)}
+ * as copy may overlap
+ */
+ sub tmp1, dstend, 32
+ sub tmp1, addr, tmp1
+ bic tmp1, tmp1, 15
+ mov x0, 32
+ sub tmp1, x0, tmp1
+ sub x0, count, 32
+ cmp x0, tmp1
+ csel x0, x0, tmp1, lt
+ cmp count, 64
+ b.le L(end_fixup)
+ /* For copy size 65..96 select min of
+ * {(count - fault_offset), (count-64)}
+ * as copy may overlap
+ */
+ sub tmp1, dstend, 32
+ sub tmp1, addr, tmp1
+ bic tmp1, tmp1, 15
+ mov x0, 32
+ sub tmp1, x0, tmp1
+ sub x0, count, 64
+ cmp x0, tmp1
+ csel x0, x0, tmp1, lt
+ cmp count, 96
+ b.lt L(end_fixup)
+ /*
+ * For copy size 97..128 same as above, but account for
+ * out-of-order initial stores, where no bytes have been
+ * copied on those faults
+ */
+ sub tmp1, dstend, 64
+ sub tmp1, addr, tmp1
+ bic tmp1, tmp1, 15
+ mov x0, 64
+ sub tmp1, x0, tmp1
+ cmp count, 128
+ mov x0, 32
+ ccmp tmp1, x0, #0, le
+ b.gt L(none_copied) // none copied if faults in first or second chunk
+ sub x0, count, 64
+ cmp x0, tmp1
+ csel x0, x0, tmp1, lt
+ cmp count, 128
+ b.le L(end_fixup)
+
+ b L(none_copied)
+
9998:
- /* Retrieve info from stack */
+ /*
+ * This routine is reached from faults on store instructions
+ * where the target address has been specified relative to the
+ * start of the user space memory buffer, and is also
+ * guaranteed to be 16-byte aligned.
+ *
+ * These instrucions occur after the algorithm aligns dst
+ * to 16B, which is after the very first store in a long copy.
+ * It then continues copying from dst+16 onwards.
+ *
+ * This could result in the second store attempting to copy
+ * data that has already been copied, as there would be an
+ * overlap in the buffer if the original dst is not 16 bytes
+ * aligned. In this case we report that 16 bytes has already
+ * already been copied, so it must take the minimum of the
+ * two values.
+ */
+
+ /* Retrieve original params from stack */
+ ldp dst, src, [sp], #16 // dst: x3, src: x1
ldr count, [sp], #16 // count: x2
- add sp, sp, 32
+ add srcend, src, count
+ add dstend, dst, count
+
+ /*
+ * Overlapping buffers:
+ * (src <= dst && dst < srcend) || (dst <= src && src < dstend)
+ */
+ cmp src, dst
+ ccmp dst, srcend, #0, le
+ b.lt L(none_copied)
+ cmp dst, src
+ ccmp src, dstend, #0, le
+ b.lt L(none_copied)
+
+ /* Take the min from {16,(fault_addr&15)-(dst&15)}
+ * and subtract from count to obtain the return value */
+ bic tmp1, dst, 15 // aligned dst
+ bic x0, addr, 15
+ sub x0, x0, tmp1 // relative fault offset
+ cmp x0, 16
+ bic x0, addr, 15
+ sub x0, x0, dst
+ sub x0, count, x0
+ b.gt L(end_fixup)
+ sub x0, count, 16
+ b L(end_fixup) // initial unaligned chunk copied
+
L(none_copied):
/*
* Return the initial count as the number
--
2.17.1

2020-06-30 21:05:44

by Oliver Swede

[permalink] [raw]
Subject: [PATCH v4 02/14] arm64: kprobes: Drop open-coded exception fixup

From: Robin Murphy <[email protected]>

The short-circuit call to fixup_exception() from kprobe_fault_handler()
poses a problem now that the former wants to consume the fault address
too, since the common kprobes API offers us no way to pass it through.
Fortunately, however, it works out to be unnecessary:

- uaccess instructions themselves are not probeable, so at most we
should only ever expect to take a fixable fault from the pre or post
handlers.
- the pre and post handler run with preemption disabled, thus for any
fault they may cause, an unhandled return from kprobe_page_fault()
will proceed directly to __do_kernel_fault() thanks to the
faulthandler_disabled() check.
- __do_kernel_fault() will immediately call fixup_exception() unless
we're in an EL1 instruction abort, and if we've somehow taken one of
those on what we think is the middle of a uaccess routine, then the
world is already very on fire.

Thus we can reasonably drop the call from kprobe_fault_handler() and
leave uaccess fixups to the regular flow.

Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Oliver Swede <[email protected]>
---
arch/arm64/kernel/probes/kprobes.c | 7 -------
1 file changed, 7 deletions(-)

diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
index d1c95dcf1d78..771635360110 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -334,13 +334,6 @@ int __kprobes kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr)
*/
if (cur->fault_handler && cur->fault_handler(cur, regs, fsr))
return 1;
-
- /*
- * In case the user-specified fault handler returned
- * zero, try to fix up.
- */
- if (fixup_exception(regs))
- return 1;
}
return 0;
}
--
2.17.1

2020-06-30 21:06:09

by Oliver Swede

[permalink] [raw]
Subject: [PATCH v4 03/14] arm64: Import latest version of Cortex Strings' memcmp

From: Sam Tebbs <[email protected]>

Import the latest version of Cortex Strings' memcmp function.

The upstream source is src/aarch64/memcmp.S as of commit f77e4c932b4f
in https://git.linaro.org/toolchain/cortex-strings.git.

Signed-off-by: Sam Tebbs <[email protected]>
[ rm: update attribution, expand commit message ]
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Oliver Swede <[email protected]>
---
arch/arm64/lib/memcmp.S | 333 ++++++++++++++--------------------------
1 file changed, 117 insertions(+), 216 deletions(-)

diff --git a/arch/arm64/lib/memcmp.S b/arch/arm64/lib/memcmp.S
index c0671e793ea9..580dd0b12ccb 100644
--- a/arch/arm64/lib/memcmp.S
+++ b/arch/arm64/lib/memcmp.S
@@ -1,13 +1,12 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
- * Copyright (C) 2013 ARM Ltd.
- * Copyright (C) 2013 Linaro.
+ * Copyright (c) 2013, 2018 Linaro Limited. All rights reserved.
+ * Copyright (c) 2017 ARM Ltd. All rights reserved.
*
- * This code is based on glibc cortex strings work originally authored by Linaro
- * be found @
+ * This code is based on glibc Cortex Strings work originally authored by
+ * Linaro, found at:
*
- * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
- * files/head:/src/aarch64/
+ * https://git.linaro.org/toolchain/cortex-strings.git
*/

#include <linux/linkage.h>
@@ -25,223 +24,125 @@
* x0 - a compare result, maybe less than, equal to, or greater than ZERO
*/

+#define L(l) .L ## l
+
/* Parameters and result. */
-src1 .req x0
-src2 .req x1
-limit .req x2
-result .req x0
+#define src1 x0
+#define src2 x1
+#define limit x2
+#define result w0

/* Internal variables. */
-data1 .req x3
-data1w .req w3
-data2 .req x4
-data2w .req w4
-has_nul .req x5
-diff .req x6
-endloop .req x7
-tmp1 .req x8
-tmp2 .req x9
-tmp3 .req x10
-pos .req x11
-limit_wd .req x12
-mask .req x13
+#define data1 x3
+#define data1w w3
+#define data1h x4
+#define data2 x5
+#define data2w w5
+#define data2h x6
+#define tmp1 x7
+#define tmp2 x8

SYM_FUNC_START_WEAK_PI(memcmp)
- cbz limit, .Lret0
- eor tmp1, src1, src2
- tst tmp1, #7
- b.ne .Lmisaligned8
- ands tmp1, src1, #7
- b.ne .Lmutual_align
- sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
- lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */
- /*
- * The input source addresses are at alignment boundary.
- * Directly compare eight bytes each time.
- */
-.Lloop_aligned:
- ldr data1, [src1], #8
- ldr data2, [src2], #8
-.Lstart_realigned:
- subs limit_wd, limit_wd, #1
- eor diff, data1, data2 /* Non-zero if differences found. */
- csinv endloop, diff, xzr, cs /* Last Dword or differences. */
- cbz endloop, .Lloop_aligned
-
- /* Not reached the limit, must have found a diff. */
- tbz limit_wd, #63, .Lnot_limit
-
- /* Limit % 8 == 0 => the diff is in the last 8 bytes. */
- ands limit, limit, #7
- b.eq .Lnot_limit
- /*
- * The remained bytes less than 8. It is needed to extract valid data
- * from last eight bytes of the intended memory range.
- */
- lsl limit, limit, #3 /* bytes-> bits. */
- mov mask, #~0
-CPU_BE( lsr mask, mask, limit )
-CPU_LE( lsl mask, mask, limit )
- bic data1, data1, mask
- bic data2, data2, mask
-
- orr diff, diff, mask
- b .Lnot_limit
-
-.Lmutual_align:
- /*
- * Sources are mutually aligned, but are not currently at an
- * alignment boundary. Round down the addresses and then mask off
- * the bytes that precede the start point.
- */
- bic src1, src1, #7
- bic src2, src2, #7
- ldr data1, [src1], #8
- ldr data2, [src2], #8
- /*
- * We can not add limit with alignment offset(tmp1) here. Since the
- * addition probably make the limit overflown.
- */
- sub limit_wd, limit, #1/*limit != 0, so no underflow.*/
- and tmp3, limit_wd, #7
- lsr limit_wd, limit_wd, #3
- add tmp3, tmp3, tmp1
- add limit_wd, limit_wd, tmp3, lsr #3
- add limit, limit, tmp1/* Adjust the limit for the extra. */
-
- lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
- neg tmp1, tmp1/* Bits to alignment -64. */
- mov tmp2, #~0
- /*mask off the non-intended bytes before the start address.*/
-CPU_BE( lsl tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/
- /* Little-endian. Early bytes are at LSB. */
-CPU_LE( lsr tmp2, tmp2, tmp1 )
-
- orr data1, data1, tmp2
- orr data2, data2, tmp2
- b .Lstart_realigned
-
- /*src1 and src2 have different alignment offset.*/
-.Lmisaligned8:
- cmp limit, #8
- b.lo .Ltiny8proc /*limit < 8: compare byte by byte*/
-
- and tmp1, src1, #7
- neg tmp1, tmp1
- add tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/
- and tmp2, src2, #7
- neg tmp2, tmp2
- add tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/
- subs tmp3, tmp1, tmp2
- csel pos, tmp1, tmp2, hi /*Choose the maximum.*/
-
- sub limit, limit, pos
- /*compare the proceeding bytes in the first 8 byte segment.*/
-.Ltinycmp:
- ldrb data1w, [src1], #1
- ldrb data2w, [src2], #1
- subs pos, pos, #1
- ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
- b.eq .Ltinycmp
- cbnz pos, 1f /*diff occurred before the last byte.*/
- cmp data1w, data2w
- b.eq .Lstart_align
-1:
- sub result, data1, data2
- ret
-
-.Lstart_align:
- lsr limit_wd, limit, #3
- cbz limit_wd, .Lremain8
-
- ands xzr, src1, #7
- b.eq .Lrecal_offset
- /*process more leading bytes to make src1 aligned...*/
- add src1, src1, tmp3 /*backwards src1 to alignment boundary*/
- add src2, src2, tmp3
- sub limit, limit, tmp3
- lsr limit_wd, limit, #3
- cbz limit_wd, .Lremain8
- /*load 8 bytes from aligned SRC1..*/
- ldr data1, [src1], #8
- ldr data2, [src2], #8
-
- subs limit_wd, limit_wd, #1
- eor diff, data1, data2 /*Non-zero if differences found.*/
- csinv endloop, diff, xzr, ne
- cbnz endloop, .Lunequal_proc
- /*How far is the current SRC2 from the alignment boundary...*/
- and tmp3, tmp3, #7
-
-.Lrecal_offset:/*src1 is aligned now..*/
- neg pos, tmp3
-.Lloopcmp_proc:
- /*
- * Divide the eight bytes into two parts. First,backwards the src2
- * to an alignment boundary,load eight bytes and compare from
- * the SRC2 alignment boundary. If all 8 bytes are equal,then start
- * the second part's comparison. Otherwise finish the comparison.
- * This special handle can garantee all the accesses are in the
- * thread/task space in avoid to overrange access.
- */
- ldr data1, [src1,pos]
- ldr data2, [src2,pos]
- eor diff, data1, data2 /* Non-zero if differences found. */
- cbnz diff, .Lnot_limit
-
- /*The second part process*/
- ldr data1, [src1], #8
- ldr data2, [src2], #8
- eor diff, data1, data2 /* Non-zero if differences found. */
- subs limit_wd, limit_wd, #1
- csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
- cbz endloop, .Lloopcmp_proc
-.Lunequal_proc:
- cbz diff, .Lremain8
-
-/* There is difference occurred in the latest comparison. */
-.Lnot_limit:
-/*
-* For little endian,reverse the low significant equal bits into MSB,then
-* following CLZ can find how many equal bits exist.
-*/
-CPU_LE( rev diff, diff )
-CPU_LE( rev data1, data1 )
-CPU_LE( rev data2, data2 )
-
- /*
- * The MS-non-zero bit of DIFF marks either the first bit
- * that is different, or the end of the significant data.
- * Shifting left now will bring the critical information into the
- * top bits.
- */
- clz pos, diff
- lsl data1, data1, pos
- lsl data2, data2, pos
- /*
- * We need to zero-extend (char is unsigned) the value and then
- * perform a signed subtraction.
- */
- lsr data1, data1, #56
- sub result, data1, data2, lsr #56
+ subs limit, limit, 8
+ b.lo L(less8)
+
+ ldr data1, [src1], 8
+ ldr data2, [src2], 8
+ cmp data1, data2
+ b.ne L(return)
+
+ subs limit, limit, 8
+ b.gt L(more16)
+
+ ldr data1, [src1, limit]
+ ldr data2, [src2, limit]
+ b L(return)
+
+L(more16):
+ ldr data1, [src1], 8
+ ldr data2, [src2], 8
+ cmp data1, data2
+ bne L(return)
+
+ /* Jump directly to comparing the last 16 bytes for 32 byte (or less)
+ strings. */
+ subs limit, limit, 16
+ b.ls L(last_bytes)
+
+ /* We overlap loads between 0-32 bytes at either side of SRC1 when we
+ try to align, so limit it only to strings larger than 128 bytes. */
+ cmp limit, 96
+ b.ls L(loop16)
+
+ /* Align src1 and adjust src2 with bytes not yet done. */
+ and tmp1, src1, 15
+ add limit, limit, tmp1
+ sub src1, src1, tmp1
+ sub src2, src2, tmp1
+
+ /* Loop performing 16 bytes per iteration using aligned src1.
+ Limit is pre-decremented by 16 and must be larger than zero.
+ Exit if <= 16 bytes left to do or if the data is not equal. */
+ .p2align 4
+L(loop16):
+ ldp data1, data1h, [src1], 16
+ ldp data2, data2h, [src2], 16
+ subs limit, limit, 16
+ ccmp data1, data2, 0, hi
+ ccmp data1h, data2h, 0, eq
+ b.eq L(loop16)
+
+ cmp data1, data2
+ bne L(return)
+ mov data1, data1h
+ mov data2, data2h
+ cmp data1, data2
+ bne L(return)
+
+ /* Compare last 1-16 bytes using unaligned access. */
+L(last_bytes):
+ add src1, src1, limit
+ add src2, src2, limit
+ ldp data1, data1h, [src1]
+ ldp data2, data2h, [src2]
+ cmp data1, data2
+ bne L(return)
+ mov data1, data1h
+ mov data2, data2h
+ cmp data1, data2
+
+ /* Compare data bytes and set return value to 0, -1 or 1. */
+L(return):
+#ifndef __AARCH64EB__
+ rev data1, data1
+ rev data2, data2
+#endif
+ cmp data1, data2
+L(ret_eq):
+ cset result, ne
+ cneg result, result, lo
ret

-.Lremain8:
- /* Limit % 8 == 0 =>. all data are equal.*/
- ands limit, limit, #7
- b.eq .Lret0
-
-.Ltiny8proc:
- ldrb data1w, [src1], #1
- ldrb data2w, [src2], #1
- subs limit, limit, #1
-
- ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
- b.eq .Ltiny8proc
- sub result, data1, data2
- ret
-.Lret0:
- mov result, #0
+ .p2align 4
+ /* Compare up to 8 bytes. Limit is [-8..-1]. */
+L(less8):
+ adds limit, limit, 4
+ b.lo L(less4)
+ ldr data1w, [src1], 4
+ ldr data2w, [src2], 4
+ cmp data1w, data2w
+ b.ne L(return)
+ sub limit, limit, 4
+L(less4):
+ adds limit, limit, 4
+ beq L(ret_eq)
+L(byte_loop):
+ ldrb data1w, [src1], 1
+ ldrb data2w, [src2], 1
+ subs limit, limit, 1
+ ccmp data1w, data2w, 0, ne /* NZCV = 0b0000. */
+ b.eq L(byte_loop)
+ sub result, data1w, data2w
ret
SYM_FUNC_END_PI(memcmp)
EXPORT_SYMBOL_NOKASAN(memcmp)
--
2.17.1

2020-07-01 08:16:42

by Oliver Swede

[permalink] [raw]
Subject: Re: [PATCH v4 00/14] arm64: Optimise and update memcpy, user copy and string routines

Hi all,

(Apologies if you have already received this follow-up, please ignore
any previous duplicates; I am re-sending this with the lists in the CC
as the formatting filters caught me out :) Thanks ]

> Version 3 addressed this but I later found some issues with the fixup
> correctness after further testing, and have partially re-written them
> here, and addressed some other behaviours of the copy algorithm.

Further to the above description from the cover letter I have expanded
on this below.

The approach introduced in this version redirects the faulting
instruction to one of multiple fixup subroutines, depending on
properties of the instance of the instruction in the copy template: if
it specifies the target address relative to the start or end of the
buffer, and if the algorithm has reached a stage where the destination
address has been set to be 16-byte aligned (this is more relevant for
longer copies).

These two properties are both in relation to the copy algorithm's aim at
reducing the number of total instruction by making accesses that may
overlap, and thus copy the same data multiple times. This is in contrast
to the current algorithm which may need to branch to subroutines that
complete an odd number of bytes that isn't divisible by the initial unit
that was directed to. E.g. with the new algorithm, 17 bytes will
complete with two load and two store instructions, by coping the first
16 bytes from the start of the buffer, then the last 16 bytes to the
left of the end of the buffer.

There appear to be two types of overlapping copies within a source or
destination buffer:
1. If src is initially non-16-byte aligned, then it is re-set to (src &~
15), and the copying then resumes from aligned_src+16. In this scenario
any data between aligned_src+16 and src+16 could repeat in the
overlapping instruction.
2. If the target address is specified relative to the end of the buffer
(e.g. [srcend, -16]) then an overlap is taking place, so it should check
the fault address against the threshold known to have already been
copied up to.

In particular, encapsulating these properties through separate fixups
allows for:

a) Accurate rounding-down to the base of the block within which the
fault occurred.
The assumption here is that load and and store instructions are
single-copy atomic, so a fault in one instruction will not result in any
of the other bytes succeeding. For instance, the copy unit size could be
1, 4, 8, or 16, depending on the instruction (ldrb/strb, ldr/str, or
ldp/stp), but any threshold prior to which copying may have succeeded
will be the target address, i.e. the base of this block.
If the fixup routing itself implies certain information about the target
address, then the scenario shouldn't occur where it faults on an address
but the backtracking to the target is inaccurate (as a result of copy
overlaps within the buffer).

b) All data copied up to some address to be taken into account, in the
case where a page is swapped out immediately before the same address is
accessed as a result of an overlap.
This is somewhat related to the above, but addresses a specific
situation, where a page could be swapped out mid-way through a copy and
result in the same address faulting on its second copy, even if the
algorithm has succeeded up to a threshold address beyond the fault.

Within each fixup is the calculation for the number of bytes remaining
is performed, with the knowledge about the stage of the algorithm that
has been reached and any previous steps before the fault, and the
information retrieved from the stack, which contains the original
arguments to the usercopy function. It is then able to determine the
accurate return value. For systems implementing UAO, due to the ldp/stp
instructions being substituted for 2x ldtr/sttr instructions, a fault in
the second half of the doubleword needs to to take any completion of the
first half into account, and there are in-line modifications for this
(but they are not necessary for the load instruction fixups).

The categorization of the fixups based on the algorithm's behaviour may
also provide a general approach to replacing the fixups for future copy
imports, as they are fairly modular. However, there is an open question
as to whether it would be easy to replace them given this framework, as
the fixup subroutines are relatively long for the store instructions. I
have added comments to explain the calculations, but I am wondering if
it would be possible to divert the fixup logic to make it easier to
modify, if the aim is to reduce the coupling between the fixup routines
and the copy algorithm.

I am waiting on access to the relevant machine before posting the
benchmark results for this optimized memcpy, but Sam reported the
following with the similar (but now slightly older) cortex-strings version:
* copy_from_user: 13.17%
* copy_to_user: 4.8%
* memcpy: 27.88%
* copy_in_user: Didn't appear in the test results.
This machine will also be used to check the fixups are accurate on a
system with UAO - they appear to be exact on a non-UAO system with PAN
that I've been working on locally.

I should also mention that the correctness of these routines were tested
using a selftest test module akin to lib/test_user_copy.c (whose
usercopy functionality checks these patches do pass) but which is more
specific to the fixup accuracy, in that it compares the return value
with the true number of bytes remaining in the destination buffer at the
point of a fault.

Thanks in advance,
Oliver

On 6/30/20 8:48 PM, Oliver Swede wrote:
> Hi,
>
> This contains an update to the cortex-strings patchset: the
> correctness of the fixup routines are improved, with the aim being to
> return the exact number of remaining bytes for all copy sizes.
> To ensure they are exact - which the current fixups are not for some
> copy sizes and are off by a few byes - is an extension to the
> original intention of fixing an issue reported by an LTP run last
> year, where the fixup routine in v2 of this patchset (which was
> importing the cortex-strings memcpy implementation) would over-report
> the number of bytes that successfully copied.
> Version 3 addressed this but I later found some issues with the fixup
> correctness after further testing, and have partially re-written them
> here, and addressed some other behaviours of the copy algorithm.
>
> Comments welcome,
>
> Thanks
> Oliver
>
> v1: https://lore.kernel.org/linux-arm-kernel/[email protected]/
> v2: https://lore.kernel.org/linux-arm-kernel/[email protected]/
> v3: https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> Changes since v3:
> * Improves the accuracy of the fixups in response to issues that
> arose during futher testing
> * Accounts for faults on store instructions on systems with UAO
> enabled
> * Expands on comments detailing the implementation
>
> Changes since v2:
> * Adds Robin's separate patch that fixes a compilation issue with
> KProbes fixup [1]
> * Imports the most recent memcpy implementation by updating Sam's
> patch (and moves this patch to occur after the cortex-strings
> importing so that it's closer to the patches containing its
> corresponding fixups)
> * Uses the stack to preserve the initial parameters
> * Replaces the usercopy fixup routine in v2 with multiple longer
> fixups that each make use of the fault address to return the exact
> number of bytes that haven't yet copied.
>
> [1] https://lore.kernel.org/linux-arm-kernel/e70f7b9de7e601b9e4a6fedad8eaf64d304b1637.1571326276.git.robin.murphy@arm.com/
>
> Oliver Swede (5):
> arm64: Store the arguments to copy_*_user on the stack
> arm64: Use additional memcpy macros and fixups
> arm64: Add fixup routines for usercopy load exceptions
> arm64: Add fixup routines for usercopy store exceptions
> arm64: Improve accuracy of fixup for UAO cases
>
> Robin Murphy (2):
> arm64: kprobes: Drop open-coded exception fixup
> arm64: Tidy up _asm_extable_faultaddr usage
>
> Sam Tebbs (7):
> arm64: Allow passing fault address to fixup handlers
> arm64: Import latest version of Cortex Strings' memcmp
> arm64: Import latest version of Cortex Strings' memmove
> arm64: Import latest version of Cortex Strings' strcmp
> arm64: Import latest version of Cortex Strings' strlen
> arm64: Import latest version of Cortex Strings' strncmp
> arm64: Import latest optimization of memcpy
>
> arch/arm64/include/asm/alternative.h | 36 ---
> arch/arm64/include/asm/assembler.h | 13 +
> arch/arm64/include/asm/extable.h | 10 +-
> arch/arm64/kernel/probes/kprobes.c | 7 -
> arch/arm64/lib/copy_from_user.S | 272 +++++++++++++++--
> arch/arm64/lib/copy_in_user.S | 287 ++++++++++++++++--
> arch/arm64/lib/copy_template.S | 377 +++++++++++++----------
> arch/arm64/lib/copy_template_user.S | 50 ++++
> arch/arm64/lib/copy_to_user.S | 273 +++++++++++++++--
> arch/arm64/lib/copy_user_fixup.S | 433 +++++++++++++++++++++++++++
> arch/arm64/lib/memcmp.S | 333 ++++++++------------
> arch/arm64/lib/memcpy.S | 127 ++++++--
> arch/arm64/lib/memmove.S | 232 +++++---------
> arch/arm64/lib/strcmp.S | 272 +++++++----------
> arch/arm64/lib/strlen.S | 247 ++++++++++-----
> arch/arm64/lib/strncmp.S | 363 ++++++++++------------
> arch/arm64/mm/extable.c | 13 +-
> arch/arm64/mm/fault.c | 2 +-
> 18 files changed, 2228 insertions(+), 1119 deletions(-)
> create mode 100644 arch/arm64/lib/copy_template_user.S
> create mode 100644 arch/arm64/lib/copy_user_fixup.S
>

2020-09-07 10:11:16

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 00/14] arm64: Optimise and update memcpy, user copy and string routines

Hi Oli,

Thanks for this. Just a few high-level comments below.

On Wed, Jul 01, 2020 at 09:12:49AM +0100, Oli Swede wrote:
> > Version 3 addressed this but I later found some issues with the fixup
> > correctness after further testing, and have partially re-written them
> > here, and addressed some other behaviours of the copy algorithm.

[...]

> I am waiting on access to the relevant machine before posting the benchmark
> results for this optimized memcpy, but Sam reported the following with the
> similar (but now slightly older) cortex-strings version:
> * copy_from_user: 13.17%
> * copy_to_user: 4.8%
> * memcpy: 27.88%
> * copy_in_user: Didn't appear in the test results.
> This machine will also be used to check the fixups are accurate on a system
> with UAO - they appear to be exact on a non-UAO system with PAN that I've
> been working on locally.

I'm inclined to say that cortex-strings is probably not a good basis for
our uaccess routines. The code needs to be adapted in a non-straightforward
way so that we lose pretty much all of the benefits we'd usually get from
adopted an existing implementation; we can't pull in fixes or improvements
without a lot of manual effort, we can't reuse existing testing infrastructure
(see below) and we end up being a "second-class" user of the routines
because of the discrepancies in implementation.

So why don't we use cortex-strings as a basis for the in-kernel routines
only, preferably in a form where the code can be used directly and updated
with a script (e.g. similar to how we pull in arch/arm64/crypto routines
from OpenSSL). We can then roll our own uaccess routines, using a slightly
more straight-forward implementation which is more amenable to handling
user faults and doesn't do things like over copying.

> I should also mention that the correctness of these routines were tested
> using a selftest test module akin to lib/test_user_copy.c (whose usercopy
> functionality checks these patches do pass) but which is more specific to
> the fixup accuracy, in that it compares the return value with the true
> number of bytes remaining in the destination buffer at the point of a fault.

Can we put this test module into the kernel source tree, please, maybe as
part of lkdtm? Given the control flow of these optimised functions, I think
we absolutely need targetted testing to make sure we're getting complete
coverage.

Will

2020-09-11 11:36:21

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v4 00/14] arm64: Optimise and update memcpy, user copy and string routines

On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
> On Wed, Jul 01, 2020 at 09:12:49AM +0100, Oli Swede wrote:
> > > Version 3 addressed this but I later found some issues with the fixup
> > > correctness after further testing, and have partially re-written them
> > > here, and addressed some other behaviours of the copy algorithm.
>
> [...]
>
> > I am waiting on access to the relevant machine before posting the benchmark
> > results for this optimized memcpy, but Sam reported the following with the
> > similar (but now slightly older) cortex-strings version:
> > * copy_from_user: 13.17%
> > * copy_to_user: 4.8%
> > * memcpy: 27.88%
> > * copy_in_user: Didn't appear in the test results.
> > This machine will also be used to check the fixups are accurate on a system
> > with UAO - they appear to be exact on a non-UAO system with PAN that I've
> > been working on locally.
>
> I'm inclined to say that cortex-strings is probably not a good basis for
> our uaccess routines. The code needs to be adapted in a non-straightforward
> way so that we lose pretty much all of the benefits we'd usually get from
> adopted an existing implementation; we can't pull in fixes or improvements
> without a lot of manual effort, we can't reuse existing testing infrastructure
> (see below) and we end up being a "second-class" user of the routines
> because of the discrepancies in implementation.

I was a bit more optimistic about this series until I saw the
copy_user_fixup.S changes (patches 12 to 14). I have a suspicion it's
only Oli (and maybe Robin) who understands it, so from a maintainer's
perspective it doesn't really scale. It's also very fragile with any
minor tweak in the actual copy routine potentially breaking the fixup
code.

> So why don't we use cortex-strings as a basis for the in-kernel routines
> only, preferably in a form where the code can be used directly and updated
> with a script (e.g. similar to how we pull in arch/arm64/crypto routines
> from OpenSSL). We can then roll our own uaccess routines, using a slightly
> more straight-forward implementation which is more amenable to handling
> user faults and doesn't do things like over copying.

I think that's probably the best option. I wouldn't mind replacing the
in-kernel memcpy/strcpy etc. with these patches since the work was done
already but definitely not for the uaccess and fixup routines (we still
have the original implementation in the git log).

A script would work even better. Do we have any issue with licensing
though? Cortex Strings is BSD (3-clause IIRC) and copyright owned by
Linaro. I got them to officially agree with relicensing (dual license)
the latest copy under GPLv2 so that we can contribute it to the kernel.
But since the project license is still BSD, any future updates in there
are BSD-only.

Maybe someone who understands this stuff can confirm that it's ok to
regularly grab the Cortex Strings files into a GPLv2 codebase without
asking for Linaro's permission.

BTW, you could pick the kprobes patch in here, that explicit fixup call
is not necessary.

--
Catalin

2020-09-11 15:22:02

by Oliver Swede

[permalink] [raw]
Subject: Re: [PATCH v4 00/14] arm64: Optimise and update memcpy, user copy and string routines

Hi Will and Catalin,

Thank you both for having a look at this, the coupling between the
usercopy routines and the fixup routines was my main concern as well and
I understand the desire to take an approach that avoids this dependency
when it comes to importing new copy routines.

On 9/11/20 12:29 PM, Catalin Marinas wrote:
>
> I was a bit more optimistic about this series until I saw the
> copy_user_fixup.S changes (patches 12 to 14). I have a suspicion it's
> only Oli (and maybe Robin) who understands it, so from a maintainer's
> perspective it doesn't really scale. It's also very fragile with any
> minor tweak in the actual copy routine potentially breaking the fixup
> code.
>

I was wondering if it would be possible to check that an alternative
expression of the logic involved (e.g. diverted to a C function - if
technically feasible - that is easier to understand and modify) would
not really be desirable either?

On 9/11/20 12:29 PM, Catalin Marinas wrote:
> On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
>>
>> I'm inclined to say that cortex-strings is probably not a good basis for
>> our uaccess routines. The code needs to be adapted in a
non-straightforward
>> way so that we lose pretty much all of the benefits we'd usually get
from
>> adopted an existing implementation; we can't pull in fixes or
improvements
>> without a lot of manual effort, we can't reuse existing testing
infrastructure
>> (see below) and we end up being a "second-class" user of the routines
>> because of the discrepancies in implementation.
>
> I was a bit more optimistic about this series until I saw the
> copy_user_fixup.S changes (patches 12 to 14). I have a suspicion it's
> only Oli (and maybe Robin) who understands it, so from a maintainer's
> perspective it doesn't really scale. It's also very fragile with any
> minor tweak in the actual copy routine potentially breaking the fixup
> code.
>

An alternative approach to the fixup routines came up recently in
discussions; it could potentially enable the copy template to be
preserved in its current form and allow the uaccess routines to re-use a
shared copy algorithm (including cortex-strings and/or future optimized
out-of-order algorithms), and is based on one of Robin's suggestions
from a few weeks ago (I found some time to prototype this locally the
other day). The algorithm-specific backtracking would be dropped in
favour of first- and second-stage fixups, with each of these being
relatively short & straightforward, and (together) likely compatible
with future copy routines.
More specifically, within the fixup path invoked after the initial
fault, it could jump back N bytes and then begin an in-order byte-wise
copy from within the first fixup, and then intentionally fault for the
second time at either the same or another fault address. As the
intermediate scan-copy should continue right up to the 'true' fault
address, and ignore any out-of-order techniques used by the optimized
algorithm, the fixup for the second fault could then contain an
accurate, conclusive calculation of the return value.

A technique for this which would need to be carried forward from patch
11 in v4 would be to redirect the PC to a different 'intermediate' fixup
(though only two choices this time) depending on whether the faulting
instruction is a load or store, so it can establish if the minimum
address to jump back to should be src or buf when comparing to
(faddr-N). Additionally, for this the parameters to the usercopy
functions would still need to be stored on the stack (patch 10 in v4).
Otherwise, it would involve a slight rewrite of v4 that could provide a
more maintainable approach; the same fixups would be shared for all copy
sizes & fault addresses (versus in v4) and the only variable subject to
re-evaluation with new imports might be N (currently for copy sizes over
64B it is guaranteed anything below the previous 64B will have been
copied). It could also remove the need to special-case UAO systems when
compared to v4.

If it's possible for pages to be swapped out in the middle of a usercopy
function then this method might not provide the extra implicit knowledge
used by the fixups in v4 about the stage of the algorithm that has been
reached (deduced from the fixup that has been routed to, implied by the
fault-instruction-to-fixup mapping) to compensate for this, so it could
under-report the bytes copied if it faults at an earlier address during
the in-order (secondary) copy. This could be outweighed by the potential
advantages mentioned above, and I mainly include this as v4 should
account for this scenario (but perhaps unnecessarily if the scenario
isn't possible or too unlikely :) ), but as you rightly point out it
would be tiresome to re-write these routines.

On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
>
> So why don't we use cortex-strings as a basis for the in-kernel routines
> only, preferably in a form where the code can be used directly and
updated
> with a script (e.g. similar to how we pull in arch/arm64/crypto routines
> from OpenSSL). We can then roll our own uaccess routines, using a
slightly
> more straight-forward implementation which is more amenable to handling
> user faults and doesn't do things like over copying.

I was wondering if the proposal described above is something that you
would want to look into, or if the use of separate copy routines
(script-based or otherwise) for in-kernel/uaccess is the preferred approach?

On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
>
> Can we put this test module into the kernel source tree, please, maybe as
> part of lkdtm? Given the control flow of these optimised functions, I
think
> we absolutely need targeted testing to make sure we're getting complete
> coverage.

I will post the standalone module in its current state later today/early
next week, the tests are exhaustive but if it would be useful to include
a mechanism for specifying certain ranges of faults/copy sizes or
improve the error reporting, I'd be happy to add these features (and/or
integrate it into lkdtm).

Many thanks,
Oli


On 9/11/20 12:29 PM, Catalin Marinas wrote:
> On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
>> On Wed, Jul 01, 2020 at 09:12:49AM +0100, Oli Swede wrote:
>>>> Version 3 addressed this but I later found some issues with the fixup
>>>> correctness after further testing, and have partially re-written them
>>>> here, and addressed some other behaviours of the copy algorithm.
>>
>> [...]
>>
>>> I am waiting on access to the relevant machine before posting the benchmark
>>> results for this optimized memcpy, but Sam reported the following with the
>>> similar (but now slightly older) cortex-strings version:
>>> * copy_from_user: 13.17%
>>> * copy_to_user: 4.8%
>>> * memcpy: 27.88%
>>> * copy_in_user: Didn't appear in the test results.
>>> This machine will also be used to check the fixups are accurate on a system
>>> with UAO - they appear to be exact on a non-UAO system with PAN that I've
>>> been working on locally.
>>
>> I'm inclined to say that cortex-strings is probably not a good basis for
>> our uaccess routines. The code needs to be adapted in a non-straightforward
>> way so that we lose pretty much all of the benefits we'd usually get from
>> adopted an existing implementation; we can't pull in fixes or improvements
>> without a lot of manual effort, we can't reuse existing testing infrastructure
>> (see below) and we end up being a "second-class" user of the routines
>> because of the discrepancies in implementation.
>
> I was a bit more optimistic about this series until I saw the
> copy_user_fixup.S changes (patches 12 to 14). I have a suspicion it's
> only Oli (and maybe Robin) who understands it, so from a maintainer's
> perspective it doesn't really scale. It's also very fragile with any
> minor tweak in the actual copy routine potentially breaking the fixup
> code.
>
>> So why don't we use cortex-strings as a basis for the in-kernel routines
>> only, preferably in a form where the code can be used directly and updated
>> with a script (e.g. similar to how we pull in arch/arm64/crypto routines
>> from OpenSSL). We can then roll our own uaccess routines, using a slightly
>> more straight-forward implementation which is more amenable to handling
>> user faults and doesn't do things like over copying.
>
> I think that's probably the best option. I wouldn't mind replacing the
> in-kernel memcpy/strcpy etc. with these patches since the work was done
> already but definitely not for the uaccess and fixup routines (we still
> have the original implementation in the git log).
>
> A script would work even better. Do we have any issue with licensing
> though? Cortex Strings is BSD (3-clause IIRC) and copyright owned by
> Linaro. I got them to officially agree with relicensing (dual license)
> the latest copy under GPLv2 so that we can contribute it to the kernel.
> But since the project license is still BSD, any future updates in there
> are BSD-only.
>
> Maybe someone who understands this stuff can confirm that it's ok to
> regularly grab the Cortex Strings files into a GPLv2 codebase without
> asking for Linaro's permission.
>
> BTW, you could pick the kprobes patch in here, that explicit fixup call
> is not necessary.
>