2021-01-28 00:08:16

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

Changelog:
v11:
- Fixed missing KEXEC_CORE dependency for trans_pgd.c
- Removed useless "if(rc) return rc" statement (thank you Tyler Hicks)
- Another 12 patches were accepted into maintainer's get.
Re-based patches against:
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
Branch: for-next/kexec
v10:
- Addressed a lot of comments form James Morse and from Marc Zyngier
- Added review-by's
- Synchronized with mainline

v9: - 9 patches from previous series landed in upstream, so now series
is smaller
- Added two patches from James Morse to address idmap issues for machines
with high physical addresses.
- Addressed comments from Selin Dag about compiling issues. He also tested
my series and got similar performance results: ~60 ms instead of ~580 ms
with an initramfs size of ~120MB.
v8:
- Synced with mainline to keep series up-to-date
v7:
-- Addressed comments from James Morse
- arm64: hibernate: pass the allocated pgdp to ttbr0
Removed "Fixes" tag, and added Added Reviewed-by: James Morse
- arm64: hibernate: check pgd table allocation
Sent out as a standalone patch so it can be sent to stable
Series applies on mainline + this patch
- arm64: hibernate: add trans_pgd public functions
Remove second allocation of tmp_pg_dir in swsusp_arch_resume
Added Reviewed-by: James Morse <[email protected]>
- arm64: kexec: move relocation function setup and clean up
Fixed typo in commit log
Changed kern_reloc to phys_addr_t types.
Added explanation why kern_reloc is needed.
Split into four patches:
arm64: kexec: make dtb_mem always enabled
arm64: kexec: remove unnecessary debug prints
arm64: kexec: call kexec_image_info only once
arm64: kexec: move relocation function setup
- arm64: kexec: add expandable argument to relocation function
Changed types of new arguments from unsigned long to phys_addr_t.
Changed offset prefix to KEXEC_*
Split into four patches:
arm64: kexec: cpu_soft_restart change argument types
arm64: kexec: arm64_relocate_new_kernel clean-ups
arm64: kexec: arm64_relocate_new_kernel don't use x0 as temp
arm64: kexec: add expandable argument to relocation function
- arm64: kexec: configure trans_pgd page table for kexec
Added invalid entries into EL2 vector table
Removed KEXEC_EL2_VECTOR_TABLE_SIZE and KEXEC_EL2_VECTOR_TABLE_OFFSET
Copy relocation functions and table into separate pages
Changed types in kern_reloc_arg.
Split into three patches:
arm64: kexec: offset for relocation function
arm64: kexec: kexec EL2 vectors
arm64: kexec: configure trans_pgd page table for kexec
- arm64: kexec: enable MMU during kexec relocation
Split into two patches:
arm64: kexec: enable MMU during kexec relocation
arm64: kexec: remove head from relocation argument
v6:
- Sync with mainline tip
- Added Acked's from Dave Young
v5:
- Addressed comments from Matthias Brugger: added review-by's, improved
comments, and made cleanups to swsusp_arch_resume() in addition to
create_safe_exec_page().
- Synced with mainline tip.
v4:
- Addressed comments from James Morse.
- Split "check pgd table allocation" into two patches, and moved to
the beginning of series for simpler backport of the fixes.
Added "Fixes:" tags to commit logs.
- Changed "arm64, hibernate:" to "arm64: hibernate:"
- Added Reviewed-by's
- Moved "add PUD_SECT_RDONLY" earlier in series to be with other
clean-ups
- Added "Derived from:" to arch/arm64/mm/trans_pgd.c
- Removed "flags" from trans_info
- Changed .trans_alloc_page assumption to return zeroed page.
- Simplify changes to trans_pgd_map_page(), by keeping the old
code.
- Simplify changes to trans_pgd_create_copy, by keeping the old
code.
- Removed: "add trans_pgd_create_empty"
- replace init_mm with NULL, and keep using non "__" version of
populate functions.
v3:
- Split changes to create_safe_exec_page() into several patches for
easier review as request by Mark Rutland. This is why this series
has 3 more patches.
- Renamed trans_table to tans_pgd as agreed with Mark. The header
comment in trans_pgd.c explains that trans stands for
transitional page tables. Meaning they are used in transition
between two kernels.
v2:
- Fixed hibernate bug reported by James Morse
- Addressed comments from James Morse:
* More incremental changes to trans_table
* Removed TRANS_FORCEMAP
* Added kexec reboot data for image with 380M in size.

Enable MMU during kexec relocation in order to improve reboot performance.

If kexec functionality is used for a fast system update, with a minimal
downtime, the relocation of kernel + initramfs takes a significant portion
of reboot.

The reason for slow relocation is because it is done without MMU, and thus
not benefiting from D-Cache.

Performance data
----------------
For this experiment, the size of kernel plus initramfs is small, only 25M.
If initramfs was larger, than the improvements would be greater, as time
spent in relocation is proportional to the size of relocation.

Previously:
kernel shutdown 0.022131328s
relocation 0.440510736s
kernel startup 0.294706768s

Relocation was taking: 58.2% of reboot time

Now:
kernel shutdown 0.032066576s
relocation 0.022158152s
kernel startup 0.296055880s

Now: Relocation takes 6.3% of reboot time

Total reboot is x2.16 times faster.

With bigger userland (fitImage 380M), the reboot time is improved by 3.57s,
and is reduced from 3.9s down to 0.33s

Previous approaches and discussions
-----------------------------------
v10: https://lore.kernel.org/linux-arm-kernel/[email protected]
v9: https://lore.kernel.org/lkml/[email protected]
v8: https://lore.kernel.org/lkml/[email protected]
v7: https://lore.kernel.org/lkml/[email protected]
v6: https://lore.kernel.org/lkml/[email protected]
v5: https://lore.kernel.org/lkml/[email protected]
v4: https://lore.kernel.org/lkml/[email protected]
v3: https://lore.kernel.org/lkml/[email protected]
v2: https://lore.kernel.org/lkml/[email protected]
v1: https://lore.kernel.org/lkml/[email protected]

Older approaches:
https://lore.kernel.org/lkml/[email protected]
reserve space for kexec to avoid relocation, involves changes to generic code
to optimize a problem that exists on arm64 only:

https://lore.kernel.org/lkml/[email protected]
The first attempt to enable MMU, some bugs that prevented performance
improvement. The page tables unnecessary configured idmap for the whole
physical space.

https://lore.kernel.org/lkml/[email protected]
No linear copy, bug with EL2 reboots.

Pavel Tatashin (6):
arm64: kexec: add expandable argument to relocation function
arm64: kexec: use ld script for relocation function
arm64: kexec: kexec may require EL2 vectors
arm64: kexec: configure trans_pgd page table for kexec
arm64: kexec: enable MMU during kexec relocation
arm64: kexec: remove head from relocation argument

arch/arm64/Kconfig | 2 +-
arch/arm64/include/asm/kexec.h | 37 ++++++
arch/arm64/include/asm/sections.h | 1 +
arch/arm64/kernel/asm-offsets.c | 15 +++
arch/arm64/kernel/cpu-reset.S | 11 +-
arch/arm64/kernel/cpu-reset.h | 8 +-
arch/arm64/kernel/machine_kexec.c | 139 ++++++++++++++++++--
arch/arm64/kernel/relocate_kernel.S | 190 ++++++++++++++++++----------
arch/arm64/kernel/vmlinux.lds.S | 19 +++
9 files changed, 332 insertions(+), 90 deletions(-)

--
2.25.1


2021-01-28 00:09:24

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v11 6/6] arm64: kexec: remove head from relocation argument

Now, that relocation is done using virtual addresses, reloc_arg->head
is not needed anymore.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm64/include/asm/kexec.h | 2 --
arch/arm64/kernel/asm-offsets.c | 1 -
arch/arm64/kernel/machine_kexec.c | 1 -
3 files changed, 4 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 049cde429b1b..2fa4109bd582 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -97,7 +97,6 @@ extern const char arm64_kexec_el2_vectors[];

/*
* kern_reloc_arg is passed to kernel relocation function as an argument.
- * head kimage->head, allows to traverse through relocation segments.
* entry_addr kimage->start, where to jump from relocation function (new
* kernel, or purgatory entry address).
* kern_arg0 first argument to kernel is its dtb address. The other
@@ -113,7 +112,6 @@ extern const char arm64_kexec_el2_vectors[];
* copy_len Number of bytes that need to be copied
*/
struct kern_reloc_arg {
- phys_addr_t head;
phys_addr_t entry_addr;
phys_addr_t kern_arg0;
phys_addr_t kern_arg1;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 06278611451d..94f050ad6471 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -153,7 +153,6 @@ int main(void)
BLANK();
#endif
#ifdef CONFIG_KEXEC_CORE
- DEFINE(KEXEC_KRELOC_HEAD, offsetof(struct kern_reloc_arg, head));
DEFINE(KEXEC_KRELOC_ENTRY_ADDR, offsetof(struct kern_reloc_arg, entry_addr));
DEFINE(KEXEC_KRELOC_KERN_ARG0, offsetof(struct kern_reloc_arg, kern_arg0));
DEFINE(KEXEC_KRELOC_KERN_ARG1, offsetof(struct kern_reloc_arg, kern_arg1));
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 9588c91f67c6..07da8d623d8e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -166,7 +166,6 @@ int machine_kexec_post_load(struct kimage *kimage)
memcpy(reloc_code, __relocate_new_kernel_start, reloc_size);
kimage->arch.kern_reloc = __pa(reloc_code) + func_offset;
kimage->arch.kern_reloc_arg = __pa(kern_reloc_arg);
- kern_reloc_arg->head = kimage->head;
kern_reloc_arg->entry_addr = kimage->start;
kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;

--
2.25.1

2021-01-28 00:09:39

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v11 4/6] arm64: kexec: configure trans_pgd page table for kexec

Configure a page table located in kexec-safe memory that has
the following mappings:

1. identity mapping for text of relocation function with executable
permission.
2. va mappings for all source ranges
3. va mappings for all destination ranges.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm64/Kconfig | 2 +-
arch/arm64/include/asm/kexec.h | 12 +++++
arch/arm64/kernel/asm-offsets.c | 6 +++
arch/arm64/kernel/machine_kexec.c | 89 ++++++++++++++++++++++++++++++-
4 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fc0ed9d6e011..440abd0c0ee1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1134,7 +1134,7 @@ config CRASH_DUMP

config TRANS_TABLE
def_bool y
- depends on HIBERNATION
+ depends on HIBERNATION || KEXEC_CORE

config XEN_DOM0
def_bool y
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index b96d8a6aac80..049cde429b1b 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -105,6 +105,12 @@ extern const char arm64_kexec_el2_vectors[];
* el2_vector If present means that relocation routine will go to EL1
* from EL2 to do the copy, and then back to EL2 to do the jump
* to new world.
+ * trans_ttbr0 idmap for relocation function and its argument
+ * trans_ttbr1 map for source/destination addresses.
+ * trans_t0sz t0sz for idmap page in trans_ttbr0
+ * src_addr start address for source pages.
+ * dst_addr start address for destination pages.
+ * copy_len Number of bytes that need to be copied
*/
struct kern_reloc_arg {
phys_addr_t head;
@@ -114,6 +120,12 @@ struct kern_reloc_arg {
phys_addr_t kern_arg2;
phys_addr_t kern_arg3;
phys_addr_t el2_vector;
+ phys_addr_t trans_ttbr0;
+ phys_addr_t trans_ttbr1;
+ unsigned long trans_t0sz;
+ unsigned long src_addr;
+ unsigned long dst_addr;
+ unsigned long copy_len;
};

#define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 8a9475be1b62..06278611451d 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -160,6 +160,12 @@ int main(void)
DEFINE(KEXEC_KRELOC_KERN_ARG2, offsetof(struct kern_reloc_arg, kern_arg2));
DEFINE(KEXEC_KRELOC_KERN_ARG3, offsetof(struct kern_reloc_arg, kern_arg3));
DEFINE(KEXEC_KRELOC_EL2_VECTOR, offsetof(struct kern_reloc_arg, el2_vector));
+ DEFINE(KEXEC_KRELOC_TRANS_TTBR0, offsetof(struct kern_reloc_arg, trans_ttbr0));
+ DEFINE(KEXEC_KRELOC_TRANS_TTBR1, offsetof(struct kern_reloc_arg, trans_ttbr1));
+ DEFINE(KEXEC_KRELOC_TRANS_T0SZ, offsetof(struct kern_reloc_arg, trans_t0sz));
+ DEFINE(KEXEC_KRELOC_SRC_ADDR, offsetof(struct kern_reloc_arg, src_addr));
+ DEFINE(KEXEC_KRELOC_DST_ADDR, offsetof(struct kern_reloc_arg, dst_addr));
+ DEFINE(KEXEC_KRELOC_COPY_LEN, offsetof(struct kern_reloc_arg, copy_len));
#endif
return 0;
}
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 41d1e3ca13f8..9588c91f67c6 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -21,6 +21,7 @@
#include <asm/mmu_context.h>
#include <asm/page.h>
#include <asm/sections.h>
+#include <asm/trans_pgd.h>

#include "cpu-reset.h"

@@ -71,11 +72,89 @@ static void *kexec_page_alloc(void *arg)
return page_address(page);
}

+/*
+ * Map source segments starting from src_va, and map destination
+ * segments starting from dst_va, and return size of copy in
+ * *copy_len argument.
+ * Relocation function essentially needs to do:
+ * memcpy(dst_va, src_va, copy_len);
+ */
+static int map_segments(struct kimage *kimage, pgd_t *pgdp,
+ struct trans_pgd_info *info,
+ unsigned long src_va,
+ unsigned long dst_va,
+ unsigned long *copy_len)
+{
+ unsigned long *ptr = 0;
+ unsigned long dest = 0;
+ unsigned long len = 0;
+ unsigned long entry, addr;
+ int rc;
+
+ for (entry = kimage->head; !(entry & IND_DONE); entry = *ptr++) {
+ addr = entry & PAGE_MASK;
+
+ switch (entry & IND_FLAGS) {
+ case IND_DESTINATION:
+ dest = addr;
+ break;
+ case IND_INDIRECTION:
+ ptr = __va(addr);
+ break;
+ case IND_SOURCE:
+ rc = trans_pgd_map_page(info, pgdp, __va(addr),
+ src_va, PAGE_KERNEL);
+ if (rc)
+ return rc;
+ rc = trans_pgd_map_page(info, pgdp, __va(dest),
+ dst_va, PAGE_KERNEL);
+ if (rc)
+ return rc;
+ dest += PAGE_SIZE;
+ src_va += PAGE_SIZE;
+ dst_va += PAGE_SIZE;
+ len += PAGE_SIZE;
+ }
+ }
+ *copy_len = len;
+
+ return 0;
+}
+
+static int mmu_relocate_setup(struct kimage *kimage, void *reloc_code,
+ struct kern_reloc_arg *kern_reloc_arg)
+{
+ struct trans_pgd_info info = {
+ .trans_alloc_page = kexec_page_alloc,
+ .trans_alloc_arg = kimage,
+ };
+ pgd_t *trans_pgd = kexec_page_alloc(kimage);
+ int rc;
+
+ if (!trans_pgd)
+ return -ENOMEM;
+
+ /* idmap relocation function */
+ rc = trans_pgd_idmap_page(&info, &kern_reloc_arg->trans_ttbr0,
+ &kern_reloc_arg->trans_t0sz, reloc_code);
+ if (rc)
+ return rc;
+
+ kern_reloc_arg->src_addr = _PAGE_OFFSET(VA_BITS_MIN);
+ kern_reloc_arg->dst_addr = _PAGE_OFFSET(VA_BITS_MIN - 1);
+ kern_reloc_arg->trans_ttbr1 = phys_to_ttbr(__pa(trans_pgd));
+
+ rc = map_segments(kimage, trans_pgd, &info, kern_reloc_arg->src_addr,
+ kern_reloc_arg->dst_addr, &kern_reloc_arg->copy_len);
+ return rc;
+}
+
int machine_kexec_post_load(struct kimage *kimage)
{
void *reloc_code = page_to_virt(kimage->control_code_page);
struct kern_reloc_arg *kern_reloc_arg = kexec_page_alloc(kimage);
long func_offset, vector_offset, reloc_size;
+ int rc = 0;

if (!kern_reloc_arg)
return -ENOMEM;
@@ -95,6 +174,14 @@ int machine_kexec_post_load(struct kimage *kimage)
if (is_hyp_mode_available() && !is_kernel_in_hyp_mode())
kern_reloc_arg->el2_vector = __pa(reloc_code) + vector_offset;

+ /*
+ * If relocation is not needed, we do not need to enable MMU in
+ * relocation routine, therefore do not create page tables for
+ * scenarios such as crash kernel
+ */
+ if (!(kimage->head & IND_DONE))
+ rc = mmu_relocate_setup(kimage, reloc_code, kern_reloc_arg);
+
kexec_image_info(kimage);

/* Flush the reloc_code in preparation for its execution. */
@@ -103,7 +190,7 @@ int machine_kexec_post_load(struct kimage *kimage)
reloc_size);
__flush_dcache_area(kern_reloc_arg, sizeof(struct kern_reloc_arg));

- return 0;
+ return rc;
}

/**
--
2.25.1

2021-01-28 00:09:52

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v11 2/6] arm64: kexec: use ld script for relocation function

Currently, relocation code declares start and end variables
which are used to compute it size.

The better way to do this is to use ld script incited, and put relocation
function in its own section.

Soon, relocation function will share the same page with EL2 vectors. So,
proper marking is needed.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm64/include/asm/kexec.h | 4 ++++
arch/arm64/include/asm/sections.h | 1 +
arch/arm64/kernel/machine_kexec.c | 17 ++++++++---------
arch/arm64/kernel/relocate_kernel.S | 15 ++-------------
arch/arm64/kernel/vmlinux.lds.S | 19 +++++++++++++++++++
5 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 990185744148..7f4f9abdf049 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,6 +90,10 @@ static inline void crash_prepare_suspend(void) {}
static inline void crash_post_resume(void) {}
#endif

+#if defined(CONFIG_KEXEC_CORE)
+extern const char arm64_relocate_new_kernel[];
+#endif
+
/*
* kern_reloc_arg is passed to kernel relocation function as an argument.
* head kimage->head, allows to traverse through relocation segments.
diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
index 8ff579361731..ae873eb22205 100644
--- a/arch/arm64/include/asm/sections.h
+++ b/arch/arm64/include/asm/sections.h
@@ -19,5 +19,6 @@ extern char __exittext_begin[], __exittext_end[];
extern char __irqentry_text_start[], __irqentry_text_end[];
extern char __mmuoff_data_start[], __mmuoff_data_end[];
extern char __entry_tramp_text_start[], __entry_tramp_text_end[];
+extern char __relocate_new_kernel_start[], __relocate_new_kernel_end[];

#endif /* __ASM_SECTIONS_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 679db3f1e0c5..361a4d082093 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -20,13 +20,10 @@
#include <asm/mmu.h>
#include <asm/mmu_context.h>
#include <asm/page.h>
+#include <asm/sections.h>

#include "cpu-reset.h"

-/* Global variables for the arm64_relocate_new_kernel routine. */
-extern const unsigned char arm64_relocate_new_kernel[];
-extern const unsigned long arm64_relocate_new_kernel_size;
-
/**
* kexec_image_info - For debugging output.
*/
@@ -78,13 +75,15 @@ int machine_kexec_post_load(struct kimage *kimage)
{
void *reloc_code = page_to_virt(kimage->control_code_page);
struct kern_reloc_arg *kern_reloc_arg = kexec_page_alloc(kimage);
+ long func_offset, reloc_size;

if (!kern_reloc_arg)
return -ENOMEM;

- memcpy(reloc_code, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
- kimage->arch.kern_reloc = __pa(reloc_code);
+ func_offset = arm64_relocate_new_kernel - __relocate_new_kernel_start;
+ reloc_size = __relocate_new_kernel_end - __relocate_new_kernel_start;
+ memcpy(reloc_code, __relocate_new_kernel_start, reloc_size);
+ kimage->arch.kern_reloc = __pa(reloc_code) + func_offset;
kimage->arch.kern_reloc_arg = __pa(kern_reloc_arg);
kern_reloc_arg->head = kimage->head;
kern_reloc_arg->entry_addr = kimage->start;
@@ -92,9 +91,9 @@ int machine_kexec_post_load(struct kimage *kimage)
kexec_image_info(kimage);

/* Flush the reloc_code in preparation for its execution. */
- __flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
+ __flush_dcache_area(reloc_code, reloc_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
- arm64_relocate_new_kernel_size);
+ reloc_size);
__flush_dcache_area(kern_reloc_arg, sizeof(struct kern_reloc_arg));

return 0;
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index c92228aeddca..d2a4a0b0d76b 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -14,6 +14,7 @@
#include <asm/page.h>
#include <asm/sysreg.h>

+.pushsection ".kexec_relocate.text", "ax"
/*
* arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
*
@@ -75,16 +76,4 @@ SYM_CODE_START(arm64_relocate_new_kernel)
ldr x0, [x0, #KEXEC_KRELOC_KERN_ARG0] /* x0 = dtb address */
br x4
SYM_CODE_END(arm64_relocate_new_kernel)
-
-.align 3 /* To keep the 64-bit values below naturally aligned. */
-
-.Lcopy_end:
-.org KEXEC_CONTROL_PAGE_SIZE
-
-/*
- * arm64_relocate_new_kernel_size - Number of bytes to copy to the
- * control_code_page.
- */
-.globl arm64_relocate_new_kernel_size
-arm64_relocate_new_kernel_size:
- .quad .Lcopy_end - arm64_relocate_new_kernel
+.popsection
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 4c0b0c89ad59..33b0d3c9fd3b 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -12,6 +12,7 @@
#include <asm/cache.h>
#include <asm/hyp_image.h>
#include <asm/kernel-pgtable.h>
+#include <asm/kexec.h>
#include <asm/memory.h>
#include <asm/page.h>

@@ -82,6 +83,16 @@ jiffies = jiffies_64;
#define HIBERNATE_TEXT
#endif

+#ifdef CONFIG_KEXEC_CORE
+#define KEXEC_TEXT \
+ . = ALIGN(SZ_4K); \
+ __relocate_new_kernel_start = .; \
+ *(.kexec_relocate.text) \
+ __relocate_new_kernel_end = .;
+#else
+#define KEXEC_TEXT
+#endif
+
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
#define TRAMP_TEXT \
. = ALIGN(PAGE_SIZE); \
@@ -142,6 +153,7 @@ SECTIONS
HYPERVISOR_TEXT
IDMAP_TEXT
HIBERNATE_TEXT
+ KEXEC_TEXT
TRAMP_TEXT
*(.fixup)
*(.gnu.warning)
@@ -316,3 +328,10 @@ ASSERT((__entry_tramp_text_end - __entry_tramp_text_start) == PAGE_SIZE,
* If padding is applied before .head.text, virt<->phys conversions will fail.
*/
ASSERT(_text == KIMAGE_VADDR, "HEAD is misaligned")
+
+#ifdef CONFIG_KEXEC_CORE
+/* kexec relocation code should fit into one KEXEC_CONTROL_PAGE_SIZE */
+ASSERT(__relocate_new_kernel_end - (__relocate_new_kernel_start & ~(SZ_4K - 1))
+ <= SZ_4K, "kexec relocation code is too big or misaligned")
+ASSERT(KEXEC_CONTROL_PAGE_SIZE >= SZ_4K, "KEXEC_CONTROL_PAGE_SIZE is brokern")
+#endif
--
2.25.1

2021-01-28 00:10:45

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v11 5/6] arm64: kexec: enable MMU during kexec relocation

Now, that we have transitional page tables configured, temporarily enable
MMU to allow faster relocation of segments to final destination.

The performance data: for a moderate size kernel + initramfs: 25M the
relocation was taking 0.382s, with enabled MMU it now takes
0.019s only or x20 improvement.

The time is proportional to the size of relocation, therefore if initramfs
is larger, 100M it could take over a second.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm64/kernel/relocate_kernel.S | 131 ++++++++++++++++++----------
1 file changed, 87 insertions(+), 44 deletions(-)

diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index c6178b1a4e60..9c60981a6911 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -4,6 +4,8 @@
*
* Copyright (C) Linaro.
* Copyright (C) Huawei Futurewei Technologies.
+ * Copyright (C) 2020, Microsoft Corporation.
+ * Pavel Tatashin <[email protected]>
*/

#include <linux/kexec.h>
@@ -14,6 +16,54 @@
#include <asm/page.h>
#include <asm/sysreg.h>

+.macro tlb_invalidate
+ dsb sy
+ dsb ish
+ tlbi vmalle1
+ dsb ish
+ isb
+.endm
+
+.macro turn_off_mmu tmp1, tmp2
+ mrs \tmp1, sctlr_el1
+ mov_q \tmp2, SCTLR_ELx_FLAGS
+ bic \tmp1, \tmp1, \tmp2
+ pre_disable_mmu_workaround
+ msr sctlr_el1, \tmp1
+ isb
+.endm
+
+.macro turn_on_mmu tmp1, tmp2
+ mrs \tmp1, sctlr_el1
+ mov_q \tmp2, SCTLR_ELx_FLAGS
+ orr \tmp1, \tmp1, \tmp2
+ msr sctlr_el1, \tmp1
+ ic iallu
+ dsb nsh
+ isb
+.endm
+
+/*
+ * Set ttbr0 and ttbr1, called while MMU is disabled, so no need to temporarily
+ * set zero_page table. Invalidate TLB after new tables are set.
+ */
+.macro set_ttbr arg, tmp1, tmp2
+ ldr \tmp1, [\arg, #KEXEC_KRELOC_TRANS_TTBR0]
+ msr ttbr0_el1, \tmp1
+ ldr \tmp1, [\arg, #KEXEC_KRELOC_TRANS_TTBR1]
+ offset_ttbr1 \tmp1, \tmp2
+ msr ttbr1_el1, \tmp1
+ isb
+.endm
+
+/* Set T0SZ to match the requirements of idmap page */
+.macro set_tcr_t0sz arg, tmp1, tmp2
+ ldr \tmp2, [\arg, #KEXEC_KRELOC_TRANS_T0SZ]
+ mrs \tmp1, tcr_el1
+ bfi \tmp1, \tmp2, TCR_T0SZ_OFFSET, TCR_TxSZ_WIDTH
+ msr tcr_el1, \tmp1
+.endm
+
.macro el1_sync_64
.align 7
br x4 /* Jump to new world from el2 */
@@ -36,56 +86,49 @@
* symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end. The
* machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
* safe memory that has been set up to be preserved during the copy operation.
+ *
+ * This function temporarily enables MMU if kernel relocation is needed.
+ * Also, if we enter this function at EL2 on non-VHE kernel, we temporarily go
+ * to EL1 to enable MMU, and escalate back to EL2 at the end to do the jump to
+ * the new kernel. This is determined by presence of el2_vector.
*/
SYM_CODE_START(arm64_relocate_new_kernel)
- /* Check if the new image needs relocation. */
- ldr x16, [x0, #KEXEC_KRELOC_HEAD] /* x16 = kimage_head */
- tbnz x16, IND_DONE_BIT, .Ldone
- raw_dcache_line_size x15, x1 /* x15 = dcache line size */
-.Lloop:
- and x12, x16, PAGE_MASK /* x12 = addr */
-
- /* Test the entry flags. */
-.Ltest_source:
- tbz x16, IND_SOURCE_BIT, .Ltest_indirection
-
- /* Invalidate dest page to PoC. */
- mov x2, x13
- add x20, x2, #PAGE_SIZE
- sub x1, x15, #1
- bic x2, x2, x1
-2: dc ivac, x2
- add x2, x2, x15
- cmp x2, x20
- b.lo 2b
- dsb sy
-
- copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
- b .Lnext
-.Ltest_indirection:
- tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
- mov x14, x12 /* ptr = addr */
- b .Lnext
-.Ltest_destination:
- tbz x16, IND_DESTINATION_BIT, .Lnext
- mov x13, x12 /* dest = addr */
-.Lnext:
- ldr x16, [x14], #8 /* entry = *ptr++ */
- tbz x16, IND_DONE_BIT, .Lloop /* while (!(entry & DONE)) */
-.Ldone:
- /* wait for writes from copy_page to finish */
- dsb nsh
- ic iallu
- dsb nsh
- isb
-
- /* Start new image. */
- ldr x4, [x0, #KEXEC_KRELOC_ENTRY_ADDR] /* x4 = kimage_start */
+ mov x20, xzr /* x20 will hold vector value */
+ ldr x11, [x0, #KEXEC_KRELOC_COPY_LEN]
+ cbz x11, 5f /* Check if need to relocate */
+ ldr x20, [x0, #KEXEC_KRELOC_EL2_VECTOR]
+ cbz x20, 2f /* need to reduce to EL1? */
+ msr vbar_el2, x20 /* el2_vector present, means */
+ adr x1, 2f /* we will do copy in el1 but */
+ msr elr_el2, x1 /* do final jump from el2 */
+ eret /* Reduce to EL1 */
+2: set_tcr_t0sz x0, x1, x2 /* Set t0sz for idmaped page */
+ set_ttbr x0, x1, x2 /* Set our page tables */
+ tlb_invalidate
+ ldr x1, [x0, #KEXEC_KRELOC_DST_ADDR]; /* arg is not idmapped so */
+ ldr x2, [x0, #KEXEC_KRELOC_SRC_ADDR]; /* read before MMU is on */
+ turn_on_mmu x3, x4 /* Turn MMU back on */
+ mov x12, x1 /* x12 dst backup */
+3: copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
+ sub x11, x11, #PAGE_SIZE
+ cbnz x11, 3b /* page copy loop */
+ raw_dcache_line_size x2, x3 /* x2 = dcache line size */
+ sub x3, x2, #1 /* x3 = dcache_size - 1 */
+ bic x12, x12, x3
+4: dc cvau, x12 /* Flush D-cache */
+ add x12, x12, x2
+ cmp x12, x1 /* Compare to dst + len */
+ b.ne 4b /* D-cache flush loop */
+ turn_off_mmu x1, x2 /* Turn off MMU */
+ tlb_invalidate /* Invalidate TLB */
+5: ldr x4, [x0, #KEXEC_KRELOC_ENTRY_ADDR] /* x4 = kimage_start */
ldr x3, [x0, #KEXEC_KRELOC_KERN_ARG3]
ldr x2, [x0, #KEXEC_KRELOC_KERN_ARG2]
ldr x1, [x0, #KEXEC_KRELOC_KERN_ARG1]
ldr x0, [x0, #KEXEC_KRELOC_KERN_ARG0] /* x0 = dtb address */
- br x4
+ cbnz x20, 6f /* need to escalate to el2? */
+ br x4 /* Jump to new world */
+6: hvc #0 /* enters kexec_el1_sync */
SYM_CODE_END(arm64_relocate_new_kernel)

/* el2 vectors - switch el2 here while we restore the memory image. */
--
2.25.1

2021-01-28 00:10:51

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v11 1/6] arm64: kexec: add expandable argument to relocation function

Currently, kexec relocation function (arm64_relocate_new_kernel) accepts
the following arguments:

head: start of array that contains relocation information.
entry: entry point for new kernel or purgatory.
dtb_mem: first and only argument to entry.

The number of arguments cannot be easily expended, because this
function is also called from HVC_SOFT_RESTART, which preserves only
three arguments (hypervisor abi). And, also arm64_relocate_new_kernel is
written in assembly but called without stack, thus no place to move extra
arguments to free registers.

Soon, we will need to pass more arguments: once we enable MMU we
will need to pass information about page tables.

Add a new struct: kern_reloc_arg, and place it in kexec safe page (i.e
memory that is not overwritten during relocation).
Thus, make arm64_relocate_new_kernel to only take one argument, that
contains all the needed information.

Note:
Another benefit of allowing this function to accept more arguments, is that
kernel can actually accept up to 4 arguments (x0-x3), however currently
only one is used, but if in the future we will need for more (for example,
pass information about when previous kernel exited to have a precise
measurement in time spent in purgatory), we won't be easilty do that
if arm64_relocate_new_kernel can't accept more arguments.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm64/include/asm/kexec.h | 18 ++++++++++++++++++
arch/arm64/kernel/asm-offsets.c | 9 +++++++++
arch/arm64/kernel/cpu-reset.S | 11 +++--------
arch/arm64/kernel/cpu-reset.h | 8 +++-----
arch/arm64/kernel/machine_kexec.c | 27 +++++++++++++++++++++++++--
arch/arm64/kernel/relocate_kernel.S | 21 ++++++++-------------
6 files changed, 66 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 9befcd87e9a8..990185744148 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -90,12 +90,30 @@ static inline void crash_prepare_suspend(void) {}
static inline void crash_post_resume(void) {}
#endif

+/*
+ * kern_reloc_arg is passed to kernel relocation function as an argument.
+ * head kimage->head, allows to traverse through relocation segments.
+ * entry_addr kimage->start, where to jump from relocation function (new
+ * kernel, or purgatory entry address).
+ * kern_arg0 first argument to kernel is its dtb address. The other
+ * arguments are currently unused, and must be set to 0
+ */
+struct kern_reloc_arg {
+ phys_addr_t head;
+ phys_addr_t entry_addr;
+ phys_addr_t kern_arg0;
+ phys_addr_t kern_arg1;
+ phys_addr_t kern_arg2;
+ phys_addr_t kern_arg3;
+};
+
#define ARCH_HAS_KIMAGE_ARCH

struct kimage_arch {
void *dtb;
phys_addr_t dtb_mem;
phys_addr_t kern_reloc;
+ phys_addr_t kern_reloc_arg;
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_mem;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 301784463587..6067a288f568 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -23,6 +23,7 @@
#include <asm/suspend.h>
#include <linux/kbuild.h>
#include <linux/arm-smccc.h>
+#include <linux/kexec.h>

int main(void)
{
@@ -150,6 +151,14 @@ int main(void)
DEFINE(PTRAUTH_USER_KEY_APGA, offsetof(struct ptrauth_keys_user, apga));
DEFINE(PTRAUTH_KERNEL_KEY_APIA, offsetof(struct ptrauth_keys_kernel, apia));
BLANK();
+#endif
+#ifdef CONFIG_KEXEC_CORE
+ DEFINE(KEXEC_KRELOC_HEAD, offsetof(struct kern_reloc_arg, head));
+ DEFINE(KEXEC_KRELOC_ENTRY_ADDR, offsetof(struct kern_reloc_arg, entry_addr));
+ DEFINE(KEXEC_KRELOC_KERN_ARG0, offsetof(struct kern_reloc_arg, kern_arg0));
+ DEFINE(KEXEC_KRELOC_KERN_ARG1, offsetof(struct kern_reloc_arg, kern_arg1));
+ DEFINE(KEXEC_KRELOC_KERN_ARG2, offsetof(struct kern_reloc_arg, kern_arg2));
+ DEFINE(KEXEC_KRELOC_KERN_ARG3, offsetof(struct kern_reloc_arg, kern_arg3));
#endif
return 0;
}
diff --git a/arch/arm64/kernel/cpu-reset.S b/arch/arm64/kernel/cpu-reset.S
index 37721eb6f9a1..bbf70db43744 100644
--- a/arch/arm64/kernel/cpu-reset.S
+++ b/arch/arm64/kernel/cpu-reset.S
@@ -16,14 +16,11 @@
.pushsection .idmap.text, "awx"

/*
- * __cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2) - Helper for
- * cpu_soft_restart.
+ * __cpu_soft_restart(el2_switch, entry, arg) - Helper for cpu_soft_restart.
*
* @el2_switch: Flag to indicate a switch to EL2 is needed.
* @entry: Location to jump to for soft reset.
- * arg0: First argument passed to @entry. (relocation list)
- * arg1: Second argument passed to @entry.(physical kernel entry)
- * arg2: Third argument passed to @entry. (physical dtb address)
+ * arg: Entry argument
*
* Put the CPU into the same state as it would be if it had been reset, and
* branch to what would be the reset vector. It must be executed with the
@@ -47,9 +44,7 @@ SYM_CODE_START(__cpu_soft_restart)
hvc #0 // no return

1: mov x8, x1 // entry
- mov x0, x2 // arg0
- mov x1, x3 // arg1
- mov x2, x4 // arg2
+ mov x0, x2 // arg
br x8
SYM_CODE_END(__cpu_soft_restart)

diff --git a/arch/arm64/kernel/cpu-reset.h b/arch/arm64/kernel/cpu-reset.h
index ed50e9587ad8..7a8720ff186f 100644
--- a/arch/arm64/kernel/cpu-reset.h
+++ b/arch/arm64/kernel/cpu-reset.h
@@ -11,12 +11,10 @@
#include <asm/virt.h>

void __cpu_soft_restart(unsigned long el2_switch, unsigned long entry,
- unsigned long arg0, unsigned long arg1, unsigned long arg2);
+ unsigned long arg);

static inline void __noreturn cpu_soft_restart(unsigned long entry,
- unsigned long arg0,
- unsigned long arg1,
- unsigned long arg2)
+ unsigned long arg)
{
typeof(__cpu_soft_restart) *restart;

@@ -25,7 +23,7 @@ static inline void __noreturn cpu_soft_restart(unsigned long entry,
restart = (void *)__pa_symbol(__cpu_soft_restart);

cpu_install_idmap();
- restart(el2_switch, entry, arg0, arg1, arg2);
+ restart(el2_switch, entry, arg);
unreachable();
}

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 90a335c74442..679db3f1e0c5 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -43,6 +43,7 @@ static void _kexec_image_info(const char *func, int line,
pr_debug(" head: %lx\n", kimage->head);
pr_debug(" nr_segments: %lu\n", kimage->nr_segments);
pr_debug(" kern_reloc: %pa\n", &kimage->arch.kern_reloc);
+ pr_debug(" kern_reloc_arg: %pa\n", &kimage->arch.kern_reloc_arg);

for (i = 0; i < kimage->nr_segments; i++) {
pr_debug(" segment[%lu]: %016lx - %016lx, 0x%lx bytes, %lu pages\n",
@@ -59,19 +60,42 @@ void machine_kexec_cleanup(struct kimage *kimage)
/* Empty routine needed to avoid build errors. */
}

+/* Allocates pages for kexec page table */
+static void *kexec_page_alloc(void *arg)
+{
+ struct kimage *kimage = (struct kimage *)arg;
+ struct page *page = kimage_alloc_control_pages(kimage, 0);
+
+ if (!page)
+ return NULL;
+
+ memset(page_address(page), 0, PAGE_SIZE);
+
+ return page_address(page);
+}
+
int machine_kexec_post_load(struct kimage *kimage)
{
void *reloc_code = page_to_virt(kimage->control_code_page);
+ struct kern_reloc_arg *kern_reloc_arg = kexec_page_alloc(kimage);
+
+ if (!kern_reloc_arg)
+ return -ENOMEM;

memcpy(reloc_code, arm64_relocate_new_kernel,
arm64_relocate_new_kernel_size);
kimage->arch.kern_reloc = __pa(reloc_code);
+ kimage->arch.kern_reloc_arg = __pa(kern_reloc_arg);
+ kern_reloc_arg->head = kimage->head;
+ kern_reloc_arg->entry_addr = kimage->start;
+ kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;
kexec_image_info(kimage);

/* Flush the reloc_code in preparation for its execution. */
__flush_dcache_area(reloc_code, arm64_relocate_new_kernel_size);
flush_icache_range((uintptr_t)reloc_code, (uintptr_t)reloc_code +
arm64_relocate_new_kernel_size);
+ __flush_dcache_area(kern_reloc_arg, sizeof(struct kern_reloc_arg));

return 0;
}
@@ -192,8 +216,7 @@ void machine_kexec(struct kimage *kimage)
* userspace (kexec-tools).
* In kexec_file case, the kernel starts directly without purgatory.
*/
- cpu_soft_restart(kimage->arch.kern_reloc, kimage->head, kimage->start,
- kimage->arch.dtb_mem);
+ cpu_soft_restart(kimage->arch.kern_reloc, kimage->arch.kern_reloc_arg);

BUG(); /* Should never get here. */
}
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index b78ea5de97a4..c92228aeddca 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -8,7 +8,7 @@

#include <linux/kexec.h>
#include <linux/linkage.h>
-
+#include <asm/asm-offsets.h>
#include <asm/assembler.h>
#include <asm/kexec.h>
#include <asm/page.h>
@@ -26,13 +26,8 @@
* safe memory that has been set up to be preserved during the copy operation.
*/
SYM_CODE_START(arm64_relocate_new_kernel)
- /* Setup the list loop variables. */
- mov x18, x2 /* x18 = dtb address */
- mov x17, x1 /* x17 = kimage_start */
- mov x16, x0 /* x16 = kimage_head */
- mov x14, xzr /* x14 = entry ptr */
- mov x13, xzr /* x13 = copy dest */
/* Check if the new image needs relocation. */
+ ldr x16, [x0, #KEXEC_KRELOC_HEAD] /* x16 = kimage_head */
tbnz x16, IND_DONE_BIT, .Ldone
raw_dcache_line_size x15, x1 /* x15 = dcache line size */
.Lloop:
@@ -73,12 +68,12 @@ SYM_CODE_START(arm64_relocate_new_kernel)
isb

/* Start new image. */
- mov x0, x18
- mov x1, xzr
- mov x2, xzr
- mov x3, xzr
- br x17
-
+ ldr x4, [x0, #KEXEC_KRELOC_ENTRY_ADDR] /* x4 = kimage_start */
+ ldr x3, [x0, #KEXEC_KRELOC_KERN_ARG3]
+ ldr x2, [x0, #KEXEC_KRELOC_KERN_ARG2]
+ ldr x1, [x0, #KEXEC_KRELOC_KERN_ARG1]
+ ldr x0, [x0, #KEXEC_KRELOC_KERN_ARG0] /* x0 = dtb address */
+ br x4
SYM_CODE_END(arm64_relocate_new_kernel)

.align 3 /* To keep the 64-bit values below naturally aligned. */
--
2.25.1

2021-01-28 00:11:48

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v11 3/6] arm64: kexec: kexec may require EL2 vectors

If we have a EL2 mode without VHE, the EL2 vectors are needed in order
to switch to EL2 and jump to new world with hypervisor privileges.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm64/include/asm/kexec.h | 5 +++++
arch/arm64/kernel/asm-offsets.c | 1 +
arch/arm64/kernel/machine_kexec.c | 9 +++++++-
arch/arm64/kernel/relocate_kernel.S | 35 +++++++++++++++++++++++++++++
4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 7f4f9abdf049..b96d8a6aac80 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -92,6 +92,7 @@ static inline void crash_post_resume(void) {}

#if defined(CONFIG_KEXEC_CORE)
extern const char arm64_relocate_new_kernel[];
+extern const char arm64_kexec_el2_vectors[];
#endif

/*
@@ -101,6 +102,9 @@ extern const char arm64_relocate_new_kernel[];
* kernel, or purgatory entry address).
* kern_arg0 first argument to kernel is its dtb address. The other
* arguments are currently unused, and must be set to 0
+ * el2_vector If present means that relocation routine will go to EL1
+ * from EL2 to do the copy, and then back to EL2 to do the jump
+ * to new world.
*/
struct kern_reloc_arg {
phys_addr_t head;
@@ -109,6 +113,7 @@ struct kern_reloc_arg {
phys_addr_t kern_arg1;
phys_addr_t kern_arg2;
phys_addr_t kern_arg3;
+ phys_addr_t el2_vector;
};

#define ARCH_HAS_KIMAGE_ARCH
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 6067a288f568..8a9475be1b62 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -159,6 +159,7 @@ int main(void)
DEFINE(KEXEC_KRELOC_KERN_ARG1, offsetof(struct kern_reloc_arg, kern_arg1));
DEFINE(KEXEC_KRELOC_KERN_ARG2, offsetof(struct kern_reloc_arg, kern_arg2));
DEFINE(KEXEC_KRELOC_KERN_ARG3, offsetof(struct kern_reloc_arg, kern_arg3));
+ DEFINE(KEXEC_KRELOC_EL2_VECTOR, offsetof(struct kern_reloc_arg, el2_vector));
#endif
return 0;
}
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 361a4d082093..41d1e3ca13f8 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -75,19 +75,26 @@ int machine_kexec_post_load(struct kimage *kimage)
{
void *reloc_code = page_to_virt(kimage->control_code_page);
struct kern_reloc_arg *kern_reloc_arg = kexec_page_alloc(kimage);
- long func_offset, reloc_size;
+ long func_offset, vector_offset, reloc_size;

if (!kern_reloc_arg)
return -ENOMEM;

func_offset = arm64_relocate_new_kernel - __relocate_new_kernel_start;
reloc_size = __relocate_new_kernel_end - __relocate_new_kernel_start;
+ vector_offset = arm64_kexec_el2_vectors - __relocate_new_kernel_start;
+
memcpy(reloc_code, __relocate_new_kernel_start, reloc_size);
kimage->arch.kern_reloc = __pa(reloc_code) + func_offset;
kimage->arch.kern_reloc_arg = __pa(kern_reloc_arg);
kern_reloc_arg->head = kimage->head;
kern_reloc_arg->entry_addr = kimage->start;
kern_reloc_arg->kern_arg0 = kimage->arch.dtb_mem;
+
+ /* Setup vector table only when EL2 is available, but no VHE */
+ if (is_hyp_mode_available() && !is_kernel_in_hyp_mode())
+ kern_reloc_arg->el2_vector = __pa(reloc_code) + vector_offset;
+
kexec_image_info(kimage);

/* Flush the reloc_code in preparation for its execution. */
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index d2a4a0b0d76b..c6178b1a4e60 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -14,6 +14,17 @@
#include <asm/page.h>
#include <asm/sysreg.h>

+.macro el1_sync_64
+ .align 7
+ br x4 /* Jump to new world from el2 */
+.endm
+
+.macro invalid_vector label
+\label:
+ .align 7
+ b \label
+.endm
+
.pushsection ".kexec_relocate.text", "ax"
/*
* arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
@@ -76,4 +87,28 @@ SYM_CODE_START(arm64_relocate_new_kernel)
ldr x0, [x0, #KEXEC_KRELOC_KERN_ARG0] /* x0 = dtb address */
br x4
SYM_CODE_END(arm64_relocate_new_kernel)
+
+/* el2 vectors - switch el2 here while we restore the memory image. */
+ .align 11
+SYM_CODE_START(arm64_kexec_el2_vectors)
+ invalid_vector el2_sync_invalid_sp0 /* Synchronous EL2t */
+ invalid_vector el2_irq_invalid_sp0 /* IRQ EL2t */
+ invalid_vector el2_fiq_invalid_sp0 /* FIQ EL2t */
+ invalid_vector el2_error_invalid_sp0 /* Error EL2t */
+
+ invalid_vector el2_sync_invalid_spx /* Synchronous EL2h */
+ invalid_vector el2_irq_invalid_spx /* IRQ EL2h */
+ invalid_vector el2_fiq_invalid_spx /* FIQ EL2h */
+ invalid_vector el2_error_invalid_spx /* Error EL2h */
+
+ el1_sync_64 /* Synchronous 64-bit EL1 */
+ invalid_vector el1_irq_invalid_64 /* IRQ 64-bit EL1 */
+ invalid_vector el1_fiq_invalid_64 /* FIQ 64-bit EL1 */
+ invalid_vector el1_error_invalid_64 /* Error 64-bit EL1 */
+
+ invalid_vector el1_sync_invalid_32 /* Synchronous 32-bit EL1 */
+ invalid_vector el1_irq_invalid_32 /* IRQ 32-bit EL1 */
+ invalid_vector el1_fiq_invalid_32 /* FIQ 32-bit EL1 */
+ invalid_vector el1_error_invalid_32 /* Error 32-bit EL1 */
+SYM_CODE_END(arm64_kexec_el2_vectors)
.popsection
--
2.25.1

2021-02-01 18:41:51

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

Hi Pavel,

On 27/01/2021 17:27, Pavel Tatashin wrote:
> Enable MMU during kexec relocation in order to improve reboot performance.
>
> If kexec functionality is used for a fast system update, with a minimal
> downtime, the relocation of kernel + initramfs takes a significant portion
> of reboot.
>
> The reason for slow relocation is because it is done without MMU, and thus
> not benefiting from D-Cache.
>
> Performance data
> ----------------
> For this experiment, the size of kernel plus initramfs is small, only 25M.
> If initramfs was larger, than the improvements would be greater, as time
> spent in relocation is proportional to the size of relocation.
>
> Previously:
> kernel shutdown 0.022131328s
> relocation 0.440510736s
> kernel startup 0.294706768s
>
> Relocation was taking: 58.2% of reboot time
>
> Now:
> kernel shutdown 0.032066576s
> relocation 0.022158152s
> kernel startup 0.296055880s
>
> Now: Relocation takes 6.3% of reboot time
>
> Total reboot is x2.16 times faster.
>
> With bigger userland (fitImage 380M), the reboot time is improved by 3.57s,
> and is reduced from 3.9s down to 0.33s

> Previous approaches and discussions
> -----------------------------------

The problem I see with this is rewriting the relocation code. It needs to work whether the
machine has enough memory to enable the MMU during kexec, or not.

In off-list mail to Pavel I proposed an alternative implementation here:
https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0

By using a copy of the linear map, and passing the phys_to_virt offset into
arm64_relocate_new_kernel() its possible to use the same code when we fail to allocate the
page tables, and run with the MMU off as it does today.
I'm convinced someone will crawl out of the woodwork screaming 'regression' if we
substantially increase the amount of memory needed to kexec at all.

From that discussion: this didn't meet Pavel's timing needs.
If you depend on having all the src/dst pages lined up in a single line, it sounds like
you've over-tuned this to depend on the CPU's streaming mode. What causes the CPU to
start/stop that stuff is very implementation specific (and firmware configurable).
I don't think we should let this rule out systems that can kexec today, but don't have
enough extra memory for the page tables.
Having two copies of the relocation code is obviously a bad idea.


(as before: ) Instead of trying to make the relocations run quickly, can we reduce them?
This would benefit other architectures too.

Can the kexec core code allocate higher order pages, instead of doing everything page at
at time?

If you have a crash kernel reservation, can we use that to eliminate the relocations
completely?
(I think this suggestion has been lost in translation each time I make it.
I mean like this:
https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0
Runes to test it:
| sudo ./kexec -p -u
| sudo cat /proc/iomem | grep Crash
| b0200000-f01fffff : Crash kernel
| sudo ./kexec --mem-min=0xb0200000 --mem-max=0xf01ffffff -l ~/Image --reuse-cmdline

I bet its even faster!)


I think 'as fast as possible' and 'memory constrained' are mutually exclusive
requirements. We need to make the page tables optional with a single implementation.


Thanks,

James

2021-02-01 20:08:02

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

Hi James,

> The problem I see with this is rewriting the relocation code. It needs to work whether the
> machine has enough memory to enable the MMU during kexec, or not.
>
> In off-list mail to Pavel I proposed an alternative implementation here:
> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0
>
> By using a copy of the linear map, and passing the phys_to_virt offset into
> arm64_relocate_new_kernel() its possible to use the same code when we fail to allocate the
> page tables, and run with the MMU off as it does today.
> I'm convinced someone will crawl out of the woodwork screaming 'regression' if we
> substantially increase the amount of memory needed to kexec at all.
>
> From that discussion: this didn't meet Pavel's timing needs.
> If you depend on having all the src/dst pages lined up in a single line, it sounds like
> you've over-tuned this to depend on the CPU's streaming mode. What causes the CPU to
> start/stop that stuff is very implementation specific (and firmware configurable).
> I don't think we should let this rule out systems that can kexec today, but don't have
> enough extra memory for the page tables.
> Having two copies of the relocation code is obviously a bad idea.

I understand that having an extra set of page tables could potentially
waste memory, especially if VAs are sparse, but in this case we use
page tables exclusively for contiguous VA space (copy [src, src +
size]). Therefore, the extra memory usage is tiny. The ratio for
kernels with 4K page_size is (size of relocated memory) / 512. A
normal initrd + kernel is usually under 64M, an extra space which
means ~128K for the page table. Even with a huge relocation, where
initrd is ~512M the extra memory usage in the worst case is just ~1M.
I really doubt we will have any problem from users because of such
small overhead in comparison to the total kexec-load size.

>
>
> (as before: ) Instead of trying to make the relocations run quickly, can we reduce them?
> This would benefit other architectures too.

This was exactly my first approach [1] where I tried to pre-reserve
memory similar to how it is done for a crash kernel, but I was asked
to go away [2] as this is an ARM64 specific problem, where current
relocation performance is prohibitively slow. I have tested on x86,
and it does not suffer from this problem, relocation performance is
just as fast as with MMU enabled ARM64.

>
> Can the kexec core code allocate higher order pages, instead of doing everything page at
> at time?

Yes, however, failures during kexec-load due to failure to coalesce
huge pages can add extra hassle to users, and therefore this should be
only an optimization with fallback to base pages.

>
> If you have a crash kernel reservation, can we use that to eliminate the relocations
> completely?
> (I think this suggestion has been lost in translation each time I make it.
> I mean like this:
> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0
> Runes to test it:
> | sudo ./kexec -p -u
> | sudo cat /proc/iomem | grep Crash
> | b0200000-f01fffff : Crash kernel
> | sudo ./kexec --mem-min=0xb0200000 --mem-max=0xf01ffffff -l ~/Image --reuse-cmdline
>
> I bet its even faster!)

There is a problem with this approach. While, with kexec_load() call
it is possible to specify physical destinations for each segment, with
kexec_file_load() it is not possible. The secure systems that do IMA
checks during kexec load require kexec_file_load(), and we cannot
ahead of time specify destinations for these segments (at least
without substantially changing common kexec code which is not going to
happen as this arm64 specific problem).

>
>
> I think 'as fast as possible' and 'memory constrained' are mutually exclusive
> requirements. We need to make the page tables optional with a single implementation.

In my opinion having two different types of relocations will only add
extra corner cases, confusion about different performance, and bugs.
It is better to have two types: 1. crash kernel type without
relocation, 2. fast relocation where MMU is enabled.

[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/[email protected]/

Thank you,
Pasha

2021-02-04 01:17:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

Pavel Tatashin <[email protected]> writes:

> Hi James,
>
>> The problem I see with this is rewriting the relocation code. It needs to work whether the
>> machine has enough memory to enable the MMU during kexec, or not.
>>
>> In off-list mail to Pavel I proposed an alternative implementation here:
>> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0
>>
>> By using a copy of the linear map, and passing the phys_to_virt offset into
>> arm64_relocate_new_kernel() its possible to use the same code when we fail to allocate the
>> page tables, and run with the MMU off as it does today.
>> I'm convinced someone will crawl out of the woodwork screaming 'regression' if we
>> substantially increase the amount of memory needed to kexec at all.
>>
>> From that discussion: this didn't meet Pavel's timing needs.
>> If you depend on having all the src/dst pages lined up in a single line, it sounds like
>> you've over-tuned this to depend on the CPU's streaming mode. What causes the CPU to
>> start/stop that stuff is very implementation specific (and firmware configurable).
>> I don't think we should let this rule out systems that can kexec today, but don't have
>> enough extra memory for the page tables.
>> Having two copies of the relocation code is obviously a bad idea.
>
> I understand that having an extra set of page tables could potentially
> waste memory, especially if VAs are sparse, but in this case we use
> page tables exclusively for contiguous VA space (copy [src, src +
> size]). Therefore, the extra memory usage is tiny. The ratio for
> kernels with 4K page_size is (size of relocated memory) / 512. A
> normal initrd + kernel is usually under 64M, an extra space which
> means ~128K for the page table. Even with a huge relocation, where
> initrd is ~512M the extra memory usage in the worst case is just ~1M.
> I really doubt we will have any problem from users because of such
> small overhead in comparison to the total kexec-load size.

Foolish question.

Does arm64 have something like 2M pages that it can use for the
linear map?

On x86_64 we always generate page tables, because they are necessary to
be in 64bit mode. As I recall on x86_64 we always use 2M pages which
means for each 4K of page tables we map 1GiB of memory. Which is very
tiny.

If you do as well as x86_64 for arm64 I suspect that will be good enough
for people to not claim regression.

Would a variation on the x86_64 implementation that allocates page
tables work for arm64?

>> (as before: ) Instead of trying to make the relocations run quickly, can we reduce them?
>> This would benefit other architectures too.
>
> This was exactly my first approach [1] where I tried to pre-reserve
> memory similar to how it is done for a crash kernel, but I was asked
> to go away [2] as this is an ARM64 specific problem, where current
> relocation performance is prohibitively slow. I have tested on x86,
> and it does not suffer from this problem, relocation performance is
> just as fast as with MMU enabled ARM64.
>
>>
>> Can the kexec core code allocate higher order pages, instead of doing everything page at
>> at time?
>
> Yes, however, failures during kexec-load due to failure to coalesce
> huge pages can add extra hassle to users, and therefore this should be
> only an optimization with fallback to base pages.
>
>>
>> If you have a crash kernel reservation, can we use that to eliminate the relocations
>> completely?
>> (I think this suggestion has been lost in translation each time I make it.
>> I mean like this:
>> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0
>> Runes to test it:
>> | sudo ./kexec -p -u
>> | sudo cat /proc/iomem | grep Crash
>> | b0200000-f01fffff : Crash kernel
>> | sudo ./kexec --mem-min=0xb0200000 --mem-max=0xf01ffffff -l ~/Image --reuse-cmdline
>>
>> I bet its even faster!)
>
> There is a problem with this approach. While, with kexec_load() call
> it is possible to specify physical destinations for each segment, with
> kexec_file_load() it is not possible. The secure systems that do IMA
> checks during kexec load require kexec_file_load(), and we cannot
> ahead of time specify destinations for these segments (at least
> without substantially changing common kexec code which is not going to
> happen as this arm64 specific problem).


>> I think 'as fast as possible' and 'memory constrained' are mutually exclusive
>> requirements. We need to make the page tables optional with a single implementation.

In my experience the slowdown with disabling a cpus cache (which
apparently happens on arm64 when the MMU is disabled) is freakishly
huge.

Enabling the cache shouldn't be 'as fast as possible' but simply
disengaging the parking brake.

> In my opinion having two different types of relocations will only add
> extra corner cases, confusion about different performance, and bugs.
> It is better to have two types: 1. crash kernel type without
> relocation, 2. fast relocation where MMU is enabled.
>
> [1] https://lore.kernel.org/lkml/[email protected]
> [2] https://lore.kernel.org/lkml/[email protected]/

As long as the page table provided is a linear mapping of physical
memory (aka it looks like paging is disabled). The the code that
relocates memory should be pretty much the same.

My experience with other architectures suggests only a couple of
instructions need to be different to deal with a MMU being enabled.

Eric

2021-02-04 22:07:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

Pavel Tatashin <[email protected]> writes:

>> > I understand that having an extra set of page tables could potentially
>> > waste memory, especially if VAs are sparse, but in this case we use
>> > page tables exclusively for contiguous VA space (copy [src, src +
>> > size]). Therefore, the extra memory usage is tiny. The ratio for
>> > kernels with 4K page_size is (size of relocated memory) / 512. A
>> > normal initrd + kernel is usually under 64M, an extra space which
>> > means ~128K for the page table. Even with a huge relocation, where
>> > initrd is ~512M the extra memory usage in the worst case is just ~1M.
>> > I really doubt we will have any problem from users because of such
>> > small overhead in comparison to the total kexec-load size.
>
> Hi Eric,
>
>>
>> Foolish question.
>
> Thank you for your e-mail, you gave some interesting insights.
>
>>
>> Does arm64 have something like 2M pages that it can use for the
>> linear map?
>
> Yes, with 4K pages arm64 as well has 2M pages, but arm64 also has a
> choice of 16K and 64K and second level pages are bigger there.

>> On x86_64 we always generate page tables, because they are necessary to
>> be in 64bit mode. As I recall on x86_64 we always use 2M pages which
>> means for each 4K of page tables we map 1GiB of memory. Which is very
>> tiny.
>>
>> If you do as well as x86_64 for arm64 I suspect that will be good enough
>> for people to not claim regression.
>>
>> Would a variation on the x86_64 implementation that allocates page
>> tables work for arm64?
> ...
>>
>> As long as the page table provided is a linear mapping of physical
>> memory (aka it looks like paging is disabled). The the code that
>> relocates memory should be pretty much the same.
>>
>> My experience with other architectures suggests only a couple of
>> instructions need to be different to deal with a MMU being enabled.
>
> I think what you are proposing is similar to what James proposed. Yes,
> for a linear map relocation should be pretty much the same as we do
> relocation as with MMU disabled.
>
> Linear map still uses memory, because page tables must be outside of
> destination addresses of segments of the next kernel. Therefore, we
> must allocate a page table for the linear map. It might be a little
> smaller, but in reality the difference is small with 4K pages, and
> insignificant with 64K pages. The benefit of my approach is that the
> assembly copy loop is simpler, and allows hardware prefetching to
> work.
>
> The regular relocation loop works like this:
>
> for (entry = head; !(entry & IND_DONE); entry = *ptr++) {
> addr = __va(entry & PAGE_MASK);
>
> switch (entry & IND_FLAGS) {
> case IND_DESTINATION:
> dest = addr;
> break;
> case IND_INDIRECTION:
> ptr = addr;
> break;
> case IND_SOURCE:
> copy_page(dest, addr);
> dest += PAGE_SIZE;
> }
> }
>
> The entry for the next relocation page has to be always fetched, and
> therefore prefetching cannot help with the actual loop.

True.

In the common case the loop looks like:
> for (entry = head; !(entry & IND_DONE); entry = *ptr++) {
> addr = __va(entry & PAGE_MASK);
>
> switch (entry & IND_FLAGS) {
> case IND_SOURCE:
> copy_page(dest, addr);
> dest += PAGE_SIZE;
> }
> }

Which is a read of the source address followed by the copy_page.
I suspect the overhead of that loop is small enough that it swamped by
the cost of the copy_page.

If not and a better data structure can be proposed we can look at that.

> In comparison, the loop that I am proposing is like this:
>
> for (addr = head; addr < end; addr += PAGE_SIZE, dst += PAGE_SIZE)
> copy_page(dest, addr);
>
> Here is assembly code for my loop:
>
> 1: copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
> sub x11, x11, #PAGE_SIZE
> cbnz x11, 1b

I think you may be hiding the cost of that loop in the page table
fetches themselves.

It is possible though unlikely that a page table with huge pages
(and thus smaller page fault costs) and the original loop is actually
cheaper.

> That said, if James and you agree that linear map is the way to go
> forward, I am OK with that as well, as it is still much better than
> having no caching at all.

The big advantage of a linear map is that the kexec'd code can continue
to use it until it sets up it's own page tables.

I probably did not document it well enough but a linear map then
equivalent of not having virtual addresses at all was always my
intention for the hand-off state of kexec between kernels.

So please try the linear map. If it is noticably slower than your
optimized page table give numbers and we can see if there is a way to
improve the generic kexec data structures.

Eric

2021-02-05 00:47:11

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v11 0/6] arm64: MMU enabled kexec relocation

> > I understand that having an extra set of page tables could potentially
> > waste memory, especially if VAs are sparse, but in this case we use
> > page tables exclusively for contiguous VA space (copy [src, src +
> > size]). Therefore, the extra memory usage is tiny. The ratio for
> > kernels with 4K page_size is (size of relocated memory) / 512. A
> > normal initrd + kernel is usually under 64M, an extra space which
> > means ~128K for the page table. Even with a huge relocation, where
> > initrd is ~512M the extra memory usage in the worst case is just ~1M.
> > I really doubt we will have any problem from users because of such
> > small overhead in comparison to the total kexec-load size.

Hi Eric,

>
> Foolish question.

Thank you for your e-mail, you gave some interesting insights.

>
> Does arm64 have something like 2M pages that it can use for the
> linear map?

Yes, with 4K pages arm64 as well has 2M pages, but arm64 also has a
choice of 16K and 64K and second level pages are bigger there.

> On x86_64 we always generate page tables, because they are necessary to
> be in 64bit mode. As I recall on x86_64 we always use 2M pages which
> means for each 4K of page tables we map 1GiB of memory. Which is very
> tiny.
>
> If you do as well as x86_64 for arm64 I suspect that will be good enough
> for people to not claim regression.
>
> Would a variation on the x86_64 implementation that allocates page
> tables work for arm64?
...
>
> As long as the page table provided is a linear mapping of physical
> memory (aka it looks like paging is disabled). The the code that
> relocates memory should be pretty much the same.
>
> My experience with other architectures suggests only a couple of
> instructions need to be different to deal with a MMU being enabled.

I think what you are proposing is similar to what James proposed. Yes,
for a linear map relocation should be pretty much the same as we do
relocation as with MMU disabled.

Linear map still uses memory, because page tables must be outside of
destination addresses of segments of the next kernel. Therefore, we
must allocate a page table for the linear map. It might be a little
smaller, but in reality the difference is small with 4K pages, and
insignificant with 64K pages. The benefit of my approach is that the
assembly copy loop is simpler, and allows hardware prefetching to
work.

The regular relocation loop works like this:

for (entry = head; !(entry & IND_DONE); entry = *ptr++) {
addr = __va(entry & PAGE_MASK);

switch (entry & IND_FLAGS) {
case IND_DESTINATION:
dest = addr;
break;
case IND_INDIRECTION:
ptr = addr;
break;
case IND_SOURCE:
copy_page(dest, addr);
dest += PAGE_SIZE;
}
}

The entry for the next relocation page has to be always fetched, and
therefore prefetching cannot help with the actual loop.

In comparison, the loop that I am proposing is like this:

for (addr = head; addr < end; addr += PAGE_SIZE, dst += PAGE_SIZE)
copy_page(dest, addr);

Here is assembly code for my loop:

1: copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
sub x11, x11, #PAGE_SIZE
cbnz x11, 1b

That said, if James and you agree that linear map is the way to go
forward, I am OK with that as well, as it is still much better than
having no caching at all.

Thank you,
Pasha